Toronto Metropolitan University
Browse

Speeding up calibration of latent Dirichlet allocation model to improve topic analysis in software engineering

Download (2.43 MB)
thesis
posted on 2021-05-24, 14:51 authored by Jorge Arturo Lopez
Extraction of topics from large text corpuses helps improve Software Engineering (SE) processes. Latent Dirichlet Allocation (LDA) represents one of the algorithmic tools to understand, search, exploit, and summarize a large corpus of data (documents), and it is often used to perform such analysis. However, calibration of the models is computationally expensive, especially if iterating over a large number of topics. Our goal is to create a simple formula allowing analysts to estimate the number of topics, so that the top X topics include the desired proportion of documents under study. We derived the formula from the empirical analysis of three SE-related text corpuses. We believe that practitioners can use our formula to expedite LDA analysis. The formula is also of interest to theoreticians, as it suggests that different SE text corpuses have similar underlying properties.

History

Language

English

Degree

  • Master of Science

Program

  • Computer Science

Granting Institution

Ryerson University

LAC Thesis Type

  • Thesis

Year

2017

Usage metrics

    Computer Science (Theses)

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC