Speeding up calibration of latent Dirichlet allocation model to improve topic analysis in software engineering

Lopez, Jorge Arturo

doi:10.32920/ryerson.14665455.v1

Lopez_Jorge_Arturo.pdf (2.43 MB)

Speeding up calibration of latent Dirichlet allocation model to improve topic analysis in software engineering

thesis

posted on 2021-05-24, 14:51 authored by Jorge Arturo Lopez

Extraction of topics from large text corpuses helps improve Software Engineering (SE) processes. Latent Dirichlet Allocation (LDA) represents one of the algorithmic tools to understand, search, exploit, and summarize a large corpus of data (documents), and it is often used to perform such analysis. However, calibration of the models is computationally expensive, especially if iterating over a large number of topics. Our goal is to create a simple formula allowing analysts to estimate the number of topics, so that the top X topics include the desired proportion of documents under study. We derived the formula from the empirical analysis of three SE-related text corpuses. We believe that practitioners can use our formula to expedite LDA analysis. The formula is also of interest to theoreticians, as it suggests that different SE text corpuses have similar underlying properties.

History

Language

English

Degree

Master of Science

Program

Computer Science

Granting Institution

Ryerson University

LAC Thesis Type

Thesis

Year

2017

Usage metrics

Keywords

Information retrieval -- Mathematical models Software engineering Data mining -- Mathematical models

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Speeding up calibration of latent Dirichlet allocation model to improve topic analysis in software engineering

History

Language

Degree

Program

Granting Institution

LAC Thesis Type

Year

Usage metrics

Categories

Keywords

Licence

Exports