Toronto Metropolitan University
Browse

Incremental Text Clustering Algorithm Using Incremental Learning in COVID-19 Research Papers

Download (3.09 MB)
thesis
posted on 2024-03-18, 16:57 authored by Mahfuja Nilufar
This thesis aims to build clusters of similar research papers. Text clustering for research articles is challenging because re-clustering is necessary to handle newly added papers. An incremental clustering algorithm is presented to find similar research papers for COVID-19 related literature. The proposed approach uses an incremental word embedding generation technique to extract feature vectors of the papers. The initial clustering is done by using the K-means algorithm by two NLP feature extraction models; TF-IDF and Word2vec. The clustering results show that the Word2vec outperforms the TF-IDF model. With increasing COVID-19 literature continuously, the ultimate focus is to add the newly published papers to the existing clusters without re-clustering. Title, abstract, and full body of papers are considered for testing the proposed incremental algorithm. Clustering quality is evaluated by the Microsoft language similarity package, which shows clustering of the full-text body outperforms the abstract and title of papers.

History

Language

eng

Degree

  • Master of Science

Program

  • Computer Science

Granting Institution

Ryerson University

LAC Thesis Type

  • Thesis

Thesis Advisor

Abdolreza Abhari

Year

2022

Usage metrics

    Toronto Metropolitan University

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC