posted on 2024-03-18, 16:57authored byMahfuja Nilufar
This thesis aims to build clusters of similar research papers. Text clustering for research articles is challenging because re-clustering is necessary to handle newly added papers. An incremental clustering algorithm is presented to find similar research papers for COVID-19 related literature. The proposed approach uses an incremental word embedding generation technique to extract feature vectors of the papers. The initial clustering is done by using the K-means algorithm by two NLP feature extraction models; TF-IDF and Word2vec. The clustering results show that the Word2vec outperforms the TF-IDF model. With increasing COVID-19 literature continuously, the ultimate focus is to add the newly published papers to the existing clusters without re-clustering. Title, abstract, and full body of papers are considered for testing the proposed incremental algorithm. Clustering quality is evaluated by the Microsoft language similarity package, which shows clustering of the full-text body outperforms the abstract and title of papers.