Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling

dc.contributor.authorMiles, Samuel
dc.contributor.authorYao, Lixia
dc.contributor.authorMeng, Weilin
dc.contributor.authorBlack, Christopher M.
dc.contributor.authorMiled, Zina Ben
dc.contributor.departmentElectrical and Computer Engineering, School of Engineering and Technology
dc.date.accessioned2023-12-01T17:27:28Z
dc.date.available2023-12-01T17:27:28Z
dc.date.issued2022-05
dc.description.abstractEfficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In the present paper, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on two datasets. The first dataset consists of posts from the online health forum r/Cancer and the second dataset is a standard benchmark for topic modeling which consists of a collection of messages posted to 20 different news groups. When compared to the state-of-the-art generative document models (i.e., ETM and NVDM), pPSO is able to produce interpretable clusters. The results indicate that pPSO is able to capture both common topics as well as emergent topics. Moreover, the topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20NewsGroups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus-specific vocabulary which is used by ETM and NVDM.
dc.eprint.versionFinal published version
dc.identifier.citationMiles, S., Yao, L., Meng, W., Black, C. M., & Miled, Z. B. (2022). Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling. Information Processing & Management, 59(3), 102921. https://doi.org/10.1016/j.ipm.2022.102921
dc.identifier.urihttps://hdl.handle.net/1805/37252
dc.language.isoen_US
dc.publisherElsevier
dc.relation.isversionof10.1016/j.ipm.2022.102921
dc.relation.journalInformation Processing & Management
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttps://creativecommons.org/licenses/by/4.0
dc.sourcePublisher
dc.subjectTopic modeling
dc.subjectClustering
dc.subjectVector embedding
dc.subjectPSO
dc.subjectETM
dc.subjectNVDM
dc.titleComparing PSO-based clustering over contextual vector embeddings to modern topic modeling
dc.typeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Miles2022Comparing-CCBY.pdf
Size:
569.14 KB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: