A social and news media benchmark dataset for topic modeling

dc.contributor.authorMiles, Samuel
dc.contributor.authorYao, Lixia
dc.contributor.authorMeng, Weilin
dc.contributor.authorBlack, Christopher M.
dc.contributor.authorBen-Miled, Zina
dc.contributor.departmentElectrical and Computer Engineering, School of Engineering and Technologyen_US
dc.date.accessioned2023-07-17T14:22:18Z
dc.date.available2023-07-17T14:22:18Z
dc.date.issued2022-07-04
dc.description.abstractTopic modeling is an active research area with several unanswered questions. The focus of recent research in this area is on the use of a vector embedding representation of the input text with both generative and evolutionary topic modeling techniques. Unfortunately, it is hard to compare different techniques when the underlying data and preprocessing steps that were used to develop the models are not available. This paper presents two secondary datasets that can help address this gap. These datasets are derived from two primary datasets. The first consists of 8145 posts from the r/Cancer health forum and the second consists of 18,294 messages submitted to 20 different news groups. The same preprocessing procedure is applied to both datasets by removing punctuation, stop words and high frequency words. Each dataset is then clustered using three different topic modeling techniques: pPSO, ETM and NVDM and three topic numbers: 10, 20, 30. In addition, for pPSO two text embeddings representation are considered: sBERT and Skipgram. The secondary datasets were originally developed in support of a comparative analysis of the aforementioned topic modeling techniques in a study titled “Comparing PSO-based Clustering over Contextual Vector Embeddings to Modern Topic Modeling” submitted to the Journal of Information Processing and Management. The present paper provides a detailed description of the two secondary datasets including the unique identifier that can be used to retrieve the original documents, the pre-processing scripts, the topic keywords generated by the three topic modeling techniques with varying topic numbers and embedding representations. As such, the datasets allow direct comparison with other topic modeling techniques. To further facilitate this process, the algorithm underlying the evolutionary topic modeling technique, pPSO, proposed by the authors is also provided.en_US
dc.eprint.versionFinal published versionen_US
dc.identifier.citationMiles S, Yao L, Meng W, Black CM, Ben-Miled Z. A social and news media benchmark dataset for topic modeling. Data Brief. 2022;43:108442. Published 2022 Jul 4. doi:10.1016/j.dib.2022.108442en_US
dc.identifier.urihttps://hdl.handle.net/1805/34411
dc.language.isoen_USen_US
dc.publisherElsevieren_US
dc.relation.isversionof10.1016/j.dib.2022.108442en_US
dc.relation.journalData in Briefen_US
dc.rightsAttribution 4.0 International*
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/*
dc.sourcePMCen_US
dc.subjectTopic modelingen_US
dc.subjectDocument embeddingen_US
dc.subjectHeath forumsen_US
dc.titleA social and news media benchmark dataset for topic modelingen_US
dc.typeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
main.pdf
Size:
248.98 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: