IU Indianapolis ScholarWorks :: Browsing by Author "Miles, Samuel"

Browsing by Author "Miles, Samuel"

Now showing 1 - 4 of 4

Comparing PSO-based clustering over contextual vector embeddings to modern topic modeling
(Elsevier, 2022-05) Miles, Samuel; Yao, Lixia; Meng, Weilin; Black, Christopher M.; Miled, Zina Ben; Electrical and Computer Engineering, School of Engineering and Technology
Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In the present paper, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on two datasets. The first dataset consists of posts from the online health forum r/Cancer and the second dataset is a standard benchmark for topic modeling which consists of a collection of messages posted to 20 different news groups. When compared to the state-of-the-art generative document models (i.e., ETM and NVDM), pPSO is able to produce interpretable clusters. The results indicate that pPSO is able to capture both common topics as well as emergent topics. Moreover, the topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20NewsGroups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus-specific vocabulary which is used by ETM and NVDM.
Comparing Pso-Based Clustering Over Contextual Vector Embeddings to Modern Topic Modeling
(2022-05) Miles, Samuel; Ben Miled, Zina; Salama, Paul; El-Sharkawy, Mohamed
Efficient topic modeling is needed to support applications that aim at identifying main themes from a collection of documents. In this thesis, a reduced vector embedding representation and particle swarm optimization (PSO) are combined to develop a topic modeling strategy that is able to identify representative themes from a large collection of documents. Documents are encoded using a reduced, contextual vector embedding from a general-purpose pre-trained language model (sBERT). A modified PSO algorithm (pPSO) that tracks particle fitness on a dimension-by-dimension basis is then applied to these embeddings to create clusters of related documents. The proposed methodology is demonstrated on three datasets across different domains. The first dataset consists of posts from the online health forum r/Cancer. The second dataset is a collection of NY Times abstracts and is used to compare the proposed model to LDA. The third is a standard benchmark dataset for topic modeling which consists of a collection of messages posted to 20 different news groups. It is used to compare state-of-the-art generative document models (i.e., ETM and NVDM) to pPSO. The results show that pPSO is able to produce interpretable clusters. Moreover, pPSO is able to capture both common topics as well as emergent topics. The topic coherence of pPSO is comparable to that of ETM and its topic diversity is comparable to NVDM. The assignment parity of pPSO on a document completion task exceeded 90% for the 20News- Groups dataset. This rate drops to approximately 30% when pPSO is applied to the same Skip-Gram embedding derived from a limited, corpus-specific vocabulary which is used by ETM and NVDM.
Simulating Modern CPU Vulnerabilities on a 5-stage MIPS Pipeline Using Node-RED
(Springer, 2022-03) Miles, Samuel; McDonough, Corey; Michael, Emmanuel Obichukwu; Shankar Kumar, Valli Sanghami; Lee, John J.; Electrical and Computer Engineering, School of Engineering and Technology
This paper proposes a simulation of the 5-stage pipelined MIPS processor using Node-RED and illustrates the basic effects of modern CPU vulnerabilities. Demonstrated in this study are Spectre vulnerability attack and load value injection (LVI) transient-execution attack. The storing of secret data within the cache is shown for Spectre, and through the use of an attacker’s injected page number after a page fault has occurred, we demonstrate LVI’s ability to access the host secrets via simulated memory hierarchy. The persistence of the secret data in the cache can also be observed in the case of both attacks. The characteristics of such security vulnerabilities are successfully simulated with the proposed Node-RED-based processor simulator.
A social and news media benchmark dataset for topic modeling
(Elsevier, 2022-07-04) Miles, Samuel; Yao, Lixia; Meng, Weilin; Black, Christopher M.; Ben-Miled, Zina; Electrical and Computer Engineering, School of Engineering and Technology
Topic modeling is an active research area with several unanswered questions. The focus of recent research in this area is on the use of a vector embedding representation of the input text with both generative and evolutionary topic modeling techniques. Unfortunately, it is hard to compare different techniques when the underlying data and preprocessing steps that were used to develop the models are not available. This paper presents two secondary datasets that can help address this gap. These datasets are derived from two primary datasets. The first consists of 8145 posts from the r/Cancer health forum and the second consists of 18,294 messages submitted to 20 different news groups. The same preprocessing procedure is applied to both datasets by removing punctuation, stop words and high frequency words. Each dataset is then clustered using three different topic modeling techniques: pPSO, ETM and NVDM and three topic numbers: 10, 20, 30. In addition, for pPSO two text embeddings representation are considered: sBERT and Skipgram. The secondary datasets were originally developed in support of a comparative analysis of the aforementioned topic modeling techniques in a study titled “Comparing PSO-based Clustering over Contextual Vector Embeddings to Modern Topic Modeling” submitted to the Journal of Information Processing and Management. The present paper provides a detailed description of the two secondary datasets including the unique identifier that can be used to retrieve the original documents, the pre-processing scripts, the topic keywords generated by the three topic modeling techniques with varying topic numbers and embedding representations. As such, the datasets allow direct comparison with other topic modeling techniques. To further facilitate this process, the algorithm underlying the evolutionary topic modeling technique, pPSO, proposed by the authors is also provided.

Browsing by Author "Miles, Samuel"

Results Per Page

Sort Options