IU Indianapolis ScholarWorks :: Browsing by Author "Jang, Hyeju"

Browsing by Author "Jang, Hyeju"

Now showing 1 - 8 of 8

Classification of Alzheimer’s Disease Leveraging Multi-task Machine Learning Analysis of Speech and Eye-Movement Data
(Frontiers Media, 2021-09-20) Jang, Hyeju; Soroski, Thomas; Rizzo, Matteo; Barral, Oswald; Harisinghani, Anuj; Newton-Mason, Sally; Granby, Saffrin; da Cunha Vasco, Thiago Monnerat Stutz; Lewis, Caitlin; Tutt, Pavan; Carenini, Giuseppe; Conati, Cristina; Field, Thalia S.; Computer Science, Luddy School of Informatics, Computing, and Engineering
Alzheimer’s disease (AD) is a progressive neurodegenerative condition that results in impaired performance in multiple cognitive domains. Preclinical changes in eye movements and language can occur with the disease, and progress alongside worsening cognition. In this article, we present the results from a machine learning analysis of a novel multimodal dataset for AD classification. The cohort includes data from two novel tasks not previously assessed in classification models for AD (pupil fixation and description of a pleasant past experience), as well as two established tasks (picture description and paragraph reading). Our dataset includes language and eye movement data from 79 memory clinic patients with diagnoses of mild-moderate AD, mild cognitive impairment (MCI), or subjective memory complaints (SMC), and 83 older adult controls. The analysis of the individual novel tasks showed similar classification accuracy when compared to established tasks, demonstrating their discriminative ability for memory clinic patients. Fusing the multimodal data across tasks yielded the highest overall AUC of 0.83 ± 0.01, indicating that the data from novel tasks are complementary to established tasks.
Evaluating Web-Based Automatic Transcription for Alzheimer Speech Data: Transcript Comparison and Machine Learning Analysis
(JMIR, 2022) Soroski, Thomas; Vasco, Thiago da Cunha; Newton-Mason, Sally; Granby, Saffrin; Lewis, Caitlin; Harisinghani, Anuj; Rizzo, Matteo; Conati, Cristina; Murray, Gabriel; Carenini, Giuseppe; Field, Thalia S.; Jang, Hyeju; Computer Science, Luddy School of Informatics, Computing, and Engineering
Background: Speech data for medical research can be collected noninvasively and in large volumes. Speech analysis has shown promise in diagnosing neurodegenerative disease. To effectively leverage speech data, transcription is important, as there is valuable information contained in lexical content. Manual transcription, while highly accurate, limits the potential scalability and cost savings associated with language-based screening. Objective: To better understand the use of automatic transcription for classification of neurodegenerative disease, namely, Alzheimer disease (AD), mild cognitive impairment (MCI), or subjective memory complaints (SMC) versus healthy controls, we compared automatically generated transcripts against transcripts that went through manual correction. Methods: We recruited individuals from a memory clinic (“patients”) with a diagnosis of mild-to-moderate AD, (n=44, 30%), MCI (n=20, 13%), SMC (n=8, 5%), as well as healthy controls (n=77, 52%) living in the community. Participants were asked to describe a standardized picture, read a paragraph, and recall a pleasant life experience. We compared transcripts generated using Google speech-to-text software to manually verified transcripts by examining transcription confidence scores, transcription error rates, and machine learning classification accuracy. For the classification tasks, logistic regression, Gaussian naive Bayes, and random forests were used. Results: The transcription software showed higher confidence scores (P<.001) and lower error rates (P>.05) for speech from healthy controls compared with patients. Classification models using human-verified transcripts significantly (P<.001) outperformed automatically generated transcript models for both spontaneous speech tasks. This comparison showed no difference in the reading task. Manually adding pauses to transcripts had no impact on classification performance. However, manually correcting both spontaneous speech tasks led to significantly higher performances in the machine learning models. Conclusions: We found that automatically transcribed speech data could be used to distinguish patients with a diagnosis of AD, MCI, or SMC from controls. We recommend a human verification step to improve the performance of automatic transcripts, especially for spontaneous tasks. Moreover, human verification can focus on correcting errors and adding punctuation to transcripts. However, manual addition of pauses is not needed, which can simplify the human verification step to more efficiently process large volumes of speech data.
Natural language processing to evaluate texting conversations between patients and healthcare providers during COVID-19 Home-Based Care in Rwanda at scale
(Public Library of Science, 2025-01-15) Lester, Richard T.; Manson, Matthew; Semakula, Muhammed; Jang, Hyeju; Mugabo, Hassan; Magzari, Ali; Blackmer, Junhong Ma; Fattah, Fanan; Niyonsenga, Simon Pierre; Rwagasore, Edson; Ruranga, Charles; Remera, Eric; Ngabonziza, Jean Claude S.; Carenini, Giuseppe; Nsanzimana, Sabin; Computer Science, Luddy School of Informatics, Computing, and Engineering
Community isolation of patients with communicable infectious diseases limits spread of pathogens but our understanding of isolated patients' needs and challenges is incomplete. Rwanda deployed a digital health service nationally to assist public health clinicians to remotely monitor and support SARS-CoV-2 cases via their mobile phones using daily interactive short message service (SMS) check-ins. We aimed to assess the texting patterns and communicated topics to better understand patient experiences. We extracted data on all COVID-19 cases and exposed contacts who were enrolled in the WelTel text messaging program between March 18, 2020, and March 31, 2022, and linked demographic and clinical data from the national COVID-19 registry. A sample of the text conversation corpus was English-translated and labeled with topics of interest defined by medical experts. Multiple natural language processing (NLP) topic classification models were trained and compared using F1 scores. Best performing models were applied to classify unlabeled conversations. Total 33,081 isolated patients (mean age 33·9, range 0-100), 44% female, including 30,398 cases and 2,683 contacts) were registered in WelTel. Registered patients generated 12,119 interactive text conversations in Kinyarwanda (n = 8,183, 67%), English (n = 3,069, 25%) and other languages. Sufficiently trained large language models (LLMs) were unavailable for Kinyarwanda. Traditional machine learning (ML) models outperformed fine-tuned transformer architecture language models on the native untranslated language corpus, however, the reverse was observed of models trained on English-only data. The most frequently identified topics discussed included symptoms (69%), diagnostics (38%), social issues (19%), prevention (18%), healthcare logistics (16%), and treatment (8·5%). Education, advice, and triage on these topics were provided to patients. Interactive text messaging can be used to remotely support isolated patients in pandemics at scale. NLP can help evaluate the medical and social factors that affect isolated patients which could ultimately inform precision public health responses to future pandemics.
Network Alignment Using Topological and Node Embedding Features
(2024-08) Almulhim, Aljohara; Al Hasan, Mohammad; Tuceryan, Mihran; Durresi, Arjan; Mukhopadhyay, Snehasis; Jang, Hyeju
In today's big data environment, development of robust knowledge discovery solutions depends on integration of data from various sources. For example, intelligence agencies fuse data from multiple sources to identify criminal activities; e-commerce platforms consolidate user activities on various platforms and devices to build better user profile; scientists connect data from various modality to develop new drugs, and treatments. In all such activities, entities from different data sources need to be aligned---first, to ensure accurate analysis and more importantly, to discover novel knowledge regarding these entities. If the data sources are networks, aligning entities from different sources leads to the task of network alignment, which is the focus of this thesis. The main objective of this task is to find an optimal one-to-one correspondence among nodes in two or more networks utilizing graph topology and nodes/edges attributes. In existing works, diverse computational schemes have been adopted for solving the network alignment task; these schemes include finding eigen-decomposition of similarity matrices, solving quadratic assignment problems via sub-gradient optimization, and designing iterative greedy matching techniques. Contemporary works approach this problem using a deep learning framework by learning node representations to identify matches. Node matching's key challenges include computational complexity and scalability. However, privacy concerns or unavailability often prevent the utilization of node attributes in real-world scenarios. In light of this, we aim to solve this problem by relying solely on the graph structure, without the need for prior knowledge, external attributes, or guidance from landmark nodes. Clearly, topology-based matching emerges as a hard problem when compared to other network matching tasks. In this thesis, I propose two original works to solve network topology-based alignment task. The first work, Graphlet-based Alignment (Graphlet-Align), employs a topological approach to network alignment. Graphlet-Align represents each node with a local graphlet count based signature and use that as feature for deriving node to node similarity across a pair of networks. By using these similarity values in a bipartite matching algorithm Graphlet-Align obtains a preliminary alignment. It then uses high-order information extending to k-hop neighborhood of a node to further refine the alignment, achieving better accuracy. We validated Graphlet-Align's efficacy by applying it to various large real-world networks, achieving accuracy improvements ranging from $20\%$ to $72\%$ over state-of-the-art methods on both duplicated and noisy graphs. Expanding on this paradigm that focuses solely on topology for solving graph alignment, in my second work, I develop a self-supervised learning framework known as Self-Supervised Topological Alignment (SST-Align). SST-Align uses graphlet-based signature for creating self-supervised node alignment labels, and then use those labels to generate node embedding vectors of both the networks in a joint space from which node alignment task can be effectively and accurately solved. It starts with an optimization process that applies average pooling on top of the extracted graphlet signature to construct an initial node assignment. Next, a self-supervised Siamese network architecture utilizes both the initial node assignment and graph convolutional networks to generate node embeddings through a contrastive loss. By applying kd-tree similarity to the two networks' embeddings, we achieve the final node mapping. Extensive testing on real-world graph alignment datasets shows that our developed methodology has competitive results compared to seven existing competing models in terms of node mapping accuracy. Additionally, we establish the Ablation Study to evaluate the two-stage accuracy, excluding the learning representation part and comparing the mapping accuracy accordingly. This thesis enhances the theoretical understanding of topological features in the analysis of graph data for network alignment task, hence facilitating future advancements toward the field.
Software Vulnerability Detection Using Deep Learning
(2025-05) Sanchez, Edwin; Xou, Xukai; Li, Feng; Jang, Hyeju
Vulnerabilities in software have remained a critical issue at the forefront of cybersecurity for as long as the field has existed. As the cost of allowing these vulnerabilities to exist increases each year, so have the efforts to detect software vulnerabilities before they can become a problem. This paper focuses specifically on static analysis, with respect to source code. Previous methods have focused on hand-crafted detections for extremely specific vulnerability types, however the recent explosion in Artificial Intelligence in the form of Large Language Models has led to a re-examination of the potential to identify common vulnerabilities more generally. This paper aims to apply common and cross-domain Deep Learning methods to examine whether these methods can be used to improve the state-of-the-art in software vulnerability detection and classification. More specifically, the concepts of prompting and fine-tuning, as well as the loss function Additive Angular Margin Loss -- which was originally designed for face recognition and classification tasks -- are applied in a series of experiments and compared. Through experimentation, it has been found that simple and common prompting methods as well as fine-tuning methods are not enough on their own to perform reliable software vulnerability detection and classification.
T3-Vis: a visual analytic framework for Training and fine-Tuning Transformers in NLP
(ACL Anthology, 2021) Li, Raymond; Xiao, Wen; Wang, Lanjun; Jang, Hyeju; Carenini, Giuseppe; Computer Science, Luddy School of Informatics, Computing, and Engineering
Transformers are the dominant architecture in NLP, but their training and fine-tuning is still very challenging. In this paper, we present the design and implementation of a visual analytic framework for assisting researchers in such process, by providing them with valuable insights about the model’s intrinsic properties and behaviours. Our framework offers an intuitive overview that allows the user to explore different facets of the model (e.g., hidden states, attention) through interactive visualization, and allows a suite of built-in algorithms that compute the importance of model components and different parts of the input sequence. Case studies and feedback from a user focus group indicate that the framework is useful, and suggest several improvements. Our framework is available at: https://github.com/raymondzmc/T3-Vis.
Tracking COVID-19 Discourse on Twitter in North America: Infodemiology Study Using Topic Modeling and Aspect-Based Sentiment Analysis
(JMIR, 2021-02-10) Jang, Hyeju; Rempel, Emily; Roth, David; Carenini, Giuseppe; Janjua, Naveed Zafar; Computer Science, Luddy School of Informatics, Computing, and Engineering
Background: Social media is a rich source where we can learn about people's reactions to social issues. As COVID-19 has impacted people's lives, it is essential to capture how people react to public health interventions and understand their concerns. Objective: We aim to investigate people's reactions and concerns about COVID-19 in North America, especially in Canada. Methods: We analyzed COVID-19-related tweets using topic modeling and aspect-based sentiment analysis (ABSA), and interpreted the results with public health experts. To generate insights on the effectiveness of specific public health interventions for COVID-19, we compared timelines of topics discussed with the timing of implementation of interventions, synergistically including information on people's sentiment about COVID-19-related aspects in our analysis. In addition, to further investigate anti-Asian racism, we compared timelines of sentiments for Asians and Canadians. Results: Topic modeling identified 20 topics, and public health experts provided interpretations of the topics based on top-ranked words and representative tweets for each topic. The interpretation and timeline analysis showed that the discovered topics and their trend are highly related to public health promotions and interventions such as physical distancing, border restrictions, handwashing, staying home, and face coverings. After training the data using ABSA with human-in-the-loop, we obtained 545 aspect terms (eg, "vaccines," "economy," and "masks") and 60 opinion terms such as "infectious" (negative) and "professional" (positive), which were used for inference of sentiments of 20 key aspects selected by public health experts. The results showed negative sentiments related to the overall outbreak, misinformation and Asians, and positive sentiments related to physical distancing. Conclusions: Analyses using natural language processing techniques with domain expert involvement can produce useful information for public health. This study is the first to analyze COVID-19-related tweets in Canada in comparison with tweets in the United States by using topic modeling and human-in-the-loop domain-specific ABSA. This kind of information could help public health agencies to understand public concerns as well as what public health messages are resonating in our populations who use Twitter, which can be helpful for public health agencies when designing a policy for new interventions.
Tracking Public Attitudes Toward COVID-19 Vaccination on Tweets in Canada: Using Aspect-Based Sentiment Analysis
(JMIR, 2022-03-29) Jang, Hyeju; Rempel, Emily; Roe, Ian; Adu, Prince; Carenini, Giuseppe; Janjua, Naveed Zafar; Computer Science, Luddy School of Informatics, Computing, and Engineering
Background: The development and approval of COVID-19 vaccines have generated optimism for the end of the COVID-19 pandemic and a return to normalcy. However, vaccine hesitancy, often fueled by misinformation, poses a major barrier to achieving herd immunity. Objective: We aim to investigate Twitter users' attitudes toward COVID-19 vaccination in Canada after vaccine rollout. Methods: We applied a weakly supervised aspect-based sentiment analysis (ABSA) technique, which involves the human-in-the-loop system, on COVID-19 vaccination-related tweets in Canada. Automatically generated aspect and opinion terms were manually corrected by public health experts to ensure the accuracy of the terms and make them more domain-specific. Then, based on these manually corrected terms, the system inferred sentiments toward the aspects. We observed sentiments toward key aspects related to COVID-19 vaccination, and investigated how sentiments toward "vaccination" changed over time. In addition, we analyzed the most retweeted or liked tweets by observing most frequent nouns and sentiments toward key aspects. Results: After applying the ABSA system, we obtained 170 aspect terms (eg, "immunity" and "pfizer") and 6775 opinion terms (eg, "trustworthy" for the positive sentiment and "jeopardize" for the negative sentiment). While manually verifying or editing these terms, our public health experts selected 20 key aspects related to COVID-19 vaccination for analysis. The sentiment analysis results for the 20 key aspects revealed negative sentiments related to "vaccine distribution," "side effects," "allergy," "reactions," and "anti-vaxxer," and positive sentiments related to "vaccine campaign," "vaccine candidates," and "immune response." These results indicate that the Twitter users express concerns about the safety of vaccines but still consider vaccines as the option to end the pandemic. In addition, compared to the sentiment of the remaining tweets, the most retweeted or liked tweets showed more positive sentiment overall toward key aspects (P<.001), especially vaccines (P<.001) and vaccination (P=.009). Further investigation of the most retweeted or liked tweets revealed two opposing trends in Twitter users who showed negative sentiments toward vaccines: the "anti-vaxxer" population that used negative sentiments as a means to discourage vaccination and the "Covid Zero" population that used negative sentiments to encourage vaccinations while critiquing the public health response. Conclusions: Our study examined public sentiments toward COVID-19 vaccination on tweets over an extended period in Canada. Our findings could inform public health agencies to design and implement interventions to promote vaccination.

Browsing by Author "Jang, Hyeju"

Results Per Page

Sort Options