- Browse by Author
Browsing by Author "Ding, Haoran"
Now showing 1 - 4 of 4
Results Per Page
Sort Options
Item Attention Mechanism with BERT for Content Annotation and Categorization of Pregnancy-Related Questions on a Community Q&A Site(IEEE, 2020-12) Luo, Xiao; Ding, Haoran; Tang, Matthew; Gandhi, Priyanka; Zhang, Zhan; He, Zhe; Engineering Technology, School of Engineering and TechnologyIn recent years, the social web has been increasingly used for health information seeking, sharing, and subsequent health-related research. Women often use the Internet or social networking sites to seek information related to pregnancy in different stages. They may ask questions about birth control, trying to conceive, labor, or taking care of a newborn or baby. Classifying different types of questions about pregnancy information (e.g., before, during, and after pregnancy) can inform the design of social media and professional websites for pregnancy education and support. This research aims to investigate the attention mechanism built-in or added on top of the BERT model in classifying and annotating the pregnancy-related questions posted on a community Q&A site. We evaluated two BERT-based models and compared them against the traditional machine learning models for question classification. Most importantly, we investigated two attention mechanisms: the built-in self-attention mechanism of BERT and the additional attention layer on top of BERT for relevant term annotation. The classification performance showed that the BERT-based models worked better than the traditional models, and BERT with an additional attention layer can achieve higher overall precision than the basic BERT model. The results also showed that both attention mechanisms work differently on annotating relevant content, and they could serve as feature selection methods for text mining in general.Item Computational methods to automate the initial interpretation of lower extremity arterial Doppler and duplex carotid ultrasound studies(Elsevier, 2021) Luo, Xiao; Ara, Lena; Ding, Haoran; Rollins, David; Motaganahalli, Raghu; Sawchuk, Alan P.; Surgery, School of MedicineBackground: Lower extremity arterial Doppler (LEAD) and duplex carotid ultrasound studies are used for the initial evaluation of peripheral arterial disease and carotid stenosis. However, intra- and inter-laboratory variability exists between interpreters, and other interpreter responsibilities can delay the timeliness of the report. To address these deficits, we examined whether machine learning algorithms could be used to classify these Doppler ultrasound studies. Methods: We developed a hierarchical deep learning model to classify aortoiliac, femoropopliteal, and trifurcation disease in LEAD ultrasound studies and a random forest machine learning algorithm to classify the amount of carotid stenosis from duplex carotid ultrasound studies using experienced physician interpretation in an active, credentialed vascular laboratory as the reference standard. Waveforms, pressures, flow velocities, and the presence of plaque were input into a hierarchal neural network. Artificial intelligence was developed to automate the interpretation of these LEAD and carotid duplex ultrasound studies. Statistical analysis was performed using the confusion matrix. Results: We extracted 5761 LEAD ultrasound studies from 2015 to 2017 and 18,650 duplex carotid ultrasound studies from 2016 to 2018 from the Indiana University Health system. The results showed the ability of artificial intelligence algorithms and method, with 97.0% accuracy for predicting normal cases, 88.2% accuracy for aortoiliac disease, 90.1% accuracy for femoropopliteal disease, and 90.5% accuracy for trifurcation disease. For internal carotid artery stenosis, the accuracy was 99.2% for predicting 0% to 49% stenosis, 100% for predicting 50% to 69% stenosis, 100% for predicting >70% stenosis, and 100% for predicting occlusion. For common carotid artery stenosis, the accuracy was 99.9% for predicting 0% to 49% stenosis, 100% for predicting 50% to 99% stenosis, and 100% for predicting occlusion. Conclusions: The machine learning models using LEAD data, with the collected blood pressure and waveform data, and duplex carotid ultrasound data with the flow velocities and the presence of plaque, showed that novel machine learning models are reliable in differentiating normal from diseased arterial systems and accurate in classifying the extent of vascular disease.Item Large Language Models for Unsupervised Keyphrase Extraction and Biomedical Data Analytics(2024-08) Ding, Haoran; Luo, Xiao; King, Brian; Zhang, Qingxue; Li, LingxiNatural Language Processing (NLP), a vital branch of artificial intelligence, is designed to equip computers with the ability to comprehend and manipulate human language, facilitating the extraction and utilization of textual data. NLP plays a crucial role in harnessing the vast quantities of textual data generated daily, facilitating meaningful information extraction. Among the various techniques, keyphrase extraction stands out due to its ability to distill concise information from extensive texts, making it invaluable for summarizing and navigating content efficiently. The process of keyphrase extraction usually begins by generating candidates first and then ranking them to identify the most relevant phrases. Keyphrase extraction can be categorized into supervised and unsupervised approaches. Supervised methods typically achieve higher accuracy as they are trained on labeled data, which allows them to effectively capture and utilize patterns recognized during training. However, the dependency on extensive, well-annotated datasets limits their applicability in scenarios where such data is scarce or costly to obtain. On the other hand, unsupervised methods, while free from the constraints of labeled data, face challenges in capturing deep semantic relationships within text, which can impact their effectiveness. Despite these challenges, unsupervised keyphrase extraction holds significant promise due to its scalability and lower barriers to entry, as it does not require labeled datasets. This approach is increasingly favored for its potential to aid in building extensive knowledge bases from unstructured data, which can be particularly useful in domains where acquiring labeled data is impractical. As a result, unsupervised keyphrase extraction is not only a valuable tool for information retrieval but also a pivotal technology for the ongoing expansion of knowledge-driven applications in NLP. In this dissertation, we introduce three innovative unsupervised keyphrase extraction methods: AttentionRank, AGRank, and LLMRank. Additionally, we present a method for constructing knowledge graphs from unsupervised keyphrase extraction, leveraging the self-attention mechanism. The first study discusses the AttentionRank model, which utilizes a pre-trained language model to derive underlying importance rankings of candidate phrases through self-attention. This model employs a cross-attention mechanism to assess the semantic relevance between each candidate phrase and the document, enhancing the phrase ranking process. AGRank, detailed in the second study, is a sophisticated graph-based framework that merges deep learning techniques with graph theory. It constructs a candidate phrase graph using mutual attentions from a pre-trained language model. Both global document information and local phrase details are incorporated as enhanced nodes within the graph, and a graph algorithm is applied to rank the candidate phrases. The third study, LLMRank, leverages the strengths of large language models (LLMs) and graph algorithms. It employs LLMs to generate keyphrase candidates and then integrates global information through the text's graphical structures. This process reranks the candidates, significantly improving keyphrase extraction performance. The fourth study explores how self-attention mechanisms can be used to extract keyphrases from medical literature and generate query-related phrase graphs, improving text retrieval visualization. The mutual attentions of medical entities, extracted using a pre-trained model, form the basis of the knowledge graph. This, coupled with a specialized retrieval algorithm, allows for the visualization of long-range connections between medical entities while simultaneously displaying the supporting literature. In summary, our exploration of unsupervised keyphrase extraction and biomedical data analysis introduces novel methods and insights in NLP, particularly in information extraction. These contributions are crucial for the efficient processing of large text datasets and suggest avenues for future research and applications.Item Using machine learning to detect sarcopenia from electronic health records(Sage, 2023-08-29) Luo, Xiao; Ding, Haoran; Broyles, Andrea; Warden, Stuart J.; Moorthi, Ranjani N.; Imel, Erik A.; Physical Therapy, School of Health and Human SciencesIntroduction: Sarcopenia (low muscle mass and strength) causes dysmobility and loss of independence. Sarcopenia is often not directly coded or described in electronic health records (EHR). The objective was to improve sarcopenia detection using structured data from EHR. Methods: Adults undergoing musculoskeletal testing (December 2017-March 2020) were classified as meeting sarcopenia thresholds for 0 (controls), ≥1 (Sarcopenia-1), or ≥2 (Sarcopenia-2) tests. Electronic health record diagnoses, medications, and laboratory testing were extracted from the Indiana Network for Patient Care. Five machine learning models were applied to EHR data for predicting sarcopenia. Results: Of 1304 participants, 1055 were controls, 249 met Sarcopenia-1 and 76 met Sarcopenia-2. Sarcopenic participants were older, with higher fat mass, Charlson Comorbidity Index, and more chronic diseases. All models performed better for Sarcopenia-2 than Sarcopenia-1. The top performing models for Sarcopenia-1 were Logistic Regression [area under the curve (AUC) 71.59 (95% confidence interval [CI], 71.51-71.66)] and Multi-Layer Perceptron [AUC 71.48 (95%CI, 71.00-71.97)]. The top performing models for Sarcopenia-2 were Logistic Regression [AUC 91.44 (95%CI, 91.28-91.60)] and Support Vector Machine [AUC 90.81 (95%CI, 88.41-93.20)]. For the best Logistic Regression Model, important sarcopenia predictors included diabetes mellitus, digestive system complaints, signs and symptoms involving the nervous, musculoskeletal and respiratory systems, metabolic disorders, and kidney or urinary tract disorders. Opioids, corticosteroids, and antihyperlipidemic drugs were also more common among sarcopenic participants. Conclusions: Applying machine learning models, sarcopenia can be predicted from structured data in EHR, which may be developed through future studies to facilitate large-scale early detection and intervention in clinical populations.