IU Indianapolis ScholarWorks :: Browsing by Subject "Text mining"

Browsing by Subject "Text mining"

Now showing 1 - 8 of 8

Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions
(2019-07) Binkheder, Samar Hussein; Jones, Josette; Li, Lang; Quinney, Sara Kay; Wu, Huanmei; Zhang, Chi
Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement.
Condition-specific differential subnetwork analysis for biological systems
(2015-04) Jhamb, Deepali; Liu, Xiaowen; Li, Lang; Liu, Yunlong; Palakal, Mathew J.; Stocum, David L.
Biological systems behave differently under different conditions. Advances in sequencing technology over the last decade have led to the generation of enormous amounts of condition-specific data. However, these measurements often fail to identify low abundance genes/proteins that can be biologically crucial. In this work, a novel text-mining system was first developed to extract condition-specific proteins from the biomedical literature. The literature-derived data was then combined with proteomics data to construct condition-specific protein interaction networks. Further, an innovative condition-specific differential analysis approach was designed to identify key differences, in the form of subnetworks, between any two given biological systems. The framework developed here was implemented to understand the differences between limb regeneration-competent Ambystoma mexicanum and –deficient Xenopus laevis. This study provides an exhaustive systems level analysis to compare regeneration competent and deficient subnetworks to show how different molecular entities inter-connect with each other and are rewired during the formation of an accumulation blastema in regenerating axolotl limbs. This study also demonstrates the importance of literature-derived knowledge, specific to limb regeneration, to augment the systems biology analysis. Our findings show that although the proteins might be common between the two given biological conditions, they can have a high dissimilarity based on their biological and topological properties in the subnetwork. The knowledge gained from the distinguishing features of limb regeneration in amphibians can be used in future to chemically induce regeneration in mammalian systems. The approach developed in this dissertation is scalable and adaptable to understand differential subnetworks between any two biological systems. This methodology will not only facilitate the understanding of biological processes and molecular functions which govern a given system but also provide novel intuitions about the pathophysiology of diseases/conditions.
Correction: PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature
(BMC, 2022-07-20) Binkheder, Samar; Wu, Heng‑Yi; Quinney, Sara K.; Zhang, Shijun; Zitu, Md. Muntasir; Chiang, Chien‑Wei; Wang, Lei; Jones, Josette; Li, Lang; BioHealth Informatics, School of Informatics and Computing
PhenoDEF: a corpus for annotating sentences with information of phenotype definitions in biomedical literature. Binkheder S, Wu HY, Quinney SK, Zhang S, Zitu MM, Chiang CW, Wang L, Jones J, Li L. J Biomed Semantics. 2022 Jun 11;13(1):17. doi: 10.1186/s13326-022-00272-6. PMID: 35690873
Identify Opiod Use Problem
(2018-12) Alzeer, Abdullah Hamad; Jones, Josette; Dixon, Brian; Bair, Matthew; Liu, Xiaowen
The aim of this research is to design a new method to identify the opioid use problems (OUP) among long-term opioid therapy patients in Indiana University Health using text mining and machine learning approaches. First, a systematic review was conducted to investigate the current variables, methods, and opioid problem definitions used in the literature. We identified 75 distinct variables in 9 models that majorly used ICD codes to identify the opioid problem (OUP). The review concluded that using ICD codes alone may not be enough to determine the real size of the opioid problem and more effort is needed to adopt other methods to understand the issue. Next, we developed a text mining approach to identify OUP and compared the results with the current conventional method of identifying OUP using ICD-9 codes. Following the institutional review board and an approval from the Regenstrief Institute, structured and unstructured data of 14,298 IUH patients were collected from the Indiana Network for Patient Care. Our text mining approach identified 127 opioid cases compared to 45 cases identified by ICD codes. We concluded that the text mining approach may be used successfully to identify OUP from patients clinical notes. Moreover, we developed a machine learning approach to identify OUP by analyzing patients’ clinical notes. Our model was able to classify positive OUP from clinical notes with a sensitivity of 88% on unseen data. We concluded that the machine learning approach may be used successfully to identify the opioid use problem from patients’ clinical notes.
Text mining and portal development for gene-specific publications on Alzheimer's disease and other neurodegenerative diseases
(Springer Nature, 2024-04-17) Liu, Jiannan; Wu, Huanmei; Robertson, Daniel H.; Zhang, Jie; Biomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering
Background: Tremendous research efforts have been made in the Alzheimer's disease (AD) field to understand the disease etiology, progression and discover treatments for AD. Many mechanistic hypotheses, therapeutic targets and treatment strategies have been proposed in the last few decades. Reviewing previous work and staying current on this ever-growing body of AD publications is an essential yet difficult task for AD researchers. Methods: In this study, we designed and implemented a natural language processing (NLP) pipeline to extract gene-specific neurodegenerative disease (ND) -focused information from the PubMed database. The collected publication information was filtered and cleaned to construct AD-related gene-specific publication profiles. Six categories of AD-related information are extracted from the processed publication data: publication trend by year, dementia type occurrence, brain region occurrence, mouse model information, keywords occurrence, and co-occurring genes. A user-friendly web portal is then developed using Django framework to provide gene query functions and data visualizations for the generalized and summarized publication information. Results: By implementing the NLP pipeline, we extracted gene-specific ND-related publication information from the abstracts of the publications in the PubMed database. The results are summarized and visualized through an interactive web query portal. Multiple visualization windows display the ND publication trends, mouse models used, dementia types, involved brain regions, keywords to major AD-related biological processes, and co-occurring genes. Direct links to PubMed sites are provided for all recorded publications on the query result page of the web portal. Conclusion: The resulting portal is a valuable tool and data source for quick querying and displaying AD publications tailored to users' interested research areas and gene targets, which is especially convenient for users without informatic mining skills. Our study will not only keep AD field researchers updated with the progress of AD research, assist them in conducting preliminary examinations efficiently, but also offers additional support for hypothesis generation and validation which will contribute significantly to the communication, dissemination, and progress of AD research.
Text Mining for Social Harm and Criminal Justice Applications
(2020-08) Pandey, Ritika; Mohler, George; Hasan, Mohammad Al; Mukhopadhyay, Snehasis
Increasing rates of social harm events and plethora of text data demands the need of employing text mining techniques not only to better understand their causes but also to develop optimal prevention strategies. In this work, we study three social harm issues: crime topic models, transitions into drug addiction and homicide investigation chronologies. Topic modeling for the categorization and analysis of crime report text allows for more nuanced categories of crime compared to official UCR categorizations. This study has important implications in hotspot policing. We investigate the extent to which topic models that improve coherence lead to higher levels of crime concentration. We further explore the transitions into drug addiction using Reddit data. We proposed a prediction model to classify the users’ transition from casual drug discussion forum to recovery drug discussion forum and the likelihood of such transitions. Through this study we offer insights into modern drug culture and provide tools with potential applications in combating opioid crises. Lastly, we present a knowledge graph based framework for homicide investigation chronologies that may aid investigators in analyzing homicide case data and also allow for post hoc analysis of key features that determine whether a homicide is ultimately solved. For this purpose we perform named entity recognition to determine witnesses, detectives and suspects from chronology, use keyword expansion to identify various evidence types and finally link these entities and evidence to construct a homicide investigation knowledge graph. We compare the performance over several choice of methodologies for these sub-tasks and analyze the association between network statistics of knowledge graph and homicide solvability.
Translational drug interaction study using text mining technology
(2017-08-15) Wu, Heng-Yi; Jones, Josette; Li, Lang; Palakal, Mathew; Wu, Huanmei
Drug-Drug Interaction (DDI) is one of the major causes of adverse drug reaction (ADR) and has been demonstrated to threat public health. It causes an estimated 195,000 hospitalizations and 74,000 emergency room visits each year in the USA alone. Current DDI research aims to investigate different scopes of drug interactions: molecular level of pharmacogenetics interaction (PG), pharmacokinetics interaction (PK), and clinical pharmacodynamics consequences (PD). All three types of experiments are important, but they are playing different roles for DDI research. As diverse disciplines and varied studies are involved, interaction evidence is often not available cross all three types of evidence, which create knowledge gaps and these gaps hinder both DDI and pharmacogenetics research. In this dissertation, we proposed to distinguish the three types of DDI evidence (in vitro PK, in vivo PK, and clinical PD studies) and identify all knowledge gaps in experimental evidence for them. This is a collective intelligence effort, whereby a text mining tool will be developed for the large-scale mining and analysis of drug-interaction information such that it can be applied to retrieve, categorize, and extract the information of DDI from published literature available on PubMed. To this end, three tasks will be done in this research work: First, the needed lexica, ontology, and corpora for distinguishing three different types of studies were prepared. Despite the lexica prepared in this work, a comprehensive dictionary for drug metabolites or reaction, which is critical to in vitro PK study, is still lacking in pubic databases. Thus, second, a name entity recognition tool will be proposed to identify drug metabolites and reaction in free text. Third, text mining tools for retrieving DDI articles and extracting DDI evidence are developed. In this work, the knowledge gaps cross all three types of DDI evidence can be identified and the gaps between knowledge of molecular mechanisms underlying DDI and their clinical consequences can be closed with the result of DDI prediction using the retrieved drug gene interaction information such that we can exemplify how the tools and methods can advance DDI pharmacogenetics research.
Using transfer learning-based causality extraction to mine latent factors for Sjögren’s syndrome from biomedical literature
(Cell Press, 2023-09) VanSchaik, Jack T.; Jain, Palak; Rajapuri, Anushri; Cheriyan, Biju; Thyvalikakath, Thankam P.; Chakraborty, Sunandan; Human-Centered Computing, School of Informatics and Computing
Understanding causality is a longstanding goal across many different domains. Different articles, such as those published in medical journals, disseminate newly discovered knowledge that is often causal. In this paper, we use this intuition to build a model that leverages causal relations to unearth factors related to Sjögren's syndrome from biomedical literature. Sjögren's syndrome is an autoimmune disease affecting up to 3.1 million Americans. Due to the uncommon nature of the illness, symptoms across different specialties coupled with common symptoms of other autoimmune conditions such as rheumatoid arthritis, it is difficult for clinicians to diagnose the disease timely. Due to the lack of a dedicated dataset for causal relationships built from biomedical literature, we propose a transfer learning-based approach, where the relationship extraction model is trained on a wide variety of datasets. We conduct an empirical analysis of numerous neural network architectures and data transfer strategies for causal relation extraction. By conducting experiments with various contextual embedding layers and architectural components, we show that an ELECTRA-based sentence-level relation extraction model generalizes better than other architectures across varying web-based sources and annotation strategies. We use this empirical observation to create a pipeline for identifying causal sentences from literature text, extracting the causal relationships from causal sentences, and building a causal network consisting of latent factors related to Sjögren's syndrome. We show that our approach can retrieve such factors with high precision and recall values. Comparative experiments show that this approach leads to 25% improvement in retrieval F1-score compared to several state-of-the-art biomedical models, including BioBERT and Gram-CNN. We apply this model to a corpus of research articles related to Sjögren's syndrome collected from PubMed to create a causal network for Sjögren's syndrome. The proposed causal network for Sjögren's syndrome will potentially help clinicians with a holistic knowledge base for faster diagnosis.

Browsing by Subject "Text mining"

Results Per Page

Sort Options