IU Indianapolis ScholarWorks :: Browsing by Subject "Data mining"

Browsing by Subject "Data mining"

Now showing 1 - 10 of 30

Advanced natural language processing and temporal mining for clinical discovery
(2015-08-17) Mehrabi, Saeed; Jones, Josette F.; Palakal, Mathew J.; Chien, Stanley Yung-Ping; Liu, Xiaowen; Schmidt, C. Max
There has been vast and growing amount of healthcare data especially with the rapid adoption of electronic health records (EHRs) as a result of the HITECH act of 2009. It is estimated that around 80% of the clinical information resides in the unstructured narrative of an EHR. Recently, natural language processing (NLP) techniques have offered opportunities to extract information from unstructured clinical texts needed for various clinical applications. A popular method for enabling secondary uses of EHRs is information or concept extraction, a subtask of NLP that seeks to locate and classify elements within text based on the context. Extraction of clinical concepts without considering the context has many complications, including inaccurate diagnosis of patients and contamination of study cohorts. Identifying the negation status and whether a clinical concept belongs to patients or his family members are two of the challenges faced in context detection. A negation algorithm called Dependency Parser Negation (DEEPEN) has been developed in this research study by taking into account the dependency relationship between negation words and concepts within a sentence using the Stanford Dependency Parser. The study results demonstrate that DEEPEN, can reduce the number of incorrect negation assignment for patients with positive findings, and therefore improve the identification of patients with the target clinical findings in EHRs. Additionally, an NLP system consisting of section segmentation and relation discovery was developed to identify patients' family history. To assess the generalizability of the negation and family history algorithm, data from a different clinical institution was used in both algorithm evaluations.
Applications of Data Mining in Healthcare
(2019-05) Peng, Bo; Mohler, George; Dundar, Murat; Zheng, Jiang Yu
With increases in the quantity and quality of healthcare related data, data mining tools have the potential to improve people’s standard of living through personalized and predictive medicine. In this thesis we improve the state-of-the-art in data mining for several problems in the healthcare domain. In problems such as drug-drug interaction prediction and Alzheimer’s Disease (AD) biomarkers discovery and prioritization, current methods either require tedious feature engineering or have unsatisfactory performance. New effective computational tools are needed that can tackle these complex problems. In this dissertation, we develop new algorithms for two healthcare problems: high-order drug-drug interaction prediction and amyloid imaging biomarker prioritization in Alzheimer’s Disease. Drug-drug interactions (DDIs) and their associated adverse drug reactions (ADRs) represent a significant detriment to the public h ealth. Existing research on DDIs primarily focuses on pairwise DDI detection and prediction. Effective computational methods for high-order DDI prediction are desired. In this dissertation, I present a deep learning based model D 3 I for cardinality-invariant and order-invariant high-order DDI pre- diction. The proposed models achieve 0.740 F1 value and 0.847 AUC value on high-order DDI prediction, and outperform classical methods on order-2 DDI prediction. These results demonstrate the strong potential of D 3 I and deep learning based models in tackling the prediction problems of high-order DDIs and their induced ADRs. The second problem I consider in this thesis is amyloid imaging biomarkers discovery, for which I propose an innovative machine learning paradigm enabling precision medicine in this domain. The paradigm tailors the imaging biomarker discovery process to individual characteristics of a given patient. I implement this paradigm using a newly developed learning-to-rank method PLTR. The PLTR model seamlessly integrates two objectives for joint optimization: pushing up relevant biomarkers and ranking among relevant biomarkers. The empirical study of PLTR conducted on the ADNI data yields promising results to identify and prioritize individual-specific amyloid imaging biomarkers based on the individual’s structural MRI data. The resulting top ranked imaging biomarkers have the potential to aid personalized diagnosis and disease subtyping.
An Automated System for Generating Situation-Specific Decision Support in Clinical Order Entry from Local Empirical Data
(2011-10-19) Klann, Jeffrey G.; Schadow, Gunther; Downs, Stephen M.; Finnell, John T.; Palakal, Mathew J.; Szolovits, Peter
Clinical Decision Support is one of the only aspects of health information technology that has demonstrated decreased costs and increased quality in healthcare delivery, yet it is extremely expensive and time-consuming to create, maintain, and localize. Consequently, a majority of health care systems do not utilize it, and even when it is available it is frequently incorrect. Therefore it is important to look beyond traditional guideline-based decision support to more readily available resources in order to bring this technology into widespread use. This study proposes that the wisdom of physicians within a practice is a rich, untapped knowledge source that can be harnessed for this purpose. I hypothesize and demonstrate that this wisdom is reflected by order entry data well enough to partially reconstruct the knowledge behind treatment decisions. Automated reconstruction of such knowledge is used to produce dynamic, situation-specific treatment suggestions, in a similar vein to Amazon.com shopping recommendations. This approach is appealing because: it is local (so it reflects local standards); it fits into workflow more readily than the traditional local-wisdom approach (viz. the curbside consult); and, it is free (the data are already being captured). This work develops several new machine-learning algorithms and novel applications of existing algorithms, focusing on an approach called Bayesian network structure learning. I develop: an approach to produce dynamic, rank-ordered situation-specific treatment menus from treatment data; statistical machinery to evaluate their accuracy using retrospective simulation; a novel algorithm which is an order of magnitude faster than existing algorithms; a principled approach to choosing smaller, more optimal, domain-specific subsystems; and a new method to discover temporal relationships in the data. The result is a comprehensive approach for extracting knowledge from order-entry data to produce situation-specific treatment menus, which is applied to order-entry data at Wishard Hospital in Indianapolis. Retrospective simulations find that, in a large variety of clinical situations, a short menu will contain the clinicians' desired next actions. A prospective survey additionally finds that such menus aid physicians in writing order sets (in completeness and speed). This study demonstrates that clinical knowledge can be successfully extracted from treatment data for decision support.
Big Data and Dysmenorrhea: What Questions Do Women and Men Ask About Menstrual Pain?
(Mary Ann Liebert, 2018-10) Chen, Chen X.; Groves, Doyle; Miller, Wendy R.; Carpenter, Janet S.; School of Nursing
BACKGROUND: Menstrual pain is highly prevalent among women of reproductive age. As the general public increasingly obtains health information online, Big Data from online platforms provide novel sources to understand the public's perspectives and information needs about menstrual pain. The study's purpose was to describe salient queries about dysmenorrhea using Big Data from a question and answer platform. MATERIALS AND METHODS: We performed text-mining of 1.9 billion queries from ChaCha, a United States-based question and answer platform. Dysmenorrhea-related queries were identified by using keyword searching. Each relevant query was split into token words (i.e., meaningful words or phrases) and stop words (i.e., not meaningful functional words). Word Adjacency Graph (WAG) modeling was used to detect clusters of queries and visualize the range of dysmenorrhea-related topics. We constructed two WAG models respectively from queries by women of reproductive age and bymen. Salient themes were identified through inspecting clusters of WAG models. RESULTS: We identified two subsets of queries: Subset 1 contained 507,327 queries from women aged 13-50 years. Subset 2 contained 113,888 queries from men aged 13 or above. WAG modeling revealed topic clusters for each subset. Between female and male subsets, topic clusters overlapped on dysmenorrhea symptoms and management. Among female queries, there were distinctive topics on approaching menstrual pain at school and menstrual pain-related conditions; while among male queries, there was a distinctive cluster of queries on menstrual pain from male's perspectives. CONCLUSIONS: Big Data mining of the ChaCha® question and answer service revealed a series of information needs among women and men on menstrual pain. Findings may be useful in structuring the content and informing the delivery platform for educational interventions.
Biomedical Literature Mining with Transitive Closure and Maximum Network Flow
(http://doi.acm.org/10.1145/1851476.1851552, 2011-05-15) Hoblitzell, Andrew P.; Mukhopadhyay, Snehasis; Xia, Yuni; Fang, Shiafoen
The biological literature is a huge and constantly increasing source of information which the biologist may consult for information about their field, but the vast amount of data can sometimes become overwhelming. Medline, which makes a great amount of biological journal data available online, makes the development of automated text mining systems and hence “data-driven discovery” possible. This thesis examines current work in the field of text mining and biological literature, and then aims to mine documents pertaining to bone biology. The documents are retrieved from PubMed, and then direct associations between the terms are computers. Potentially novel transitive associations among biological objects are then discovered using the transitive closure algorithm and the maximum flow algorithm. The thesis discusses in detail the extraction of biological objects from the collected documents and the co-occurrence based text mining algorithm, the transitive closure algorithm, and the maximum network flow which were then run to extract the potentially novel biological associations. Generated hypotheses (novel associations) were assigned with significance scores for further validation by a bone biologist expert. Extension of the work in to hypergraphs for enhanced meaning and accuracy is also examined in the thesis.
Bridging Text Mining and Bayesian Networks
(2011-03-09) Raghuram, Sandeep Mudabail; Xia, Yuni; Palakal, Mathew; Zou, Xukai, 1963-
After the initial network is constructed using expert’s knowledge of the domain, Bayesian networks need to be updated as and when new data is observed. Literature mining is a very important source of this new data. In this work, we explore what kind of data needs to be extracted with the view to update Bayesian Networks, existing technologies which can be useful in achieving some of the goals and what research is required to accomplish the remaining requirements. This thesis specifically deals with utilizing causal associations and experimental results which can be obtained from literature mining. However, these associations and numerical results cannot be directly integrated with the Bayesian network. The source of the literature and the perceived quality of research needs to be factored into the process of integration, just like a human, reading the literature, would. This thesis presents a general methodology for updating a Bayesian Network with the mined data. This methodology consists of solutions to some of the issues surrounding the task of integrating the causal associations with the Bayesian Network and demonstrates the idea with a semiautomated software system.
Combinatorial analyses reveal cellular composition changes have different impacts on transcriptomic changes of cell type specific genes in Alzheimer’s Disease
(Springer Nature, 2021-01-11) Johnson, Travis S.; Xiang, Shunian; Dong, Tianhan; Huang, Zhi; Cheng, Michael; Wang, Tianfu; Yang, Kai; Ni, Dong; Huang, Kun; Zhang, Jie; Biostatistics, School of Public Health
Alzheimer’s disease (AD) brains are characterized by progressive neuron loss and gliosis. Previous studies of gene expression using bulk tissue samples often fail to consider changes in cell-type composition when comparing AD versus control, which can lead to differences in expression levels that are not due to transcriptional regulation. We mined five large transcriptomic AD datasets for conserved gene co-expression module, then analyzed differential expression and differential co-expression within the modules between AD samples and controls. We performed cell-type deconvolution analysis to determine whether the observed differential expression was due to changes in cell-type proportions in the samples or to transcriptional regulation. Our findings were validated using four additional datasets. We discovered that the increased expression of microglia modules in the AD samples can be explained by increased microglia proportions in the AD samples. In contrast, decreased expression and perturbed co-expression within neuron modules in the AD samples was likely due in part to altered regulation of neuronal pathways. Several transcription factors that are differentially expressed in AD might account for such altered gene regulation. Similarly, changes in gene expression and co-expression within astrocyte modules could be attributed to combined effects of astrogliosis and astrocyte gene activation. Gene expression in the astrocyte modules was also strongly correlated with clinicopathological biomarkers. Through this work, we demonstrated that combinatorial analysis can delineate the origins of transcriptomic changes in bulk tissue data and shed light on key genes and pathways involved in AD.
A Compressed Data Collection System For Use In Wireless Sensor Networks
(2013-03-06) Erratt, Newlyn S.; Liang, Yao; Raje, Rajeev; Tuceryan, Mihran
One of the most common goals of a wireless sensor network is to collect sensor data. The goal of this thesis is to provide an easy to use and energy-e fficient system for deploying data collection sensor networks. There are numerous challenges associated with deploying a wireless sensor network for collection of sensor data; among these challenges are reducing energy consumption and the fact that users interested in collecting data may not be familiar with software design. This thesis presents a complete system, comprised of the Compression Data-stream Protocol and a general gateway for data collection in wireless sensor networks, which attempts to provide an easy to use, energy efficient and complete system for data collection in sensor networks. The Compressed Data-stream Protocol is a transport layer compression protocol with a primary goal, in this work, to reduce energy consumption. Energy consumption of the radio in wireless sensor network nodes is expensive and the Com-pressed Data-stream Protocol has been shown in simulations to reduce energy used on transmission and reception by around 26%. The general gateway has been designed in such a way as to make customization simple without requiring vast knowledge of sensor networks and software development. This, along with the modular nature of the Compressed Data-stream Protocol, enables the creation of an easy to deploy and easy to configure sensor network for data collection. Findings show that individual components work well and that the system as a whole performs without errors. This system, the components of which will eventually be released as open source, provides a platform for researchers purely interested in the data gathered to deploy a sensor network without being restricted to specific vendors of hardware.
Condition-specific differential subnetwork analysis for biological systems
(2015-04) Jhamb, Deepali; Liu, Xiaowen; Li, Lang; Liu, Yunlong; Palakal, Mathew J.; Stocum, David L.
Biological systems behave differently under different conditions. Advances in sequencing technology over the last decade have led to the generation of enormous amounts of condition-specific data. However, these measurements often fail to identify low abundance genes/proteins that can be biologically crucial. In this work, a novel text-mining system was first developed to extract condition-specific proteins from the biomedical literature. The literature-derived data was then combined with proteomics data to construct condition-specific protein interaction networks. Further, an innovative condition-specific differential analysis approach was designed to identify key differences, in the form of subnetworks, between any two given biological systems. The framework developed here was implemented to understand the differences between limb regeneration-competent Ambystoma mexicanum and –deficient Xenopus laevis. This study provides an exhaustive systems level analysis to compare regeneration competent and deficient subnetworks to show how different molecular entities inter-connect with each other and are rewired during the formation of an accumulation blastema in regenerating axolotl limbs. This study also demonstrates the importance of literature-derived knowledge, specific to limb regeneration, to augment the systems biology analysis. Our findings show that although the proteins might be common between the two given biological conditions, they can have a high dissimilarity based on their biological and topological properties in the subnetwork. The knowledge gained from the distinguishing features of limb regeneration in amphibians can be used in future to chemically induce regeneration in mammalian systems. The approach developed in this dissertation is scalable and adaptable to understand differential subnetworks between any two biological systems. This methodology will not only facilitate the understanding of biological processes and molecular functions which govern a given system but also provide novel intuitions about the pathophysiology of diseases/conditions.
Detecting significant genotype-phenotype association rules in bipolar disorder: market research meets complex genetics
(SpringerOpen, 2018-11-11) Breuer, René; Mattheisen, Manuel; Frank, Josef; Krumm, Bertram; Treutlein, Jens; Kassem, Layla; Strohmaier, Jana; Herms, Stefan; Mühleisen, Thomas W.; Degenhardt, Franziska; Cichon, Sven; Nöthen, Markus M.; Karypis, George; Kelsoe, John; Greenwood, Tiffany; Nievergelt, Caroline; Shilling, Paul; Shekhtman, Tatyana; Edenberg, Howard; Craig, David; Szelinger, Szabolcs; Nurnberger, John; Gershon, Elliot; Alliey‑Rodriguez, Ney; Zandi, Peter; Goes, Fernando; Schork, Nicholas; Smith, Erin; Koller, Daniel; Zhang, Peng; Badner, Judith; Berrettini, Wade; Bloss, Cinnamon; Byerley, William; Coryell, William; Foroud, Tatiana; Guo, Yirin; Hipolito, Maria; Keating, Brendan; Lawson, William; Liu, Chunyu; Mahon, Pamela; McInnis, Melvin; Murray, Sarah; Nwulia, Evaristus; Potash, James; Rice, John; Scheftner, William; Zöllner, Sebastian; McMahon, Francis J.; Rietschel, Marcella; Schulze, Thomas G.; Biochemistry and Molecular Biology, School of Medicine
BACKGROUND: Disentangling the etiology of common, complex diseases is a major challenge in genetic research. For bipolar disorder (BD), several genome-wide association studies (GWAS) have been performed. Similar to other complex disorders, major breakthroughs in explaining the high heritability of BD through GWAS have remained elusive. To overcome this dilemma, genetic research into BD, has embraced a variety of strategies such as the formation of large consortia to increase sample size and sequencing approaches. Here we advocate a complementary approach making use of already existing GWAS data: a novel data mining procedure to identify yet undetected genotype-phenotype relationships. We adapted association rule mining, a data mining technique traditionally used in retail market research, to identify frequent and characteristic genotype patterns showing strong associations to phenotype clusters. We applied this strategy to three independent GWAS datasets from 2835 phenotypically characterized patients with BD. In a discovery step, 20,882 candidate association rules were extracted. RESULTS: Two of these rules-one associated with eating disorder and the other with anxiety-remained significant in an independent dataset after robust correction for multiple testing. Both showed considerable effect sizes (odds ratio ~ 3.4 and 3.0, respectively) and support previously reported molecular biological findings. CONCLUSION: Our approach detected novel specific genotype-phenotype relationships in BD that were missed by standard analyses like GWAS. While we developed and applied our method within the context of BD gene discovery, it may facilitate identifying highly specific genotype-phenotype relationships in subsets of genome-wide data sets of other complex phenotype with similar epidemiological properties and challenges to gene discovery efforts.

Browsing by Subject "Data mining"

Results Per Page

Sort Options