Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

Binkheder, Samar Hussein

Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

dc.contributor.advisor	Jones, Josette
dc.contributor.author	Binkheder, Samar Hussein
dc.contributor.other	Li, Lang
dc.contributor.other	Quinney, Sara Kay
dc.contributor.other	Wu, Huanmei
dc.contributor.other	Zhang, Chi
dc.date.accessioned	2019-08-06T16:01:20Z
dc.date.available	2021-08-05T09:30:12Z
dc.date.issued	2019-07
dc.degree.date	2019	en_US
dc.degree.discipline
dc.degree.grantor	Indiana University	en_US
dc.degree.level	Ph.D.	en_US
dc.description	Indiana University-Purdue University Indianapolis (IUPUI)	en_US
dc.description.abstract	Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement.	en_US
dc.identifier.uri	https://hdl.handle.net/1805/20201
dc.identifier.uri	http://dx.doi.org/10.7912/C2/956
dc.language.iso	en_US	en_US
dc.subject	Biomedical literature	en_US
dc.subject	Electronic Health Records	en_US
dc.subject	Information retrieval and extraction	en_US
dc.subject	Machine learning	en_US
dc.subject	Phenotyping definitions	en_US
dc.subject	Text mining	en_US
dc.title	Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions	en_US
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Binkheder_iupui_0104D_10368.pdf
Size:: 6.59 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Informatics School Theses and Dissertations