IU Indianapolis ScholarWorks :: Browsing by Subject "Computational biology and bioinformatics"

Browsing by Subject "Computational biology and bioinformatics"

Now showing 1 - 7 of 7

Classifying early infant feeding status from clinical notes using natural language processing and machine learning
(Springer Nature, 2024-04-03) Lemas, Dominick J.; Du, Xinsong; Rouhizadeh, Masoud; Lewis, Braeden; Frank, Simon; Wright, Lauren; Spirache, Alex; Gonzalez, Lisa; Cheves, Ryan; Magalhães, Marina; Zapata, Ruben; Reddy, Rahul; Xu, Ke; Parker, Leslie; Harle, Chris; Young, Bridget; Louis‑Jaques, Adetola; Zhang, Bouri; Thompson, Lindsay; Hogan, William R.; Modave, François; Health Policy and Management, Richard M. Fairbanks School of Public Health
The objective of this study is to develop and evaluate natural language processing (NLP) and machine learning models to predict infant feeding status from clinical notes in the Epic electronic health records system. The primary outcome was the classification of infant feeding status from clinical notes using Medical Subject Headings (MeSH) terms. Annotation of notes was completed using TeamTat to uniquely classify clinical notes according to infant feeding status. We trained 6 machine learning models to classify infant feeding status: logistic regression, random forest, XGBoost gradient descent, k-nearest neighbors, and support-vector classifier. Model comparison was evaluated based on overall accuracy, precision, recall, and F1 score. Our modeling corpus included an even number of clinical notes that was a balanced sample across each class. We manually reviewed 999 notes that represented 746 mother-infant dyads with a mean gestational age of 38.9 weeks and a mean maternal age of 26.6 years. The most frequent feeding status classification present for this study was exclusive breastfeeding [n = 183 (18.3%)], followed by exclusive formula bottle feeding [n = 146 (14.6%)], and exclusive feeding of expressed mother’s milk [n = 102 (10.2%)], with mixed feeding being the least frequent [n = 23 (2.3%)]. Our final analysis evaluated the classification of clinical notes as breast, formula/bottle, and missing. The machine learning models were trained on these three classes after performing balancing and down sampling. The XGBoost model outperformed all others by achieving an accuracy of 90.1%, a macro-averaged precision of 90.3%, a macro-averaged recall of 90.1%, and a macro-averaged F1 score of 90.1%. Our results demonstrate that natural language processing can be applied to clinical notes stored in the electronic health records to classify infant feeding status. Early identification of breastfeeding status using NLP on unstructured electronic health records data can be used to inform precision public health interventions focused on improving lactation support for postpartum patients.
Clinical exome sequencing efficacy and phenotypic expansions involving anomalous pulmonary venous return
(Springer Nature, 2023) Huth, Emily A.; Zhao, Xiaonan; Owen, Nichole; Luna, Pamela N.; Vogel, Ida; Dorf, Inger L. H.; Joss, Shelagh; Clayton-Smith, Jill; Parker, Michael J.; Louw, Jacoba J.; Gewillig, Marc; Breckpot, Jeroen; Kraus, Alison; Sasaki, Erina; Kini, Usha; Burgess, Trent; Tan, Tiong Y.; Armstrong, Ruth; Neas, Katherine; Ferrero, Giovanni B.; Brusco, Alfredo; Kerstjens-Frederikse, Wihelmina S.; Rankin, Julia; Helvaty, Lindsey R.; Landis, Benjamin J.; Geddes, Gabrielle C.; McBride, Kim L.; Ware, Stephanie M.; Shaw, Chad A.; Lalani, Seema R.; Rosenfeld, Jill A.; Scott, Daryl A.; Medical and Molecular Genetics, School of Medicine
Anomalous pulmonary venous return (APVR) frequently occurs with other congenital heart defects (CHDs) or extra-cardiac anomalies. While some genetic causes have been identified, the optimal approach to genetic testing in individuals with APVR remains uncertain, and the etiology of most cases of APVR is unclear. Here, we analyzed molecular data from 49 individuals to determine the diagnostic yield of clinical exome sequencing (ES) for non-isolated APVR. A definitive or probable diagnosis was made for 8 of those individuals yielding a diagnostic efficacy rate of 16.3%. We then analyzed molecular data from 62 individuals with APVR accrued from three databases to identify novel APVR genes. Based on data from this analysis, published case reports, mouse models, and/or similarity to known APVR genes as revealed by a machine learning algorithm, we identified 3 genes-EFTUD2, NAA15, and NKX2-1-for which there is sufficient evidence to support phenotypic expansion to include APVR. We also provide evidence that 3 recurrent copy number variants contribute to the development of APVR: proximal 1q21.1 microdeletions involving RBM8A and PDZK1, recurrent BP1-BP2 15q11.2 deletions, and central 22q11.2 deletions involving CRKL. Our results suggest that ES and chromosomal microarray analysis (or genome sequencing) should be considered for individuals with non-isolated APVR for whom a genetic etiology has not been identified, and that genetic testing to identify an independent genetic etiology of APVR is not warranted in individuals with EFTUD2-, NAA15-, and NKX2-1-related disorders.
Combinatorial analyses reveal cellular composition changes have different impacts on transcriptomic changes of cell type specific genes in Alzheimer’s Disease
(Springer Nature, 2021-01-11) Johnson, Travis S.; Xiang, Shunian; Dong, Tianhan; Huang, Zhi; Cheng, Michael; Wang, Tianfu; Yang, Kai; Ni, Dong; Huang, Kun; Zhang, Jie; Biostatistics, School of Public Health
Alzheimer’s disease (AD) brains are characterized by progressive neuron loss and gliosis. Previous studies of gene expression using bulk tissue samples often fail to consider changes in cell-type composition when comparing AD versus control, which can lead to differences in expression levels that are not due to transcriptional regulation. We mined five large transcriptomic AD datasets for conserved gene co-expression module, then analyzed differential expression and differential co-expression within the modules between AD samples and controls. We performed cell-type deconvolution analysis to determine whether the observed differential expression was due to changes in cell-type proportions in the samples or to transcriptional regulation. Our findings were validated using four additional datasets. We discovered that the increased expression of microglia modules in the AD samples can be explained by increased microglia proportions in the AD samples. In contrast, decreased expression and perturbed co-expression within neuron modules in the AD samples was likely due in part to altered regulation of neuronal pathways. Several transcription factors that are differentially expressed in AD might account for such altered gene regulation. Similarly, changes in gene expression and co-expression within astrocyte modules could be attributed to combined effects of astrogliosis and astrocyte gene activation. Gene expression in the astrocyte modules was also strongly correlated with clinicopathological biomarkers. Through this work, we demonstrated that combinatorial analysis can delineate the origins of transcriptomic changes in bulk tissue data and shed light on key genes and pathways involved in AD.
Image segmentation of plexiform neurofibromas from a deep neural network using multiple b-value diffusion data
(Nature Publishing Group, 2020-10-20) Ho, Chang Y.; Kindler, John M.; Persohn, Scott; Kralik, Stephen F.; Robertson, Kent A.; Territo, Paul R.; Radiology and Imaging Sciences, School of Medicine
We assessed the accuracy of semi-automated tumor volume maps of plexiform neurofibroma (PN) generated by a deep neural network, compared to manual segmentation using diffusion weighted imaging (DWI) data. NF1 Patients were recruited from a phase II clinical trial for the treatment of PN. Multiple b-value DWI was imaged over the largest PN. All DWI datasets were registered and intensity normalized prior to segmentation with a multi-spectral neural network classifier (MSNN). Manual volumes of PN were performed on 3D-T2 images registered to diffusion images and compared to MSNN volumes with the Sørensen-Dice coefficient. Intravoxel incoherent motion (IVIM) parameters were calculated from resulting volumes. 35 MRI scans were included from 14 subjects. Sørensen-Dice coefficient between the semi-automated and manual segmentation was 0.77 ± 0.016. Perfusion fraction (f) was significantly higher for tumor versus normal tissue (0.47 ± 0.42 vs. 0.30 ± 0.22, p = 0.02), similarly, true diffusion (D) was significantly higher for PN tumor versus normal (0.0018 ± 0.0003 vs. 0.0012 ± 0.0002, p < 0.0001). By contrast, the pseudodiffusion coefficient (D*) was significantly lower for PN tumor versus normal (0.024 ± 0.01 vs. 0.031 ± 0.005, p < 0.0001). Volumes generated by a neural network from multiple diffusion data on PNs demonstrated good correlation with manual volumes. IVIM analysis of multiple b-value diffusion data demonstrates significant differences between PN and normal tissue.
Methane, arsenic, selenium and the origins of the DMSO reductase family
(Nature Publishing group, 2020-07-02) Wells, Michael; Kanmanii, Narthana Jeganathar; Al Zadjali, Al Muatasim; Janecka, Jan E.; Basu, Partha; Oremland, Ronald S.; Stolz, John F.; Chemistry and Chemical Biology, School of Science
Mononuclear molybdoenzymes of the dimethyl sulfoxide reductase (DMSOR) family catalyze a number of reactions essential to the carbon, nitrogen, sulfur, arsenic, and selenium biogeochemical cycles. These enzymes are also ancient, with many lineages likely predating the divergence of the last universal common ancestor into the Bacteria and Archaea domains. We have constructed rooted phylogenies for over 1,550 representatives of the DMSOR family using maximum likelihood methods to investigate the evolution of the arsenic biogeochemical cycle. The phylogenetic analysis provides compelling evidence that formylmethanofuran dehydrogenase B subunits, which catalyze the reduction of CO2 to formate during hydrogenotrophic methanogenesis, constitutes the most ancient lineage. Our analysis also provides robust support for selenocysteine as the ancestral ligand for the Mo/W atom. Finally, we demonstrate that anaerobic arsenite oxidase and respiratory arsenate reductase catalytic subunits represent a more ancient lineage of DMSORs compared to aerobic arsenite oxidase catalytic subunits, which evolved from the assimilatory nitrate reductase lineage. This provides substantial support for an active arsenic biogeochemical cycle on the anoxic Archean Earth. Our work emphasizes that the use of chalcophilic elements as substrates as well as the Mo/W ligand in DMSORs has indelibly shaped the diversification of these enzymes through deep time.
Multi-omics analysis identifies glioblastoma dependency on H3K9me3 methyltransferase activity
(Springer Nature, 2025-03-20) Xie, Qiqi; Du, Yuanning; Ghosh, Sugata; Rajendran, Saranya; Cohen-Gadol, Aaron A.; Baizabal, José-Manuel; Nephew, Kenneth P.; Han, Leng; Shen, Jia; Neurological Surgery, School of Medicine
Histone H3 lysine 9 dimethylation and trimethylation (H3K9me2/3) are prevalent in human genomes, especially in heterochromatin and specific euchromatic genes. Methylation of H3K9 is modulated by enzymes such as SUV39H1, SUV39H2, SETDB1, SETDB2, and EHMT1/2, which influence cancer progression. This study reveals differential expression of these six H3K9 methyltransferases in tumors, with SUV39H1, SUV39H2, and SETDB1 showing significant links to cancer phenotypes. We developed the “H3K9me3 MtSig” (H3K9me3 methyltransferases signature) based on these findings. H3K9me3 MtSig is unique to various tumors, with prognostic significance and associations with key signaling pathways, especially in glioblastoma (GBM). Elevated H3K9me3 MtSig was observed in GBM samples, correlating with the G2/M cell cycle and reduced immune responses. H3K9me3-mediated repetitive sequence silencing by H3K9me3 MtSig contributed to these phenotypes, and inhibiting H3K9me3 MtSig in patient-derived GBM cells suppressed proliferation and increased immune responses. H3K9me3 MtSig serves as an independent prognostic factor and potential therapeutic target.
Transcriptome-wide high-throughput mapping of protein–RNA occupancy profiles using POP-seq
(Springer Nature, 2021-01-13) Srivastava, Mansi; Srivastava, Rajneesh; Janga, Sarath Chandra; BioHealth Informatics, School of Informatics and Computing
Interaction between proteins and RNA is critical for post-transcriptional regulatory processes. Existing high throughput methods based on crosslinking of the protein–RNA complexes and poly-A pull down are reported to contribute to biases and are not readily amenable for identifying interaction sites on non poly-A RNAs. We present Protein Occupancy Profile-Sequencing (POP-seq), a phase separation based method in three versions, one of which does not require crosslinking, thus providing unbiased protein occupancy profiles on whole cell transcriptome without the requirement of poly-A pulldown. Our study demonstrates that ~ 68% of the total POP-seq peaks exhibited an overlap with publicly available protein–RNA interaction profiles of 97 RNA binding proteins (RBPs) in K562 cells. We show that POP-seq variants consistently capture protein–RNA interaction sites across a broad range of genes including on transcripts encoding for transcription factors (TFs), RNA-Binding Proteins (RBPs) and long non-coding RNAs (lncRNAs). POP-seq identified peaks exhibited a significant enrichment (p value < 2.2e−16) for GWAS SNPs, phenotypic, clinically relevant germline as well as somatic variants reported in cancer genomes, suggesting the prevalence of uncharacterized genomic variation in protein occupied sites on RNA. We demonstrate that the abundance of POP-seq peaks increases with an increase in expression of lncRNAs, suggesting that highly expressed lncRNA are likely to act as sponges for RBPs, contributing to the rewiring of protein–RNA interaction network in cancer cells. Overall, our data supports POP-seq as a robust and cost-effective method that could be applied to primary tissues for mapping global protein occupancies.

Browsing by Subject "Computational biology and bioinformatics"

Results Per Page

Sort Options