- Browse by Subject
Browsing by Subject "Computational biology and bioinformatics"
Now showing 1 - 5 of 5
Results Per Page
Sort Options
Item Classifying early infant feeding status from clinical notes using natural language processing and machine learning(Springer Nature, 2024-04-03) Lemas, Dominick J.; Du, Xinsong; Rouhizadeh, Masoud; Lewis, Braeden; Frank, Simon; Wright, Lauren; Spirache, Alex; Gonzalez, Lisa; Cheves, Ryan; Magalhães, Marina; Zapata, Ruben; Reddy, Rahul; Xu, Ke; Parker, Leslie; Harle, Chris; Young, Bridget; Louis‑Jaques, Adetola; Zhang, Bouri; Thompson, Lindsay; Hogan, William R.; Modave, François; Health Policy and Management, Richard M. Fairbanks School of Public HealthThe objective of this study is to develop and evaluate natural language processing (NLP) and machine learning models to predict infant feeding status from clinical notes in the Epic electronic health records system. The primary outcome was the classification of infant feeding status from clinical notes using Medical Subject Headings (MeSH) terms. Annotation of notes was completed using TeamTat to uniquely classify clinical notes according to infant feeding status. We trained 6 machine learning models to classify infant feeding status: logistic regression, random forest, XGBoost gradient descent, k-nearest neighbors, and support-vector classifier. Model comparison was evaluated based on overall accuracy, precision, recall, and F1 score. Our modeling corpus included an even number of clinical notes that was a balanced sample across each class. We manually reviewed 999 notes that represented 746 mother-infant dyads with a mean gestational age of 38.9 weeks and a mean maternal age of 26.6 years. The most frequent feeding status classification present for this study was exclusive breastfeeding [n = 183 (18.3%)], followed by exclusive formula bottle feeding [n = 146 (14.6%)], and exclusive feeding of expressed mother’s milk [n = 102 (10.2%)], with mixed feeding being the least frequent [n = 23 (2.3%)]. Our final analysis evaluated the classification of clinical notes as breast, formula/bottle, and missing. The machine learning models were trained on these three classes after performing balancing and down sampling. The XGBoost model outperformed all others by achieving an accuracy of 90.1%, a macro-averaged precision of 90.3%, a macro-averaged recall of 90.1%, and a macro-averaged F1 score of 90.1%. Our results demonstrate that natural language processing can be applied to clinical notes stored in the electronic health records to classify infant feeding status. Early identification of breastfeeding status using NLP on unstructured electronic health records data can be used to inform precision public health interventions focused on improving lactation support for postpartum patients.Item Combinatorial analyses reveal cellular composition changes have different impacts on transcriptomic changes of cell type specific genes in Alzheimer’s Disease(Springer Nature, 2021-01-11) Johnson, Travis S.; Xiang, Shunian; Dong, Tianhan; Huang, Zhi; Cheng, Michael; Wang, Tianfu; Yang, Kai; Ni, Dong; Huang, Kun; Zhang, Jie; Biostatistics, School of Public HealthAlzheimer’s disease (AD) brains are characterized by progressive neuron loss and gliosis. Previous studies of gene expression using bulk tissue samples often fail to consider changes in cell-type composition when comparing AD versus control, which can lead to differences in expression levels that are not due to transcriptional regulation. We mined five large transcriptomic AD datasets for conserved gene co-expression module, then analyzed differential expression and differential co-expression within the modules between AD samples and controls. We performed cell-type deconvolution analysis to determine whether the observed differential expression was due to changes in cell-type proportions in the samples or to transcriptional regulation. Our findings were validated using four additional datasets. We discovered that the increased expression of microglia modules in the AD samples can be explained by increased microglia proportions in the AD samples. In contrast, decreased expression and perturbed co-expression within neuron modules in the AD samples was likely due in part to altered regulation of neuronal pathways. Several transcription factors that are differentially expressed in AD might account for such altered gene regulation. Similarly, changes in gene expression and co-expression within astrocyte modules could be attributed to combined effects of astrogliosis and astrocyte gene activation. Gene expression in the astrocyte modules was also strongly correlated with clinicopathological biomarkers. Through this work, we demonstrated that combinatorial analysis can delineate the origins of transcriptomic changes in bulk tissue data and shed light on key genes and pathways involved in AD.Item Image segmentation of plexiform neurofibromas from a deep neural network using multiple b-value diffusion data(Nature Publishing Group, 2020-10-20) Ho, Chang Y.; Kindler, John M.; Persohn, Scott; Kralik, Stephen F.; Robertson, Kent A.; Territo, Paul R.; Radiology and Imaging Sciences, School of MedicineWe assessed the accuracy of semi-automated tumor volume maps of plexiform neurofibroma (PN) generated by a deep neural network, compared to manual segmentation using diffusion weighted imaging (DWI) data. NF1 Patients were recruited from a phase II clinical trial for the treatment of PN. Multiple b-value DWI was imaged over the largest PN. All DWI datasets were registered and intensity normalized prior to segmentation with a multi-spectral neural network classifier (MSNN). Manual volumes of PN were performed on 3D-T2 images registered to diffusion images and compared to MSNN volumes with the Sørensen-Dice coefficient. Intravoxel incoherent motion (IVIM) parameters were calculated from resulting volumes. 35 MRI scans were included from 14 subjects. Sørensen-Dice coefficient between the semi-automated and manual segmentation was 0.77 ± 0.016. Perfusion fraction (f) was significantly higher for tumor versus normal tissue (0.47 ± 0.42 vs. 0.30 ± 0.22, p = 0.02), similarly, true diffusion (D) was significantly higher for PN tumor versus normal (0.0018 ± 0.0003 vs. 0.0012 ± 0.0002, p < 0.0001). By contrast, the pseudodiffusion coefficient (D*) was significantly lower for PN tumor versus normal (0.024 ± 0.01 vs. 0.031 ± 0.005, p < 0.0001). Volumes generated by a neural network from multiple diffusion data on PNs demonstrated good correlation with manual volumes. IVIM analysis of multiple b-value diffusion data demonstrates significant differences between PN and normal tissue.Item Methane, arsenic, selenium and the origins of the DMSO reductase family(Nature Publishing group, 2020-07-02) Wells, Michael; Kanmanii, Narthana Jeganathar; Al Zadjali, Al Muatasim; Janecka, Jan E.; Basu, Partha; Oremland, Ronald S.; Stolz, John F.; Chemistry and Chemical Biology, School of ScienceMononuclear molybdoenzymes of the dimethyl sulfoxide reductase (DMSOR) family catalyze a number of reactions essential to the carbon, nitrogen, sulfur, arsenic, and selenium biogeochemical cycles. These enzymes are also ancient, with many lineages likely predating the divergence of the last universal common ancestor into the Bacteria and Archaea domains. We have constructed rooted phylogenies for over 1,550 representatives of the DMSOR family using maximum likelihood methods to investigate the evolution of the arsenic biogeochemical cycle. The phylogenetic analysis provides compelling evidence that formylmethanofuran dehydrogenase B subunits, which catalyze the reduction of CO2 to formate during hydrogenotrophic methanogenesis, constitutes the most ancient lineage. Our analysis also provides robust support for selenocysteine as the ancestral ligand for the Mo/W atom. Finally, we demonstrate that anaerobic arsenite oxidase and respiratory arsenate reductase catalytic subunits represent a more ancient lineage of DMSORs compared to aerobic arsenite oxidase catalytic subunits, which evolved from the assimilatory nitrate reductase lineage. This provides substantial support for an active arsenic biogeochemical cycle on the anoxic Archean Earth. Our work emphasizes that the use of chalcophilic elements as substrates as well as the Mo/W ligand in DMSORs has indelibly shaped the diversification of these enzymes through deep time.Item Transcriptome-wide high-throughput mapping of protein–RNA occupancy profiles using POP-seq(Springer Nature, 2021-01-13) Srivastava, Mansi; Srivastava, Rajneesh; Janga, Sarath Chandra; BioHealth Informatics, School of Informatics and ComputingInteraction between proteins and RNA is critical for post-transcriptional regulatory processes. Existing high throughput methods based on crosslinking of the protein–RNA complexes and poly-A pull down are reported to contribute to biases and are not readily amenable for identifying interaction sites on non poly-A RNAs. We present Protein Occupancy Profile-Sequencing (POP-seq), a phase separation based method in three versions, one of which does not require crosslinking, thus providing unbiased protein occupancy profiles on whole cell transcriptome without the requirement of poly-A pulldown. Our study demonstrates that ~ 68% of the total POP-seq peaks exhibited an overlap with publicly available protein–RNA interaction profiles of 97 RNA binding proteins (RBPs) in K562 cells. We show that POP-seq variants consistently capture protein–RNA interaction sites across a broad range of genes including on transcripts encoding for transcription factors (TFs), RNA-Binding Proteins (RBPs) and long non-coding RNAs (lncRNAs). POP-seq identified peaks exhibited a significant enrichment (p value < 2.2e−16) for GWAS SNPs, phenotypic, clinically relevant germline as well as somatic variants reported in cancer genomes, suggesting the prevalence of uncharacterized genomic variation in protein occupied sites on RNA. We demonstrate that the abundance of POP-seq peaks increases with an increase in expression of lncRNAs, suggesting that highly expressed lncRNA are likely to act as sponges for RBPs, contributing to the rewiring of protein–RNA interaction network in cancer cells. Overall, our data supports POP-seq as a robust and cost-effective method that could be applied to primary tissues for mapping global protein occupancies.