- Browse by Subject
Browsing by Subject "Computational Biology"
Now showing 1 - 10 of 14
Results Per Page
Sort Options
Item Advances in translational bioinformatics facilitate revealing the landscape of complex disease mechanisms(Springer (Biomed Central Ltd.), 2014) Yang, Jack Y.; Dunker, A. Keith; Liu, Jun S.; Qin, Xiang; Arabnia, Hamid R.; Yang, William; Niemierko, Andrzej; Chen, Zhongxue; Luo, Zuojie; Wang, Liangjiang; Liu, Yunlong; Xu, Dong; Deng, Youping; Tong, Weida; Yang, Mary Qu; Department of Biochemistry and Molecular Biology, IU School of MedicineAdvances of high-throughput technologies have rapidly produced more and more data from DNAs and RNAs to proteins, especially large volumes of genome-scale data. However, connection of the genomic information to cellular functions and biological behaviours relies on the development of effective approaches at higher systems level. In particular, advances in RNA-Seq technology has helped the studies of transcriptome, RNA expressed from the genome, while systems biology on the other hand provides more comprehensive pictures, from which genes and proteins actively interact to lead to cellular behaviours and physiological phenotypes. As biological interactions mediate many biological processes that are essential for cellular function or disease development, it is important to systematically identify genomic information including genetic mutations from GWAS (genome-wide association study), differentially expressed genes, bidirectional promoters, intrinsic disordered proteins (IDP) and protein interactions to gain deep insights into the underlying mechanisms of gene regulations and networks. Furthermore, bidirectional promoters can co-regulate many biological pathways, where the roles of bidirectional promoters can be studied systematically for identifying co-regulating genes at interactive network level. Combining information from different but related studies can ultimately help revealing the landscape of molecular mechanisms underlying complex diseases such as cancer.Item Development of Data-driven and AI-powered Systems Biology Methods for Understanding Human Disease(2024-08) Dang, Pengtao; Zhang, Chi; Salama, Paul; Cao, Sha; King, Brian; Ben-Miled, ZinaSystems biology dynamic models, which are based on differential equations, offer a flexible and accurate framework to explain physiological properties emerging from complex biochem- ical or biological systems. These models enable explicit quantification and interpretation, allowing for simulation and perturbation analysis to study biological features and their inter- actions, as well as understanding system progression and convergence under various initial conditions. However, their application in human disease systems is limited due to unknown kinetics parameters under disease conditions and a reductionist paradigm that fails to cap- ture the complexity of diseases. Meanwhile, the advent of omics technologies provides high- resolution molecular measurements from single cells and spatially resolved samples, as well as comprehensive disease-specific molecular signatures from large patient cohorts. This wealth of data holds the promise for characterizing complex biological systems, necessitating ad- vanced systems biology models and computational tools that can harness multi-omics data to reliably depict biological processes. However, this endeavor faces the challenge of nonlinear relationships between omics data and the system’s dynamic properties, such as the global or local low-rank gene expression patterns across cell types and the nonlinear complexities within transcriptional regulatory networks revealed by single-cell RNA sequencing. The overall goal of this report is to develop new computational frameworks, AI-empowered methods, and related mathematical theories to explicitly represent and approximate the dy- namics of complex biological systems by using biological omics data. Our aim is to unravel the intricacies of context-specific dynamic systems using multi-Omics data. Specifically, we solved two different but related computational tasks and enabled the first-of-its-kind methods to (1) identify local low-rank matrices from large omics data, and (2) a robust optimization strategy to approximate metabolic flux. Subsequently, we delve into the realm of data-driven and AI-powered systems biology, harnessing the power of statistical learning and artificial intelligence to approximate differential equations or their representations. This research en- deavor not only contributes to the advancement of subspace modeling but also offers insights into a wide array of complex phenomena across diverse domains, with profound implications for computational biology and beyond.Item Discovery and Interpretation of Subspace Structures in Omics Data by Low-Rank Representation(2022-10) Lu, Xiaoyu; Cao, Sha; Zhang, Chi; Yan, Jingwen; Zang, YongBiological functions in cells are highly complicated and heterogenous, and can be reflected by omics data, such as gene expression levels. Detecting subspace structures in omics data and understanding the diversity of the biological processes is essential to the full comprehension of biological mechanisms and complicated biological systems. In this thesis, we are developing novel statistical learning approaches to reveal the subspace structures in omics data. Specifically, we focus on three types of subspace structures: low-rank subspace, sparse subspace and covariates explainable subspace. For low-rank subspace, we developed a semi-supervised model SSMD to detect cell type specific low-rank structures and predict their relative proportions across different tissue samples. SSMD is the first computational tool that utilizes semi-supervised identification of cell types and their marker genes specific to each mouse tissue transcriptomics data, for better understanding of the disease microenvironment and downstream disease mechanism. For sparsity-driven sparse subspace, we proposed a novel positive and unlabeled learning model, namely PLUS, that could identify cancer metastasis related genes, predict cancer metastasis status and specifically address the under-diagnosis issue in studying metastasis potential. We found PLUS predicted metastasis potential at diagnosis have significantly strong association with patient’s progression-free survival in their follow-up data. Lastly, to discover the covariates explainable subspace, we proposed an analytical pipeline based on covariance regression, namely, scCovReg. We utilized scCovReg to detect the pathway level second-order variations using scRNA-Seq data in a statistically powerful manner, and to associate the second-order variations with important subject-level characteristics, such as disease status. In conclusion, we presented a set of state-of-the-art computational solutions for identifying sparse subspaces in omics data, which promise to provide insights into the mechanism in complex diseases.Item Identification of functionally connected multi-omic biomarkers for Alzheimer’s disease using modularity-constrained Lasso(PLOS, 2020-06-17) Xie, Linhui; Varathan, Pradeep; Nho, Kwangsik; Saykin, Andrew J.; Salama, Paul; Yan, Jingwen; Radiology and Imaging Sciences, School of MedicineLarge-scale genome wide association studies (GWASs) have led to discovery of many genetic risk factors in Alzheimer’s disease (AD), such as APOE, TOMM40 and CLU. Despite the significant progress, it remains a major challenge to functionally validate these genetic findings and translate them into targetable mechanisms. Integration of multiple types of molecular data is increasingly used to address this problem. In this paper, we proposed a modularity-constrained Lasso model to jointly analyze the genotype, gene expression and protein expression data for discovery of functionally connected multi-omic biomarkers in AD. With a prior network capturing the functional relationship between SNPs, genes and proteins, the newly introduced penalty term maximizes the global modularity of the subnetwork involving selected markers and encourages the selection of multi-omic markers with dense functional connectivity, instead of individual markers. We applied this new model to the real data collected in the ROS/MAP cohort where the cognitive performance was used as disease quantitative trait. A functionally connected subnetwork involving 276 multi-omic biomarkers, including SNPs, genes and proteins, were identified to bear predictive power. Within this subnetwork, multiple trans-omic paths from SNPs to genes and then proteins were observed. This suggests that cognitive performance deterioration in AD patients can be potentially a result of genetic variations due to their cascade effect on the downstream transcriptome and proteome level.Item Method Development Involving Modeling Bacterial Metabolite Regulation of Vaginal Epithelial Cell Signaling in Bacterial Vaginosis(2022-04-28) Trinh, Alan; Brubaker, DouglasBACKGROUND Bacterial vaginosis, which is the imbalance of normal vaginal microbiota, contributes to preterm delivery, vaginitis, and decreased drug efficacy. Despite metronidazole efficacy in reducing BV contributing organisms, BV continues to recur in 50% of patients. Previous studies showing imidazole propionate’s role in the pathogenesis of type II diabetes suggest that similar metabolite-regulated pathways in vaginal microbiomes may be the key in pathogenesis of uterine diseases such as BV. Thus, the purpose of this study was to observe the relationship between vaginal metabolites, host or microbiome-derived, and transcriptomic responses in vaginal epithelial tissues stratified by vaginal microbiome composition (“microbiome group”). The hypothesis was that differences in vaginal microbiome composition result in differential regulation of metabolite-host pathway functional relationships. METHODS Transcript levels and metabolite concentrations precollected from 23 East African women were processed and analyzed via R. Transcriptomic data were converted into KEGG pathway enrichment scores via ssGSEA2.0, a package within R. Enrichment scores were correlated (Spearman) with metabolite levels by microbiome group and lactobacillus dominant phenotypes, and relationships were visualized via Heatmap3 and Cytoscape. RESULTS The results showed varying strengths in correlation among metabolites and KEGG pathway enrichment scores after filtering for strong correlations (R > |0.5|) and significance (p< 0.05). Nonlactobacillus dominant microbiomes showed fewer strongly associated metabolite-KEGG pathway relationships compared to the lactobacillus dominant microbiome group, specifically the imidazole-related networks. CONCLUSIONS In this study, variations in significant correlations among metabolites and KEGG pathways suggests that microbiome diversity may contribute to how metabolites regulate host pathways in vaginal epithelial cells. The reduced pathway interactions observed in imidazole compounds suggests that dysregulation may contribute to recurrence of bacterial vaginosis. This method of modelling could be used to characterize the regulation of critical pathways associated with the pathogenesis of bacterial vaginosis.Item MutSignatures: an R package for extraction and analysis of cancer mutational signatures(Nature Publishing Group, 2020-10-26) Fantini, Damiano; Vidimar, Vania; Yu, Yanni; Condello, Salvatore; Meeks, Joshua J.; Obstetrics and Gynecology, School of MedicineCancer cells accumulate somatic mutations as result of DNA damage, inaccurate repair and other mechanisms. Different genetic instability processes result in characteristic non-random patterns of DNA mutations, also known as mutational signatures. We developed mutSignatures, an integrated R-based computational framework aimed at deciphering DNA mutational signatures. Our software provides advanced functions for importing DNA variants, computing mutation types, and extracting mutational signatures via non-negative matrix factorization. Specifically, mutSignatures accepts multiple types of input data, is compatible with non-human genomes, and supports the analysis of non-standard mutation types, such as tetra-nucleotide mutation types. We applied mutSignatures to analyze somatic mutations found in smoking-related cancer datasets. We characterized mutational signatures that were consistent with those reported before in independent investigations. Our work demonstrates that selected mutational signatures correlated with specific clinical and molecular features across different cancer types, and revealed complementarity of specific mutational patterns that has not previously been identified. In conclusion, we propose mutSignatures as a powerful open-source tool for detecting the molecular determinants of cancer and gathering insights into cancer biology and treatment.Item A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects(Springer (Biomed Central Ltd.), 2014) Dundar, Murat; Akova, Ferit; Yerebakan, Halid Z.; Rajwa, Bartek; Department of Computer & Information Science, School of ScienceBACKGROUND: Flow cytometry (FC)-based computer-aided diagnostics is an emerging technique utilizing modern multiparametric cytometry systems.The major difficulty in using machine-learning approaches for classification of FC data arises from limited access to a wide variety of anomalous samples for training. In consequence, any learning with an abundance of normal cases and a limited set of specific anomalous cases is biased towards the types of anomalies represented in the training set. Such models do not accurately identify anomalies, whether previously known or unknown, that may exist in future samples tested. Although one-class classifiers trained using only normal cases would avoid such a bias, robust sample characterization is critical for a generalizable model. Owing to sample heterogeneity and instrumental variability, arbitrary characterization of samples usually introduces feature noise that may lead to poor predictive performance. Herein, we present a non-parametric Bayesian algorithm called ASPIRE (anomalous sample phenotype identification with random effects) that identifies phenotypic differences across a batch of samples in the presence of random effects. Our approach involves simultaneous clustering of cellular measurements in individual samples and matching of discovered clusters across all samples in order to recover global clusters using probabilistic sampling techniques in a systematic way. RESULTS: We demonstrate the performance of the proposed method in identifying anomalous samples in two different FC data sets, one of which represents a set of samples including acute myeloid leukemia (AML) cases, and the other a generic 5-parameter peripheral-blood immunophenotyping. Results are evaluated in terms of the area under the receiver operating characteristics curve (AUC). ASPIRE achieved AUCs of 0.99 and 1.0 on the AML and generic blood immunophenotyping data sets, respectively. CONCLUSIONS: These results demonstrate that anomalous samples can be identified by ASPIRE with almost perfect accuracy without a priori access to samples of anomalous subtypes in the training set. The ASPIRE approach is unique in its ability to form generalizations regarding normal and anomalous states given only very weak assumptions regarding sample characteristics and origin. Thus, ASPIRE could become highly instrumental in providing unique insights about observed biological phenomena in the absence of full information about the investigated samples.Item Overlapping Genes Produce Proteins with Unusual Sequence Properties and Offer Insight into De Novo Protein Creation(American Society for Microbiology, 2009-10) Rancurel, Corinne; Khosravi, Mahvash; Dunker, A. Keith; Romero, Pedro R.; Karlin, David; Biochemistry and Molecular Biology, School of MedicineIt is widely assumed that new proteins are created by duplication, fusion, or fission of existing coding sequences. Another mechanism of protein birth is provided by overlapping genes. They are created de novo by mutations within a coding sequence that lead to the expression of a novel protein in another reading frame, a process called "overprinting." To investigate this mechanism, we have analyzed the sequences of the protein products of manually curated overlapping genes from 43 genera of unspliced RNA viruses infecting eukaryotes. Overlapping proteins have a sequence composition globally biased toward disorder-promoting amino acids and are predicted to contain significantly more structural disorder than nonoverlapping proteins. By analyzing the phylogenetic distribution of overlapping proteins, we were able to confirm that 17 of these had been created de novo and to study them individually. Most proteins created de novo are orphans (i.e., restricted to one species or genus). Almost all are accessory proteins that play a role in viral pathogenicity or spread, rather than proteins central to viral replication or structure. Most proteins created de novo are predicted to be fully disordered and have a highly unusual sequence composition. This suggests that some viral overlapping reading frames encoding hypothetical proteins with highly biased composition, often discarded as noncoding, might in fact encode proteins. Some proteins created de novo are predicted to be ordered, however, and whenever a three-dimensional structure of such a protein has been solved, it corresponds to a fold previously unobserved, suggesting that the study of these proteins could enhance our knowledge of protein space.Item Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework(Oxford University Press, 2019-09-05) Yang, Jinyu; Ma, Anjun; Hoppe, Adam D.; Wang, Cankun; Li, Yang; Zhang, Chi; Wang, Yan; Liu, Bingqiang; Ma, Qin; Medical and Molecular Genetics, School of MedicineThe identification of transcription factor binding sites and cis-regulatory motifs is a frontier whereupon the rules governing protein-DNA binding are being revealed. Here, we developed a new method (DEep Sequence and Shape mOtif or DESSO) for cis-regulatory motif prediction using deep neural networks and the binomial distribution model. DESSO outperformed existing tools, including DeepBind, in predicting motifs in 690 human ENCODE ChIP-sequencing datasets. Furthermore, the deep-learning framework of DESSO expanded motif discovery beyond the state-of-the-art by allowing the identification of known and new protein-protein-DNA tethering interactions in human transcription factors (TFs). Specifically, 61 putative tethering interactions were identified among the 100 TFs expressed in the K562 cell line. In this work, the power of DESSO was further expanded by integrating the detection of DNA shape features. We found that shape information has strong predictive power for TF-DNA binding and provides new putative shape motif information for human TFs. Thus, DESSO improves in the identification and structural analysis of TF binding sites, by integrating the complexities of DNA binding into a deep-learning framework.Item RareVar: A Framework for Detecting Low-Frequency Single-Nucleotide Variants(Mary Ann Liebert, Inc., 2017-07) Hao, Yangyang; Xuei, Xiaoling; Li, Lang; Nakshatri, Harikrishna; Edenberg, Howard J.; Liu, Yunlong; Medical and Molecular Genetics, School of MedicineAccurate identification of low-frequency somatic point mutations in tumor samples has important clinical utilities. Although high-throughput sequencing technology enables capturing such variants while sequencing primary tumor samples, our ability for accurate detection is compromised when the variant frequency is close to the sequencer error rate. Most current experimental and bioinformatic strategies target mutations with ≥5% allele frequency, which limits our ability to understand the cancer etiology and tumor evolution. We present an experimental and computational modeling framework, RareVar, to reliably identify low-frequency single-nucleotide variants from high-throughput sequencing data under standard experimental protocols. RareVar protocol includes a benchmark design by pooling DNAs from already sequenced individuals at various concentrations to target variants at desired frequencies, 0.5%-3% in our case. By applying a generalized, linear model-based, position-specific error model, followed by machine-learning-based variant calibration, our approach outperforms existing methods. Our method can be applied on most capture and sequencing platforms without modifying the experimental protocol.