Informatics School Theses and Dissertations

Permanent URI for this collection

Please go to "Informatics Graduate Theses and PhD Dissertations" to submit dissertations and theses for the School of Informatics and Computing, at:


Recent Submissions

Now showing 1 - 10 of 189
  • Item
    Computational Methods for Proteoform Identification and Characterization Using Top-Down Mass Spectrometry
    (2023-12) Chen, Wenrong; Yan, Jingwen; Wang, Juexin; Wan, Jun; Zang, Yong; Luo, Xiao; Liu, Xiaowen
    Proteoforms, distinct molecular forms of proteins, arise due to numerous factors such as genetic mutations, differential gene expression, alternative splicing, and a range of biological processes. These proteoforms are often characterized by primary structural variances such as amino acid substitutions, terminal truncations, and post-translational modifications (PTMs). Proteoforms from the same proteins can manifest varied functional behaviors based on the specific alterations. The complexity inherent to proteoforms has elevated the significance of top-down mass spectrometry (MS) due to its proficiency in providing intricate sequence information for these intact proteoforms. During a typical top-down MS experiment, intact proteoforms are separated through platforms like liquid chromatography (LC) or capillary zone electrophoresis (CZE) prior to tandem mass spectrometry (MS/MS) analysis. Despite advancements in instruments and protocols for top-down MS, computational challenges persist, with software tool development still in its early stage. In this dissertation, our research revolves around three primary goals, all aimed at refining proteoform characterization. First, we bridge RNA-Seq with top-down MS for a better proteoform identification. We propose TopPG, an innovative proteogenomic tool which is tailored to generate proteoform sequence databases from genetic and splicing variations explicitly for top-down MS in contrast to traditional approaches. Second, to boost the accuracy of proteoform detection, we utilize machine learning methods to predict proteoform retention and migration times in top-down MS, an area previously overshadowed by bottom-up MS paradigms. critically evaluating models in a realm traditionally dominated by bottom-up MS methodologies. Lastly, recognizing the indispensable role of post-translational modifications (PTMs) on cellular functions, we introduce PTM-TBA. This tool integrates the complementary strengths of both top-down and bottom-up MS, augmented with annotations, building a comprehensive strategy for precise PTM identification and localization.
  • Item
    Survivor-Centered Transformative Justice: An Approach to Designing Sociotechnical Systems Alongside Domestic Violence Stakeholders in US Muslim Communities
    (2023-08) Rabaan, Hawra; Dombrowski, Lynn; Bolchini, Davide; Brady, Erin; Khaja, Khadija; Schoenebeck, Sarita
    Domestic violence (DV) is a social, political, and legal problem that requires contextual examination. In the United States, earlier advocacy work focused on law reform to empower survivors in influencing the public and state to take DV seriously and provide resources to support and protect survivors. However, harm is still perpetuated systemically and socially for survivors, especially those from racial and religious minorities. In this dissertation, I focus on domestic violence within the US-based Muslim population due to the unique issues Muslim survivors face when dealing with governmental services and service providers (e.g., gendered Islamophobia, racial discrimination, punitive actions) and within the Muslim community itself (e.g., community trauma, faith leaders lacking appropriate training). This work incorporates three phases of research that utilize qualitative and design methods to examine the forms and dynamics of domestic violence, help-seeking and healing challenges, and survivor advocacy, abuser accountability, and community transformation interventions. I argue that to pursue justice for survivors in design research, a multifaceted approach rooted in principles from Islamic feminism, traumainformed care, and restorative and transformative justice tenets is needed. Consequently, I propose Survivor-Centered Transformative Justice (SCTJ), a framework to discern individual and systemic harm, to understand how to design alongside victim-survivors, and to focus on victim-survivors' autonomy. I illustrate how SCTJ allows researchers and designers to account for individual inequalities, recognize communities' preferred approaches to pursuing justice, tackle the underlying conditions enabling harm, and provide interventions that alter, repair, and reduce harm within different scales of relationships. Additionally, I present the concept of healing structures, which aim to safeguard against harmful community practices, discriminatory laws, and practices while facilitating collective and survivor-centered interventions to promote healing. Lastly, I demonstrate the potential for design research to progress by taking a closer look into the belief systems, cultural values, and surrounding conditions that contribute to users' obtainable choices and decision-making processes, and by centering the needs of people at the margins. With this empirical, theoretical, and design work, I present insights that inform the HCI community at the intersection of social justiceoriented design, Islamic feminism, and gender-based violence.
  • Item
    Understanding Informational Practices and Exploring Data Collection Approaches for Quality of Life in Brain Injury Illness Management
    (2023-07) Masterson, Yamini Lalama Patnaik; Brady, Erin; Miller, Andrew D.; Toscos, Tammy; Hong, Youngbok; Gunter, Tracy D.
    Brain injury, a combination of medical injury, chronic illness, and impairment, affects more than 3.5 million people in the United States every year through an interplay of physiological, psychological, environmental, and cultural factors spanning clinical recovery, illness management, and personal recovery phases. The lack of collaborative and integrated understanding from healthcare and accessibility communities led to treating brain injury as a localized damage rather than individual response to ever-changing impairment and symptoms, focusing primarily on clinical recovery until recently. While self-tracking and management technologies have been widely successful in measuring individual symptoms, they have struggled to facilitate sensemaking and problem solving to achieve a consistent biopsychosocial awareness of illness. My dissertation addresses this gap through three aims: (1) investigate the current informational practices of individuals undergoing post-acute brain injury recovery, (2) explore technology-agnostic approaches for data collection and their impact on sensemaking processes and conceptual understanding of brain injury, and (3) develop guidelines for designing data collection tools that facilitate sensemaking in brain injury self-management. I achieve this through two longitudinal studies – an interview study that introduced participants to the framework on quality of life after traumatic brain injury (QoLIBRI) and a narrative study that used QoLIBRI framework to do structured journaling and co-design individualized data collection tools. The goal of this work is to improve self-awareness of individuals with brain injury enabling them to anticipate or recognize the occurrence of a challenge caused by impairment and then, utilize assistive technologies to bypass the limitation. It also has implications for involving neurodiverse populations in research and technology design.
  • Item
    Celltyper: A Single-Cell Sequencing Marker Gene Tool Suite
    (2023-05) Paisley, Brianna Meadow; Liu, Yunlong; Yan, Jingwen; Cao, Sha; Wang, Juexin; Carfagna, Mark
    Single-cell RNA-sequencing (scRNA-seq) has enabled researchers to study interindividual cellular heterogeneity, to explore disease impact on cellular composition of tissue, and to identify novel cell subtypes. However, a major challenge in scRNA-seq analysis is to identify the cell type of individual cells. Accurate cell type identification is crucial for any scRNA-seq analysis to be valid as incorrect cell type assignment will reduce statistical robustness and may lead to incorrect biological conclusions. Therefore, accurate and comprehensive cell type assignment is necessary for reliable biological insights into scRNA-seq datasets. With over 200 distinct cell types in humans alone, the concept of cell identity is large. Even within the same cell type there exists heterogeneity due to cell cycle phase, cell state, cell subtypes, cell health and the tissue microenvironment. This makes cell type classification a complicated biological problem requiring bioinformatics. One approach to classify cell type identity is using marker genes. Marker genes are genes specific for one or a few cell types. When coupled with bioinformatic methods, marker genes show promise of improving cell type classification. However, current scRNA-seq classification methods and databases use marker genes that are non-specific across sources, samples, and/or species leading to bias and errors. Furthermore, many existing tools require manual intervention by the user to provide training datasets or the expected number and name of cell types, which can introduce selection bias. The selection bias negatively impacts the accuracy of cell type classification methods as the model cannot extrapolate outside of the user inputs even when it is biologically meaningful to do so. In this dissertation I developed CellTypeR, a suite of tools to explore the biology governing cell identity in a “normal” state for humans and mice. The work presented here accomplishes three aims: 1. Develop an ontology standardized database of published marker gene literature; 2. Develop and apply a marker gene classification algorithm; and 3. Create user interface and input data structure for scRNA-seq cell type prediction.
  • Item
    Transcriptome-Wide Methods for functional and Structural Annotation of Long Non-Coding RNAs
    (2023-05) Daulatabad, Swapna Vidhur; Janga, Sarath Chandra; Reda, Khairi; Yan, Jingwen; Ye, Yuzhen
    Non-coding RNAs across the genome have been associated with various biological processes, ranging from regulation of splicing to remodeling of chromatin. Amongst the repertoire of non-coding sequences lies a critical species of RNAs called long non-coding RNAs (lncRNAs). LncRNAs significantly contribute to a large spectrum of human phenotypes, including cancers, Heart failure, Diabetes, and Alzheimer’s disease. This dissertation emphasizes the need to characterize the functional role of lncRNAs to improve our understanding of human diseases. This work consolidates a resource from multiple computational genomics and natural language processing-based approaches to advance our ability to functionally annotate hundreds of lncRNAs and their interactions, providing a one-stop lncRNA functional annotation and dynamic interaction network and multi-facet omics data visualization platform. RNA interactions are vital in various cellular processes, from transcription to RNA processing. These interactions dictate the functional scope of the RNA. However, the multifaceted functional nature of RNA stems from its ability to form secondary structures. Therefore, this work establishes a computational method to characterize RNA secondary structure by integrating SHAPE-seq and long-read sequencing to enhance further our understanding of RNA structure in modulating the post-transcriptional regulatory processes and deciphering the influence at several layers of biological features, ranging from structure composition to consequent protein occupancy. This study will potentially impact the research community by providing methods, web interfaces, and computational pipelines, improving our functional understanding of long non-coding RNAs. This work also provides novel integration methods of technologies like Oxford Nanopore-based long-read sequencing, RNA structure-probing methods, and machine learning. The approaches developed in this dissertation are scalable and adaptable to investigate further the functional and regulatory role of RNA and its structure. Overall, this study accelerates the development of RNA-based diagnostics and the identification of therapeutic targets in human disease.
  • Item
    Discovery and Interpretation of Subspace Structures in Omics Data by Low-Rank Representation
    (2022-10) Lu, Xiaoyu; Cao, Sha; Zhang, Chi; Yan, Jingwen; Zang, Yong
    Biological functions in cells are highly complicated and heterogenous, and can be reflected by omics data, such as gene expression levels. Detecting subspace structures in omics data and understanding the diversity of the biological processes is essential to the full comprehension of biological mechanisms and complicated biological systems. In this thesis, we are developing novel statistical learning approaches to reveal the subspace structures in omics data. Specifically, we focus on three types of subspace structures: low-rank subspace, sparse subspace and covariates explainable subspace. For low-rank subspace, we developed a semi-supervised model SSMD to detect cell type specific low-rank structures and predict their relative proportions across different tissue samples. SSMD is the first computational tool that utilizes semi-supervised identification of cell types and their marker genes specific to each mouse tissue transcriptomics data, for better understanding of the disease microenvironment and downstream disease mechanism. For sparsity-driven sparse subspace, we proposed a novel positive and unlabeled learning model, namely PLUS, that could identify cancer metastasis related genes, predict cancer metastasis status and specifically address the under-diagnosis issue in studying metastasis potential. We found PLUS predicted metastasis potential at diagnosis have significantly strong association with patient’s progression-free survival in their follow-up data. Lastly, to discover the covariates explainable subspace, we proposed an analytical pipeline based on covariance regression, namely, scCovReg. We utilized scCovReg to detect the pathway level second-order variations using scRNA-Seq data in a statistically powerful manner, and to associate the second-order variations with important subject-level characteristics, such as disease status. In conclusion, we presented a set of state-of-the-art computational solutions for identifying sparse subspaces in omics data, which promise to provide insights into the mechanism in complex diseases.
  • Item
    A Comprehensive Survey and Deep Learning-Based Prediction on G-quadruplex Formation and Biological Functions
    (2022-09) Fang, Shuyi; Wan, Jun; Liu, Yunlong; Yan, Jingwen; Zhang, Jie
    The G-quadruplexes (G4s) are guanine-rich four-stranded DNA/RNA structures, which have been found throughout the human genome. G4s have been reported to affect chromatin structure and are involved in important biological processes at transcriptional and epigenetic levels. However, the underlying molecular mechanisms and locating of G4 still remain elusive due to the complexity of G4s. Taking advantage of the development of high-throughput sequencing technologies and machine learning approaches, we constructed this comprehensive investigation on G4 structures, including discovery of a novel marker for functional human hematopoietic stem cells and gained interest in G4 structure, exploring association between G4 and genomic factors by incorporating multi-omics data, and development of a deep-learningbased G4 prediction tool with G4 motif. First, we discovered ADGRG1 as a novel marker for functional human hematopoietic stem cells and its regulation through transcription activities. Our interest in G4s was stimulated while the transcription-related investigations. Next, we analyzed the genome-wide distribution properties of G4s and uncovered the associations of G4 with other epigenetic and transcriptional mechanisms to coordinate gene transcription. We explored that different-confidence G4 groups correlated differently with epigenetic regulatory elements and revealed that G4 structures could correlate with gene expression in two opposite ways depending on their locations and forming strands. Some transcription factors were identified to be over-represented with G4 emergence. We found distinct consensus sequences enriched in the G4 feet, with a high GC content in the feet of high-confidence G4s and a high TA content in solely predicted G4 feet. As for the last part, we developed a novel deep-learning-based prediction tool for DNA G4s with G4 motifs. Considering the classical G4 motif, we applied bi-directional LSTM model with attention method, which captures sequential information, and showed good performance in whole-genome level prediction of DNA G4s with the certified G4 pattern. Our comprehensive work investigated G4 with its functions and predictions and provided a better understanding of G4s on multi-omics level and computational information capture riding the wave of deep learning.
  • Item
    Deciphering Gene Regulatory Mechanisms Through Multi-omics Integration
    (2022-09) Chen, Duojiao; Liu, Yunlong; Wan, Jun; Zhang, Chi; Yan, Jingwen
    Complex biological systems are composed of many regulatory components, which can be measured with the advent of genomics technology. Each molecular assay is normally designed to interrogate one aspect of the cell state. However, a comprehensive understanding of the regulatory mechanism requires characterization from multiple levels such as genome, epigenome, and transcriptome. Integration of multi-omics data is urgently needed for understanding the global regulatory mechanism of gene expression. In recent years, single-cell technology offers unprecedented resolution for a deeper characterization of cellular diversity and states. High-quality single-cell suspensions from tissue biopsies are required for single-cell sequencing experiments. Tissue biopsies need to be processed as soon as being collected to avoid gene expression changes and RNA degradation. Although cryopreservation is a feasible solution to preserve freshly isolated samples, its effect on transcriptome profiles still needs to be investigated. Investigation of multi-omics data at the single-cell level can provide new insights into the biological process. In addition to the common method of integrating multi-omics data, it is also capable of simultaneously profiling the transcriptome and epigenome at single-cell resolution, enhancing the power of discovering new gene regulatory interactions. In this dissertation, we integrated bulk RNA-seq with ATAC-seq and several additional assays and revealed the complex mechanisms of ER–E2 interaction with nucleosomes. A comparison analysis was conducted for comparing fresh and frozen multiple myeloma single-cell RNA sequencing data and concluded that cryopreservation is a feasible protocol for preserving cells. We also analyzed the single-cell multiome data for mesenchymal stem cells. With the unified landscape from simultaneously profiling gene expression and chromatin accessibility, we discovered distinct osteogenic differentiation potential of mesenchymal stem cells and different associations with bone disease-related traits. We gained a deeper insight into the underlying gene regulatory mechanisms with this frontier single-cell mutliome sequencing technique.
  • Item
    Intron Retention Induced Neoantigen as Biomarkers in Diseases
    (2022-08) Dong, Chuanpeng; Yan, Jingwen; Liu, Yunlong; Huang, Kun; Wan, Jun; Liu, Xiaowen
    Alternative splicing is a regulatory mechanism that generates multiple mRNA transcripts from a single gene, allowing significant expansion in proteome diversity. Disruption of splicing mechanisms has a large impact on the transcriptome and is a significant driver of complex diseases by producing condition-specific transcripts. Recent studies have reported that mis-spliced RNA transcripts can be another major source of neoantigens directly associated with immune responses. Particularly, aberrant peptides derived from unspliced introns can be presented by the major histocompatibility complex (MHC) class I molecules on the cell surface and elicit immunogenicity. In this dissertation, we first developed an integrated computational pipeline for identifying IR-induced neoantigens (IR-neoAg) from RNA sequencing (RNA-Seq) data. Our workflow also included a random forest classifier for prioritizing the neoepitopes with the highest likelihood to induce a T cell response. Second, we analyzed IR neoantigen using RNA-Seq data for multiple myeloma patients from the MMRF study. Our results suggested that the IR-neoAg load could serve as a prognosis biomarker, and immunosuppression in the myeloma microenvironment might offset the increasing neoantigen load effect. Thirdly, we demonstrated that high IR-neoAg predicts better overall survival in TCGA pancreatic cancer patients. Moreover, our results indicated the IR-neoAg load might be useful in identifying pancreatic cancer patients who might benefit from immune checkpoint blockade (ICB) therapy. Finally, we explored the association of IR-induced neo-peptides with neurodegeneration disease pathology and susceptibility. In conclusion, we presented a state-of-art computational solution for identifying IR-neoAgs, which might aid neoantigen-based vaccine development and the prediction of patient immunotherapy responses. Our studies provide remarkable insights into the roles of alternative splicing in complex diseases by directly mediating immune responses.
  • Item
    Computational Modeling of Cell and Tissue Level Metabolic Characterization of the Human Metabolic Network by Using scRNA-seq Data
    (2022-06) Alghamdi, Norah Saeed; Zhang, Chi; Cao, Sha; Yan, Jingwen; Jones, Josette
    The heterogeneity of metabolic pathways is a hallmark of many common disease types. Nowadays, there are several sources of knowledge on the core components of metabolic networks and sub-networks we have obtained, however, there are still limitations in our knowledge of the integrated behavior and metabolic reprogramming of cells microenvironment. Basically, the metabolic changes can be characterized by different factors, and the changes are different from one cell to another cell because of their high plasticity. The large amount of single-cell and tissue data gained from disease tissue has the potential to provide information on a cell functioning state and its underlying phenotypic changes. Hence, advanced systems biology models and computational tools are in pressing need to empower reliable characterization of metabolic variations in disease by using scRNA-seq data. Our preliminary data include (1) a new computational method to estimate cell-wise metabolic flux and states from single-cell and tissue transcriptomics data, and (2) matched scRNA-seq data and metabolomics experiment on cells under perturbed biochemical conditions and knock-down of metabolic genes, both of which form the computational and experimental foundations of this project. In this dissertation, we proposed to develop a suite of novel computational methods, systems biology models, and quantitative metrics to bring the following unmet capabilities: (1) reconstruction of context-specific and subcellular-resolution metabolic network for different disease types, (2) estimation of cell-/sample-wise metabolic flux by considering metabolic imbalance, metabolic exchange between cells in the disease microenvironment, (3) a systematic evaluation of the functional impact of variations in gene expression, metabolite availability and network structure on the context-specific metabolic network and flux. By implementing these methods using scRNA-seq data, we addressed the following outstanding biological questions: (i) identification of genes, metabolites, and network topology with high impact on metabolic variations, (ii) estimation of metabolic flux, and (iv) assessment of metabolic changes over metabolic network. Successful execution of the proposed research provides a suite of computational capabilities to analyze metabolic variations that could be broadly utilized by the biomedical research community.