Department of BioHealth Informatics Works

Permanent URI for this collection


Recent Submissions

Now showing 1 - 10 of 406
  • Item
    Characterization of Proteoform Post-Translational Modifications by Top-Down and Bottom-Up Mass Spectrometry in Conjunction with Annotations
    (American Chemical Society, 2023) Chen, Wenrong; Ding, Zhengming; Zang, Yong; Liu, Xiaowen; BioHealth Informatics, School of Informatics and Computing
    Many proteoforms can be produced from a gene due to genetic mutations, alternative splicing, post-translational modifications (PTMs), and other variations. PTMs in proteoforms play critical roles in cell signaling, protein degradation, and other biological processes. Mass spectrometry (MS) is the primary technique for investigating PTMs in proteoforms, and two alternative MS approaches, top-down and bottom-up, have complementary strengths. The combination of the two approaches has the potential to increase the sensitivity and accuracy in PTM identification and characterization. In addition, protein and PTM knowledge bases, such as UniProt, provide valuable information for PTM characterization and verification. Here, we present a software pipeline PTM-TBA (PTM characterization by Top-down and Bottom-up MS and Annotations) for identifying and localizing PTMs in proteoforms by integrating top-down and bottom-up MS as well as PTM annotations. We assessed PTM-TBA using a technical triplicate of bottom-up and top-down MS data of SW480 cells. On average, database search of the top-down MS data identified 2000 mass shifts, 814.5 (40.7%) of which were matched to 11 common PTMs and 423 of which were localized. Of the mass shifts identified by top-down MS, PTM-TBA verified 435 mass shifts using the bottom-up MS data and UniProt annotations.
  • Item
    Annotating and Detecting Topics in Social Media Forum and Modelling the Annotation to Derive Directions-A Case Study
    (Research Square, 2021) B., Athira; Jones, Josette; Idicula, Sumam Mary; Kulanthaivel, Anand; Zhang, Enming; BioHealth Informatics, School of Informatics and Computing
    The widespread influence of social media impacts every aspect of life, including the healthcare sector. Although medics and health professionals are the final decision makers, the advice and recommendations obtained from fellow patients are significant. In this context, the present paper explores the topics of discussion posted by breast cancer patients and survivors on online forums. The study examines an online forum,, maps the discussion entries to several topics, and proposes a machine learning model based on a classification algorithm to characterize the topics. To explore the topics of breast cancer patients and survivors, approximately 1000 posts are selected and manually labeled with annotations. In contrast, millions of posts are available to build the labels. A semi-supervised learning technique is used to build the labels for the unlabeled data; hence, the large data are classified using a deep learning algorithm. The deep learning algorithm BiLSTM with BERT word embedding technique provided a better f1-score of 79.5%. This method is able to classify the following topics: medication reviews, clinician knowledge, various treatment options, seeking and providing support, diagnostic procedures, financial issues and implications for everyday life. What matters the most for the patients is coping with everyday living as well as seeking and providing emotional and informational support. The approach and findings show the potential of studying social media to provide insight into patients' experiences with cancer like critical health problems.
  • Item
    Genetic Regulation of Human isomiR Biogenesis
    (MDPI, 2023-09-04) Jiang, Guanglong; Reiter, Jill L.; Dong, Chuanpeng; Wang, Yue; Fang, Fang; Jiang, Zhaoyang; Liu, Yunlong; BioHealth Informatics, School of Informatics and Computing
    MicroRNAs play a critical role in regulating gene expression post-transcriptionally. Variations in mature microRNA sequences, known as isomiRs, arise from imprecise cleavage and nucleotide substitution or addition. These isomiRs can target different mRNAs or compete with their canonical counterparts, thereby expanding the scope of miRNA post-transcriptional regulation. Our study investigated the relationship between cis-acting single-nucleotide polymorphisms (SNPs) in precursor miRNA regions and isomiR composition, represented by the ratio of a specific 5'-isomiR subtype to all isomiRs identified for a particular mature miRNA. Significant associations between 95 SNP-isomiR pairs were identified. Of note, rs6505162 was significantly associated with both the 5'-extension of hsa-miR-423-3p and the 5'-trimming of hsa-miR-423-5p. Comparison of breast cancer and normal samples revealed that the expression of both isomiRs was significantly higher in tumors than in normal tissues. This study sheds light on the genetic regulation of isomiR maturation and advances our understanding of post-transcriptional regulation by microRNAs.
  • Item
    CETP and SGLT2 inhibitor combination therapy improves glycemic control
    (medRxiv, 2023-06-16) Khomtchouk, Bohdan B.; Sun, Patrick; Ditmarsch, Marc; Kastelein, John J. P.; Davidson, Michael H.; BioHealth Informatics, School of Informatics and Computing
    Importance: Cholesteryl ester transfer protein (CETP) inhibition has been associated with decreased risk of new-onset diabetes in past clinical trials exploring their efficacy in cardiovascular disease and can potentially be repurposed to treat metabolic disease. Notably, as an oral drug it can potentially be used to supplement existing oral drugs such as sodium-glucose cotransporter 2 (SGLT2) inhibitors before patients are required to take injectable drugs such as insulin. Objective: To identify whether CETP inhibitors could be used as an oral add-on to SGLT2 inhibition to improve glycemic control. Design setting and participants: 2×2 factorial Mendelian Randomization (MR) is performed on the general population of UK Biobank participants with European ancestry. Exposures: Previously constructed genetic scores for CETP and SGLT2 function are combined in a 2×2 factorial framework to characterize the associations between joint CETP and SGLT2 inhibition compared to either alone. Main outcomes and measures: Glycated hemoglobin and type-2 diabetes incidence. Results: Data on 233,765 UK Biobank participants suggests that individuals with genetic inhibition of both CETP and SGLT2 have significantly lower glycated hemoglobin levels (mmol/mol) than control (Effect size: -0.136; 95% CI: -0.190 to -0.081; p-value: 1.09E-06), SGLT2 inhibition alone (Effect size: -0.082; 95% CI: -0.140 to -0.024; p-value: 0.00558), and CETP inhibition alone (Effect size: -0.08479; 95% CI: -0.136 to -0.033; p-value: 0.00118). Furthermore, joint CETP and SGLT2 inhibition is associated with decreased incidence of diabetes (log-odds ratio) compared to control (Effect size: -0.068; 95% CI: -0.115 to -0.021; p-value: 4.44E-03) and SGLT2 inhibition alone (Effect size: -0.062; 95% CI: -0.112 to -0.012; p-value: 0.0149). Conclusions and relevance: Our results suggest that CETP and SGLT2 inhibitor therapy may improve glycemic control over SGLT2 inhibitors alone. Future clinical trials can explore whether CETP inhibitors can be repurposed to treat metabolic disease and provide an oral therapeutic option to benefit high-risk patients before escalation to injectable drugs such as insulin or glucagon-like peptide 1 (GLP1) receptor agonists.
  • Item
    Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data
    (Elsevier, 2022) Hassan, Doaa; Acevedo, Daniel; Daulatabad, Swapna Vidhur; Mir, Quoseena; Janga, Sarath Chandra; BioHealth Informatics, School of Informatics and Computing
    Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes and has been reported to have application in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies have enabled direct detection of RNA modifications on the molecule being sequenced. In this study, we introduce a tool called Penguin that integrates several machine learning (ML) models to identify RNA Pseudouridine sites on Nanopore direct RNA sequencing reads. Pseudouridine sites were identified on single molecule sequencing data collected from direct RNA sequencing resulting in 723K reads in Hek293 and 500K reads in Hela cell lines. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, can predict whether the signal is modified by the presence of Pseudouridine sites in the testing phase. We have included various predictors in Penguin, including Support vector machines (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets for Hek293 and Hela cell lines show outstanding performance of Penguin either in random split testing or in independent validation testing. In random split testing, Penguin has been able to identify Pseudouridine sites with a high accuracy of 93.38% by applying SVM to Hek293 benchmark dataset. In independent validation testing, Penguin achieves an accuracy of 92.61% by training SVM with Hek293 benchmark dataset and testing it for identifying Pseudouridine sites on Hela benchmark dataset. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature by 16 % higher accuracy than those predictors using independent validation testing. Employing penguin to predict Pseudouridine revealed a significant enrichment of “regulation of mRNA 3’-end processing” in Hek293 cell line and positive regulation of transcription from RNA polymerase II promoter involved in cellular response to chemical stimulus in Hela cell line. Penguin software and models are available on GitHub at and can be readily employed for predicting Ψ sites from Nanopore direct RNA-sequencing datasets.
  • Item
    Development of a Citizen Science platform for Indianas'a Cardiovascular Mortality Rates and Social Health Determinants
    (2023-12-17) Malempati, Thejomayi; Purkayastha, Saptarshi; Hamid, Zeyana
    The high cardiovascular disease (CVD) mortality burden in Indiana, which has witnessed over 82,373 CVD-attributed deaths from 2017-2021 as per vital statistics data from the Indiana Department of Health. The project explores the association between CVD mortality rates and social determinants of health including gender, education level, occupation, and lifestyle factors across 779 Zip Code Tabulation Areas (ZCTAs) in Indiana. Merging cardiovascular mortality rates from the Indiana Department of Health with socioeconomic attributes for each zip code, the study explores the complex intersections of place and society in determining health. Spatial data visualization maps geographic clusters exhibiting elevated death rates warranting priority attention. Statistical models estimate the extent of disproportionate mortality risk faced by disadvantaged groups after accounting for other factors. Overall, the project knits together conceptual underpinnings from medical geography, social epidemiology, and health informatics to put the spotlight on social determinants as pivotal upstream drivers of cardiovascular health disparities. The interactive heatmaps and dashboards will allow for citizen science and participation in understanding targeted interventions that may address root causes of challenges promoting health equity.
  • Item
    TopFD: A Proteoform Feature Detection Tool for Top–Down Proteomics
    (American Chemical Society, 2023) Basharat, Abdul Rehman; Zang, Yong; Sun, Liangliang; Liu, Xiaowen; BioHealth Informatics, School of Informatics and Computing
    Top-down liquid chromatography-mass spectrometry (LC-MS) analyzes intact proteoforms and generates mass spectra containing peaks of proteoforms with various isotopic compositions, charge states, and retention times. An essential step in top-down MS data analysis is proteoform feature detection, which aims to group these peaks into peak sets (features), each containing all peaks of a proteoform. Accurate protein feature detection enhances the accuracy in MS-based proteoform identification and quantification. Here, we present TopFD, a software tool for top-down MS feature detection that integrates algorithms for proteoform feature detection, feature boundary refinement, and machine learning models for proteoform feature evaluation. We performed extensive benchmarking of TopFD, ProMex, FlashDeconv, and Xtract using seven top-down MS data sets and demonstrated that TopFD outperforms other tools in feature accuracy, reproducibility, and feature abundance reproducibility.
  • Item
    The Cannabis sativa genetics and therapeutics relationship network: automatically associating cannabis-related genes to therapeutic properties through chemicals from cannabis literature
    (BMC, 2023-05-30) Jackson, Trever J.; Chakraborty, Sunandan; BioHealth Informatics, School of Informatics and Computing
    Background: Understanding the genome of Cannabis sativa holds significant scientific value due to the multi-faceted therapeutic nature of the plant. Links from cannabis gene to therapeutic property are important to establish gene targets for the optimization of specific therapeutic properties through selective breeding of cannabis strains. Our work establishes a resource for quickly obtaining a complete set of therapeutic properties and genes associated with any known cannabis chemical constituent, as well as relevant literature. Methods: State-of-the-art natural language processing (NLP) was used to automatically extract information from many cannabis-related publications, thus producing an undirected multipartite weighted-edge paragraph co-occurrence relationship network composed of two relationship types, gene-chemical and chemical property. We also developed an interactive application to visualize sub-graphs of manageable size. Results: Two hundred thirty-four cannabis constituent chemicals, 352 therapeutic properties, and 124 genes from the Cannabis sativa genome form a multipartite network graph which transforms 29,817 cannabis-related research documents from PubMed Central into an easy to visualize and explore network format. Conclusion: Use of our network replaces time-consuming and labor intensive manual extraction of information from the large amount of available cannabis literature. This streamlined information retrieval process will enhance the activities of cannabis breeders, cannabis researchers, organic biochemists, pharmaceutical researchers and scientists in many other disciplines.
  • Item
    An answer recommendation framework for an online cancer community forum
    (Springer Nature, 2023-05-15) Athira, B.; Idicula, Sumam Mary; Jones, Josette; Kulanthaivel, Anand; BioHealth Informatics, School of Informatics and Computing
    Health community forums are a kind of online platform to discuss various matters related to management of illness. People are increasingly searching for answers online, particularly when they are diagnosed with cancer like life-threatening diseases. People seek suggestions or advice through these platforms to make decisions during their treatments. However, locating the correct information or similar people is often a great challenge for them. In this scenario, this paper proposes an answer recommendation system in an online breast cancer community forum that provide guidance and valuable references to users while making decisions. The answer is the summary of already discussed topic in the forum, so that they do not need to go through all the answer posts which spans over multiple pages or initiate a thread once again. There are three phases for the answer recommendation system, including query similarity model to retrieve the past similar query, query-answer pair generation and answer recommendation. Query similarity model is employed by a Siamese network with Bi-LSTM architecture which could achieve an F1-score of 85.5%. Also, the paper shows the efficacy of transfer learning technique to generalize the model well in our breast cancer query-query pair data set. The query-answer pairs are generated by an extractive summarization technique that is based on an optimization algorithm. The effectiveness of the generated summary is evaluated based on a manually generated summary, and the result shows a ROUGE-1 score of 49%.
  • Item
    Dimension-agnostic and granularity-based spatially variable gene identification
    (Research Square, 2023-03-22) Wang, Juexin; Li, Jinpu; Kramer, Skyler; Su, Li; Chang, Yuzhou; Xu, Chunhui; Ma, Qin; Xu, Dong; BioHealth Informatics, School of Informatics and Computing
    Identifying spatially variable genes (SVGs) is critical in linking molecular cell functions with tissue phenotypes. Spatially resolved transcriptomics captures cellular-level gene expression with corresponding spatial coordinates in two or three dimensions and can be used to infer SVGs effectively. However, current computational methods may not achieve reliable results and often cannot handle three-dimensional spatial transcriptomic data. Here we introduce BSP (big-small patch), a spatial granularity-guided and non-parametric model to identify SVGs from two or three-dimensional spatial transcriptomics data in a fast and robust manner. This new method has been extensively tested in simulations, demonstrating superior accuracy, robustness, and high efficiency. BSP is further validated by substantiated biological discoveries in cancer, neural science, rheumatoid arthritis, and kidney studies with various types of spatial transcriptomics technologies.