Jake Chen

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 10 of 21
  • Item
    ProteoLens: a visual analytic tool for multi-scale database-driven biological network data mining
    (BioMed Central, 2008-08-12) Huan, Tianxiao; Sivachenko, Andrey Y.; Harrison, Scott H.; Chen, Jake Yue; Computer and Information Science, School of Science
    Background New systems biology studies require researchers to understand how interplay among myriads of biomolecular entities is orchestrated in order to achieve high-level cellular and physiological functions. Many software tools have been developed in the past decade to help researchers visually navigate large networks of biomolecular interactions with built-in template-based query capabilities. To further advance researchers' ability to interrogate global physiological states of cells through multi-scale visual network explorations, new visualization software tools still need to be developed to empower the analysis. A robust visual data analysis platform driven by database management systems to perform bi-directional data processing-to-visualizations with declarative querying capabilities is needed. Results We developed ProteoLens as a JAVA-based visual analytic software tool for creating, annotating and exploring multi-scale biological networks. It supports direct database connectivity to either Oracle or PostgreSQL database tables/views, on which SQL statements using both Data Definition Languages (DDL) and Data Manipulation languages (DML) may be specified. The robust query languages embedded directly within the visualization software help users to bring their network data into a visualization context for annotation and exploration. ProteoLens supports graph/network represented data in standard Graph Modeling Language (GML) formats, and this enables interoperation with a wide range of other visual layout tools. The architectural design of ProteoLens enables the de-coupling of complex network data visualization tasks into two distinct phases: 1) creating network data association rules, which are mapping rules between network node IDs or edge IDs and data attributes such as functional annotations, expression levels, scores, synonyms, descriptions etc; 2) applying network data association rules to build the network and perform the visual annotation of graph nodes and edges according to associated data values. We demonstrated the advantages of these new capabilities through three biological network visualization case studies: human disease association network, drug-target interaction network and protein-peptide mapping network. Conclusion The architectural design of ProteoLens makes it suitable for bioinformatics expert data analysts who are experienced with relational database management to perform large-scale integrated network visual explorations. ProteoLens is a promising visual analytic platform that will facilitate knowledge discoveries in future network and systems biology studies.
  • Item
    PEPPI: a peptidomic database of human protein isoforms for proteomics experiments
    (BMC, 2010-10-07) Zhou, Ao; Zhang, Fan; Chen, Jake Yue; BioHealth Informatics, School of Informatics and Computing
    Background Protein isoform generation, which may derive from alternative splicing, genetic polymorphism, and posttranslational modification, is an essential source of achieving molecular diversity by eukaryotic cells. Previous studies have shown that protein isoforms play critical roles in disease diagnosis, risk assessment, sub-typing, prognosis, and treatment outcome predictions. Understanding the types, presence, and abundance of different protein isoforms in different cellular and physiological conditions is a major task in functional proteomics, and may pave ways to molecular biomarker discovery of human diseases. In tandem mass spectrometry (MS/MS) based proteomics analysis, peptide peaks with exact matches to protein sequence records in the proteomics database may be identified with mass spectrometry (MS) search software. However, due to limited annotation and poor coverage of protein isoforms in proteomics databases, high throughput protein isoform identifications, particularly those arising from alternative splicing and genetic polymorphism, have not been possible. Results Therefore, we present the PEPtidomics Protein Isoform Database (PEPPI, http://bio.informatics.iupui.edu/peppi), a comprehensive database of computationally-synthesized human peptides that can identify protein isoforms derived from either alternatively spliced mRNA transcripts or SNP variations. We collected genome, pre-mRNA alternative splicing and SNP information from Ensembl. We synthesized in silico isoform transcripts that cover all exons and theoretically possible junctions of exons and introns, as well as all their variations derived from known SNPs. With three case studies, we further demonstrated that the database can help researchers discover and characterize new protein isoform biomarkers from experimental proteomics data. Conclusions We developed a new tool for the proteomics community to characterize protein isoforms from MS-based proteomics experiments. By cataloguing each peptide configurations in the PEPPI database, users can study genetic variations and alternative splicing events at the proteome level. They can also batch-download peptide sequences in FASTA format to search for MS/MS spectra derived from human samples. The database can help generate novel hypotheses on molecular risk factors and molecular mechanisms of complex diseases, leading to identification of potentially highly specific protein isoform biomarkers.
  • Item
    Discovery of pathway biomarkers from coupled proteomics and systems biology methods
    (BMC, 2010-11-02) Zhang, Fan; Chen, Jake Yue; BioHealth Informatics, School of Informatics and Computing
    Background: Breast cancer is worldwide the second most common type of cancer after lung cancer. Plasma proteome profiling may have a higher chance to identify protein changes between plasma samples such as normal and breast cancer tissues. Breast cancer cell lines have long been used by researches as model system for identifying protein biomarkers. A comparison of the set of proteins which change in plasma with previously published findings from proteomic analysis of human breast cancer cell lines may identify with a higher confidence a subset of candidate protein biomarker. Results: In this study, we analyzed a liquid chromatography (LC) coupled tandem mass spectrometry (MS/MS) proteomics dataset from plasma samples of 40 healthy women and 40 women diagnosed with breast cancer. Using a two-sample t-statistics and permutation procedure, we identified 254 statistically significant, differentially expressed proteins, among which 208 are over-expressed and 46 are under-expressed in breast cancer plasma. We validated this result against previously published proteomic results of human breast cancer cell lines and signaling pathways to derive 25 candidate protein biomarkers in a panel. Using the pathway analysis, we observed that the 25 “activated” plasma proteins were present in several cancer pathways, including ‘Complement and coagulation cascades’, ‘Regulation of actin cytoskeleton’, and ‘Focal adhesion’, and match well with previously reported studies. Additional gene ontology analysis of the 25 proteins also showed that cellular metabolic process and response to external stimulus (especially proteolysis and acute inflammatory response) were enriched functional annotations of the proteins identified in the breast cancer plasma samples. By cross-validation using two additional proteomics studies, we obtained 86% and 83% similarities in pathway-protein matrix between the first study and the two testing studies, which is much better than the similarity we measured with proteins. Conclusions: We presented a ‘systems biology’ method to identify, characterize, analyze and validate panel biomarkers in breast cancer proteomics data, which includes 1) t statistics and permutation process, 2) network, pathway and function annotation analysis, and 3) cross-validation of multiple studies. Our results showed that the systems biology approach is essential to the understanding molecular mechanisms of panel protein biomarkers.
  • Item
    A new approach to construct pathway connected networks and its application in dose responsive gene expression profiles of rat liver regulated by 2,4DNT
    (BMC, 2010-12-01) Chowbina, Sudhir; Deng, Youping; Ai, Junmei; Wu, Xiaogang; Guan, Xin; Wilbanks, Mitchell S.; Escalon, Barbara Lynn; Meyer, Sharon A.; Perkins, Edward J.; Chen, Jake Yue; BioHealth Informatics, School of Informatics and Computing
    Military and industrial activities have lead to reported release of 2,4-dinitrotoluene (2,4DNT) into soil, groundwater or surface water. It has been reported that 2,4DNT can induce toxic effects on humans and other organisms. However the mechanism of 2,4DNT induced toxicity is still unclear. Although a series of methods for gene network construction have been developed, few instances of applying such technology to generate pathway connected networks have been reported. Results Microarray analyses were conducted using liver tissue of rats collected 24h after exposure to a single oral gavage with one of five concentrations of 2,4DNT. We observed a strong dose response of differentially expressed genes after 2,4DNT treatment. The most affected pathways included: long term depression, breast cancer regulation by stathmin1, WNT Signaling; and PI3K signaling pathways. In addition, we propose a new approach to construct pathway connected networks regulated by 2,4DNT. We also observed clear dose response pathway networks regulated by 2,4DNT. Conclusions We developed a new method for constructing pathway connected networks. This new method was successfully applied to microarray data from liver tissue of 2,4DNT exposed animals and resulted in the identification of unique dose responsive biomarkers in regards to affected pathways.
  • Item
    "Super Gene Set" Causal Relationship Discovery from Functional Genomics Data
    (IEEE, 2018-11) Yue, Zongliang; Neylon, Michael T.; Nguyen, Thanh; Ratliff, Timothy; Chen, Jake Yue; BioHealth Informatics, School of Informatics and Computing
    In this article, we present a computational framework to identify "causal relationships" among super gene sets. For "causal relationships," we refer to both stimulatory and inhibitory regulatory relationships, regardless of through direct or indirect mechanisms. For super gene sets, we refer to "pathways, annotated lists, and gene signatures," or PAGs. To identify causal relationships among PAGs, we extend the previous work on identifying PAG-to-PAG regulatory relationships by further requiring them to be significantly enriched with gene-to-gene co-expression pairs across the two PAGs involved. This is achieved by developing a quantitative metric based on PAG-to-PAG Co-expressions (PPC), which we use to infer the likelihood that PAG-to-PAG relationships under examination are causal-either stimulatory or inhibitory. Since true causal relationships are unknown, we approximate the overall performance of inferring causal relationships with the performance of recalling known r-type PAG-to-PAG relationships from causal PAG-to-PAG inference, using a functional genomics benchmark dataset from the GEO database. We report the area-under-curve (AUC) performance for both precision and recall being 0.81. By applying our framework to a myeloid-derived suppressor cells (MDSC) dataset, we further demonstrate that this framework is effective in helping build multi-scale biomolecular systems models with new insights on regulatory and causal links for downstream biological interpretations.
  • Item
    Graft-Versus-Host Disease-Free Antitumoral Signature After Allogeneic Donor Lymphocyte Injection Identified by Proteomics and Systems Biology
    (American Society of Clinical Oncology, 2019) Liu, Xiaowen; Yue, Zongliang; Cao, Yimou; Taylor, Lauren; Zhang, Qing; Choi, Sung W.; Hanash, Samir; Ito, Sawa; Chen, Jake Yue; Wu, Huanmei; Paczesny, Sophie; Pediatrics, School of Medicine
    PURPOSE: As a tumor immunotherapy, allogeneic hematopoietic cell transplantation with subsequent donor lymphocyte injection (DLI) aims to induce the graft-versus-tumor (GVT) effect but often also leads to acute graft-versus-host disease (GVHD). Plasma tests that can predict the likelihood of GVT without GVHD are still needed. PATIENTS AND METHODS: We first used an intact-protein analysis system to profile the plasma proteome post-DLI of patients who experienced GVT and acute GVHD for comparison with the proteome of patients who experienced GVT without GVHD in a training set. Our novel six-step systems biology analysis involved removing common proteins and GVHD-specific proteins, creating a protein-protein interaction network, calculating relevance and penalty scores, and visualizing candidate biomarkers in gene networks. We then performed a second proteomics experiment in a validation set of patients who experienced GVT without acute GVHD after DLI for comparison with the proteome of patients before DLI. We next combined the two experiments to define a biologically relevant signature of GVT without GVHD. An independent experiment with single-cell profiling in tumor antigen-activated T cells from a patient with post-hematopoietic cell transplantation relapse was performed. RESULTS: The approach provided a list of 46 proteins in the training set, and 30 proteins in the validation set were associated with GVT without GVHD. The combination of the two experiments defined a unique 61-protein signature of GVT without GVHD. Finally, the single-cell profiling in activated T cells found 43 of the 61 genes. Novel markers, such as RPL23, ILF2, CD58, and CRTAM, were identified and could be extended to other antitumoral responses. CONCLUSION: Our multiomic analysis provides, to our knowledge, the first human plasma signature for GVT without GVHD. Risk stratification on the basis of this signature would allow for customized treatment plans.
  • Item
    PAGER 2.0: an update to the pathway, annotated-list and gene-signature electronic repository for Human Network Biology
    (Oxford Academic, 2018-01-04) Yue, Zongliang; Zheng, Qi; Neylon, Michael T.; Yoo, Minjae; Shin, Jimin; Zhao, Zhiying; Tan, Aik Choon; Chen, Jake Yue; BioHealth Informatics, School of Informatics and Computing
    Integrative Gene-set, Network and Pathway Analysis (GNPA) is a powerful data analysis approach developed to help interpret high-throughput omics data. In PAGER 1.0, we demonstrated that researchers can gain unbiased and reproducible biological insights with the introduction of PAGs (Pathways, Annotated-lists and Gene-signatures) as the basic data representation elements. In PAGER 2.0, we improve the utility of integrative GNPA by significantly expanding the coverage of PAGs and PAG-to-PAG relationships in the database, defining a new metric to quantify PAG data qualities, and developing new software features to simplify online integrative GNPA. Specifically, we included 84 282 PAGs spanning 24 different data sources that cover human diseases, published gene-expression signatures, drug-gene, miRNA-gene interactions, pathways and tissue-specific gene expressions. We introduced a new normalized Cohesion Coefficient (nCoCo) score to assess the biological relevance of genes inside a PAG, and RP-score to rank genes and assign gene-specific weights inside a PAG. The companion web interface contains numerous features to help users query and navigate the database content. The database content can be freely downloaded and is compatible with third-party Gene Set Enrichment Analysis tools. We expect PAGER 2.0 to become a major resource in integrative GNPA. PAGER 2.0 is available at http://discovery.informatics.uab.edu/PAGER/.
  • Item
    Proteomic characterization reveals that MMP-3 correlates with bronchiolitis obliterans syndrome following allogeneic hematopoietic cell and lung transplantation
    (Wiley, 2016-08) Liu, Xiaowen; Yue, Zongliang; Yu, Jeffrey; Daguindau, Etienne; Kushekhar, Kushi; Zhang, Qing; Ogata, Yuko; Gafken, Philip R.; Inamoto, Yoshihiro; Gracon, Adam; Wilkes, David S.; Hansen, John A.; Lee, Stephanie J.; Chen, Jake Yue; Paczesny, Sophie; BioHealth Informatics, School of Informatics and Computing
    Improved diagnostic methods are needed for bronchiolitis obliterans syndrome (BOS), a serious complication after allogeneic hematopoietic cell transplantation (HCT) and lung transplantation. For proteins candidate discovery, we compared plasma pools from HCT transplantation recipients with: BOS at onset (n=12), pulmonary infection (n=16), chronic graft-versus-host disease without pulmonary involvement (n=15), and no chronic complications post-HCT (n=15). Pools were labeled with different tags [isobaric Tags for Relative and Absolute Quantification (iTRAQ)], and two software tools identified differentially expressed proteins (≥1.5-fold change). Candidate proteins were further selected using a six-step computational biology approach. The diagnostic value of the lead candidate, matrix metalloproteinase-3 (MMP-3), was evaluated by ELISA in plasma of a verification cohort (n=112) with and without BOS following HCT (n=76) or lung transplantation (n=36). MMP-3 plasma concentrations differed significantly between patients with and without BOS (AUC=0.77). Thus, MMP-3 represents a potential non-invasive blood test for diagnosis of BOS.
  • Item
    HAPPI-2: a Comprehensive and High-quality Map of Human Annotated and Predicted Protein Interactions
    (BioMed Central, 2017-02-17) Chen, Jake Yue; Pandey, Ragini; Nguyen, Thanh M.; Department of Biohealth Informatics, School of Informatics and Computing
    BACKGROUND: Human protein-protein interaction (PPI) data is essential to network and systems biology studies. PPI data can help biochemists hypothesize how proteins form complexes by binding to each other, how extracellular signals propagate through post-translational modification of de-activated signaling molecules, and how chemical reactions are coupled by enzymes involved in a complex biological process. Our capability to develop good public database resources for human PPI data has a direct impact on the quality of future research on genome biology and medicine. RESULTS: The database of Human Annotated and Predicted Protein Interactions (HAPPI) version 2.0 is a major update to the original HAPPI 1.0 database. It contains 2,922,202 unique protein-protein interactions (PPI) linked by 23,060 human proteins, making it the most comprehensive database covering human PPI data today. These PPIs contain both physical/direct interactions and high-quality functional/indirect interactions. Compared with the HAPPI 1.0 database release, HAPPI database version 2.0 (HAPPI-2) represents a 485% of human PPI data coverage increase and a 73% protein coverage increase. The revamped HAPPI web portal provides users with a friendly search, curation, and data retrieval interface, allowing them to retrieve human PPIs and available annotation information on the interaction type, interaction quality, interacting partner drug targeting data, and disease information. The updated HAPPI-2 can be freely accessed by Academic users at http://discovery.informatics.uab.edu/HAPPI . CONCLUSIONS: While the underlying data for HAPPI-2 are integrated from a diverse data sources, the new HAPPI-2 release represents a good balance between data coverage and data quality of human PPIs, making it ideally suited for network biology.
  • Item
    A method for identifying discriminative isoform-specific peptides for clinical proteomics application
    (BioMed Central, 2016-08-22) Zhang, Fan; Chen, Jake Yue; Department of Biohealth Informatics, IU School of Informatics and Computing
    BACKGROUND: Clinical proteomics application aims at solving a specific clinical problem within the context of a clinical study. It has been growing rapidly in the field of biomarker discovery, especially in the area of cancer diagnostics. Until recently, protein isoform has not been viewed as a new class of early diagnostic biomarkers for clinical proteomics. A protein isoform is one of different forms of the same protein. Different forms of a protein may be produced from single-nucleotide polymorphisms (SNPs), alternative splicing, or post-translational modifications (PTMs). Previous studies have shown that protein isoforms play critical roles in tumorigenesis, disease diagnosis, and prognosis. Identifying and characterizing protein isoforms are essential to the study of molecular mechanisms and early detection of complex diseases such as breast cancer. However, there are limitations with traditional methods such as EST sequencing, Microarray profiling (exon array, Exon-exon junction array), mRNA next-generation sequencing used for protein isoform determination: 1) not in the protein level, 2) no connectivity about connection of nonadjacent exons, 3) no SNPs and PTMs, and 4) low reproducibility. Moreover, there exist the computational challenges of clinical proteomics studies: 1) low sensitivity of instruments, 2) high data noise, and 3) high variability and low repeatability, although recent advances in clinical proteomics technology, LC-MS/MS proteomics, have been used to identify candidate molecular biomarkers in diverse range of samples, including cells, tissues, serum/plasma, and other types of body fluids. RESULTS: Therefore, in the paper, we presented a peptidomics method for identifying cancer-related and isoform-specific peptide for clinical proteomics application from LC-MS/MS. First, we built a Peptidomic Database of Human Protein Isoforms, then created a peptidomics approach to perform large-scale screen of breast cancer-associated alternative splicing isoform markers in clinical proteomics, and lastly performed four kinds of validations: biological validation (explainable index), exon array, statistical validation of independent samples, and extensive pathway analysis. CONCLUSIONS: Our results showed that alternative splicing isoform makers can act as independent markers of breast cancer and that the method for identifying cancer-specific protein isoform biomarkers from clinical proteomics application is an effective one for increasing the number of identified alternative splicing isoform markers in clinical proteomics.