- Browse by Author
Browsing by Author "Kou, Qiang"
Now showing 1 - 10 of 15
Results Per Page
Sort Options
Item Characterization of proteoforms with unknown post-translational modi cations using the MIScore(ACS, 2016) Kou, Qiang; Zhu, Binhai; Wu, Si; Ansong, Charles; Tolić, Nikola; Paša-Tolić, Ljiljana; Liu, Xiaowen; Department of Biohealth Informatics, School of Informatics and ComputingVarious proteoforms may be generated from a single gene due to primary structure alterations (PSAs) such as genetic variations, alternative splicing, and post-translational modifications (PTMs). Top-down mass spectrometry is capable of analyzing intact proteins and identifying patterns of multiple PSAs, making it the method of choice for studying complex proteoforms. In top-down proteomics, proteoform identification is often performed by searching tandem mass spectra against a protein sequence database that contains only one reference protein sequence for each gene or transcript variant in a proteome. Because of the incompleteness of the protein database, an identified proteoform may contain unknown PSAs compared with the reference sequence. Proteoform characterization is to identify and localize PSAs in a proteoform. Although many software tools have been proposed for proteoform identification by top-down mass spectrometry, the characterization of proteoforms in identified proteoform–spectrum matches still relies mainly on manual annotation. We propose to use the Modification Identification Score (MIScore), which is based on Bayesian models, to automatically identify and localize PTMs in proteoforms. Experiments showed that the MIScore is accurate in identifying and localizing one or two modifications.Item Complex Proteoform Identification Using Top-Down Mass Spectrometry(2018-12) Kou, Qiang; Wu, Huanmei; Liu, Xiaowen; Liu, Yunlong; Al Hasan, MohammadProteoforms are distinct protein molecule forms created by variations in genes, gene expression, and other biological processes. Many proteoforms contain multiple primary structural alterations, including amino acid substitutions, terminal truncations, and posttranslational modifications. These primary structural alterations play a crucial role in determining protein functions: proteoforms from the same protein with different alterations may exhibit different functional behaviors. Because top-down mass spectrometry directly analyzes intact proteoforms and provides complete sequence information of proteoforms, it has become the method of choice for the identification of complex proteoforms. Although instruments and experimental protocols for top-down mass spectrometry have been advancing rapidly in the past several years, many computational problems in this area remain unsolved, and the development of software tools for analyzing such data is still at its very early stage. In this dissertation, we propose several novel algorithms for challenging computational problems in proteoform identification by top-down mass spectrometry. First, we present two approximate spectrum-based protein sequence filtering algorithms that quickly find a small number of candidate proteins from a large proteome database for a query mass spectrum. Second, we describe mass graph-based alignment algorithms that efficiently identify proteoforms with variable post-translational modifications and/or terminal truncations. Third, we propose a Markov chain Monte Carlo method for estimating the statistical signi ficance of identified proteoform spectrum matches. They are the first efficient algorithms that take into account three types of alterations: variable post-translational modifications, unexpected alterations, and terminal truncations in proteoform identification. As a result, they are more sensitive and powerful than other existing methods that consider only one or two of the three types of alterations. All the proposed algorithms have been incorporated into TopMG, a complete software pipeline for complex proteoform identification. Experimental results showed that TopMG significantly increases the number of identifications than other existing methods in proteome-level top-down mass spectrometry studies. TopMG will facilitate the applications of top-down mass spectrometry in many areas, such as the identification and quantification of clinically relevant proteoforms and the discovery of new proteoform biomarkers.Item Deep Top-Down Proteomics Using Capillary Zone Electrophoresis-Tandem Mass Spectrometry: Identification of 5700 Proteoforms from the Escherichia coli Proteome(American Chemical Society, 2018-05-01) McCool, Elijah N.; Lubeckyj, Rachele A.; Shen, Xiaojing; Chen, Daoyang; Kou, Qiang; Liu, Xiaowen; Sun, Liangliang; BioHealth Informatics, School of Informatics and ComputingCapillary zone electrophoresis (CZE)-tandem mass spectrometry (MS/MS) has been recognized as a useful tool for top-down proteomics. However, its performance for deep top-down proteomics is still dramatically lower than widely used reversed-phase liquid chromatography (RPLC)-MS/MS. We present an orthogonal multidimensional separation platform that couples size exclusion chromatography (SEC) and RPLC based protein prefractionation to CZE-MS/MS for deep top-down proteomics of Escherichia coli. The platform generated high peak capacity (∼4000) for separation of intact proteins, leading to the identification of 5700 proteoforms from the Escherichia coli proteome. The data represents a 10-fold improvement in the number of proteoform identifications compared with previous CZE-MS/MS studies and represents the largest bacterial top-down proteomics data set reported to date. The performance of the CZE-MS/MS based platform is comparable to the state-of-the-art RPLC-MS/MS based systems in terms of the number of proteoform identifications and the instrument time.Item Evaluation of top-down mass spectral identification with homologous protein sequences(Biomed Central, 2018-12-28) Li, Ziwei; He, Bo; Kou, Qiang; Wang, Zhe; Wu, Si; Liu, Yunlong; Feng, Weixing; Liu, Xiaowen; Medical and Molecular Genetics, School of MedicineBACKGROUND: Top-down mass spectrometry has unique advantages in identifying proteoforms with multiple post-translational modifications and/or unknown alterations. Most software tools in this area search top-down mass spectra against a protein sequence database for proteoform identification. When the species studied in a mass spectrometry experiment lacks its proteome sequence database, a homologous protein sequence database can be used for proteoform identification. The accuracy of homologous protein sequences affects the sensitivity of proteoform identification and the accuracy of mass shift localization. RESULTS: We tested TopPIC, a commonly used software tool for top-down mass spectral identification, on a top-down mass spectrometry data set of Escherichia coli K12 MG1655, and evaluated its performance using an Escherichia coli K12 MG1655 proteome database and a homologous protein database. The number of identified spectra with the homologous database was about half of that with the Escherichia coli K12 MG1655 database. We also tested TopPIC on a top-down mass spectrometry data set of human MCF-7 cells and obtained similar results. CONCLUSIONS: Experimental results demonstrated that TopPIC is capable of identifying many proteoform spectrum matches and localizing unknown alterations using homologous protein sequences containing no more than 2 mutations.Item Large-scale Top-down Proteomics Using Capillary Zone Electrophoresis Tandem Mass Spectrometry(MyJove Corporation, 2018-10-24) McCool, Elijah N.; Lubeckyj, Rachele; Shen, Xiaojing; Kou, Qiang; Liu, Xiaowen; Sun, Liangliang; Computer and Information Science, School of ScienceCapillary zone electrophoresis-electrospray ionization-tandem mass spectrometry (CZE-ESI-MS/MS) has been recognized as a useful tool for top-down proteomics that aims to characterize proteoforms in complex proteomes. However, the application of CZE-MS/MS for large-scale top-down proteomics has been impeded by the low sample-loading capacity and narrow separation window of CZE. Here, a protocol is described using CZE-MS/MS with a microliter-scale sample-loading volume and a 90-min separation window for large-scale top-down proteomics. The CZE-MS/MS platform is based on a linear polyacrylamide (LPA)-coated separation capillary with extremely low electroosmotic flow, a dynamic pH-junction-based online sample concentration method with a high efficiency for protein stacking, an electro-kinetically pumped sheath flow CE-MS interface with extremely high sensitivity, and an ion trap mass spectrometer with high mass resolution and scan speed. The platform can be used for the high-resolution characterization of simple intact protein samples and the large-scale characterization of proteoforms in various complex proteomes. As an example, a highly efficient separation of a standard protein mixture and a highly sensitive detection of many impurities using the platform is demonstrated. As another example, this platform can produce over 500 proteoform and 190 protein identifications from an Escherichia coli proteome in a single CZE-MS/MS run.Item A Markov chain Monte Carlo method for estimating the statistical significance of proteoform identifications by top-down mass spectrometry(ACS, 2019-03) Kou, Qiang; Wang, Zhe; Lubeckyj, Rachele A.; Wu, Si; Liu, Xiaowen; BioHealth Informatics, School of Informatics and ComputingTop-down mass spectrometry is capable of identifying whole proteoform sequences with multiple post-translational modifications because it generates tandem mass spectra directly from intact proteoforms. Many software tools, such as ProSightPC, MSPathFinder, and TopMG, have been proposed for identifying proteoforms with modifications. In these tools, various methods are employed to estimate the statistical significance of identifications. However, most existing methods are designed for proteoform identifications without modifications, and the challenge remains for accurately estimating the statistical significance of proteoform identifications with modifications. Here we propose TopMCMC, a method that combines a Markov chain random walk algorithm and a greedy algorithm for assigning statistical significance to matches between spectra and protein sequences with variable modifications. Experimental results showed that TopMCMC achieved high accuracy in estimating E-values and false discovery rates of identifications in top-down mass spectrometry. Coupled with TopMG, TopMCMC identified more spectra than the generating function method from an MCF-7 top-down mass spectrometry data set.Item A mass graph-based approach for the identification of modified proteoforms using top-down tandem mass spectra(Oxford, 2017-05-01) Kou, Qiang; Wu, Si; Tolić, Nikola; Paša-Tolić, Ljiljana; Liu, Yunlong; Liu, Xiaowen; BioHealth Informatics, School of Informatics and ComputingMotivation: Although proteomics has rapidly developed in the past decade, researchers are still in the early stage of exploring the world of complex proteoforms, which are protein products with various primary structure alterations resulting from gene mutations, alternative splicing, post-translational modifications, and other biological processes. Proteoform identification is essential to mapping proteoforms to their biological functions as well as discovering novel proteoforms and new protein functions. Top-down mass spectrometry is the method of choice for identifying complex proteoforms because it provides a 'bird's eye view' of intact proteoforms. The combinatorial explosion of various alterations on a protein may result in billions of possible proteoforms, making proteoform identification a challenging computational problem. Results: We propose a new data structure, called the mass graph, for efficient representation of proteoforms and design mass graph alignment algorithms. We developed TopMG, a mass graph-based software tool for proteoform identification by top-down mass spectrometry. Experiments on top-down mass spectrometry datasets showed that TopMG outperformed existing methods in identifying complex proteoforms.Item Mass graphs and their applications in top-down proteomics(2015) Kou, Qiang; Wu, Si; Tolić, Nikola; Pasa-Tolić, Ljiljana; Liu, Xiaowen; Department of Biohealth Informatics, School of Informatics and ComputingAlthough proteomics has made rapid progress in the past decade, researchers are still in the early stage of exploring the world of complex proteoforms, which are protein products with various primary structure alterations resulting from gene mutations, alternative splicing, post-translational modifications, and other biological processes. Proteoform identification is essential to mapping proteoforms to their biological functions as well as discovering novel proteoforms and new protein functions. Top-down mass spectrometry is the method of choice for identifying complex proteoforms because it provides a "bird view" of intact proteoforms. The combinatorial explosion of possible proteoforms, which may result in billions of possible proteoforms for one protein, makes proteoform identification a challenging computational problem. Here we propose a new data structure, called the mass graph, for efficiently representing proteoforms. In addition, we design mass graph alignment algorithms for proteoform identification by top-down mass spectrometry. Experiments on a histone H4 mass spectrometry data set showed that the proposed methods outperformed MS-Align-E in identifying complex proteoforms.Item A new scoring function for top-down spectral deconvolution(Springer (Biomed Central Ltd.), 2014) Kou, Qiang; Wu, Si; Liu, Xiaowen; Department of BioHealth Informatics, School of Informatics and ComputingBACKGROUND: Top-down mass spectrometry plays an important role in intact protein identification and characterization. Top-down mass spectra are more complex than bottom-up mass spectra because they often contain many isotopomer envelopes from highly charged ions, which may overlap with one another. As a result, spectral deconvolution, which converts a complex top-down mass spectrum into a monoisotopic mass list, is a key step in top-down spectral interpretation. RESULTS: In this paper, we propose a new scoring function, L-score, for evaluating isotopomer envelopes. By combining L-score with MS-Deconv, a new software tool, MS-Deconv+, was developed for top-down spectral deconvolution. Experimental results showed that MS-Deconv+ outperformed existing software tools in top-down spectral deconvolution. CONCLUSIONS: L-score shows high discriminative ability in identification of isotopomer envelopes. Using L-score, MS-Deconv+ reports many correct monoisotopic masses missed by other software tools, which are valuable for proteoform identification and characterization.Item Quantitative Top-Down Proteomics in Complex Samples Using Protein-Level Tandem Mass Tag Labeling(American Chemical Society, 2021-06-02) Yu, Dahang; Wang, Zhe; Cupp-Sutton, Kellye A.; Guo, Yanting; Kou, Qiang; Smith, Kenneth; Liu, Xiaowen; Wu, Si; BioHealth Informatics, School of Informatics and ComputingLabeling approaches using isobaric chemical tags (e.g., isobaric tagging for relative and absolute quantification, iTRAQ and tandem mass tag, TMT) have been widely applied for the quantification of peptides and proteins in bottom-up MS. However, until recently, successful applications of these approaches to top-down proteomics have been limited because proteins tend to precipitate and “crash” out of solution during TMT labeling of complex samples making the quantification of such samples difficult. In this study, we report a top-down TMT MS platform for confidently identifying and quantifying low molecular weight intact proteoforms in complex biological samples. To reduce the sample complexity and remove large proteins from complex samples, we developed a filter-SEC technique that combines a molecular weight cutoff filtration step with high-performance size exclusion chromatography (SEC) separation. No protein precipitation was observed in filtered samples under the intact protein-level TMT labeling conditions. The proposed top-down TMT MS platform enables high-throughput analysis of intact proteoforms, allowing for the identification and quantification of hundreds of intact proteoforms from Escherichia coli cell lysates. To our knowledge, this represents the first high-throughput TMT labeling-based, quantitative, top-down MS analysis suitable for complex biological samples.