Spectral Deconvolution, Feature Detection, and Proteoform Identification for Top-Down Proteomics
Date
Authors
Language
Embargo Lift Date
Department
Committee Chair
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
Abstract
Liquid chromatography-based mass spectrometry (LC-MS) is widely used for proteoform identification, characterization, and quantitation. Bottom-up proteomics analyzes enzymatically digested peptides, while top-down proteomics examines intact proteoforms, enabling comprehensive identification of proteoforms with post-translational modifications (PTMs), genetic mutations, and alternative splicing. In MS data, due to the occurrence of different isotopes, proteins with the same chemical composition and charge state produce a group of peaks with different mass-to-charge ratios (m/z), called an isotopic envelope. A top-down mass spectrum often contains hundreds of high-charge state envelopes, some of which are overlapping. Consequently, analyzing top-down MS data presents computational challenges due to the complexity of top-down spectra. This dissertation introduces three new software tools EnvCNN, TopFD, and TopDIA for enhancing proteoform identification, characterization, and quantification in top-down MS data analysis. EnvCNN is a deep-learning model for evaluating isotopic envelopes of proteoforms and their fragments. This model aims to improve the accuracy of reporting fragments, thus increasing the number of identified proteoforms and improving the reliability of proteoform identification and characterization. TopFD is a software tool for proteoform feature detection, grouping all peaks of a proteoform in an LC-MS map into a single feature. TopFD outperforms other existing tools in the accuracy and reproducibility of feature detection, thereby improving proteoform identification and quantification. TopDIA is the first software tool for proteoform identification by top-down data-independent acquisition MS (TD-DIA-MS). Unlike conventional top-down data-dependent acquisition MS (TD-DDA-MS), which relies on intensity-based proteoform selection to generate fragment mass spectra, TD-DIA-MS fragments all proteoforms within predefined isolation windows, generating fragment mass spectra for every proteoform. TopDIA processes TD-DIA-MS data to generate demultiplexed pseudo spectra, which are searched against a protein database for proteoform identification, leading to a significant increase in the number of identified proteoforms compared with TD-DDA-MS. In summary, these new software tools help advance proteomics research by increasing the accuracy and comprehensiveness of proteoform analysis by top-down MS.