- Browse by Author
Browsing by Author "Zhao, Huiying"
Now showing 1 - 8 of 8
Results Per Page
Sort Options
Item Charting the Unexplored RNA-binding Protein Atlas of the Human Genome(Office of the Vice Chancellor for Research, 2012-04-13) Zhao, Huiying; Yang, Yuedong; Janga, Sarath Chandra; Chen, Jason; Zhu, Heng; Kao, Cheng; Zhou, YaoqiDetecting protein-RNA interactions is challenging–both experimentally and computationally– because RNAs are large in number, diverse in cellular location and function, and flexible in structure. As a result, many RNA-binding proteins (RBPs) remain to be identified and characterized. Recently, we developed a bioinformatics tool called SPOT-Seq that integrates template-based structure prediction with RNA-binding affinity prediction to predict RBPs. Application of SPOT-Seq to human genome leads to doubling of RBPs from 2115 to 4296. Half of novel (>2000) RBPs are poorly or not annotated. The other half possesses Gene Ontology leaf IDs that are associated with known RBPs. In particular, we identified 36 novel RBPs in cancer, cardiovascular, diabetes and neurodegenerative pathways and 26 novel RBPs associated with disease-causing SNPs. Half of these disease-associating, predicted novel RBPs are annotated to interact with known RBPs. Accuracy of predicted novel RBPs is further validated by same confirmation rate of novel and annotated RBPs in human proteome microarrays experiments. The large number of predicted novel RBPs and their abundance in disease pathways and disease-causing SNPs are useful for hypothesis generation. These predicted novel human RBPs (>2000) with confidence level and their predicted complex structures with RNA can be downloaded from http://sparks.informatics.iupui.edu (yqzhou@iupui.edu)Item Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features(Office of the Vice Chancellor for Research, 2013-04-05) Zhao, Huiying; Yang, Yuedong; Lin, Hai; Zhang, Xinjun; Mort, Matthew; Cooper, David N.; Liu, Yunlong; Zhou, YaoqiMicro-INDELs (insertions or deletions of ≤20 bp) constitute the second most frequent class of human gene mutation after single nucleotide variants. A significant portion of exonic INDELs are non-frameshifting (NFS), serving to insert or delete a discrete number of amino-acid residues. Despite the relative abundance of NFS-INDELs, their damaging effect on protein structure and function has gone largely unstudied whilst bioinformatics tools for discriminating between disease-causing and neutral NFS-INDELs remain to be developed. We have developed such a technique (DDIG-in; Detecting DIsease-causing Genetic variations due to INDELs) by comparing the properties of disease-causing NFS-INDELs from the Human Gene Mutation Database (HGMD) with putatively neutral NFS-INDELs from the 1,000 Genomes Project. Having considered 58 different sequence- and structure-based features, we found that predicted disordered regions around the NFS-INDEL region had the highest discriminative capability (disease versus neutral) with an Area Under the receiver-operating characteristic Curve (AUC) of 0.82 and a Matthews Correlation Coefficient (MCC) of 0.56. All features studied were combined by support vector machines (SVM) and selected by a greedy algorithm. The resulting SVM models were trained and tested by ten-fold cross-validation on the microdeletion dataset and independently tested on the microinsertion dataset and vice versa. The final SVM model for determining NFS-INDEL disease-causing probability was built on non-redundant datasets with a protein sequence identity cutoff of 35% and yielded an MCC value of 0.68, an accuracy of 84% and an AUC of 0.89. Predicted disease-causing probabilities exhibited a strong negative correlation with the average minor allele frequency (correlation coefficient, -0.84). DDIG-in, available at http://sparks.informatics.iupui.edu, can be used to estimate the disease-causing probability for a given NFS-INDEL.Item Impact of human pathogenic micro-insertions and micro-deletions on post-transcriptional regulation(Oxford University Press, 2014-06-01) Zhang, Xinjun; Lin, Hai; Zhao, Huiying; Hao, Yangyang; Mort, Matthew; Cooper, David N.; Zhou, Yaoqi; Liu, Yunlong; Department of Medical & Molecular Genetics, IU School of MedicineSmall insertions/deletions (INDELs) of ≤21 bp comprise 18% of all recorded mutations causing human inherited disease and are evident in 24% of documented Mendelian diseases. INDELs affect gene function in multiple ways: for example, by introducing premature stop codons that either lead to the production of truncated proteins or affect transcriptional efficiency. However, the means by which they impact post-transcriptional regulation, including alternative splicing, have not been fully evaluated. In this study, we collate disease-causing INDELs from the Human Gene Mutation Database (HGMD) and neutral INDELs from the 1000 Genomes Project. The potential of these two types of INDELs to affect binding-site affinity of RNA-binding proteins (RBPs) was then evaluated. We identified several sequence features that can distinguish disease-causing INDELs from neutral INDELs. Moreover, we built a machine-learning predictor called PinPor (predicting pathogenic small insertions and deletions affecting post-transcriptional regulation, http://watson.compbio.iupui.edu/pinpor/) to ascertain which newly observed INDELs are likely to be pathogenic. Our results show that disease-causing INDELs are more likely to ablate RBP-binding sites and tend to affect more RBP-binding sites than neutral INDELs. Additionally, disease-causing INDELs give rise to greater deviations in binding affinity than neutral INDELs. We also demonstrated that disease-causing INDELs may be distinguished from neutral INDELs by several sequence features, such as their proximity to splice sites and their potential effects on RNA secondary structure. This predictor showed satisfactory performance in identifying numerous pathogenic INDELs, with a Matthews correlation coefficient (MCC) value of 0.51 and an accuracy of 0.75.Item Prediction and validation of the unexplored RNA-binding protein atlas of the human proteome(2014-04) Zhao, Huiying; Yang, Yuedong; Janga, Sarath Chandra; Kao, C. Cheng; Zhou, YaoqiDetecting protein-RNA interactions is challenging both experimentally and computationally because RNAs are large in number, diverse in cellular location and function, and flexible in structure. As a result, many RNA-binding proteins (RBPs) remain to be identified. Here, a template-based, function-prediction technique SPOT-Seq for RBPs is applied to human proteome and its result is validated by a recent proteomic experimental discovery of 860 mRNA-binding proteins (mRBPs). The coverage (or sensitivity) is 42.6% for 1217 known RBPs annotated in the Gene Ontology and 43.6% for 860 newly discovered human mRBPs. Consistent sensitivity indicates the robust performance of SPOT-Seq for predicting RBPs. More importantly, SPOT-Seq detects 2418 novel RBPs in human proteome, 291 of which were validated by the newly discovered mRBP set. Among 291 validated novel RBPs, 61 are not homologous to any known RBPs. Successful validation of predicted novel RBPs permits us to further analysis of their phenotypic roles in disease pathways. The dataset of 2418 predicted novel RBPs along with confidence levels and complex structures is available at http://sparks-lab.org (in publications) for experimental confirmations and hypothesis generation.Item Prediction and validation of the unexplored RNA-binding protein atlas of the human proteome(Wiley, 2014-04) Zhao, Huiying; Yang, Yuedong; Janga, Sarath Chandra; Kao, C. Cheng; Zhou, Yaoqi; Department of Medicine, IU School of MedicineDetecting protein-RNA interactions is challenging both experimentally and computationally because RNAs are large in number, diverse in cellular location and function, and flexible in structure. As a result, many RNA-binding proteins (RBPs) remain to be identified. Here, a template-based, function-prediction technique SPOT-Seq for RBPs is applied to human proteome and its result is validated by a recent proteomic experimental discovery of 860 mRNA-binding proteins (mRBPs). The coverage (or sensitivity) is 42.6% for 1217 known RBPs annotated in the Gene Ontology and 43.6% for 860 newly discovered human mRBPs. Consistent sensitivity indicates the robust performance of SPOT-Seq for predicting RBPs. More importantly, SPOT-Seq detects 2418 novel RBPs in human proteome, 291 of which were validated by the newly discovered mRBP set. Among 291 validated novel RBPs, 61 are not homologous to any known RBPs. Successful validation of predicted novel RBPs permits us to further analysis of their phenotypic roles in disease pathways. The dataset of 2418 predicted novel RBPs along with confidence levels and complex structures is available at http://sparks-lab.org (in publications) for experimental confirmations and hypothesis generation.Item Protein function prediction by integrating sequence, structure and binding affinity information(2014-02-03) Zhao, Huiying; Zhou, Yaoqi; Liu, Yunlong; Meroueh, Samy; Janga, Sarath ChandraProteins are nano-machines that work inside every living organism. Functional disruption of one or several proteins is the cause for many diseases. However, the functions for most proteins are yet to be annotated because inexpensive sequencing techniques dramatically speed up discovery of new protein sequences (265 million and counting) and experimental examinations of every protein in all its possible functional categories are simply impractical. Thus, it is necessary to develop computational function-prediction tools that complement and guide experimental studies. In this study, we developed a series of predictors for highly accurate prediction of proteins with DNA-binding, RNA-binding and carbohydrate-binding capability. These predictors are a template-based technique that combines sequence and structural information with predicted binding affinity. Both sequence and structure-based approaches were developed. Results indicate the importance of binding affinity prediction for improving sensitivity and precision of function prediction. Application of these methods to the human genome and structure genome targets demonstrated its usefulness in annotating proteins of unknown functions and discovering moon-lighting proteins with DNA,RNA, or carbohydrate binding function. In addition, we also investigated disruption of protein functions by naturally occurring genetic variations due to insertions and deletions (INDELS). We found that protein structures are the most critical features in recognising disease-causing non-frame shifting INDELs. The predictors for function predictions are available at http://sparks-lab.org/spot, and the predictor for classification of non-frame shifting INDELs is available at http://sparks-lab.org/ddig.Item The Role of Semidisorder in Temperature Adaptation of Bacterial FlgM Proteins(Elsevier B.V., 2013-12-03) Wang, Jihua; Yang, Yuedong; Cao, Zanxia; Li, Zhixiu; Zhao, Huiying; Zhou, Yaoqi; Department of Biochemistry & Molecular Biology, IU School of MedicineProbabilities of disorder for FlgM proteins of 39 species whose optimal growth temperature ranges from 273 K (0°C) to 368 K (95°C) were predicted by a newly developed method called Sequence-based Prediction with Integrated NEural networks for Disorder (SPINE-D). We showed that the temperature-dependent behavior of FlgM proteins could be separated into two subgroups according to their sequence lengths. Only shorter sequences evolved to adapt to high temperatures (>318 K or 45°C). Their ability to adapt to high temperatures was achieved through a transition from a fully disordered state with little secondary structure to a semidisordered state with high predicted helical probability at the N-terminal region. The predicted results are consistent with available experimental data. An analysis of all orthologous protein families in 39 species suggests that such a transition from a fully disordered state to semidisordered and/or ordered states is one of the strategies employed by nature for adaptation to high temperatures.Item SPOT-Seq-RNA: Predicting protein-RNA complex structure and RNA-binding function by fold recognition and binding affinity prediction(Springer, 2014) Yang, Yuedong; Zhao, Huiying; Wang, Jihua; Zhou, Yaoqi; Department of BioHealth InformaticsRNA-binding proteins (RBPs) play key roles in RNA metabolism and post-transcriptional regulation. Computational methods have been developed separately for prediction of RBPs and RNA-binding residues by machine-learning techniques and prediction of protein-RNA complex structures by rigid or semiflexible structure-to-structure docking. Here, we describe a template-based technique called SPOT-Seq-RNA that integrates prediction of RBPs, RNA-binding residues, and protein-RNA complex structures into a single package. This integration is achieved by combining template-based structure-prediction software, SPARKS X, with binding affinity prediction software, DRNA. This tool yields reasonable sensitivity (46 %) and high precision (84 %) for an independent test set of 215 RBPs and 5,766 non-RBPs. SPOT-Seq-RNA is computationally efficient for genome-scale prediction of RBPs and protein-RNA complex structures. Its application to human genome study has revealed a similar sensitivity and ability to uncover hundreds of novel RBPs beyond simple homology. The online server and downloadable version of SPOT-Seq-RNA are available at http://sparks-lab.org/server/SPOT-Seq-RNA/.