IU Indianapolis ScholarWorks :: Browsing by Author "Zhou, Yaoqi"

Browsing by Author "Zhou, Yaoqi"

Now showing 1 - 10 of 31

Accurate single-sequence prediction of solvent accessible surface area using local and global features
(Wiley Blackwell (John Wiley & Sons), 2014-11) Faraggi, Eshel; Zhou, Yaoqi; Kloczkowski, Andrzej; Department of Biochemistry & Molecular Biology, IU School of Medicine
We present a new approach for predicting the Accessible Surface Area (ASA) using a General Neural Network (GENN). The novelty of the new approach lies in not using residue mutation profiles generated by multiple sequence alignments as descriptive inputs. Instead we use solely sequential window information and global features such as single-residue and two-residue compositions of the chain. The resulting predictor is both highly more efficient than sequence alignment-based predictors and of comparable accuracy to them. Introduction of the global inputs significantly helps achieve this comparable accuracy. The predictor, termed ASAquick, is tested on predicting the ASA of globular proteins and found to perform similarly well for so-called easy and hard cases indicating generalizability and possible usability for de-novo protein structure prediction. The source code and a Linux executables for GENN and ASAquick are available from Research and Information Systems at http://mamiris.com, from the SPARKS Lab at http://sparks-lab.org, and from the Battelle Center for Mathematical Medicine at http://mathmed.org.
BEST: Improved Prediction of B-Cell Epitopes from Antigen Sequences
(Public Library of Science, 2012) Gao, Jianzhao; Faraggi, Eshel; Zhou, Yaoqi; Ruan, Jishou; Kurgan, Lukasz; Physics, School of Science
Accurate identification of immunogenic regions in a given antigen chain is a difficult and actively pursued problem. Although accurate predictors for T-cell epitopes are already in place, the prediction of the B-cell epitopes requires further research. We overview the available approaches for the prediction of B-cell epitopes and propose a novel and accurate sequence-based solution. Our BEST (B-cell Epitope prediction using Support vector machine Tool) method predicts epitopes from antigen sequences, in contrast to some method that predict only from short sequence fragments, using a new architecture based on averaging selected scores generated from sliding 20-mers by a Support Vector Machine (SVM). The SVM predictor utilizes a comprehensive and custom designed set of inputs generated by combining information derived from the chain, sequence conservation, similarity to known (training) epitopes, and predicted secondary structure and relative solvent accessibility. Empirical evaluation on benchmark datasets demonstrates that BEST outperforms several modern sequence-based B-cell epitope predictors including ABCPred, method by Chen et al. (2007), BCPred, COBEpro, BayesB, and CBTOPE, when considering the predictions from antigen chains and from the chain fragments. Our method obtains a cross-validated area under the receiver operating characteristic curve (AUC) for the fragment-based prediction at 0.81 and 0.85, depending on the dataset. The AUCs of BEST on the benchmark sets of full antigen chains equal 0.57 and 0.6, which is significantly and slightly better than the next best method we tested. We also present case studies to contrast the propensity profiles generated by BEST and several other methods.
Charting the Unexplored RNA-binding Protein Atlas of the Human Genome
(Office of the Vice Chancellor for Research, 2012-04-13) Zhao, Huiying; Yang, Yuedong; Janga, Sarath Chandra; Chen, Jason; Zhu, Heng; Kao, Cheng; Zhou, Yaoqi
Detecting protein-RNA interactions is challenging–both experimentally and computationally– because RNAs are large in number, diverse in cellular location and function, and flexible in structure. As a result, many RNA-binding proteins (RBPs) remain to be identified and characterized. Recently, we developed a bioinformatics tool called SPOT-Seq that integrates template-based structure prediction with RNA-binding affinity prediction to predict RBPs. Application of SPOT-Seq to human genome leads to doubling of RBPs from 2115 to 4296. Half of novel (>2000) RBPs are poorly or not annotated. The other half possesses Gene Ontology leaf IDs that are associated with known RBPs. In particular, we identified 36 novel RBPs in cancer, cardiovascular, diabetes and neurodegenerative pathways and 26 novel RBPs associated with disease-causing SNPs. Half of these disease-associating, predicted novel RBPs are annotated to interact with known RBPs. Accuracy of predicted novel RBPs is further validated by same confirmation rate of novel and annotated RBPs in human proteome microarrays experiments. The large number of predicted novel RBPs and their abundance in disease pathways and disease-causing SNPs are useful for hypothesis generation. These predicted novel human RBPs (>2000) with confidence level and their predicted complex structures with RNA can be downloaded from http://sparks.informatics.iupui.edu (yqzhou@iupui.edu)
Computational protein design: assessment and applications
(2015) Li, Zhixiu; Zhou, Yaoqi
Computational protein design aims at designing amino acid sequences that can fold into a target structure and perform a desired function. Many computational design methods have been developed and their applications have been successful during past two decades. However, the success rate of protein design remains too low to be of a useful tool by biochemists whom are not an expert of computational biology. In this dissertation, we first developed novel computational assessment techniques to assess several state-of-the-art computational techniques. We found that significant progresses were made in several important measures by two new scoring functions from RosettaDesign and from OSCAR-design, respectively. We also developed the first machine-learning technique called SPIN that predicts a sequence profile compatible to a given structure with a novel nonlocal energy-based feature. The accuracy of predicted sequences is comparable to RosettaDesign in term of sequence identity to wild type sequences. In the last two application chapters, we have designed self-inhibitory peptides of Escherichia coli methionine aminopeptidase (EcMetAP) and de novo designed barstar. Several peptides were confirmed inhibition of EcMetAP at the micromole-range 50% inhibitory concentration. Meanwhile, the assessment of designed barstar sequences indicates the improvement of OSCAR-design over RosettaDesign.
DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels
(Springer Nature, 2013-03-13) Zhao, Huiying; Yang, Yuedong; Lin, Hai; Zhang, Xinjun; Mort, Matthew; Cooper, David N.; Liu, Yunlong; Zhou, Yaoqi; Medical and Molecular Genetics, School of Medicine
Micro-indels (insertions or deletions shorter than 21 bps) constitute the second most frequent class of human gene mutation after single nucleotide variants. Despite the relative abundance of non-frameshifting indels, their damaging effect on protein structure and function has gone largely unstudied. We have developed a support vector machine-based method named DDIG-in (Detecting disease-causing genetic variations due to indels) to prioritize non-frameshifting indels by comparing disease-associated mutations with putatively neutral mutations from the 1,000 Genomes Project. The final model gives good discrimination for indels and is robust against annotation errors. A webserver implementing DDIG-in is available at http://sparks-lab.org/ddig.
DescribePROT: database of amino acid-level protein structure and function predictions
(Oxford University Press, 2021-01-08) Zhao, Bi; Katuwawala, Akila; Oldfield, Christopher J.; Dunker, A. Keith; Faraggi, Eshel; Gsponer, Jörg; Kloczkowski, Andrzej; Malhis, Nawar; Mirdita, Milot; Obradovic, Zoran; Söding, Johannes; Steinegger, Martin; Zhou, Yaoqi; Kurgan, Lukasz; Medicine, School of Medicine
We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.
Direct prediction of profiles of sequences compatible to a protein structure by neural networks with fragment-based local and energy-based nonlocal profiles
(Wiley Online Library, 2014-10) Li, Zhixiu; Yang, Yuedong; Faraggi, Eshel; Zhou, Jian; Zhou, Yaoqi; Department of BioHealth Informatics, IU School of Informatics and Computing
Locating sequences compatible with a protein structural fold is the well-known inverse protein-folding problem. While significant progress has been made, the success rate of protein design remains low. As a result, a library of designed sequences or profile of sequences is currently employed for guiding experimental screening or directed evolution. Sequence profiles can be computationally predicted by iterative mutations of a random sequence to produce energy-optimized sequences, or by combining sequences of structurally similar fragments in a template library. The latter approach is computationally more efficient but yields less accurate profiles than the former because of lacking tertiary structural information. Here we present a method called SPIN that predicts Sequence Profiles by Integrated Neural network based on fragment-derived sequence profiles and structure-derived energy profiles. SPIN improves over the fragment-derived profile by 6.7% (from 23.6 to 30.3%) in sequence identity between predicted and wild-type sequences. The method also reduces the number of residues in low complex regions by 15.7% and has a significantly better balance of hydrophilic and hydrophobic residues at protein surface. The accuracy of sequence profiles obtained is comparable to those generated from the protein design program RosettaDesign 3.5. This highly efficient method for predicting sequence profiles from structures will be useful as a single-body scoring term for improving scoring functions used in protein design and fold recognition. It also complements protein design programs in guiding experimental design of the sequence library for screening and directed evolution of designed sequences. The SPIN server is available at http://sparks-lab.org.
Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features
(Office of the Vice Chancellor for Research, 2013-04-05) Zhao, Huiying; Yang, Yuedong; Lin, Hai; Zhang, Xinjun; Mort, Matthew; Cooper, David N.; Liu, Yunlong; Zhou, Yaoqi
Micro-INDELs (insertions or deletions of ≤20 bp) constitute the second most frequent class of human gene mutation after single nucleotide variants. A significant portion of exonic INDELs are non-frameshifting (NFS), serving to insert or delete a discrete number of amino-acid residues. Despite the relative abundance of NFS-INDELs, their damaging effect on protein structure and function has gone largely unstudied whilst bioinformatics tools for discriminating between disease-causing and neutral NFS-INDELs remain to be developed. We have developed such a technique (DDIG-in; Detecting DIsease-causing Genetic variations due to INDELs) by comparing the properties of disease-causing NFS-INDELs from the Human Gene Mutation Database (HGMD) with putatively neutral NFS-INDELs from the 1,000 Genomes Project. Having considered 58 different sequence- and structure-based features, we found that predicted disordered regions around the NFS-INDEL region had the highest discriminative capability (disease versus neutral) with an Area Under the receiver-operating characteristic Curve (AUC) of 0.82 and a Matthews Correlation Coefficient (MCC) of 0.56. All features studied were combined by support vector machines (SVM) and selected by a greedy algorithm. The resulting SVM models were trained and tested by ten-fold cross-validation on the microdeletion dataset and independently tested on the microinsertion dataset and vice versa. The final SVM model for determining NFS-INDEL disease-causing probability was built on non-redundant datasets with a protein sequence identity cutoff of 35% and yielded an MCC value of 0.68, an accuracy of 84% and an AUC of 0.89. Predicted disease-causing probabilities exhibited a strong negative correlation with the average minor allele frequency (correlation coefficient, -0.84). DDIG-in, available at http://sparks.informatics.iupui.edu, can be used to estimate the disease-causing probability for a given NFS-INDEL.
ExonImpact: Prioritizing Pathogenic Alternative Splicing Events
(Wiley, 2017-01) Li, Meng; Feng, Weixing; Zhang, Xinjun; Yang, Yuedong; Wang, Kejun; Mort, Matthew; Cooper, David N.; Wang, Yue; Zhou, Yaoqi; Liu, Yunlong; Medicine, School of Medicine
Alternative splicing (AS) is a closely regulated process that allows a single gene to encode multiple protein isoforms, thereby contributing to the diversity of the proteome. Dysregulation of the splicing process has been found to be associated with many inherited diseases. However, in amongst the pathogenic AS events there are numerous “passenger” events whose inclusion or exclusion does not lead to significant changes with respect to protein function. In this study, we evaluate the secondary and tertiary structural features of proteins associated with disease-causing and neutral AS events, and show that several structural features are strongly associated with the pathological impact of exon inclusion. We further develop a machine learning-based computational model, ExonImpact, for prioritizing and evaluating the functional consequences of hitherto uncharacterized AS events. We evaluated our model using several strategies including cross-validation, and data from the Gene-Tissue Expression (GTEx) and ClinVar databases. ExonImpact is freely available at http://watson.compbio.iupui.edu/ExonImpact
From sequence to structure, to function, and back again: Integrating knowledge-based approaches with physical intuitions for protein folding, binding, and design
(Office of the Vice Chancellor for Research, 2010-04-09) Zhou, Yaoqi
By combining physical and knowledge-based approaches, state-of-the-art bioinformatics tools are developed for protein structure prediction, function prediction (DNA binding) and structurebased protein and ligand design.

Browsing by Author "Zhou, Yaoqi"

Results Per Page

Sort Options