Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features

Micro-INDELs (insertions or deletions of ≤20 bp) constitute the second most frequent class of human gene mutation after single nucleotide variants. A significant portion of exonic INDELs are non-frameshifting (NFS), serving to insert or delete a discrete number of amino-acid residues. Despite the relative abundance of NFS-INDELs, their damaging effect on protein structure and function has gone largely unstudied whilst bioinformatics tools for discriminating between disease-causing and neutral NFS-INDELs remain to be developed. We have developed such a technique (DDIG-in; Detecting DIsease-causing Genetic variations due to INDELs) by comparing the properties of disease-causing NFS-INDELs from the Human Gene Mutation Database (HGMD) with putatively neutral NFS-INDELs from the 1,000 Genomes Project. Having considered 58 different sequence- and structure-based features, we found that predicted disordered regions around the NFS-INDEL region had the highest discriminative capability (disease versus neutral) with an Area Under the receiver-operating characteristic Curve (AUC) of 0.82 and a Matthews Correlation Coefficient (MCC) of 0.56. All features studied were combined by support vector machines (SVM) and selected by a greedy algorithm. The resulting SVM models were trained and tested by ten-fold cross-validation on the microdeletion dataset and independently tested on the microinsertion dataset and vice versa. The final SVM model for determining NFS-INDEL disease-causing probability was built on non-redundant datasets with a protein sequence identity cutoff of 35% and yielded an MCC value of 0.68, an accuracy of 84% and an AUC of 0.89. Predicted disease-causing probabilities exhibited a strong negative correlation with the average minor allele frequency (correlation coefficient, -0.84). DDIG-in, available at http://sparks.informatics.iupui.edu, can be used to estimate the disease-causing probability for a given NFS-INDEL.

Description

poster abstract

Keywords

micro-INDELs, human gene mutation, non-frameshifting exonic INDELs, disease-causing non-frameshifting INDELs, neutral non-frameshifting INDELs, Detecting DIsease-causing Genetic variations due to INDELs

Cite As

Zhao, Huiying, Yuedong Yang, Hai Lin, Xinjun Zhang, Matthew Mort, David N. Cooper, Yunlong Liu, and Yaoqi Zhou. (2013, April 5). Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features. Poster session presented at IUPUI Research Day 2013, Indianapolis, Indiana.

Rights

Type

Poster

Permanent Link

https://hdl.handle.net/1805/7109

Collections

IUPUI Research Day 2013

Full item page