Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features
dc.contributor.author | Zhao, Huiying | |
dc.contributor.author | Yang, Yuedong | |
dc.contributor.author | Lin, Hai | |
dc.contributor.author | Zhang, Xinjun | |
dc.contributor.author | Mort, Matthew | |
dc.contributor.author | Cooper, David N. | |
dc.contributor.author | Liu, Yunlong | |
dc.contributor.author | Zhou, Yaoqi | |
dc.date.accessioned | 2015-10-02T13:16:21Z | |
dc.date.available | 2015-10-02T13:16:21Z | |
dc.date.issued | 2013-04-05 | |
dc.description | poster abstract | en_US |
dc.description.abstract | Micro-INDELs (insertions or deletions of ≤20 bp) constitute the second most frequent class of human gene mutation after single nucleotide variants. A significant portion of exonic INDELs are non-frameshifting (NFS), serving to insert or delete a discrete number of amino-acid residues. Despite the relative abundance of NFS-INDELs, their damaging effect on protein structure and function has gone largely unstudied whilst bioinformatics tools for discriminating between disease-causing and neutral NFS-INDELs remain to be developed. We have developed such a technique (DDIG-in; Detecting DIsease-causing Genetic variations due to INDELs) by comparing the properties of disease-causing NFS-INDELs from the Human Gene Mutation Database (HGMD) with putatively neutral NFS-INDELs from the 1,000 Genomes Project. Having considered 58 different sequence- and structure-based features, we found that predicted disordered regions around the NFS-INDEL region had the highest discriminative capability (disease versus neutral) with an Area Under the receiver-operating characteristic Curve (AUC) of 0.82 and a Matthews Correlation Coefficient (MCC) of 0.56. All features studied were combined by support vector machines (SVM) and selected by a greedy algorithm. The resulting SVM models were trained and tested by ten-fold cross-validation on the microdeletion dataset and independently tested on the microinsertion dataset and vice versa. The final SVM model for determining NFS-INDEL disease-causing probability was built on non-redundant datasets with a protein sequence identity cutoff of 35% and yielded an MCC value of 0.68, an accuracy of 84% and an AUC of 0.89. Predicted disease-causing probabilities exhibited a strong negative correlation with the average minor allele frequency (correlation coefficient, -0.84). DDIG-in, available at http://sparks.informatics.iupui.edu, can be used to estimate the disease-causing probability for a given NFS-INDEL. | en_US |
dc.identifier.citation | Zhao, Huiying, Yuedong Yang, Hai Lin, Xinjun Zhang, Matthew Mort, David N. Cooper, Yunlong Liu, and Yaoqi Zhou. (2013, April 5). Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features. Poster session presented at IUPUI Research Day 2013, Indianapolis, Indiana. | en_US |
dc.identifier.uri | https://hdl.handle.net/1805/7109 | |
dc.language.iso | en_US | en_US |
dc.publisher | Office of the Vice Chancellor for Research | en_US |
dc.subject | micro-INDELs | en_US |
dc.subject | human gene mutation | en_US |
dc.subject | non-frameshifting exonic INDELs | en_US |
dc.subject | disease-causing non-frameshifting INDELs | en_US |
dc.subject | neutral non-frameshifting INDELs | en_US |
dc.subject | Detecting DIsease-causing Genetic variations due to INDELs | en_US |
dc.title | Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features | en_US |
dc.type | Poster | en_US |
Files
Original bundle
1 - 1 of 1