Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features

dc.contributor.authorZhao, Huiying
dc.contributor.authorYang, Yuedong
dc.contributor.authorLin, Hai
dc.contributor.authorZhang, Xinjun
dc.contributor.authorMort, Matthew
dc.contributor.authorCooper, David N.
dc.contributor.authorLiu, Yunlong
dc.contributor.authorZhou, Yaoqi
dc.date.accessioned2015-10-02T13:16:21Z
dc.date.available2015-10-02T13:16:21Z
dc.date.issued2013-04-05
dc.descriptionposter abstracten_US
dc.description.abstractMicro-INDELs (insertions or deletions of ≤20 bp) constitute the second most frequent class of human gene mutation after single nucleotide variants. A significant portion of exonic INDELs are non-frameshifting (NFS), serving to insert or delete a discrete number of amino-acid residues. Despite the relative abundance of NFS-INDELs, their damaging effect on protein structure and function has gone largely unstudied whilst bioinformatics tools for discriminating between disease-causing and neutral NFS-INDELs remain to be developed. We have developed such a technique (DDIG-in; Detecting DIsease-causing Genetic variations due to INDELs) by comparing the properties of disease-causing NFS-INDELs from the Human Gene Mutation Database (HGMD) with putatively neutral NFS-INDELs from the 1,000 Genomes Project. Having considered 58 different sequence- and structure-based features, we found that predicted disordered regions around the NFS-INDEL region had the highest discriminative capability (disease versus neutral) with an Area Under the receiver-operating characteristic Curve (AUC) of 0.82 and a Matthews Correlation Coefficient (MCC) of 0.56. All features studied were combined by support vector machines (SVM) and selected by a greedy algorithm. The resulting SVM models were trained and tested by ten-fold cross-validation on the microdeletion dataset and independently tested on the microinsertion dataset and vice versa. The final SVM model for determining NFS-INDEL disease-causing probability was built on non-redundant datasets with a protein sequence identity cutoff of 35% and yielded an MCC value of 0.68, an accuracy of 84% and an AUC of 0.89. Predicted disease-causing probabilities exhibited a strong negative correlation with the average minor allele frequency (correlation coefficient, -0.84). DDIG-in, available at http://sparks.informatics.iupui.edu, can be used to estimate the disease-causing probability for a given NFS-INDEL.en_US
dc.identifier.citationZhao, Huiying, Yuedong Yang, Hai Lin, Xinjun Zhang, Matthew Mort, David N. Cooper, Yunlong Liu, and Yaoqi Zhou. (2013, April 5). Discriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based features. Poster session presented at IUPUI Research Day 2013, Indianapolis, Indiana.en_US
dc.identifier.urihttps://hdl.handle.net/1805/7109
dc.language.isoen_USen_US
dc.publisherOffice of the Vice Chancellor for Researchen_US
dc.subjectmicro-INDELsen_US
dc.subjecthuman gene mutationen_US
dc.subjectnon-frameshifting exonic INDELsen_US
dc.subjectdisease-causing non-frameshifting INDELsen_US
dc.subjectneutral non-frameshifting INDELsen_US
dc.subjectDetecting DIsease-causing Genetic variations due to INDELsen_US
dc.titleDiscriminating between disease-causing and neutral non-frameshifting micro-INDELs by support vector machines by means of integrated sequence- and structure-based featuresen_US
dc.typePosteren_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhao-discriminating.pdf
Size:
48.23 KB
Format:
Adobe Portable Document Format