Computational modeling for identification of low-frequency single nucleotide variants

dc.contributor.advisorLiu, Yunlong
dc.contributor.authorHao, Yangyang
dc.contributor.otherEdenberg, Howard J.
dc.contributor.otherLi, Lang
dc.contributor.otherNakshatr, Harikrishna
dc.date.accessioned2016-03-17T15:19:11Z
dc.date.available2016-03-18T09:30:31Z
dc.date.issued2015-11-16
dc.degree.date2016
dc.degree.disciplineMedical & Molecular Genetics.
dc.degree.grantorIndiana University.
dc.degree.levelPh.D.
dc.descriptionIndiana University-Purdue University Indianapolis (IUPUI)en_US
dc.description.abstractReliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.en_US
dc.identifier.doi10.7912/C2PC73
dc.identifier.urihttps://hdl.handle.net/1805/8891
dc.identifier.urihttp://dx.doi.org/10.7912/C2/1952
dc.language.isoen_USen_US
dc.subjectLow-frequency variantsen_US
dc.subjectMachine-learningen_US
dc.subjectNext generation sequencingen_US
dc.subjectSNVsen_US
dc.subjectSomatic mutationsen_US
dc.subjectStatistical modelingen_US
dc.subject.lcshCancer -- Genetic aspects
dc.subject.lcshNucleotide sequence -- Statistical methods
dc.subject.lcshGenetics -- Statistics
dc.subject.lcshGenomics
dc.subject.lcshMachine learning -- Mathematical models
dc.subject.lcshMathematical optimization
dc.subject.lcshBiopsy
dc.subject.lcshPopulation genetics
dc.titleComputational modeling for identification of low-frequency single nucleotide variantsen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Hao_iupui_0104D_10072.pdf
Size:
1.5 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.88 KB
Format:
Item-specific license agreed upon to submission
Description: