Computational modeling for identification of low-frequency single nucleotide variants

Hao, Yangyang

Computational modeling for identification of low-frequency single nucleotide variants

Files

Hao_iupui_0104D_10072.pdf (1.5 MB)

Date

2015-11-16

Authors

Hao, Yangyang

Language

American English

Committee Chair

Liu, Yunlong

Committee Members

Edenberg, Howard J.
Li, Lang
Nakshatr, Harikrishna

Degree

Ph.D.

Degree Year

2016

Department

Medical & Molecular Genetics.

Grantor

Indiana University.

Abstract

Reliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.

Description

Indiana University-Purdue University Indianapolis (IUPUI)

Keywords

Low-frequency variants, Machine-learning, Next generation sequencing, SNVs, Somatic mutations, Statistical modeling

LC Subjects

Cancer -- Genetic aspects, Nucleotide sequence -- Statistical methods, Genetics -- Statistics, Genomics, Machine learning -- Mathematical models, Mathematical optimization, Biopsy, Population genetics

Rights

Permanent Link

https://hdl.handle.net/1805/8891
http://dx.doi.org/10.7912/C2/1952

Collections

Medical & Molecular Genetics Department Theses & Dissertations

Full item page