- Browse by Author
Browsing by Author "Hao, Yangyang"
Now showing 1 - 10 of 11
Results Per Page
Sort Options
Item Alt Event Finder: a tool for extracting alternative splicing events from RNA-seq data.(BMC, 2012) Zhou, Ao; Breese, Marcus R.; Hao, Yangyang; Edenberg, Howard J.; Li, Lang; Skaar, Todd C.; Liu, YunlongBACKGROUND: Alternative splicing increases proteome diversity by expressing multiple gene isoforms that often differ in function. Identifying alternative splicing events from RNA-seq experiments is important for understanding the diversity of transcripts and for investigating the regulation of splicing. RESULTS: We developed Alt Event Finder, a tool for identifying novel splicing events by using transcript annotation derived from genome-guided construction tools, such as Cufflinks and Scripture. With a proper combination of alignment and transcript reconstruction tools, Alt Event Finder is capable of identifying novel splicing events in the human genome. We further applied Alt Event Finder on a set of RNA-seq data from rat liver tissues, and identified dozens of novel cassette exon events whose splicing patterns changed after extensive alcohol exposure. CONCLUSIONS: Alt Event Finder is capable of identifying de novo splicing events from data-driven transcript annotation, and is a useful tool for studying splicing regulation.Item Computational modeling for identification of low-frequency single nucleotide variants(2015-11-16) Hao, Yangyang; Liu, Yunlong; Edenberg, Howard J.; Li, Lang; Nakshatr, HarikrishnaReliable detection of low-frequency single nucleotide variants (SNVs) carries great significance in many applications. In cancer genetics, the frequencies of somatic variants from tumor biopsies tend to be low due to contamination with normal tissue and tumor heterogeneity. Circulating tumor DNA monitoring also faces the challenge of detecting low-frequency variants due to the small percentage of tumor DNA in blood. Moreover, in population genetics, although pooled sequencing is cost-effective compared with individual sequencing, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by multiple sources of errors, especially next-generation sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 5%; most fail to consider differential, context-specific sequencing artifacts. To face this challenge, we developed a computational and experimental framework, RareVar, to reliably identify low-frequency SNVs from high-throughput sequencing data. For optimized performance, RareVar utilized a supervised learning framework to model artifacts originated from different components of a specific sequencing pipeline. This is enabled by a customized, comprehensive benchmark data enriched with known low-frequency SNVs from the sequencing pipeline of interest. Genomic-context-specific sequencing error model was trained on the benchmark data to characterize the systematic sequencing artifacts, to derive the position-specific detection limit for sensitive low-frequency SNV detection. Further, a machine-learning algorithm utilized sequencing quality features to refine SNV candidates for higher specificity. RareVar outperformed existing approaches, especially at 0.5% to 5% frequency. We further explored the influence of statistical modeling on position specific error modeling and showed zero-inflated negative binomial as the best-performed statistical distribution. When replicating analyses on an Illumina MiSeq benchmark dataset, our method seamlessly adapted to technologies with different biochemistries. RareVar enables sensitive detection of low-frequency SNVs across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.Item Dependence receptor UNC5A restricts luminal to basal breast cancer plasticity and metastasis(BMC, 2018-05-02) Padua, Maria B.; Bhat-Nakshatri, Poornima; Anjanappa, Manjushree; Prasad, Mayuri S.; Hao, Yangyang; Rao, Xi; Liu, Sheng; Wan, Jun; Liu, Yunlong; McElyea, Kyle; Jacobsen, Max; Sandusky, George; Althouse, Sandra; Perkins, Susan; Nakshatri, Harikrishna; Surgery, School of MedicineBACKGROUND: The majority of estrogen receptor-positive (ERα+) breast cancers respond to endocrine therapies. However, resistance to endocrine therapies is common in 30% of cases, which may be due to altered ERα signaling and/or enhanced plasticity of cancer cells leading to breast cancer subtype conversion. The mechanisms leading to enhanced plasticity of ERα-positive cancer cells are unknown. METHODS: We used short hairpin (sh)RNA and/or the CRISPR/Cas9 system to knockdown the expression of the dependence receptor UNC5A in ERα+ MCF7 and T-47D cell lines. RNA-seq, quantitative reverse transcription polymerase chain reaction, chromatin immunoprecipitation, and Western blotting were used to measure the effect of UNC5A knockdown on basal and estradiol (E2)-regulated gene expression. Mammosphere assay, flow cytometry, and immunofluorescence were used to determine the role of UNC5A in restricting plasticity. Xenograft models were used to measure the effect of UNC5A knockdown on tumor growth and metastasis. Tissue microarray and immunohistochemistry were utilized to determine the prognostic value of UNC5A in breast cancer. Log-rank test, one-way, and two-way analysis of variance (ANOVA) were used for statistical analyses. RESULTS: Knockdown of the E2-inducible UNC5A resulted in altered basal gene expression affecting plasma membrane integrity and ERα signaling, as evident from ligand-independent activity of ERα, altered turnover of phosphorylated ERα, unique E2-dependent expression of genes effecting histone demethylase activity, enhanced upregulation of E2-inducible genes such as BCL2, and E2-independent tumorigenesis accompanied by multiorgan metastases. UNC5A depletion led to the appearance of a luminal/basal hybrid phenotype supported by elevated expression of basal/stem cell-enriched ∆Np63, CD44, CD49f, epidermal growth factor receptor (EGFR), and the lymphatic vessel permeability factor NTN4, but lower expression of luminal/alveolar differentiation-associated ELF5 while maintaining functional ERα. In addition, UNC5A-depleted cells acquired bipotent luminal progenitor characteristics based on KRT14+/KRT19+ and CD49f+/EpCAM+ phenotype. Consistent with in vitro results, UNC5A expression negatively correlated with EGFR expression in breast tumors, and lower expression of UNC5A, particularly in ERα+/PR+/HER2- tumors, was associated with poor outcome. CONCLUSION: These studies reveal an unexpected role of the axon guidance receptor UNC5A in fine-tuning ERα and EGFR signaling and the luminal progenitor status of hormone-sensitive breast cancers. Furthermore, UNC5A knockdown cells provide an ideal model system to investigate metastasis of ERα+ breast cancers.Item Impact of human pathogenic micro-insertions and micro-deletions on post-transcriptional regulation(Oxford University Press, 2014-06-01) Zhang, Xinjun; Lin, Hai; Zhao, Huiying; Hao, Yangyang; Mort, Matthew; Cooper, David N.; Zhou, Yaoqi; Liu, Yunlong; Department of Medical & Molecular Genetics, IU School of MedicineSmall insertions/deletions (INDELs) of ≤21 bp comprise 18% of all recorded mutations causing human inherited disease and are evident in 24% of documented Mendelian diseases. INDELs affect gene function in multiple ways: for example, by introducing premature stop codons that either lead to the production of truncated proteins or affect transcriptional efficiency. However, the means by which they impact post-transcriptional regulation, including alternative splicing, have not been fully evaluated. In this study, we collate disease-causing INDELs from the Human Gene Mutation Database (HGMD) and neutral INDELs from the 1000 Genomes Project. The potential of these two types of INDELs to affect binding-site affinity of RNA-binding proteins (RBPs) was then evaluated. We identified several sequence features that can distinguish disease-causing INDELs from neutral INDELs. Moreover, we built a machine-learning predictor called PinPor (predicting pathogenic small insertions and deletions affecting post-transcriptional regulation, http://watson.compbio.iupui.edu/pinpor/) to ascertain which newly observed INDELs are likely to be pathogenic. Our results show that disease-causing INDELs are more likely to ablate RBP-binding sites and tend to affect more RBP-binding sites than neutral INDELs. Additionally, disease-causing INDELs give rise to greater deviations in binding affinity than neutral INDELs. We also demonstrated that disease-causing INDELs may be distinguished from neutral INDELs by several sequence features, such as their proximity to splice sites and their potential effects on RNA secondary structure. This predictor showed satisfactory performance in identifying numerous pathogenic INDELs, with a Matthews correlation coefficient (MCC) value of 0.51 and an accuracy of 0.75.Item Improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies(BioMed Central, 2016-08-22) Feng, Weixing; Zhao, Sen; Xue, Dingkai; Song, Fengfei; Li, Ziwei; Chao, Duojiao; He, Bo; Hao, Yangyang; Wang, Yadong; Liu, Yunlong; Department of Medical and Molecular Genetics, IU School of MedicineBACKGROUND: Ion Torrent and Ion Proton are semiconductor-based sequencing technologies that feature rapid sequencing speed and low upfront and operating costs, thanks to the avoidance of modified nucleotides and optical measurements. Despite of these advantages, however, Ion semiconductor sequencing technologies suffer much reduced sequencing accuracy at the genomic loci with homopolymer repeats of the same nucleotide. Such limitation significantly reduces its efficiency for the biological applications aiming at accurately identifying various genetic variants. RESULTS: In this study, we propose a Bayesian inference-based method that takes the advantage of the signal distributions of the electrical voltages that are measured for all the homopolymers of a fixed length. By cross-referencing the length of homopolymers in the reference genome and the voltage signal distribution derived from the experiment, the proposed integrated model significantly improves the alignment accuracy around the homopolymer regions. CONCLUSIONS: Besides improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies with the proposed model, similar strategies can also be used on other high-throughput sequencing technologies that share similar limitations.Item miR2Disease: a manually curated database for microRNA deregulation in human disease(Oxford Academic, 2008-10-15) Jiang, Qinghua; Wang, Yadong; Hao, Yangyang; Juan, Liran; Teng, Mingxiang; Zhang, Xinjun; Li, Meimei; Wang, Guohua; Liu, Yunlong; Medicine, School of Medicine‘miR2Disease’, a manually curated database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases. The current version of miR2Disease documents 1939 curated relationships between 299 human microRNAs and 94 human diseases by reviewing more than 600 published papers. Around one-seventh of the microRNA–disease relationships represent the pathogenic roles of deregulated microRNA in human disease. Each entry in the miR2Disease contains detailed information on a microRNA–disease relationship, including a microRNA ID, the disease name, a brief description of the microRNA–disease relationship, an expression pattern of the microRNA, the detection method for microRNA expression, experimentally verified target gene(s) of the microRNA and a literature reference. miR2Disease provides a user-friendly interface for a convenient retrieval of each entry by microRNA ID, disease name, or target gene. In addition, miR2Disease offers a submission page that allows researchers to submit established microRNA–disease relationships that are not documented. Once approved by the submission review committee, the submitted records will be included in the database. miR2Disease is freely available at http://www.miR2Disease.org.Item PASSPORT-seq: A Novel High-Throughput Bioassay to Functionally Test Polymorphisms in Micro-RNA Target Sites(Frontiers Media, 2018-06-15) Ipe, Joseph; Collins, Kimberly S.; Hao, Yangyang; Gao, Hongyu; Bhatia, Puja; Gaedigk, Andrea; Liu, Yunlong; Skaar, Todd C.; Pharmacology and Toxicology, School of MedicineNext-generation sequencing (NGS) studies have identified large numbers of genetic variants that are predicted to alter miRNA-mRNA interactions. We developed a novel high-throughput bioassay, PASSPORT-seq, that can functionally test in parallel 100s of these variants in miRNA binding sites (mirSNPs). The results are highly reproducible across both technical and biological replicates. The utility of the bioassay was demonstrated by testing 100 mirSNPs in HEK293, HepG2, and HeLa cells. The results of several of the variants were validated in all three cell lines using traditional individual luciferase assays. Fifty-five mirSNPs were functional in at least one of three cell lines (FDR ≤ 0.05); 11, 36, and 27 of them were functional in HEK293, HepG2, and HeLa cells, respectively. Only four of the variants were functional in all three cell lines, which demonstrates the cell-type specific effects of mirSNPs and the importance of testing the mirSNPs in multiple cell lines. Using PASSPORT-seq, we functionally tested 111 variants in the 3' UTR of 17 pharmacogenes that are predicted to alter miRNA regulation. Thirty-three of the variants tested were functional in at least one cell line.Item RareVar: A Framework for Detecting Low-Frequency Single-Nucleotide Variants(Mary Ann Liebert, Inc., 2017-07) Hao, Yangyang; Xuei, Xiaoling; Li, Lang; Nakshatri, Harikrishna; Edenberg, Howard J.; Liu, Yunlong; Medical and Molecular Genetics, School of MedicineAccurate identification of low-frequency somatic point mutations in tumor samples has important clinical utilities. Although high-throughput sequencing technology enables capturing such variants while sequencing primary tumor samples, our ability for accurate detection is compromised when the variant frequency is close to the sequencer error rate. Most current experimental and bioinformatic strategies target mutations with ≥5% allele frequency, which limits our ability to understand the cancer etiology and tumor evolution. We present an experimental and computational modeling framework, RareVar, to reliably identify low-frequency single-nucleotide variants from high-throughput sequencing data under standard experimental protocols. RareVar protocol includes a benchmark design by pooling DNAs from already sequenced individuals at various concentrations to target variants at desired frequencies, 0.5%-3% in our case. By applying a generalized, linear model-based, position-specific error model, followed by machine-learning-based variant calibration, our approach outperforms existing methods. Our method can be applied on most capture and sequencing platforms without modifying the experimental protocol.Item Statistical modeling for sensitive detection of low-frequency single nucleotide variants(BioMed Central, 2016-08-22) Hao, Yangyang; Zhang, Pengyue; Xuei, Xiaoling; Nakshatri, Harikrishna; Edenberg, Howard J.; Li, Lang; Liu, Yunlong; Department of Medical and Molecular Genetics, IU School of MedicineBACKGROUND: Sensitive detection of low-frequency single nucleotide variants carries great significance in many applications. In cancer genetics research, tumor biopsies are a mixture of normal and tumor cells from various subpopulations due to tumor heterogeneity. Thus the frequencies of somatic variants from a subpopulation tend to be low. Liquid biopsies, which monitor circulating tumor DNA in blood to detect metastatic potential, also face the challenge of detecting low-frequency variants due to the small percentage of the circulating tumor DNA in blood. Moreover, in population genetics research, although pooled sequencing of a large number of individuals is cost-effective, pooling dilutes the signals of variants from any individual. Detection of low frequency variants is difficult and can be cofounded by sequencing artifacts. Existing methods are limited in sensitivity and mainly focus on frequencies around 2 % to 5 %; most fail to consider differential sequencing artifacts. RESULTS: We aimed to push down the frequency detection limit close to the position specific sequencing error rates by modeling the observed erroneous read counts with respect to genomic sequence contexts. 4 distributions suitable for count data modeling (using generalized linear models) were extensively characterized in terms of their goodness-of-fit as well as the performances on real sequencing data benchmarks, which were specifically designed for testing detection of low-frequency variants; two sequencing technologies with significantly different chemistry mechanisms were used to explore systematic errors. We found the zero-inflated negative binomial distribution generalized linear mode is superior to the other models tested, and the advantage is most evident at 0.5 % to 1 % range. This method is also generalizable to different sequencing technologies. Under standard sequencing protocols and depth given in the testing benchmarks, 95.3 % recall and 79.9 % precision for Ion Proton data, 95.6 % recall and 97.0 % precision for Illumina MiSeq data were achieved for SNVs with frequency > = 1 %, while the detection limit is around 0.5 %. CONCLUSIONS: Our method enables sensitive detection of low-frequency single nucleotide variants across different sequencing platforms and will facilitate research and clinical applications such as pooled sequencing, cancer early detection, prognostic assessment, metastatic monitoring, and relapses or acquired resistance identification.Item A system for detecting high impact-low frequency mutations in primary tumors and metastases(Springer Nature, 2018-01-11) Anjanappa, Manjushree; Hao, Yangyang; Simpson, Edward R; Bhat-Nakshatri, Poornima; Nelson, Jennifer B; Tersey, Sarah A; Mirmira, Raghavendra G; Cohen-Gadol, Aaron A; Saadatzadeh, M. Reza; Li, Lang; Fang, Fang; Nephew, Kenneth P.; Miller, Kathy D.; Liu, Yunlong; Nakshatri, Harikrishna; Medical and Molecular Genetics, School of MedicineTumor complexity and intratumor heterogeneity contribute to subclonal diversity. Despite advances in next-generation sequencing (NGS) and bioinformatics, detecting rare mutations in primary tumors and metastases contributing to subclonal diversity is a challenge for precision genomics. Here, in order to identify rare mutations, we adapted a recently described epithelial reprograming assay for short-term propagation of epithelial cells from primary and metastatic tumors. Using this approach, we expanded minor clones and obtained epithelial cell-specific DNA/RNA for quantitative NGS analysis. Comparative Ampliseq Comprehensive Cancer Panel sequence analyses were performed on DNA from unprocessed breast tumor and tumor cells propagated from the same tumor. We identified previously uncharacterized mutations present only in the cultured tumor cells, a subset of which has been reported in brain metastatic but not primary breast tumors. In addition, whole-genome sequencing identified mutations enriched in liver metastases of various cancers, including Notch pathway mutations/chromosomal inversions in 5/5 liver metastases, irrespective of cancer types. Mutations/rearrangements in FHIT, involved in purine metabolism, were detected in 4/5 liver metastases, and the same four liver metastases shared mutations in 32 genes, including mutations of different HLA-DR family members affecting OX40 signaling pathway, which could impact the immune response to metastatic cells. Pathway analyses of all mutated genes in liver metastases showed aberrant tumor necrosis factor and transforming growth factor signaling in metastatic cells. Epigenetic regulators including KMT2C/MLL3 and ARID1B, which are mutated in >50% of hepatocellular carcinomas, were also mutated in liver metastases. Thus, irrespective of cancer types, organ-specific metastases may share common genomic aberrations. Since recent studies show independent evolution of primary tumors and metastases and in most cases mutation burden is higher in metastases than primary tumors, the method described here may allow early detection of subclonal somatic alterations associated with metastatic progression and potentially identify therapeutically actionable, metastasis-specific genomic aberrations.