- Biostatistics Department Theses and Dissertations
Biostatistics Department Theses and Dissertations
Permanent URI for this collection
Browse
Recent Submissions
Item Comparing Nanopore to MethylationEPIC Array and EM-Seq in DNA Methylation Detection(2024-12) Brooks, Steven; Liu, Yunlong; Peng, Gang; Zhang, PengyueDNA Methylation is an important biological process in epigenetics, and many methods have been developed to profile DNA methylation. Recently a growing number of studies use Nanopore long-read sequencing technology in DNA methylation detection, in contrast to widely used Infinium arrays and short-read whole genome sequencing (WGS) methods. In this study, we evaluate the performance of Nanopore sequencing in DNA methylation detection by comparing it to the Illumina MethylationEPIC microarray (EPIC) and Enzymatic Methyl-Sequencing. We first compare Oxford Nanopore Technologies’ Nanopore with MethylationEPIC array. Among the ~850,000 CpG sites covered by both methods, we observed high concordance (R ≥ 0.94 across all four samples). After downsampling Nanopore data from an average coverage of 26.6 reads per site to 10 reads per site, the correlation in CpG methylation remained high (R≥ 0.935). Next, we compare Nanopore with EM-Seq in the context of low coverage. The lower CpG methylation correlation (R ≥ 0.8), can be attributed to reduced coverage of hypomethylated CpG sites by EM-Seq. Furthermore, we highlight Nanopore’s unique capabilities, including native DNA sequencing that can differentiate modification types and the use of long reads for haplotype phasing. Overall, Nanopore demonstrated high concordance with the EPIC array and more uniform coverage across the genome than EM-Seq. This study provides insights for researchers in selecting appropriate DNA methylation detection methods, considering factors such as cost, DNA input, and the complexity of downstream analysis.Item Bayesian Adaptive Designs for Phase II Clinical Trials Evaluating Subgroup-Specific Treatment Effect(2024-12) Shan, Mu; Zang, Yong; Han, Jiali; Tu, Wanzhu; Zhang, PengyueIn Phase II clinical trials, particularly for molecularly targeted agents (MTAs) and biotherapies, there is a critical need to evaluate subgroup-specific treatment effects due to the heterogeneous nature of these therapies. This dissertation introduces two innovative Bayesian adaptive designs for biomarker-guided clinical trials: the Bayesian Order Constrained Adaptive (BOCA) design and the Bayesian Adaptive Marker-Stratified Design Using Calibrated Spike-and-Slab priors (SSS). The BOCA design addresses the limitations of the "one-size-fits-all" approach in non-randomized Phase II trials by efficiently detecting subgroup-specific treatment effects. It combines elements of enrichment and sequential designs, starting with an "all-comers" stage and transitioning to an enrichment stage based on interim analysis results. The decision to continue with either the marker-positive or marker-negative subgroup is guided by two posterior probabilities utilizing inherent ordering constraints. This adaptive approach enhances trial efficiency and cost-effectiveness while managing missing biomarker data. Comprehensive simulation studies show that the BOCA design outperforms conventional designs in detecting subgroup-specific treatment effects, making it a robust tool for Phase II trials. The SSS design improves the efficiency of marker-stratified designs (MSD) by leveraging clinical features of biomarkers and treatments. Patients are classified into marker-positive and marker-negative subgroups and randomized to receive either the MTA or a control treatment. The SSS design uses spike-and-slab priors to dynamically share information on response rates across subgroups, governed by two posterior probabilities that assess similarities in response rates. Additionally, it incorporates a Bayesian multiple imputation method to address missing biomarker profiles. Simulation studies confirm that the SSS design exhibits favorable operational characteristics, surpassing conventional designs in evaluating subgroup-specific treatment effects. Both the BOCA and SSS designs represent significant advancements in Bayesian adaptive methodologies for Phase II trials. By addressing traditional approach limitations, these designs enhance the evaluation of subgroup-specific treatment effects, contributing valuable methodologies to the field of personalized medicine.Item Statistical Deep Learning of Multivariate Longitudinal Data(2024-11) Li, Yunyi; Gao, Sujuan; Liu, Hao; Apostolova, Liana G.; Li, Xiaochun; Zhao, YiNowadays, various types of longitudinal data, including continuous, binary, and count data, are increasingly collected in numerous scientific research fields such as Alzheimer’s disease studies. Despite the wealth of data, the complex structure of multivariate longitudinal data presents significant modeling challenges. For years, scientific research has been actively exploring dynamic interactions among multiple components and understanding how interventions can impact outcomes over time with complex underlying dynamics. However, statistical methods for modeling these dynamic changes and associations are still limited. To address these gaps, we propose a novel nonparametric method to describe the mean temporal changes of sparsely and irregularly observed multivariate longitudinal data. This method is based on an Ordinary Differential Equation (ODE) system approximated by neural networks. Furthermore, we presented a novel approach to treat the initial values of ODEs as an unknown parameter vector, a departure from existing methods that either pre-specify the initial values or estimate them in an ad hoc manner. In the second topic, we propose deep latent ODE models. These models nonparametrically model latent temporal trends by an unknown function of an ODE system and parametrically estimate the effects of covariates using Bayesian approaches. To address the intractability of the posterior distribution of initial values, we employ a variational autoencoder (VAE) algorithm. The approximate posterior distribution is characterized by a recurrent neural network (RNN), and high dimensional hy-perparameters are estimated using the stochastic gradient descent method based on Kullback-Leibler (KL) divergence. Lastly, we propose Bayesian generalized random effects models for modeling longitudinal data from various distributions, including longitudinal counts, and longitudinal binary outcomes. This model extends traditional generalized linear mixed effect models (GLMMs) to generalized semi-parametric mixed effect models. It assumes a nonparametric baseline function with a stochastic process prior, and parameters are estimated using the Bayesian approach. The proposed model is practical and can be applied to various types of longitudinal data, including longitudinal binary, and count data. Neural ODE, RNN, variational inference, and KL divergence techniques are also applied in this project.Item Identify Signature Genes/Pathways to Characterize Alzheimer's Disease Subtypes Based on Uncoupled Tauopathies and Cognitive Decline(2024-06) Huang, Xiaoqing; Huang, Kun; Zhang, Jie; Johnson, Travis; Zhang, JianjunAlzheimer's disease (AD) is a slow-progressing dementia usually found in elderlies, with heterogeneous clinical phenotypes and possible underlying mechanisms. Widely spread tauopathy is one of the pathological change hallmarks in AD brains, in which microtube protein tau forms scar-like neurofibrillary tangles that kill neurons. However, subgroups of patients present unmatched tauopathy progression with their cognitive decline. A detailed study on these so-called atypical AD patients allows for a deeper understanding of possible various disease mechanisms and the factors contributing to disease vulnerability or resilience, which can help guide the drug development and treatment strategy tailored to different subgroups, as well as establish foundations for disease prevention. By identifying specific molecular biomarkers associated with each subtype, I hope to help clinicians diagnose various AD subtypes at an earlier stage. In this work, I have performed transcriptomic and proteomic characterization of two atypical AD subtypes on two large AD/normal brain cohorts to further understand the role of tauopathy in the AD etiology, identified several pathways that are associated with the two phenotypes’ AD-resilient and AD-vulnerable characteristics, and tried to identify the potential drug targets for the precision treatment of AD using extensive bioinformatic approaches. In the meanwhile, two methodologies were developed and applied. One is a new type of interpretable deep learning model (ParsVNN) coupled with the neural network architecture with the hierarchical structure of the gene/protein pathways is introduced and leveraged to address the complexity and improve the interpretability by making its biological hierarchy simple and specific to the predicted subgroup. The other is a label transferring approach using optimal transport from brain samples to blood samples in the hope of finding serum biomarkers for atypical AD groups in live patients and predicting their disease progression in a non-invasive fashion. Conclusively, the study improves our understanding of AD etiology and leads to more personalized care and disease prevention. It acknowledges the complexity of the disease and aims to uncover mechanistic distinctions within the broad Alzheimer’s disease spectrum.Item Modified 3+3 Design for MTD Re-estimation(2024-06) Zhang, Tianshu; Zang, Yong; Han, Yan; Liu, ZiyueThe 3+3 clinical trial design is one of the most popular dose-finding designs used in phase I oncology trials to identify the maximum tolerated dose (MTD) for new treatment regimens. While this design is widely used due to its simplicity , it has some notable limitations, including a maximum of six patients per dose level and fixed target toxicity rates. To address these issues, we propose a modified 3+3 design that extends the traditional 3+3 design by treating the remaining patients at the MTD level for additional dose-limiting toxicity (DLT) assessment. This modification allows for a more flexible and accurate way to identify the MTD, enhanced by the use of isotonic regression to calculate DLT rates. To compare the modified 3+3 designs and the traditional 3+3 design, computer simulation studies have been carried out under various dose-toxicity scenarios. The results show that the modified 3+3 design yields higher accuracy in MTD identification.Item Transparent and Efficient Designs for Clinical Trials(2024-05) Qiu, Yingjie; Zhao, Yi; Zang, Yong; Perkins, Susan; Zhang, Pengyue; Yan, JingwenModern early phase clinical trials are integral in assessing the efficacy and safety of new treatments. Traditional methodologies heavily rely on complex parametric models to determine dose-response relationships. They come with inherent challenges: difficulty in practical validation, potential for poor performances if parametric assumptions are inaccurately defined, and a heavy learning burden for medical practitioners. The need for novel methods that bridge the gap between statistical robustness and clinical applicability is evident. To accommodate those issues, we proposed two transparent and efficient designs. The modified isotonic regression based phase I/II clinical trial design (mISO) and the utility-based model free phase I/II design (UFO) represent innovative strides in identifying optimal doses for clinical trials. The mISO design, eschewing traditional parametric assumptions, offers a transparent and efficient method, adaptable to various dose-response curves and enhanced by the mISO-B extension for delayed outcomes. In parallel, the UFO design, specifically tailored for immunotherapy trials, diverges from complex models to employ a dynamic, utility-based approach. This approach continuously updates with trial data, optimizing dose allocation for each patient cohort. Both designs have demonstrated superior performance in comprehensive simulation studies by comparing them with existing methods. Several sequential methods populate the statistical literature, but there remains a notable gap in addressing secondary objectives without altering the primary aim. Addressing this, a two-stage design for randomized controlled trials sequentially testing superiority and noninferiority introduces a novel two-stage group sequential strategy. This strategy primarily aims to establish the superiority of a treatment, assessed at both interim and final stages. Uniquely, it shifts to test noninferiority only if the superiority criterion is not met at the end of the second stage. This dual-focus approach is particularly appreciated in clinical settings for its practical application. Furthermore, it provides a valuable alternative in scenarios where achieving sufficient power for the superiority objective is hindered by limited participant recruitment, allowing the study to pivot towards demonstrating noninferiority.Item Statistical Methods for Cancer Research(2024-01) Han, Yan; Zhao, Yi; Tu, Wanzhu; Li, Yang; Zhang, JianjunPhase I/II clinical trial design is pivotal for achieving optimal therapeutic effect in immunotherapy and drug combination therapy for cancer treatment. Additionally, the identification of biomarkers associated with the risk of severe complications during cancer therapy is a crucial research area. This dissertation contains three related topics, which focus on adaptive Phase I/II clinical trial design and the identification of biomarkers relevant to cancer research. The first topic focuses on developing a two-stage nonparametric (TSNP) phase I/II clinical trial design to identify the optimal biological dose (OBD) of immunotherapy. We derive the closed-form estimates of the joint toxicity-efficacy response probabilities under the monotonic increasing constraint for the toxicity outcomes. The first stage of the design aims to explore the toxicity profile. The second stage aims to find the OBD through a utility function. The simulation results show that the TSNP design yields superior operating characteristics than the existing Bayesian parametric designs. User-friendly computational software is freely available to facilitate the application of the proposed design to real trials. The second topic focuses on dose optimization in drug-combination trials. We propose the Great Wall design, which employs a "divide-and-conquer" algorithm to address the issue of partial order of toxicity. It constructs a candidate set of the most promising dose combinations using the mean utility method. The patients assigned to the candidate set are followed to collect the survival outcomes and the final optimal dose combination is then select to maximize the survival benefit. A simulation study confirmed the desirable operating characteristics of the Great Wall design, compared with other conventional phase I/II designs for drug-combination trials. The last topic of my dissertation is prospective assessment of risk biomarkers of sinusoidal obstruction syndrome (SOS) after hematopoietic cell transplantation (HCT). We aimed to define risk groups for SOS occurrence using three proteins: L-Ficolin, Hyaluronic Acid (HA), and Stimulation-2 (ST2), by assessing SOS incidence at day 35 post-HCT, and overall survival (OS) at day 100 post-HCT. We conclude that L-Ficolin, HA, and ST2 levels measured as early as three days post-HCT improved risk stratification for SOS occurrence and OS.Item Bayesian Adaptive Designs for Early Phase Clinical Trials(2023-07) Guo, Jiaying; Zang, Yong; Han, Jiali; Zhao, Yi; Ren, JieDelayed toxicity outcomes are common in phase I clinical trials, especially in oncology studies. It causes logistic difficulty, wastes resources, and prolongs the trial duration. We propose the time-to-event 3+3 (T-3+3) design to solve the delayed outcome issue for the 3+3 design. We convert the dose decision rules of the 3+3 design into a series of events. A transparent yet efficient Bayesian probability model is applied to calculate the event happening probabilities in the presence of delayed outcomes, which incorporates the informative pending patients' remaining follow-up time into consideration. The T-3+3 design only models the information for the pending patients and seamlessly reduces to the conventional 3+3 design in the absence of delayed outcomes. We further extend the proposed method to interval 3+3 (i3+3) design, an algorithm-based phase I dose-finding design which is based on simple but more comprehensive rules that account for the variabilities in the observed data. Similarly, the dose escalation/deescalation decision is recommended by comparing the event happening probabilities which are calculated by considering the ratio between the averaged follow-up time for at-risk patients and the total assessment window. We evaluate the operating characteristics of the proposed designs through simulation studies and compare them to existing methods. The umbrella trial is a clinical trial strategy that accommodates the paradigm shift towards personalized medicine, which evaluates multiple investigational drugs in different subgroups of patients with the same disease. A Bayesian adaptive umbrella trial design is proposed to select effective targeted agents for different biomarker-based subgroups of patients. To facilitate treatment evaluation, the design uses a mixture regression model that jointly models short-term and long-term response outcomes. In addition, a data-driven latent class model is employed to adaptively combine subgroups into induced latent classes based on overall data heterogeneities, which improves the statistical power of the umbrella trial. To enhance individual ethics, the design includes a response-adaptive randomization scheme with early stopping rules for futility and superiority. Bayesian posterior probabilities are used to make these decisions. Simulation studies demonstrate that the proposed design outperforms two conventional designs across a range of practical treatment-outcome scenarios.Item Sparse Latent-Space Learning for High-Dimensional Data: Extensions and Applications(2023-05) White, Alexander James; Cao, Sha; Tu, Wanzhu; Zhang, Chi; Zhao, YiThe successful treatment and potential eradication of many complex diseases, such as cancer, begins with elucidating the convoluted mapping of molecular profiles to phenotypical manifestation. Our observed molecular profiles (e.g., genomics, transcriptomics, epigenomics) are often high-dimensional and are collected from patient samples falling into heterogeneous disease subtypes. Interpretable learning from such data calls for sparsity-driven models. This dissertation addresses the high dimensionality, sparsity, and heterogeneity issues when analyzing multiple-omics data, where each method is implemented with a concomitant R package. First, we examine challenges in submatrix identification, which aims to find subgroups of samples that behave similarly across a subset of features. We resolve issues such as two-way sparsity, non-orthogonality, and parameter tuning with an adaptive thresholding procedure on the singular vectors computed via orthogonal iteration. We validate the method with simulation analysis and apply it to an Alzheimer’s disease dataset. The second project focuses on modeling relationships between large, matched datasets. Exploring regressional structures between large data sets can provide insights such as the effect of long-range epigenetic influences on gene expression. We present a high-dimensional version of mixture multivariate regression to detect patient clusters, each with different correlation structures of matched-omics datasets. Results are validated via simulation and applied to matched-omics data sets. In the third project, we introduce a novel approach to modeling spatial transcriptomics (ST) data with a spatially penalized multinomial model of the expression counts. This method solves the low-rank structures of zero-inflated ST data with spatial smoothness constraints. We validate the model using manual cell structure annotations of human brain samples. We then applied this technique to additional ST datasets.Item Insights in Response to Statewide COVID-19 Sampling in Indiana(2023-05) Shields, David William, Jr.; Yiannoutsos, Constantin; Fadel, William; Bakoyannis, GiorgosDuring 2020, the Indiana State Department of Health conducted a longitudinal study of novel severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) virus, the cause of COVID-19 disease, to understand the number of past and current infections as well as the prevalence of disease in the State of Indiana by conducting a survey to participants as well as administering testing for exposure to SARS-COV-2. The study consisted of 3 waves of testing, each spread months apart, consisting of a random sample and a non-random sample. The non-random sample was used to ensure the sample population was representative of the state of Indiana and was used as stratum in the logistic regression model, allowing for the adjustment for nonresponse. These finding indicate that persons of non-White race and persons of Hispanic ethnicity had highest risk of exposure to the virus. Understanding the disparity in health in various racial and ethnic populations and addressing how different communities are impacted by the pandemic, as well as working with the community is paramount when attempting to mitigate a pandemic. In addition, understanding the data from the ambient pandemic when instituting measures to mitigate the spread of viruses is also extremely important for managing health emergencies such as the COVID-19 pandemic.