Biostatistics Department Theses and Dissertations

Permanent URI for this collection

Browse

Recent Submissions

Now showing 1 - 10 of 69
  • Item
    Statistical and Generative Methods for Brain Functional Connectivity Matrices
    (2026-05) Xu, Yixi; Zhao, Yi; Tu, Wanzhu; Sun, Dayu; Yan, Jingwen
    Functional connectivity (FC), derived from resting-state functional Magnetic Resonance Imaging (fMRI), serves as a powerful tool for revealing brain co-activation pattern, identifying functional networks, and understanding of brain organization. FC is commonly represented by covariance matrices that characterize the temporal dependencies among predefined brain regions. Despite the richness of information covariance matrices provide, statistical modeling of such covariance matrices remains challenging due to dimensionality, limited interpretability, and the non-Euclidean geometry of symmetric positive definite (SPD) matrices. To address these challenges, we introduce a suite of novel statistical methods that enable interpretable and flexible modeling of FC-derived covariance matrices to advance the understanding of brain mechanism. We first propose a causal mediation framework with covariance matrix as graph mediator. We define causal estimands under a structural equation modeling framework, introduce a low-rank representation of covariance matrices, and develop likelihood-based estimators for identifying both mediation effects and low-dimensional structures. Simulation studies demonstrate that the proposed method achieves comparable performance to existing approaches under various scenarios. The framework is applied to resting-state fMRI data to evaluate the mediation effect of FC in explaining differences in motor task performance by sex. Next, we propose a parsimonious clustering model that integrates a Mixture-of-Experts structure with a covariance regression framework. This approach clusters subjects based on their brain connectivity while allowing cluster membership to vary with subject-level covariates. Simulation results demonstrate the superior performance of proposed method relative to existing methods. Application in fMRI data reveals clinically relevant subgroups along with associated cognitive and demographic characteristics. Lastly, we address the inherent challenge of fMRI data scarcity by developing a geometry-aware generative modeling framework for functional connectivity data. Recognizing the non-Euclidean geometry of SPD matrices, we adopt a Log-Euclidean representation for generative modeling of FC matrices. We further introduce a novel integration of diffusion transformers (DiT) and a rectified-flow strategy to achieve scalable and efficient synthesis of realistic functional connectivity matrices.
  • Item
    Detecting Precise Adverse Drug Event (ADE) Signals from Real-World Data
    (2026-01) Shi, Yi; Zhang, Pengyue; Tu, Wanzhu; Zang, Yong; Nan, Hongmei
    Adverse drug events (ADEs) are a major public health burden, yet many risks remain undetected before drugs reach the market. While large-scale real-world data (RWD) offers a powerful resource for post-market surveillance, its use is hindered by methodological challenges like confounding, model misspecification, and high false positive rates. This dissertation develops and applies novel statistical methods to overcome these challenges. The research focuses on detecting nuanced drug safety signals, specifically identifying subpopulation-specific ADEs, timing-dependent drug-drug interactions (DDIs), and complex drug-drug-host interactions (DDHIs). Three models were developed and applied to a large U.S. administrative claims database. The Precision Mixture Risk Model (PMRM) uses a case-crossover design to find ADEs in specific patient subgroups while controlling confounding and false discovery rates (FDR). The Sensitive and Timing-awarE Model (STEM) identifies DDIs by accounting for the sequence of drug administration. Finally, the Trajectory-Informed Model (TIM), coupled with an optimal control selection strategy, detects DDHIs where risk is amplified in patients with specific characteristics. The models successfully identified numerous signals missed by traditional methods. PMRM revealed drugs posing risks only in distinct demographic and clinical subgroups. STEM detected substantially more DDI signals than conventional approaches, including interactions with timing-dependent risks. TIM identified thousands of potential DDIs and DDHIs, demonstrating that many adverse interactions manifest exclusively within patient subgroups defined by composite risk factors (e.g., age, sex). The proposed methods consistently showed superior detection power while maintaining rigorous FDR control. This dissertation delivers a robust statistical framework for precision pharmacovigilance. By effectively identifying complex drug-host, drug-drug, and drug-drug-host interactions from RWD, these models support a more personalized approach to prescribing. This work enables the anticipation and mitigation of ADE risks based on individual patient profiles, ultimately advancing drug safety.
  • Item
    Bayesian and Deep Learning Extensions to Dynamic Treatment Regime
    (2025-12) Zhang, Xuan; Tu, Wanzhu; Bakoyannis, Giorgos; Liang, Yao; Su, Jiang
    This dissertation introduces new statistical and machine learning methods for estimating optimal dynamic treatment regimes (DTR) to aid the practice of personalized medicine. Traditional Q-learning often struggles in high-dimensional treatment spaces, especially with unbalanced treatment assignment, thus limiting its clinical utility. To address these challenges, I propose two extensions. First, Bayesian Weighted Q-learning, which incorporates prior information from previous clinical trials to stabilize inference in small-sample or imbalanced settings. Adaptive Bayesian priors reduce bias from uneven allocation while improving interpretability. Second, a Conditional Variational Autoencoder (CVAE) Q-learning approach, which uses deep generative models to compress complex, high-dimensional treatment combinations into a low-dimensional latent space. This enables more accurate estimation of treatment effects and supports the discovery of optimal multitreatment strategies. Both methods are integrated into multi-stage decision-making via backward induction and evaluated through extensive simulations and real-world data from the Systolic Blood Pressure Intervention Trial (SPRINT). Results show that the approaches can uncover clinically meaningful regimes, highlighting heterogeneity in treatment benefit across patient subgroups. To support practical use, I introduce the QlearningPlus R package, which implements standard Q-learning alongside the proposed extensions, providing a unified toolset for researchers and clinicians.
  • Item
    Exploration in Alzheimer's Disease and Epigenetic Age
    (2025-07) Robling, Charles Oliver; Huang, Kun; Fadel, William; Johnson, Travis
    Alzheimer’s Disease is a progressive neurodegenerative disease resulting in impaired cognition and function. The prevalence of Alzheimer’s Disease has increased steadily in the United States as the average lifespan has risen. Previous research has suggested that aging patterns are not identical for each person due to Various epigenetic clock models have been made to assess the biological or metabolic age of a person, regardless of chronological age based on DNA methylation gene expression values. For the benefit of this analysis, we propose using epigenetic aging algorithms in estimating biological age in postmortem patients and merging this data with Alzheimer’s Disease data, exploring various correlations and relationships. The results show that the biological age is not a significant predictor of Alzheimer’s Disease diagnosis.
  • Item
    A Bayesian Design For Platform Trials With Temporal Changes
    (2025-05) Zhang, Chen; Zang, Yong; Fadel, William F.; Zhang, Pengyue
    The platform trial, which aims to find the best treatment for a disease by sequentially investigating multiple treatments in a single trial, has become increasingly popular in recent decades. An inherent problem for a platform trial is how to borrow information from the non-current controls to improve the efficiency of the statistical inference. The practical solution of directly combining all the control patients does not work due to the population heterogeneity between the concurrent and non-current controls. The temporal changes are the significant resources for that heterogeneity, which will affect patients’ responses over time. In this paper, we develop a Bayesian design to evaluate treatment effects of platform trials accounting for temporal changes. We treat each cohort of patients as a matching set and develop a conditional likelihood method to eliminate the impact of temporal changes. The performance of the proposed method is evaluated through simulation studies.
  • Item
    Transcriptomic Analysis of Survival of Pulmonary Arterial Hypertension Patients
    (2025-05) Gomez Aleman, Adrian; Liu, Yunlong; Schwantes-An , Tae-Hwi Linus; Fadel, William; Reiter, Jill
    Pulmonary arterial hypertension (PAH) is a rare and often fatal condition characterized by obliterative PA remodeling, inflammation, and metabolic reprogramming leading to increased pulmonary vascular resistance (PVR) and right heart failure. To elucidate the genetic causes for disease risk, progression, and outcomes in PAH, many genetic studies, including genome-wide association studies (GWAS), have been conducted. These efforts culminated in identifying both rare and common genetic variants that alter the risk for developing PAH. However, the genetic underpinning of outcomes in PAH remains largely unidentified. To address this crucial gap in developing treatments for PAH, we sought to leverage available data to identify transcriptomic signatures that stratify the hazard for death among patients with PAH, affecting all-cause mortality using the PAH Biobank, which included over 1,000 patients with PAH from diverse genetic ancestry groups. Using available whole-blood RNA-Seq data, we conducted a survival analysis for all-cause mortality or transplant stratified by genetic ancestry groups using the Cox proportional hazards model. RNA-Seq data were quantified using SALMON and normalized using the DESeq2 package in R. Both normalized and tertile gene expression levels were tested for association with survival while adjusting for age at diagnosis, sex, type of PAH, PVR, neutrophils, and the 5 principal components in the survival analysis. A two-stage analysis with EUR as the discovery cohort and AFR and AMR as two independent replication cohorts was performed. A Bonferroni correction was applied to adjust for the number of discovery tests conducted. In total, there were 848 EUR (European genetic ancestry), 81 AFR (African genetic ancestry), and 103 AMR (Admixed American genetic ancestry) participants for analyses. In the discovery cohort, 45,915 genes were tested, and 8 genes were statistically significantly associated with the hazard. Three gene associations (REXO2, FHL2, and CABP4) were replicated (p-value < 0.05 with an exact direction of effect on hazard) in both replication cohorts (AFR, AMR). Using one of the largest cohorts of patients with PAH, we identified three genes that are significantly associated with all-cause mortality across populations. These genes represent potential targets for therapeutic developments as well as for understanding the biological underpinning of progression in PAH.
  • Item
    Comparing Nanopore to MethylationEPIC Array and EM-Seq in DNA Methylation Detection
    (2024-12) Brooks, Steven; Liu, Yunlong; Peng, Gang; Zhang, Pengyue
    DNA Methylation is an important biological process in epigenetics, and many methods have been developed to profile DNA methylation. Recently a growing number of studies use Nanopore long-read sequencing technology in DNA methylation detection, in contrast to widely used Infinium arrays and short-read whole genome sequencing (WGS) methods. In this study, we evaluate the performance of Nanopore sequencing in DNA methylation detection by comparing it to the Illumina MethylationEPIC microarray (EPIC) and Enzymatic Methyl-Sequencing. We first compare Oxford Nanopore Technologies’ Nanopore with MethylationEPIC array. Among the ~850,000 CpG sites covered by both methods, we observed high concordance (R ≥ 0.94 across all four samples). After downsampling Nanopore data from an average coverage of 26.6 reads per site to 10 reads per site, the correlation in CpG methylation remained high (R≥ 0.935). Next, we compare Nanopore with EM-Seq in the context of low coverage. The lower CpG methylation correlation (R ≥ 0.8), can be attributed to reduced coverage of hypomethylated CpG sites by EM-Seq. Furthermore, we highlight Nanopore’s unique capabilities, including native DNA sequencing that can differentiate modification types and the use of long reads for haplotype phasing. Overall, Nanopore demonstrated high concordance with the EPIC array and more uniform coverage across the genome than EM-Seq. This study provides insights for researchers in selecting appropriate DNA methylation detection methods, considering factors such as cost, DNA input, and the complexity of downstream analysis.
  • Item
    Bayesian Adaptive Designs for Phase II Clinical Trials Evaluating Subgroup-Specific Treatment Effect
    (2024-12) Shan, Mu; Zang, Yong; Han, Jiali; Tu, Wanzhu; Zhang, Pengyue
    In Phase II clinical trials, particularly for molecularly targeted agents (MTAs) and biotherapies, there is a critical need to evaluate subgroup-specific treatment effects due to the heterogeneous nature of these therapies. This dissertation introduces two innovative Bayesian adaptive designs for biomarker-guided clinical trials: the Bayesian Order Constrained Adaptive (BOCA) design and the Bayesian Adaptive Marker-Stratified Design Using Calibrated Spike-and-Slab priors (SSS). The BOCA design addresses the limitations of the "one-size-fits-all" approach in non-randomized Phase II trials by efficiently detecting subgroup-specific treatment effects. It combines elements of enrichment and sequential designs, starting with an "all-comers" stage and transitioning to an enrichment stage based on interim analysis results. The decision to continue with either the marker-positive or marker-negative subgroup is guided by two posterior probabilities utilizing inherent ordering constraints. This adaptive approach enhances trial efficiency and cost-effectiveness while managing missing biomarker data. Comprehensive simulation studies show that the BOCA design outperforms conventional designs in detecting subgroup-specific treatment effects, making it a robust tool for Phase II trials. The SSS design improves the efficiency of marker-stratified designs (MSD) by leveraging clinical features of biomarkers and treatments. Patients are classified into marker-positive and marker-negative subgroups and randomized to receive either the MTA or a control treatment. The SSS design uses spike-and-slab priors to dynamically share information on response rates across subgroups, governed by two posterior probabilities that assess similarities in response rates. Additionally, it incorporates a Bayesian multiple imputation method to address missing biomarker profiles. Simulation studies confirm that the SSS design exhibits favorable operational characteristics, surpassing conventional designs in evaluating subgroup-specific treatment effects. Both the BOCA and SSS designs represent significant advancements in Bayesian adaptive methodologies for Phase II trials. By addressing traditional approach limitations, these designs enhance the evaluation of subgroup-specific treatment effects, contributing valuable methodologies to the field of personalized medicine.
  • Item
    Statistical Deep Learning of Multivariate Longitudinal Data
    (2024-11) Li, Yunyi; Gao, Sujuan; Liu, Hao; Apostolova, Liana G.; Li, Xiaochun; Zhao, Yi
    Nowadays, various types of longitudinal data, including continuous, binary, and count data, are increasingly collected in numerous scientific research fields such as Alzheimer’s disease studies. Despite the wealth of data, the complex structure of multivariate longitudinal data presents significant modeling challenges. For years, scientific research has been actively exploring dynamic interactions among multiple components and understanding how interventions can impact outcomes over time with complex underlying dynamics. However, statistical methods for modeling these dynamic changes and associations are still limited. To address these gaps, we propose a novel nonparametric method to describe the mean temporal changes of sparsely and irregularly observed multivariate longitudinal data. This method is based on an Ordinary Differential Equation (ODE) system approximated by neural networks. Furthermore, we presented a novel approach to treat the initial values of ODEs as an unknown parameter vector, a departure from existing methods that either pre-specify the initial values or estimate them in an ad hoc manner. In the second topic, we propose deep latent ODE models. These models nonparametrically model latent temporal trends by an unknown function of an ODE system and parametrically estimate the effects of covariates using Bayesian approaches. To address the intractability of the posterior distribution of initial values, we employ a variational autoencoder (VAE) algorithm. The approximate posterior distribution is characterized by a recurrent neural network (RNN), and high dimensional hy-perparameters are estimated using the stochastic gradient descent method based on Kullback-Leibler (KL) divergence. Lastly, we propose Bayesian generalized random effects models for modeling longitudinal data from various distributions, including longitudinal counts, and longitudinal binary outcomes. This model extends traditional generalized linear mixed effect models (GLMMs) to generalized semi-parametric mixed effect models. It assumes a nonparametric baseline function with a stochastic process prior, and parameters are estimated using the Bayesian approach. The proposed model is practical and can be applied to various types of longitudinal data, including longitudinal binary, and count data. Neural ODE, RNN, variational inference, and KL divergence techniques are also applied in this project.
  • Item
    Identify Signature Genes/Pathways to Characterize Alzheimer's Disease Subtypes Based on Uncoupled Tauopathies and Cognitive Decline
    (2024-06) Huang, Xiaoqing; Huang, Kun; Zhang, Jie; Johnson, Travis; Zhang, Jianjun
    Alzheimer's disease (AD) is a slow-progressing dementia usually found in elderlies, with heterogeneous clinical phenotypes and possible underlying mechanisms. Widely spread tauopathy is one of the pathological change hallmarks in AD brains, in which microtube protein tau forms scar-like neurofibrillary tangles that kill neurons. However, subgroups of patients present unmatched tauopathy progression with their cognitive decline. A detailed study on these so-called atypical AD patients allows for a deeper understanding of possible various disease mechanisms and the factors contributing to disease vulnerability or resilience, which can help guide the drug development and treatment strategy tailored to different subgroups, as well as establish foundations for disease prevention. By identifying specific molecular biomarkers associated with each subtype, I hope to help clinicians diagnose various AD subtypes at an earlier stage. In this work, I have performed transcriptomic and proteomic characterization of two atypical AD subtypes on two large AD/normal brain cohorts to further understand the role of tauopathy in the AD etiology, identified several pathways that are associated with the two phenotypes’ AD-resilient and AD-vulnerable characteristics, and tried to identify the potential drug targets for the precision treatment of AD using extensive bioinformatic approaches. In the meanwhile, two methodologies were developed and applied. One is a new type of interpretable deep learning model (ParsVNN) coupled with the neural network architecture with the hierarchical structure of the gene/protein pathways is introduced and leveraged to address the complexity and improve the interpretability by making its biological hierarchy simple and specific to the predicted subgroup. The other is a label transferring approach using optimal transport from brain samples to blood samples in the hope of finding serum biomarkers for atypical AD groups in live patients and predicting their disease progression in a non-invasive fashion. Conclusively, the study improves our understanding of AD etiology and leads to more personalized care and disease prevention. It acknowledges the complexity of the disease and aims to uncover mechanistic distinctions within the broad Alzheimer’s disease spectrum.