Machine learning models incorporating genotype and ancestry improve severe asthma risk prediction
Date
Language
Embargo Lift Date
Department
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
Abstract
This study proposes a novel machine learning (ML)-based stacking technique that integrates Single Nucleotide Polymorphisms (SNPs) and inferred local ancestry (LA) to improve predictive accuracy in clinical outcomes. Asthma, particularly severe asthma (SA) with poor response to inhaled corticosteroids (ICS), serves as the case study to illustrate this approach. Using data from the Biorepository and Integrative Genomics (BIG) Initiative, which includes whole-exome sequenced data from a self-reported African American pediatric cohort (N=248), we develop an ML framework to predict ICS response. After SNP data preprocessing and LA estimation, we employ stratified 10-fold cross-validation, creating base pipelines for SNP and LA data, which are then combined in stacked pipelines to assess the effectiveness of integrating these distinct data types. The stacked SNP pipeline yields an AUC of 0.693 ± 0.066 and the stacked LA pipeline yields an AUC of 0.625 ± 0.103. The integration of LA with SNP data significantly improves predictive performance, boosting the AUC to 0.729 ± 0.048 (paired t-test p-value = 0.005). Pipelines using LA data alone shows comparable performance to those using SNP data alone. However, the most important contributing features are distinct between LA and SNP data demonstrating that these data types capture distinct sources of variation and could provide complementary insights. This study highlights the potential of stacking ML pipelines, based on feature selection techniques and along with logistic regression and random forest predictive models, to integrate SNP and LA data. Such holistic approach has the promise to improve predictive performance of medication response in complex conditions like SA. This approach has broader implications for advancing personalized medicine through the effective use of multifactorial data.
