Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction

dc.contributor.authorYun, Taedong
dc.contributor.authorCosentino, Justin
dc.contributor.authorBehsaz, Babak
dc.contributor.authorMcCaw, Zachary R.
dc.contributor.authorHill, Davin
dc.contributor.authorLuben, Robert
dc.contributor.authorLai, Dongbing
dc.contributor.authorBates, John
dc.contributor.authorYang, Howard
dc.contributor.authorSchwantes-An, Tae-Hwi
dc.contributor.authorZhou, Yuchen
dc.contributor.authorKhawaja, Anthony P.
dc.contributor.authorCarroll, Andrew
dc.contributor.authorHobbs, Brian D.
dc.contributor.authorCho, Michael H.
dc.contributor.authorMcLean, Cory Y.
dc.contributor.authorHormozdiari, Farhad
dc.contributor.departmentMedical and Molecular Genetics, School of Medicine
dc.date.accessioned2024-10-15T08:28:32Z
dc.date.available2024-10-15T08:28:32Z
dc.date.issued2024
dc.description.abstractAlthough high-dimensional clinical data (HDCD) are increasingly available in biobank-scale datasets, their use for genetic discovery remains challenging. Here we introduce an unsupervised deep learning model, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), for discovering associations between genetic variants and HDCD. REGLE leverages variational autoencoders to compute nonlinear disentangled embeddings of HDCD, which become the inputs to genome-wide association studies (GWAS). REGLE can uncover features not captured by existing expert-defined features and enables the creation of accurate disease-specific polygenic risk scores (PRSs) in datasets with very few labeled data. We apply REGLE to perform GWAS on respiratory and circulatory HDCD-spirograms measuring lung function and photoplethysmograms measuring blood volume changes. REGLE replicates known loci while identifying others not previously detected. REGLE are predictive of overall survival, and PRSs constructed from REGLE loci improve disease prediction across multiple biobanks. Overall, REGLE contain clinically relevant information beyond that captured by existing expert-defined features, leading to improved genetic discovery and disease prediction.
dc.eprint.versionFinal published version
dc.identifier.citationYun T, Cosentino J, Behsaz B, et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat Genet. 2024;56(8):1604-1613. doi:10.1038/s41588-024-01831-6
dc.identifier.urihttps://hdl.handle.net/1805/43940
dc.language.isoen_US
dc.publisherSpringer Nature
dc.relation.isversionof10.1038/s41588-024-01831-6
dc.relation.journalNature Genetics
dc.rightsAttribution 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by/4.0/
dc.sourcePMC
dc.subjectGenome-wide association studies
dc.subjectPopulation genetics
dc.subjectUnsupervised machine learning
dc.subjectMultifactorial inheritance
dc.titleUnsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction
dc.typeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yun2024Unsupervised-CCBY.pdf
Size:
4.8 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.04 KB
Format:
Item-specific license agreed upon to submission
Description: