LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection

Johnson, Travis S.Wang, TongxinHuang, ZhiYu, Christina Y.Wu, YiHan, YatongZhang, YanHuang, KunZhang, Jie2022-01-122022-01-122019-04Johnson, T. S., Wang, T., Huang, Z., Yu, C. Y., Wu, Y., Han, Y., Zhang, Y., Huang, K., & Zhang, J. (2019). LAmbDA: Label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection. Bioinformatics, 35(22), 4696–4706. https://doi.org/10.1093/bioinformatics/btz2951367-4803, 1460-2059https://hdl.handle.net/1805/27408Motivation Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. Results We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing.en-USPublisher PolicyRNALAmbDAscRNA-seqLAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detectionArticle