- Browse by Author
Browsing by Author "Sun, Ju"
Now showing 1 - 2 of 2
Results Per Page
Sort Options
Item Evaluation of federated learning variations for COVID-19 diagnosis using chest radiographs from 42 US and European hospitals(Oxford University Press, 2022) Peng, Le; Luo, Gaoxiang; Walker, Andrew; Zaiman, Zachary; Jones, Emma K.; Gupta, Hemant; Kersten, Kristopher; Burns, John L.; Harle, Christopher A.; Magoc, Tanja; Shickel, Benjamin; Steenburg, Scott D.; Loftus, Tyler; Melton, Genevieve B.; Wawira Gichoya, Judy; Sun, Ju; Tignanelli, Christopher J.; Radiology and Imaging Sciences, School of MedicineObjective: Federated learning (FL) allows multiple distributed data holders to collaboratively learn a shared model without data sharing. However, individual health system data are heterogeneous. "Personalized" FL variations have been developed to counter data heterogeneity, but few have been evaluated using real-world healthcare data. The purpose of this study is to investigate the performance of a single-site versus a 3-client federated model using a previously described Coronavirus Disease 19 (COVID-19) diagnostic model. Additionally, to investigate the effect of system heterogeneity, we evaluate the performance of 4 FL variations. Materials and methods: We leverage a FL healthcare collaborative including data from 5 international healthcare systems (US and Europe) encompassing 42 hospitals. We implemented a COVID-19 computer vision diagnosis system using the Federated Averaging (FedAvg) algorithm implemented on Clara Train SDK 4.0. To study the effect of data heterogeneity, training data was pooled from 3 systems locally and federation was simulated. We compared a centralized/pooled model, versus FedAvg, and 3 personalized FL variations (FedProx, FedBN, and FedAMP). Results: We observed comparable model performance with respect to internal validation (local model: AUROC 0.94 vs FedAvg: 0.95, P = .5) and improved model generalizability with the FedAvg model (P < .05). When investigating the effects of model heterogeneity, we observed poor performance with FedAvg on internal validation as compared to personalized FL algorithms. FedAvg did have improved generalizability compared to personalized FL algorithms. On average, FedBN had the best rank performance on internal and external validation. Conclusion: FedAvg can significantly improve the generalization of the model compared to other personalization FL algorithms; however, at the cost of poor internal validity. Personalized FL may offer an opportunity to develop both internal and externally validated algorithms.Item Performance of a Chest Radiograph AI Diagnostic Tool for COVID-19: A Prospective Observational Study(Radiological Society of North America, 2022-06-01) Sun, Ju; Peng, Le; Li, Taihui; Adila, Dyah; Zaiman, Zach; Melton-Meaux, Genevieve B.; Ingraham, Nicholas E.; Murray, Eric; Boley, Daniel; Switzer, Sean; Burns, John L.; Huang, Kun; Allen, Tadashi; Steenburg, Scott D.; Wawira Gichoya, Judy; Kummerfeld, Erich; Tignanelli, Christopher J.; Radiology and Imaging Sciences, School of MedicinePurpose: To conduct a prospective observational study across 12 U.S. hospitals to evaluate real-time performance of an interpretable artificial intelligence (AI) model to detect COVID-19 on chest radiographs. Materials and methods: A total of 95 363 chest radiographs were included in model training, external validation, and real-time validation. The model was deployed as a clinical decision support system, and performance was prospectively evaluated. There were 5335 total real-time predictions and a COVID-19 prevalence of 4.8% (258 of 5335). Model performance was assessed with use of receiver operating characteristic analysis, precision-recall curves, and F1 score. Logistic regression was used to evaluate the association of race and sex with AI model diagnostic accuracy. To compare model accuracy with the performance of board-certified radiologists, a third dataset of 1638 images was read independently by two radiologists. Results: Participants positive for COVID-19 had higher COVID-19 diagnostic scores than participants negative for COVID-19 (median, 0.1 [IQR, 0.0-0.8] vs 0.0 [IQR, 0.0-0.1], respectively; P < .001). Real-time model performance was unchanged over 19 weeks of implementation (area under the receiver operating characteristic curve, 0.70; 95% CI: 0.66, 0.73). Model sensitivity was higher in men than women (P = .01), whereas model specificity was higher in women (P = .001). Sensitivity was higher for Asian (P = .002) and Black (P = .046) participants compared with White participants. The COVID-19 AI diagnostic system had worse accuracy (63.5% correct) compared with radiologist predictions (radiologist 1 = 67.8% correct, radiologist 2 = 68.6% correct; McNemar P < .001 for both). Conclusion: AI-based tools have not yet reached full diagnostic potential for COVID-19 and underperform compared with radiologist prediction.