MedShift: Automated Identification of Shift Data for Medical Image Dataset Curation

dc.contributor.authorGuo, Xiaoyuan
dc.contributor.authorWawira Gichoya, Judy
dc.contributor.authorTrivedi, Hari
dc.contributor.authorPurkayastha, Saptarshi
dc.contributor.authorBanerjee, Imon
dc.contributor.departmentBiomedical Engineering and Informatics, Luddy School of Informatics, Computing, and Engineering
dc.date.accessioned2024-10-11T12:44:42Z
dc.date.available2024-10-11T12:44:42Z
dc.date.issued2023
dc.description.abstractAutomated curation of noisy external data in the medical domain has long been demanding as AI technologies should be validated on various sources with clean annotated data. To curate a high-quality dataset, identifying variance between the internal and external sources is a fundamental step as the data distributions from different sources can vary significantly and subsequently affect the performance of the AI models. Primary challenges for detecting data shifts are – (1) access to private data across healthcare institutions for manual detection, and (2) the lack of automated approaches to learn efficient shift-data representation without training samples. To overcome the problems, we propose an automated pipeline called MedShift to detect the top-level shift samples and evaluating the significance of shift data without sharing data between the internal and external organizations. MedShift employs unsupervised anomaly detectors to learn the internal distribution and identify samples showing significant shiftness for external datasets, and compared their performance. To quantify the effects of detected shift data, we train a multi-class classifier that learns internal domain knowledge and evaluating the classification performance for each class in external domains after dropping the shift data. We also propose a data quality metric to quantify the dissimilarity between the internal and external datasets. We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source. Experiments show our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently. The code can be found at https://github.com/XiaoyuanGuo/MedShift. An interface introduction video to visualize our results is available at https://youtu.be/V3BF0P1sxQE.
dc.eprint.versionAuthor's manuscript
dc.identifier.citationGuo X, Gichoya JW, Trivedi H, Purkayastha S, Banerjee I. MedShift: Automated Identification of Shift Data for Medical Image Dataset Curation. IEEE J Biomed Health Inform. 2023;27(8):3936-3947. doi:10.1109/JBHI.2023.3275104
dc.identifier.urihttps://hdl.handle.net/1805/43905
dc.language.isoen_US
dc.publisherIEEE
dc.relation.isversionof10.1109/JBHI.2023.3275104
dc.relation.journalIEEE Journal of Biomedical and Health Informatics
dc.rightsPublisher Policy
dc.sourcePMC
dc.subjectAnomaly detection
dc.subjectDataset curation
dc.subjectMedical shift data
dc.subjectX-ray
dc.subjectOOD detection
dc.titleMedShift: Automated Identification of Shift Data for Medical Image Dataset Curation
dc.typeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Guo2023MedShift-AAM.pdf
Size:
2.12 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.04 KB
Format:
Item-specific license agreed upon to submission
Description: