MedShift: identifying shift data for medical dataset curation

Guo, Xiaoyuan; Gichoya, Judy Wawira; Trivedi, Hari; Purkayastha, Saptarshi; Banerjee, Imon

MedShift: identifying shift data for medical dataset curation

dc.contributor.author	Guo, Xiaoyuan
dc.contributor.author	Gichoya, Judy Wawira
dc.contributor.author	Trivedi, Hari
dc.contributor.author	Purkayastha, Saptarshi
dc.contributor.author	Banerjee, Imon
dc.contributor.department	BioHealth Informatics, School of Informatics and Computing	en_US
dc.date.accessioned	2023-02-06T19:35:50Z
dc.date.available	2023-02-06T19:35:50Z
dc.date.issued	2021
dc.description.abstract	To curate a high-quality dataset, identifying data variance between the internal and external sources is a fundamental and crucial step. However, methods to detect shift or variance in data have not been significantly researched. Challenges to this are the lack of effective approaches to learn dense representation of a dataset and difficulties of sharing private data across medical institutions. To overcome the problems, we propose a unified pipeline called MedShift to detect the top-level shift samples and thus facilitate the medical curation. Given an internal dataset A as the base source, we first train anomaly detectors for each class of dataset A to learn internal distributions in an unsupervised way. Second, without exchanging data across sources, we run the trained anomaly detectors on an external dataset B for each class. The data samples with high anomaly scores are identified as shift data. To quantify the shiftness of the external dataset, we cluster B's data into groups class-wise based on the obtained scores. We then train a multi-class classifier on A and measure the shiftness with the classifier's performance variance on B by gradually dropping the group with the largest anomaly score for each class. Additionally, we adapt a dataset quality metric to help inspect the distribution differences for multiple medical sources. We verify the efficacy of MedShift with musculoskeletal radiographs (MURA) and chest X-rays datasets from more than one external source. Experiments show our proposed shift data detection pipeline can be beneficial for medical centers to curate high-quality datasets more efficiently. An interface introduction video to visualize our results is available at https://youtu.be/V3BF0P1sxQE.	en_US
dc.eprint.version	Final published version	en_US
dc.identifier.citation	Guo, X., Gichoya, J. W., Trivedi, H., Purkayastha, S., & Banerjee, I. (2021). MedShift: Identifying shift data for medical dataset curation. https://doi.org/10.48550/ARXIV.2112.13885	en_US
dc.identifier.uri	https://hdl.handle.net/1805/31154
dc.language.iso	en_US	en_US
dc.relation.isversionof	10.48550/arXiv.2112.13885	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 International	*
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/4.0	*
dc.source	ArXiv	en_US
dc.subject	Dataset curation	en_US
dc.subject	Medical shift data	en_US
dc.subject	Anomaly detection	en_US
dc.subject	OOD detection	en_US
dc.title	MedShift: identifying shift data for medical dataset curation	en_US
dc.type	Article	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Guo2021MedShift-CCBYNCND.pdf
Size:: 26.29 MB
Format:: Adobe Portable Document Format
Description:: Article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.99 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Open Access Policy Articles
Department of Biomedical Engineering and Informatics Works
Saptarshi Purkayastha