A Statistical Testing Procedure for Validating Class Labels

dc.contributor.authorKey, Melissa C.
dc.contributor.authorBoukai, Benzion
dc.contributor.departmentBiostatistics, School of Public Healthen_US
dc.date.accessioned2022-02-08T21:14:09Z
dc.date.available2022-02-08T21:14:09Z
dc.date.issued2020
dc.description.abstractMotivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class/protein labels using available measurements across instances/peptides. More generally, we present a solution to the problem of identifying instances that are deemed, based on some distance (or quasi-distance) measure, as outliers relative to the subset of instances assigned to the same class. The proposed procedure is non-parametric and requires no specific distributional assumption on the measured distances. The only assumption underlying the testing procedure is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. The test is shown to simultaneously control the Type I and Type II error probabilities whilst also controlling the overall error probability of the repeated testing invoked in the validation procedure of initial class labeling. The theoretical results are supplemented with results from an extensive numerical study, simulating a typical setup for labeling validation in proteomics work-flow applications. These results illustrate the applicability and viability of our method. Even with up to 25% of instances mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances.en_US
dc.eprint.versionAuthor's manuscripten_US
dc.identifier.citationKey, M. C., & Boukai, B. (2020). A statistical Testing Procedure for Validating Class Labels. ArXiv:2006.03025 [Stat]. http://arxiv.org/abs/2006.03025en_US
dc.identifier.urihttps://hdl.handle.net/1805/27718
dc.language.isoenen_US
dc.publisherArxiven_US
dc.relation.journalArxiven_US
dc.rightsPublisher Policyen_US
dc.sourceArXiven_US
dc.subjectnon-parametricen_US
dc.subjecthypothesis testingen_US
dc.subjectBonferronien_US
dc.titleA Statistical Testing Procedure for Validating Class Labelsen_US
dc.typeArticleen_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Key2020Statistical-preprint.pdf
Size:
486.01 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.99 KB
Format:
Item-specific license agreed upon to submission
Description: