An active learning pipeline to automatically identify candidate terms for a CDSS ontology-measures, experiments, and performance

Date
2025-08-23
Language
American English
Embargo Lift Date
Committee Members
Degree
Degree Year
Department
Grantor
Journal Title
Journal ISSN
Volume Title
Found At
medRxiv
Can't use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Abstract

Objective: To explore new strategies to make the document selection process more transparent, reproducible, and effective for the active learning process. The ultimate goal is to leverage active learning in identifying keyphrases to facilitate ontology development and construction, to streamline the process, and help with the long-term maintenance.

Methods: The active learning pipeline used a BILSTM-CRF model and over 2900 abstracts retrieved from PubMed relevant to clinical decision support systems. We started the model training with synthetic labeled abstracts, then used different strategies to select domain experts' annotated abstracts (gold standards). Random sampling was used as the baseline. Recall, F1 (beta = 1, 5, and 10) scores are used as measures to compare the performance of the active learning pipeline by different strategies.

Results: We tested four novel document-level uncertainty aggregation strategies-KPSum, KPAvg, DOCSum, and DOCAvg-that operate over standard token-level uncertainty scores such as Maximum Token Probability (MTP), Token Entropy (TE), and Margin. All strategies show significant improvement in early active learning cycles (θ0 to θ2) for recall and F1. The systematic evaluations show that KPSum (actual order) shows consistent improvement in both recall and F1 and KPSum (actual order) shows better results than the random sampling results. The document order (actual versus reverse) does not seem to play a critical role across strategies in model learning and performance in our datasets, although in some strategies, actual order shows slightly more effective results. The weighted F1 (beta = 5 and 10) provided complementary results to raw recall and F1 (beta = 1).

Conclusion: While prior work on uncertainty sampling typically focuses on token-level uncertainty metrics within generic NER tasks, our work advances this line of research by introducing a higher-level abstraction: document-level uncertainty aggregation. With a human-in-the-loop Active Learning pipeline, it can effectively prioritize high-impact documents, improve early-cycle recall, and reduce annotation effort. Our results show promise in automating part of ontology construction and maintenance work, i.e., monitoring and screening new publications to identify candidate keyphrases. However, future work needs to improve the model performance to make it usable in real-world operations.

Description
item.page.description.tableofcontents
item.page.relation.haspart
Cite As
Alluri S, Komatineni K, Goli R, et al. An active learning pipeline to automatically identify candidate terms for a CDSS ontology-measures, experiments, and performance. Preprint. medRxiv. 2025;2025.04.15.25325868. Published 2025 Aug 23. doi:10.1101/2025.04.15.25325868
ISSN
Publisher
Series/Report
Sponsorship
Major
Extent
Identifier
Relation
Journal
Source
PMC
Alternative Title
Type
Article
Number
Volume
Conference Dates
Conference Host
Conference Location
Conference Name
Conference Panel
Conference Secretariat Location
Version
Preprint
Full Text Available at
This item is under embargo {{howLong}}