Linkability measures to assess the data characteristics for record linkage

dc.contributor.authorOng, Toan C.
dc.contributor.authorHill, Andrew
dc.contributor.authorKahn, Michael G.
dc.contributor.authorLembcke, Lauren R.
dc.contributor.authorSchilling, Lisa M.
dc.contributor.authorGrannis, Shaun J.
dc.contributor.departmentMedicine, School of Medicine
dc.date.accessioned2025-04-18T10:15:05Z
dc.date.available2025-04-18T10:15:05Z
dc.date.issued2024
dc.description.abstractObjectives: Accurate record linkage (RL) enables consolidation and de-duplication of data from disparate datasets, resulting in more comprehensive and complete patient data. However, conducting RL with low quality or unfit data can waste institutional resources on poor linkage results. We aim to evaluate data linkability to enhance the effectiveness of record linkage. Materials and methods: We describe a systematic approach using data fitness ("linkability") measures, defined as metrics that characterize the availability, discriminatory power, and distribution of potential variables for RL. We used the isolation forest algorithm to detect abnormal linkability values from 188 sites in Indiana and Colorado, and manually reviewed the data to understand the cause of anomalies. Result: We calculated 10 linkability metrics for 11 potential linkage variables (LVs) across 188 sites for a total of 20 680 linkability metrics. Potential LVs such as first name, last name, date of birth, and sex have low missing data rates, while Social Security Number vary widely in completeness among all sites. We investigated anomalous linkability values to identify the cause of many records having identical values in certain LVs, issues with placeholder values disguising data missingness, and orphan records. Discussion: The fitness of a variable for RL is determined by its availability and its discriminatory power to uniquely identify individuals. These results highlight the need for awareness of placeholder values, which inform the selection of variables and methods to optimize RL performance. Conclusion: Evaluating linkability measures using the isolation forest algorithm to highlight anomalous findings can help identify fitness-for-use issues that must be addressed before initiating the RL process to ensure high-quality linkage outcomes.
dc.eprint.versionFinal published version
dc.identifier.citationOng TC, Hill A, Kahn MG, Lembcke LR, Schilling LM, Grannis SJ. Linkability measures to assess the data characteristics for record linkage. J Am Med Inform Assoc. 2024;31(11):2651-2659. doi:10.1093/jamia/ocae248
dc.identifier.urihttps://hdl.handle.net/1805/47161
dc.language.isoen_US
dc.publisherOxford University Press
dc.relation.isversionof10.1093/jamia/ocae248
dc.relation.journalJournal of the American Medical Informatics Association
dc.rightsAttribution-NonCommercial 4.0 Internationalen
dc.rights.urihttp://creativecommons.org/licenses/by-nc/4.0/
dc.sourcePMC
dc.subjectData characteristics
dc.subjectDistributional measures
dc.subjectIntrinsic measures
dc.subjectLinkability
dc.subjectRecord linkage
dc.titleLinkability measures to assess the data characteristics for record linkage
dc.typeArticle
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Ong2024Linkability-CCBYNC.pdf
Size:
1.11 MB
Format:
Adobe Portable Document Format
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.04 KB
Format:
Item-specific license agreed upon to submission
Description: