- Browse by Subject
Browsing by Subject "Record linkage"
Now showing 1 - 6 of 6
Results Per Page
Sort Options
Item A simple two-step procedure using the Fellegi-Sunter model for frequency-based record linkage(Taylor & Francis, 2021-05-04) Xu, Huiping; Li, Xiaochun; Grannis, Shaun; Biostatistics, School of Public HealthThe widely used Fellegi-Sunter model for probabilistic record linkage does not leverage information contained in field values and consequently leads to identical classification of match status regardless of whether records agree on rare or common values. Since agreement on rare values is less likely to occur by chance than agreement on common values, records agreeing on rare values are more likely to be matches. Existing frequency-based methods typically rely on knowledge of error probabilities associated with field values and frequencies of agreed field values among matches, often derived using prior studies or training data. When such information is unavailable, applications of these methods are challenging. In this paper, we propose a simple two-step procedure for frequency-based matching using the Fellegi-Sunter framework to overcome these challenges. Matching weights are adjusted based on frequency distributions of the agreed field values among matches and non-matches, estimated by the Fellegi-Sunter model without relying on prior studies or training data. Through a real-world application and simulation, our method is found to produce comparable or better performance than the unadjusted method. Furthermore, frequency-based matching provides greater improvement in matching accuracy when using poorly discriminating fields with diminished benefit as the discriminating power of matching fields increases.Item Evaluating Two Approaches for Parameterizing the Fellegi-Sunter Patient Matching Algorithm to Optimize Accuracy(Medinfo conference proceedings, 2019-08-25) Grannis, Shaun; Kasthurirathne, Suranga; Bo, Na; Huiping, XuItem Evaluation of real-world referential and probabilistic patient matching to advance patient identification strategy(Oxford University Press, 2022) Grannis, Shaun J.; Williams, Jennifer L.; Kasthuri, Suranga; Murray, Molly; Xu, Huiping; Medicine, School of MedicineObjective: This study sought both to support evidence-based patient identity policy development by illustrating an approach for formally evaluating operational matching methods, and also to characterize the performance of both referential and probabilistic patient matching algorithms using real-world demographic data. Materials and methods: We assessed matching accuracy for referential and probabilistic matching algorithms using a manually reviewed 30 000 record gold standard reference dataset derived from a large health information exchange containing over 47 million patient registrations. We applied referential and probabilistic algorithms to this dataset and compared the outputs to the gold standard. We computed performance metrics including sensitivity (recall), positive predictive value (precision), and F-score for each algorithm. Results: The probabilistic algorithm exhibited sensitivity, positive predictive value (PPV), and F-score of .6366, 0.9995, and 0.7778, respectively. The referential algorithm exhibited corresponding sensitivity, PPV, and F-score values of 0.9351, 0.9996, and 0.9663, respectively. Treating discordant and limited-data records as nonmatches increased referential match sensitivity to 0.9578. Compared to the more traditional probabilistic approach, referential matching exhibits greater accuracy. Conclusions: Referential patient matching, an increasingly popular method among health IT vendors, demonstrated notably greater accuracy than a more traditional probabilistic approach without the adaptation of the algorithm to the data that the traditional probabilistic approach usually requires. Health IT policymakers, including the Office of the National Coordinator for Health Information Technology (ONC), should explore strategies to expand the evidence base for real-world matching system performance, given the need for an evidence-based patient identity strategy.Item Evolving availability and standardization of patient attributes for matching(Oxford University Press, 2023-10-12) Deng, Yu; Gleason, Lacey P.; Culbertson, Adam; Chen, Xiaotian; Bernstam, Elmer V.; Cullen, Theresa; Gouripeddi, Ramkiran; Harle, Christopher; Hesse, David F.; Kean, Jacob; Lee, John; Magoc, Tanja; Meeker, Daniella; Ong, Toan; Pathak, Jyotishman; Rosenman, Marc; Rusie, Laura K.; Shah, Akash J.; Shi, Lizheng; Thomas, Aaron; Trick, William E.; Grannis, Shaun; Kho, Abel; Health Policy and Management, Richard M. Fairbanks School of Public HealthVariation in availability, format, and standardization of patient attributes across health care organizations impacts patient-matching performance. We report on the changing nature of patient-matching features available from 2010-2020 across diverse care settings. We asked 38 health care provider organizations about their current patient attribute data-collection practices. All sites collected name, date of birth (DOB), address, and phone number. Name, DOB, current address, social security number (SSN), sex, and phone number were most commonly used for cross-provider patient matching. Electronic health record queries for a subset of 20 participating sites revealed that DOB, first name, last name, city, and postal codes were highly available (>90%) across health care organizations and time. SSN declined slightly in the last years of the study period. Birth sex, gender identity, language, country full name, country abbreviation, health insurance number, ethnicity, cell phone number, email address, and weight increased over 50% from 2010 to 2020. Understanding the wide variation in available patient attributes across care settings in the United States can guide selection and standardization efforts for improved patient matching in the United States.Item Privacy‐preserving record linkage across disparate institutions and datasets to enable a learning health system: The national COVID cohort collaborative (N3C) experience(Wiley, 2024-01-11) Tachinardi, Umberto; Grannis, Shaun J.; Michael, Sam G.; Misquitta, Leonie; Dahlin, Jayme; Sheikh, Usman; Kho, Abel; Phua, Jasmin; Rogovin, Sara S.; Amor, Benjamin; Choudhury, Maya; Sparks, Philip; Mannaa, Amin; Ljazouli, Saad; Saltz, Joel; Prior, Fred; Baghal, Ahmen; Gersing, Kenneth; Embi, Peter J.; Medicine, School of MedicineIntroduction: Research driven by real-world clinical data is increasingly vital to enabling learning health systems, but integrating such data from across disparate health systems is challenging. As part of the NCATS National COVID Cohort Collaborative (N3C), the N3C Data Enclave was established as a centralized repository of deidentified and harmonized COVID-19 patient data from institutions across the US. However, making this data most useful for research requires linking it with information such as mortality data, images, and viral variants. The objective of this project was to establish privacy-preserving record linkage (PPRL) methods to ensure that patient-level EHR data remains secure and private when governance-approved linkages with other datasets occur. Methods: Separate agreements and approval processes govern N3C data contribution and data access. The Linkage Honest Broker (LHB), an independent neutral party (the Regenstrief Institute), ensures data linkages are robust and secure by adding an extra layer of separation between protected health information and clinical data. The LHB's PPRL methods (including algorithms, processes, and governance) match patient records using "deidentified tokens," which are hashed combinations of identifier fields that define a match across data repositories without using patients' clear-text identifiers. Results: These methods enable three linkage functions: Deduplication, Linking Multiple Datasets, and Cohort Discovery. To date, two external repositories have been cross-linked. As of March 1, 2023, 43 sites have signed the LHB Agreement; 35 sites have sent tokens generated for 9 528 998 patients. In this initial cohort, the LHB identified 135 037 matches and 68 596 duplicates. Conclusion: This large-scale linkage study using deidentified datasets of varying characteristics established secure methods for protecting the privacy of N3C patient data when linked for research purposes. This technology has potential for use with registries for other diseases and conditions.Item The Data-Adaptive Fellegi-Sunter Model for Probabilistic Record Linkage: Algorithm Development and Validation for Incorporating Missing Data and Field Selection(JMIR Publications, 2022-09-29) Li, Xiaochun; Xu, Huiping; Grannis, Shaun; Biostatistics, School of Public HealthBackground: Quality patient care requires comprehensive health care data from a broad set of sources. However, missing data in medical records and matching field selection are 2 real-world challenges in patient-record linkage. Objective: In this study, we aimed to evaluate the extent to which incorporating the missing at random (MAR)-assumption in the Fellegi-Sunter model and using data-driven selected fields improve patient-matching accuracy using real-world use cases. Methods: We adapted the Fellegi-Sunter model to accommodate missing data using the MAR assumption and compared the adaptation to the common strategy of treating missing values as disagreement with matching fields specified by experts or selected by data-driven methods. We used 4 use cases, each containing a random sample of record pairs with match statuses ascertained by manual reviews. Use cases included health information exchange (HIE) record deduplication, linkage of public health registry records to HIE, linkage of Social Security Death Master File records to HIE, and deduplication of newborn screening records, which represent real-world clinical and public health scenarios. Matching performance was evaluated using the sensitivity, specificity, positive predictive value, negative predictive value, and F1-score. Results: Incorporating the MAR assumption in the Fellegi-Sunter model maintained or improved F1-scores, regardless of whether matching fields were expert-specified or selected by data-driven methods. Combining the MAR assumption and data-driven fields optimized the F1-scores in the 4 use cases. Conclusions: MAR is a reasonable assumption in real-world record linkage applications: it maintains or improves F1-scores regardless of whether matching fields are expert-specified or data-driven. Data-driven selection of fields coupled with MAR achieves the best overall performance, which can be especially useful in privacy-preserving record linkage.