Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules.

Objective: The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models.

Methods: In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models.

Results: The OpenDeID achieved a best F-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time.

Conclusions: The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10733816PMC
http://dx.doi.org/10.2196/48145DOI Listing

Publication Analysis

Top Keywords

opendeid pipeline
20
text notes
16
hybrid deidentification
12
deidentification pipeline
12
ehr text
12
opendeid
9
electronic health
8
rules transformers
8
transformer-based language
8
pipeline
8

Similar Publications

Article Synopsis
  • * It evaluates a hybrid deidentification strategy involving deep learning and contextual rules, tested on three different datasets containing medical reports and clinical narratives.
  • * The findings show that the best performing model, trained on 4,038 reports, achieved high F1-scores (0.9248 for strict and 0.9692 for relaxed settings), indicating strong deidentification capabilities that can protect patient information effectively.
View Article and Find Full Text PDF

Background: Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification.

View Article and Find Full Text PDF