Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

David S Carrell , Bradley A Malin , David J Cronkite , John S Aberdeen , Cheryl Clark , Muqun Rachel Li , Dikshya Bastakoty , Steve Nyemba , Lynette Hirschman

J Am Med Inform Assoc

Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA.

Published: July 2020

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.

Materials And Methods: Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers.

Results: Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers.

Discussion And Conclusions: Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC7647331	PMC
http://dx.doi.org/10.1093/jamia/ocaa095	DOI Listing

Publication Analysis

Top Keywords

leaked pii

clinical text

reidentification attacks

human readers

pii tagged

aggressive reidentification

precision

precision 37%

32% precision

rate reported

Similar Publications

Addressing contemporary threats in anonymised healthcare data using privacy engineering.

NPJ Digit Med

March 2025

University of California Berkeley, School of Information Science, Berkeley, CA, USA.

Sanjiv M Narayan , Nitin Kohli , Megan M Martin

Cyber-attacks on healthcare entities and leaks of personal identifiable information (PII) are a growing threat. However, it is now possible to learn sensitive characteristics of an individual without PII, by combining advances in artificial intelligence, analytics, and online repositories. We discuss privacy threats and privacy engineering solutions, emphasizing the selection of privacy enhancing technologies for various healthcare cases.

View Article and Find Full Text PDF

Similar Publications

Simulating data breaches: Synthetic datasets for depicting personally identifiable information through scenario-based breaches.

Data Brief

February 2025

Kennesaw State University, United States.

Abhishek Sharma , May Bantan

With hackers relentlessly disrupting cyberspace and the day-to-day operations of organizations worldwide, there are also concerns related to Personally Identifiable Information (PII). Due to the data breaches and the data getting dumped on the clear web or the dark web, there are serious concerns about how the different threat actors worldwide can misuse the data. Also, it raises the question of how hackers can create a profile of an individual starting from one data leak and getting more details on individuals with the help of Open Source Intelligence (OSINT).

View Article and Find Full Text PDF

Similar Publications

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.

Patterns (N Y)

June 2021

nference, Cambridge, MA 02142, USA.

Karthik Murugadoss , Ajit Rajasekharan , Bradley Malin , Vineet Agarwal , Sairam Bade

The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data.

View Article and Find Full Text PDF

Similar Publications

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

J Am Med Inform Assoc

July 2020

Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA.

David S Carrell , Bradley A Malin , David J Cronkite , John S Aberdeen , Cheryl Clark

View Article and Find Full Text PDF

Similar Publications

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

J Am Med Inform Assoc

December 2019

The MITRE Corp, Bedford, Massachusetts, USA.

David S Carrell , David J Cronkite , Muqun Rachel Li , Steve Nyemba , Bradley A Malin

Objective: Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus.

View Article and Find Full Text PDF

Similar Publications