The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

David S Carrell , David J Cronkite , Muqun Rachel Li , Steve Nyemba , Bradley A Malin , John S Aberdeen , Lynette Hirschman

J Am Med Inform Assoc

The MITRE Corp, Bedford, Massachusetts, USA.

Published: December 2019

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus.

Materials And Methods: We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy.

Results: The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected.

Discussion And Conclusion: A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC6857511	PMC
http://dx.doi.org/10.1093/jamia/ocz114	DOI Listing

Publication Analysis

Top Keywords

leaked pii

parrot attack

hips resynthesis

pii

clinical text

text deidentified

hiding plain

plain sight

malicious attacker

expose leaked

Similar Publications

Addressing contemporary threats in anonymised healthcare data using privacy engineering.

NPJ Digit Med

March 2025

University of California Berkeley, School of Information Science, Berkeley, CA, USA.

Sanjiv M Narayan , Nitin Kohli , Megan M Martin

Cyber-attacks on healthcare entities and leaks of personal identifiable information (PII) are a growing threat. However, it is now possible to learn sensitive characteristics of an individual without PII, by combining advances in artificial intelligence, analytics, and online repositories. We discuss privacy threats and privacy engineering solutions, emphasizing the selection of privacy enhancing technologies for various healthcare cases.

View Article and Find Full Text PDF

Similar Publications

Simulating data breaches: Synthetic datasets for depicting personally identifiable information through scenario-based breaches.

Data Brief

February 2025

Kennesaw State University, United States.

Abhishek Sharma , May Bantan

With hackers relentlessly disrupting cyberspace and the day-to-day operations of organizations worldwide, there are also concerns related to Personally Identifiable Information (PII). Due to the data breaches and the data getting dumped on the clear web or the dark web, there are serious concerns about how the different threat actors worldwide can misuse the data. Also, it raises the question of how hackers can create a profile of an individual starting from one data leak and getting more details on individuals with the help of Open Source Intelligence (OSINT).

View Article and Find Full Text PDF

Similar Publications

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.

Patterns (N Y)

June 2021

nference, Cambridge, MA 02142, USA.

Karthik Murugadoss , Ajit Rajasekharan , Bradley Malin , Vineet Agarwal , Sairam Bade

The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data.

View Article and Find Full Text PDF

Similar Publications

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

J Am Med Inform Assoc

July 2020

Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA.

David S Carrell , Bradley A Malin , David J Cronkite , John S Aberdeen , Cheryl Clark

Objective: Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.

View Article and Find Full Text PDF

Similar Publications

The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight.

J Am Med Inform Assoc

December 2019

The MITRE Corp, Bedford, Massachusetts, USA.

David S Carrell , David J Cronkite , Muqun Rachel Li , Steve Nyemba , Bradley A Malin

View Article and Find Full Text PDF

Similar Publications