Generative artificial intelligence for automated data extraction from unstructured medical text.

Nam Dao , Luisa Quesada , Syed Moin Hassan , Monica Iturrioz Campo , Shelsey Johnson , Suchandra Ghose , Raúl San José Estépar , Aaron Waxman , George Washko , Farbod N Rahaghi

JAMIA Open

Division of Pulmonary and Critical Care, Brigham and Women's Hospital, Boston, MA, United States.

Published: October 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objectives: Unstructured data, such as procedure notes, contain valuable medical information that is frequently underutilized due to the labor-intensive nature of data extraction. This study aims to develop a generative artificial intelligence (GenAI) pipeline using an open-source Large Language Model (LLM) with built-in guardrails and a retry mechanism to extract data from unstructured right heart catheterization (RHC) notes while minimizing errors, including hallucinations.

Materials And Methods: A total of 220 RHC notes were randomly selected for pipeline development and 200 for validation from the Pulmonary Vascular Disease Registry. The pipeline comprised three main components: the Engineered Preload Framework (EPF), which integrated schemas and instructions; the LLM module, enhanced by reasoning capabilities; and the validation and retry mechanism, which ensured data accuracy through iterative self-correction. A clinical expert manually extracted data from the validation cohort to establish the ground truth. Pipeline performance was evaluated using precision, recall, and F1 score. Additionally, the dataset was stratified into quartiles to assess the pipeline's ability to handle varying levels of data availability.

Results: The pipeline achieved 99.0% precision, 85.0% recall, and a 91.5% F1 score, with an overall accuracy of 90% when evaluated at the note level. The most common error was missed values (5.2%), while hallucinations were the least frequent (<0.01%).

Discussion And Conclusion: This study demonstrates the feasibility of a robust GenAI pipeline for automating structured data extraction from unstructured RHC procedure notes. The approach highlights the potential of LLMs in medical data mining, improving research efficiency and clinical applications.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12410982	PMC
http://dx.doi.org/10.1093/jamiaopen/ooaf097	DOI Listing

Publication Analysis

Top Keywords

generative artificial

artificial intelligence

data extraction

retry mechanism

rhc notes

data

pipeline

intelligence automated

automated data

extraction unstructured

Similar Publications

Nuclear receptors in metabolic, inflammatory, and oncologic diseases: mechanisms, therapeutic advances, and future directions.

Eur J Med Res

September 2025

Department of Zoology, Faculty of Science, Ain Shams University, Abbassia, Cairo, 11566, Egypt.

Mohammed A Abdel-Rasol , Wael M El-Sayed

Nuclear receptors (NRs) are a superfamily of ligand-activated transcription factors that regulate gene expression in response to metabolic, hormonal, and environmental signals. These receptors play a critical role in metabolic homeostasis, inflammation, immune function, and disease pathogenesis, positioning them as key therapeutic targets. This review explores the mechanistic roles of NRs such as PPARs, FXR, LXR, and thyroid hormone receptors (THRs) in regulating lipid and glucose metabolism, energy expenditure, cardiovascular health, and neurodegeneration.

View Article and Find Full Text PDF

Similar Publications

Engineering resistance genes against tomato brown rugose fruit virus.

Sci China Life Sci

September 2025

MOE Key Laboratory of Bioinformatics and Center for Plant Biology, School of Life Sciences, Tsinghua University, Beijing, 100084, China.

Junzhu Wang , Tianxin Shen , Mengjie Song , Jiayi Fan , Yiqing Li

Tomato brown rugose fruit virus (ToBRFV) overcomes all known tomato resistance genes, including the durable Tm-2, posing a serious threat to global tomato production. Here, we employed in vitro random mutagenesis to evolve the Tm-2 leucine-rich repeat (LRR) domain and screened ∼8,000 variants for gain-of-function mutants capable of recognizing the ToBRFV movement protein (MP) and triggering hypersensitive cell death. We identified five such mutants.

View Article and Find Full Text PDF

Similar Publications

The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.

J Cancer Res Clin Oncol

September 2025

Department of Surgery, Mannheim School of Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.

Cheng-Peng Li , Aimé Terence Kalisa , Siyer Roohani , Kamal Hummedah , Franka Menge

Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.

Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.

View Article and Find Full Text PDF

Similar Publications

Learning ecosystem-scale dynamics from microbiome data with MDSINE2.

Nat Microbiol

September 2025

Division of Computational Pathology, Brigham and Women's Hospital, Boston, MA, USA.

Travis E Gibson , Younhun Kim , Sawal Acharya , David E Kaplan , Nicholas DiBenedetto

Although dynamical systems models are a powerful tool for analysing microbial ecosystems, challenges in learning these models from complex microbiome datasets and interpreting their outputs limit use. We introduce the Microbial Dynamical Systems Inference Engine 2 (MDSINE2), a Bayesian method that learns compact and interpretable ecosystems-scale dynamical systems models from microbiome timeseries data. Microbial dynamics are modelled as stochastic processes driven by interaction modules, or groups of microbes with similar interaction structure and responses to perturbations, and additionally, noise characteristics of data are modelled.

View Article and Find Full Text PDF

Similar Publications

Non-invasive motor unit analysis reveals specific responses during maximal muscle contraction under normobaric hypoxia.

Pflugers Arch

September 2025

Department of Science, University "G. d'Annunzio" Chieti-Pescara, Chieti, Italy.

Danilo Bondi , Giacomo Valli , Carmen Santangelo , Salvatore Annarumma , Tiziana Pietrangelo

Hypoxia has been extensively studied as a stressor which pushes human bodily systems to responses and adaptations. Nevertheless, a few evidence exist onto constituent trains of motor unit action potential, despite recent advancements which allow to decompose surface electromyographic signals. This study aimed to investigate motor unit properties from noninvasive approaches during maximal isometric exercise in normobaric hypoxia.

View Article and Find Full Text PDF

Similar Publications