Large language models can accurately populate Vascular Quality Initiative procedural databases using narrative operative reports.

Colleen P Flanagan , Karen Trang , Joyce Nacario , Peter A Schneider , Warren J Gasper , Michael S Conte , Elizabeth C Wick , Allan M Conway

J Vasc Surg

Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA.

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: Participation in the Vascular Quality Initiative (VQI) provides important resources to surgeons, but the ability to do so is often limited by time and data entry personnel. Large language models (LLMs) such as ChatGPT (OpenAI) are examples of generative artificial intelligence products that may help bridge this gap. Trained on large volumes of data, the models are used for natural language processing and text generation. We evaluated the ability of LLMs to accurately populate VQI procedural databases using operative reports.

Methods: A single-center, retrospective study was performed using institutional VQI data from 2021 to 2023. The most recent procedures for carotid endarterectomy (CEA), endovascular aneurysm repair (EVAR), and infrainguinal lower extremity bypass (LEB) were analyzed using Versa, a HIPAA (Health Insurance Portability and Accountability Act)-compliant institutional version of ChatGPT. We created an automated function to analyze operative reports and generate a shareable VQI file using two models: gpt-35-turbo and gpt-4. Application of the LLMs was accomplished with a cloud-based programming interface. The outputs of this model were compared with VQI data for accuracy. We defined a metric as "unavailable" to the LLM if it was discussed by surgeons in <20% of operative reports.

Results: A total of 150 operative notes were analyzed, including 50 CEA, 50 EVAR, and 50 LEB. These procedural VQI databases included 25, 179, and 51 metrics, respectively. For all fields, gpt-35-turbo had a median accuracy of 84.0% for CEA (interquartile range [IQR]: 80.0%-88.0%), 92.2% for EVAR (IQR: 87.2%-94.0%), and 84.3% for LEB (IQR: 80.2%-88.1%). A total of 3 of 25, 6 of 179, and 7 of 51 VQI variables were unavailable in the operative reports, respectively. Excluding metric information routinely unavailable in operative reports, the median accuracy rate was 95.5% for each CEA procedure (IQR: 90.9%-100.0%), 94.8% for EVAR (IQR: 92.2%-98.5%), and 93.2% for LEB (IQR: 90.2%-96.4%). Across procedures, gpt-4 did not meaningfully improve performance compared with gpt-35 (P = .97, .85, and .95 for CEA, EVAR, and LEB overall performance, respectively). The cost for 150 operative reports analyzed with gpt-35-turbo and gpt-4 was $0.12 and $3.39, respectively.

Conclusions: LLMs can accurately populate VQI procedural databases with both structured and unstructured data, while incurring only minor processing costs. Increased workflow efficiency may improve center ability to successfully participate in the VQI. Further work examining other VQI databases and methods to increase accuracy is needed.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.jvs.2024.12.002	DOI Listing

Publication Analysis

Top Keywords

large language

language models

accurately populate

vascular quality

quality initiative

procedural databases

operative reports

vqi data

vqi

models

Similar Publications

Performance of vision language models for optic disc swelling identification on fundus photographs.

Front Digit Health

August 2025

Department of Ophthalmology, Stanford University, Palo Alto, CA, United States.

Kelvin Zhenghao Li , Tuyet Thao Nguyen , Heather E Moss

Introduction: Vision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.

View Article and Find Full Text PDF

Similar Publications

Increased Time to Provider for Patients With a Non-English Language Preference: A Retrospective Cohort Study.

J Am Coll Emerg Physicians Open

October 2025

Department of Emergency Medicine, University of Michigan, Ann Arbor, Michigan, USA.

Asmaa Rimawi , Anne Sung , Morgan Pike , Erica Lin , David A Haidar

Objectives: We assessed time to provider (TTP) for patients with a non-English language preference (NELP) compared to patients with an English language preference (ELP) in the emergency department (ED).

Methods: We conducted a retrospective cohort study of adults presenting between 2019 and 2023 to a large urban ED. We used a 2-step classification that first identified NELP from patients' reported language at registration, followed by identification in the narrative text of the triage note.

View Article and Find Full Text PDF

Similar Publications

A Pure Transformer Pretraining Framework on Text-attributed Graphs.

Proc Mach Learn Res

November 2024

Michigan State University.

Yu Song , Haitao Mao , Jiachen Xiao , Jingzhe Liu , Zhikai Chen

Pretraining plays a pivotal role in acquiring generalized knowledge from large-scale data, achieving remarkable successes as evidenced by large models in CV and NLP. However, progress in the graph domain remains limited due to fundamental challenges represented by feature heterogeneity and structural heterogeneity. Recent efforts have been made to address feature heterogeneity via Large Language Models (LLMs) on text-attributed graphs (TAGs) by generating fixed-length text representations as node features.

View Article and Find Full Text PDF

Similar Publications

Artificial Intelligence in Cardiac Treatment Decision-Making: An Evaluation of the Performance of ChatGPT Versus the Heart Team in Coronary Revascularization.

Rev Cardiovasc Med

August 2025

Cardiovascular Surgery Department, Ankara Bilkent City Hospital, 06800 Ankara, Turkey.

Serkan Mola , Alp Yıldırım , Enis Burak Gül

Background: This study aimed to investigate the performance of two versions of ChatGPT (o1 and 4o) in making decisions about coronary revascularization and to compare the recommendations of these versions with those of a multidisciplinary Heart Team. Moreover, the study aimed to assess whether the decisions generated by ChatGPT, based on the internal knowledge base of the system and clinical guidelines, align with expert recommendations in real-world coronary artery disease management. Given the increasing prevalence and processing capabilities of large language models, such as ChatGPT, this comparison offers insights into the potential applicability of these systems in complex clinical decision-making.

View Article and Find Full Text PDF

Similar Publications

Toward more realistic career path prediction: evaluation and methods.

Front Big Data

August 2025

MaiNLP, Center for Information and Language Processing, LMU Munich, Munich, Germany.

Elena Senger , Yuri Campbell , Rob van der Goot , Barbara Plank

Predicting career trajectories is a complex yet impactful task, offering significant benefits for personalized career counseling, recruitment optimization, and workforce planning. However, effective career path prediction (CPP) modeling faces challenges including highly variable career trajectories, free-text resume data, and limited publicly available benchmark datasets. In this study, we present a comprehensive comparative evaluation of CPP models-linear projection, multilayer perceptron (MLP), LSTM, and large language models (LLMs)-across multiple input settings and two recently introduced public datasets.

View Article and Find Full Text PDF

Similar Publications