Evaluating prompt and data perturbation sensitivity in large language models for radiology reports classification.

Vera Sorin , Jeremy D Collins , Alex K Bratt , Joanna E Kusmirek , Vamshi K Mugu , Timothy L Kline , Crystal L Butler , Nadia G Wood , Cole J Cook , Panagiotis Korfiatis

JAMIA Open

Department of Radiology, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, MN 55905, United States.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objectives: Large language models (LLMs) offer potential in natural language processing tasks in healthcare. Due to the need for high accuracy, understanding their limitations is essential. The purpose of this study was to evaluate the performance of LLMs in classifying radiology reports for the presence of pulmonary embolism (PE) under various conditions, including different prompt designs and data perturbations.

Materials And Methods: In this retrospective, institutional review board approved study, we evaluated 3 Google's LLMs including Gemini-1.5-Pro, Gemini-1.5-Flash-001, and Gemini-1.5-Flash-002, in classifying 11 999 pulmonary CT angiography radiology reports for PE. Ground truth labels were determined by concordance between a computer vision-based PE detection (CVPED) algorithm and multiple LLM runs under various configurations. Discrepancies between algorithms' classifications were aggregated and manually reviewed. We evaluated the effects of prompt design, data perturbations, and repeated analyses across geographic cloud regions. Performance metrics were calculated.

Results: Of 11 999 reports, 1296 (10.8%) were PE-positive. Accuracy across LLMs ranged between 0.953 and 0.996. The highest recall rate for a prompt modified after a review of the misclassified cases (up to 0.997). Few-shot prompting improved recall (up to 0.99), while chain-of-thought generally degraded performance. Gemini-1.5-Flash-002 demonstrated the highest robustness against data perturbations. Geographic cloud region variability was minimal for Gemini-1.5+-Pro, while the Flash models showed stable performance.

Discussion And Conclusion: LLMs demonstrated high performance in classifying radiology reports, though results varied with prompt design and data quality. These findings underscore the need for systematic evaluation and validation of LLMs for clinical applications, particularly in high-stakes scenarios.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12343119	PMC
http://dx.doi.org/10.1093/jamiaopen/ooaf073	DOI Listing

Publication Analysis

Top Keywords

radiology reports

large language

language models

classifying radiology

prompt design

design data

data perturbations

geographic cloud

llms

data

A PHP Error was encountered