Comparison of CT referral justification using clinical decision support and large language models in a large European cohort.

Mor Saban , Yaniv Alon , Osnat Luxenburg , Clara Singer , Monika Hierath , Alexandra Karoussou Schreiner , Boris Brkljačić , Jacob Sosna

Eur Radiol

Department of Radiology, Hadassah Medical Center, Faculty of Medicine, Hebrew University of Jerusalem, Jerusalem, Israel.

Published: October 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Ensuring appropriate use of CT scans is critical for patient safety and resource optimization. Decision support tools and artificial intelligence (AI), such as large language models (LLMs), have the potential to improve CT referral justification, yet require rigorous evaluation against established standards and expert assessments.

Aim: To evaluate the performance of LLMs (Generation Pre-trained Transformer 4 (GPT-4) and Claude-3 Haiku) and independent experts in justifying CT referrals compared to the ESR iGuide clinical decision support system as the reference standard.

Methods: CT referral data from 6356 patients were retrospectively analyzed. Recommendations were generated by the ESR iGuide, LLMs, and independent experts, and evaluated for accuracy, precision, recall, F1 score, and Cohen's kappa across medical test, organ, and contrast predictions. Statistical analysis included demographic stratification, confidence intervals, and p-values to ensure robust comparisons.

Results: Independent experts achieved the highest accuracy (92.4%) for medical test justification, surpassing GPT-4 (88.8%) and Claude-3 Haiku (85.2%). For organ predictions, LLMs performed comparably to experts, achieving accuracies of 75.3-77.8% versus 82.6%. For contrast predictions, GPT-4 showed the highest accuracy (57.4%) among models, while Claude demonstrated poor agreement with guidelines (kappa = 0.006).

Conclusion: Independent experts remain the most reliable, but LLMs show potential for optimization, particularly in organ prediction. A hybrid human-AI approach could enhance CT referral appropriateness and utilization. Further research should focus on improving LLM performance and exploring their integration into clinical workflows.

Key Points: Question Can GPT-4 and Claude-3 Haiku justify CT referrals as accurately as independent experts, using the ESR iGuide as the gold standard? Findings Independent experts outperformed large language models in test justification. GPT-4 and Claude-3 showed comparable organ prediction but struggled with contrast selection, limiting full automation. Clinical relevance While independent experts remain most reliable, integrating AI with expert oversight may improve CT referral appropriateness, optimizing resource allocation and enhancing clinical decision-making.

Download full-text PDF	Source
http://dx.doi.org/10.1007/s00330-025-11608-y	DOI Listing
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417242	PMC

Publication Analysis

Top Keywords

independent experts

decision support

large language

language models

gpt-4 claude-3

claude-3 haiku

esr iguide

referral justification

clinical decision

llms potential

A PHP Error was encountered