Large Language Models for CAD-RADS 2.0 Extraction From Semi-Structured Coronary CT Angiography Reports: A Multi-Institutional Study.

Dabin Min , Kwang Nam Jin , SangHeum Bang , Moon Young Kim , Hack-Lyoung Kim , Won Gi Jeong , Hye-Jeong Lee , Kyongmin Sarah Beck , Sung Ho Hwang , Eun Young Kim , Chang Min Park

Korean J Radiol

Interdisciplinary Program in Bioengineering, Seoul National University Graduate School, Seoul, Republic of Korea.

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objective: To evaluate the accuracy of large language models (LLMs) in extracting Coronary Artery Disease-Reporting and Data System (CAD-RADS) 2.0 components from coronary CT angiography (CCTA) reports, and assess the impact of prompting strategies.

Materials And Methods: In this multi-institutional study, we collected 319 synthetic, semi-structured CCTA reports from six institutions to protect patient privacy while maintaining clinical relevance. The dataset included 150 reports from a primary institution (100 for instruction development and 50 for internal testing) and 169 reports from five external institutions for external testing. Board-certified radiologists established reference standards following the CAD-RADS 2.0 guidelines for all three components: stenosis severity, plaque burden, and modifiers. Six LLMs (GPT-4, GPT-4o, Claude-3.5-Sonnet, o1-mini, Gemini-1.5-Pro, and DeepSeek-R1-Distill-Qwen-14B) were evaluated using an optimized instruction with prompting strategies, including zero-shot or few-shot with or without chain-of-thought (CoT) prompting. The accuracy was assessed and compared using McNemar's test.

Results: LLMs demonstrated robust accuracy across all CAD-RADS 2.0 components. Peak stenosis severity accuracies reached 0.980 (48/49, Claude-3.5-Sonnet and o1-mini) in internal testing and 0.946 (158/167, GPT-4o and o1-mini) in external testing. Plaque burden extraction showed exceptional accuracy, with multiple models achieving perfect accuracy (43/43) in internal testing and 0.993 (137/138, GPT-4o, and o1-mini) in external testing. Modifier detection demonstrated consistently high accuracy (≥0.990) across most models. One open-source model, DeepSeek-R1-Distill-Qwen-14B, showed a relatively low accuracy for stenosis severity: 0.898 (44/49, internal) and 0.820 (137/167, external). CoT prompting significantly enhanced the accuracy of several models, with GPT-4 showing the most substantial improvements: stenosis severity accuracy increased by 0.192 ( < 0.001) and plaque burden accuracy by 0.152 ( < 0.001) in external testing.

Conclusion: LLMs demonstrated high accuracy in automated extraction of CAD-RADS 2.0 components from semi-structured CCTA reports, particularly when used with CoT prompting.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12394816	PMC
http://dx.doi.org/10.3348/kjr.2025.0293	DOI Listing

Publication Analysis

Top Keywords

stenosis severity

cad-rads components

ccta reports

internal testing

external testing

plaque burden

cot prompting

accuracy

large language

language models

A PHP Error was encountered