Evaluating large language models in echocardiography reporting: opportunities and challenges.

Chieh-Ju Chao , Imon Banerjee , Reza Arsanjani , Chadi Ayoub , Andrew Tseng , Jean-Benoit Delbrouck , Garvan C Kane , Francisco Lopez-Jimenez , Zachi Attia , Jae K Oh , Bradley Erickson , Li Fei-Fei , Ehsan Adeli , Curtis Langlotz

Eur Heart J Digit Health

Center for Artificial Intelligence in Medicine and Imaging (AIMI), Stanford University, 1701 Page Mill Rd, Palo Alto, CA 94304, USA.

Published: May 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Aims: The increasing need for diagnostic echocardiography tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored.

Methods And Results: Adult echocardiography studies, conducted at the Mayo Clinic from 1 January 2017 to 31 December 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning and Quantized Low-Rank Adaptation fine-tuning (FT) for echo report summarization from 'Findings' to 'Impressions.' Against cardiologist-generated Impressions, the models' performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists. The development dataset included 97 506 reports from 71 717 unique patients, predominantly male (55.4%), with an average age of 64.3 ± 15.8 years. EchoGPT, a fine-tuned Llama-2 model, outperformed other models with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness ( < 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores vs. clinical utility ( = 0.42) and automatic metrics showed insensitivity (0-5% drop) to changes in measurement numbers.

Conclusion: EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12088711	PMC
http://dx.doi.org/10.1093/ehjdh/ztae086	DOI Listing

Publication Analysis

Top Keywords

automatic metrics

large language

language models

clinical utility

reports

evaluating large

models echocardiography

echocardiography reporting

reporting opportunities

opportunities challenges

A PHP Error was encountered