Multilingual feasibility of GPT-4o for automated Voice-to-Text CT and MRI report transcription.

Felix Busch , Philipp Prucker , Alexander Komenda , Sebastian Ziegelmayer , Marcus R Makowski , Keno K Bressem , Lisa C Adams

Eur J Radiol

School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Ismaninger Str. 22, 81675 Munich, Germany.

Published: January 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Purpose: Large language models (LLMs) promise to streamline radiology reporting. With the release of OpenAI's GPT-4o (Generative Pre-trained Transformers-4 omni), which processes not only text but also speech, multimodal LLMs might now also be used as medical speech recognition software for radiology reporting in multiple languages. This proof-of-concept study investigates the feasibility of using GPT-4o for automated voice-to-text transcription of radiology reports in English and German.

Methods: Three readers with varying levels of experience each dictated 100 synthetic radiology reports in both languages using GPT-4o via the ChatGPT iOS mobile application. Reports included CT and MRI scans of various anatomical regions. Evaluation metrics included error type, severity, and correction time. BERTScore and ROUGE metrics were calculated to assess semantic similarity and n-gram overlap between dictated and original reports.

Results: No significant differences in correction time between languages were found, but differences were observed between readers based on experience. Error rates were similar for both languages, with most errors being minor (92.68 %, n = 114/123 German; 94.74 %, n = 90/95 English) and technical (27.04 %, n = 43/159 German; 35.65 %, n = 41/115 English) or typographical (23.9 %, n = 38/159 German; 27.83 %, n = 32/115 English). BERTScore metrics were significantly higher for German, while ROUGE metrics showed no significant differences between languages.

Conclusion: This study demonstrates the potential of GPT-4o for multilingual transcription of radiology reports, effectively handling both English and German with minimal errors and high semantic understanding. Future research should compare GPT-4o with current radiology dictation tools, assessing performance, cost-effectiveness, and multilingual capabilities across diverse speaker populations.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.ejrad.2024.111827	DOI Listing

Publication Analysis

Top Keywords

radiology reports

feasibility gpt-4o

gpt-4o automated

automated voice-to-text

radiology reporting

transcription radiology

correction time

rouge metrics

gpt-4o

radiology

Similar Publications

Biparametric vs Multiparametric MRI for Prostate Cancer Diagnosis: The PRIME Diagnostic Clinical Trial.

JAMA

September 2025

Division of Surgery and Interventional Science, UCL, London, United Kingdom.

Alexander B C D Ng , Aqua Asif , Ridhi Agarwal , Valeria Panebianco , Rossano Girometti

Importance: Multiparametric magnetic resonance imaging (MRI), with or without prostate biopsy, has become the standard of care for diagnosing clinically significant prostate cancer. Resource capacity limits widespread adoption. Biparametric MRI, which omits the gadolinium contrast sequence, is a shorter and cheaper alternative offering time-saving capacity gains for health systems globally.

View Article and Find Full Text PDF

Similar Publications

Thirty years of SPM-BrainMap synergy: making and mining coordinate-based literature.

Cereb Cortex

August 2025

Research Imaging Institute, University of Texas Health Science Center at San Antonio, 8403 Floyd Curl Drive, San Antonio, TX 78229, United States.

Peter T Fox

Statistical Parametric Mapping (SPM) adheres to rigorous methodological standards, including: spatial normalization, inter-subject averaging, voxel-wise contrasts, and coordinate reporting. This rigor ensures that a thematically diverse literature is amenable to meta-analysis. BrainMap is a community database (www.

View Article and Find Full Text PDF

Similar Publications

Atypical "opened-bottle" proximal tibial fractures in young male patients with growth hormone and aromatase inhibitor treatment: case series.

Skeletal Radiol

September 2025

Department of Radiology, Hospital do Coração (HCor), Rua Desembargador Eliseu Guilherme, 53, 7th floor. CEP, São Paulo, SP, 04004-03, Brazil.

Ivan Rodrigues Barros Godoy , Tatiane Cantarelli Rodrigues , Andre Fukunishi Yamada , Abdalla Skaf

Atypical proximal tibial fractures in adolescents are rare, particularly when linked to hormonal therapy for short stature. This case series reports the clinical and imaging features of atypical proximal tibial and distal femoral physeal fractures in male adolescents undergoing combined growth hormone (GH) and aromatase inhibitor (AI) therapy for idiopathic short stature. We report three cases of skeletally immature male adolescents (ages 12-16) treated with GH and anastrozole who presented with acute leg pain following low-energy trauma during soccer.

View Article and Find Full Text PDF

Similar Publications

Mycoplasma pneumoniae Infection Mimicking Pediatric Tuberculosis: A Case Report.

Pediatr Infect Dis J

September 2025

Department of Pediatric Infectious Diseases, University of Health Sciences Dr. Behçet Uz Children's Hospital, İzmir, Turkey.

Damla Sel Coban , Hincal Ozbakir , Berna Avci Yavuz , Berna Kahraman Cetin , Ozgul Gulaslan Erdogan

View Article and Find Full Text PDF

Similar Publications

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

AJR Am J Roentgenol

September 2025

Department of Radiology, Stanford University, Stanford, CA, USA.

Ish A Talati , Juan M Zambrano Chaves , Avisha Das , Imon Banerjee , Daniel L Rubin

The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting.

View Article and Find Full Text PDF

Similar Publications