Benchmarking the diagnostic performance of open source LLMs in 1933 Eurorad case reports.

Su Hwan Kim , Severin Schramm , Lisa C Adams , Rickmer Braren , Keno K Bressem , Matthias Keicher , Paul-Sören Platzek , Karolin Johanna Paprottka , Claus Zimmer , Dennis M Hedderich , Benedikt Wiestler

NPJ Digit Med

Department of Diagnostic and Interventional Neuroradiology, Klinikum rechts der Isar, School of Medicine and Health, Technical University of Munich, Munich, Germany.

Published: February 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Recent advancements in large language models (LLMs) have created new ways to support radiological diagnostics. While both open-source and proprietary LLMs can address privacy concerns through local or cloud deployment, open-source models provide advantages in continuity of access, and potentially lower costs. This study evaluated the diagnostic performance of fifteen open-source LLMs and one closed-source LLM (GPT-4o) in 1,933 cases from the Eurorad library. LLMs provided differential diagnoses based on clinical history and imaging findings. Responses were considered correct if the true diagnosis appeared in the top three suggestions. Models were further tested on 60 non-public brain MRI cases from a tertiary hospital to assess generalizability. In both datasets, GPT-4o demonstrated superior performance, closely followed by Llama-3-70B, revealing how open-source LLMs are rapidly closing the gap to proprietary models. Our findings highlight the potential of open-source LLMs as decision support tools for radiological differential diagnosis in challenging, real-world cases.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11814077	PMC
http://dx.doi.org/10.1038/s41746-025-01488-3	DOI Listing

Publication Analysis

Top Keywords

open-source llms

diagnostic performance

llms

open-source

benchmarking diagnostic

performance open

open source

source llms

llms 1933

1933 eurorad

Similar Publications

Performance of vision language models for optic disc swelling identification on fundus photographs.

Front Digit Health

August 2025

Department of Ophthalmology, Stanford University, Palo Alto, CA, United States.

Kelvin Zhenghao Li , Tuyet Thao Nguyen , Heather E Moss

Introduction: Vision language models (VLMs) combine image analysis capabilities with large language models (LLMs). Because of their multimodal capabilities, VLMs offer a clinical advantage over image classification models for the diagnosis of optic disc swelling by allowing a consideration of clinical context. In this study, we compare the performance of non-specialty-trained VLMs with different prompts in the classification of optic disc swelling on fundus photographs.

View Article and Find Full Text PDF

Similar Publications

Applications of Federated Large Language Model for Adverse Drug Reactions Prediction: Scoping Review.

J Med Internet Res

September 2025

Department of Information Systems and Cybersecurity, The University of Texas at San Antonio, 1 UTSA Circle, San Antonio, TX, 78249, United States, 1 (210) 458-6300.

David Guo , Kim-Kwang Raymond Choo

Background: Adverse drug reactions (ADR) present significant challenges in health care, where early prevention is vital for effective treatment and patient safety. Traditional supervised learning methods struggle to address heterogeneous health care data due to their unstructured nature, regulatory constraints, and restricted access to sensitive personal identifiable information.

Objective: This review aims to explore the potential of federated learning (FL) combined with natural language processing and large language models (LLMs) to enhance ADR prediction.

View Article and Find Full Text PDF

Similar Publications

Generative artificial intelligence for automated data extraction from unstructured medical text.

JAMIA Open

October 2025

Division of Pulmonary and Critical Care, Brigham and Women's Hospital, Boston, MA, United States.

Nam Dao , Luisa Quesada , Syed Moin Hassan , Monica Iturrioz Campo , Shelsey Johnson

Objectives: Unstructured data, such as procedure notes, contain valuable medical information that is frequently underutilized due to the labor-intensive nature of data extraction. This study aims to develop a generative artificial intelligence (GenAI) pipeline using an open-source Large Language Model (LLM) with built-in guardrails and a retry mechanism to extract data from unstructured right heart catheterization (RHC) notes while minimizing errors, including hallucinations.

Materials And Methods: A total of 220 RHC notes were randomly selected for pipeline development and 200 for validation from the Pulmonary Vascular Disease Registry.

View Article and Find Full Text PDF

Similar Publications

Development and evaluation of a lightweight large language model chatbot for medication enquiry.

PLOS Digit Health

September 2025

Singapore Health Services, Artificial Intelligence Office, Singapore.

Kabilan Elangovan , Jasmine Chiat Ling Ong , Liyuan Jin , Benjamin Jun Jie Seng , Yu Heng Kwan

Large Language Models (LLMs) show promise in augmenting digital health applications. However, development and scaling of large models face computational constraints, data security concerns and limitations of internet accessibility in some regions. We developed and tested Med-Pal, a medical domain-specific LLM-chatbot fine-tuned with a fine-grained, expert curated medication-enquiry dataset consisting of 1,100 question and answer pairs.

View Article and Find Full Text PDF

Similar Publications

Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context.

Stud Health Technol Inform

September 2025

MOLIT Institute, Heilbronn, Germany.

Kevin Kaufmes , Georg Mathes , Dilyana Vladimirova , Stephanie Berger , Christian Fegeler

Introduction: In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand.

View Article and Find Full Text PDF

Similar Publications