Comparison between multimodal foundation models and radiologists for the diagnosis of challenging neuroradiology cases with text and images.

Bastien Le Guellec , Cyril Bruge , Najib Chalhoub , Victor Chaton , Edouard De Sousa , Yann Gaillandre , Riyad Hanafi , Matthieu Masy , Quentin Vannod-Michel , Aghiles Hamroun , Grégory Kuchcinski ,

Diagn Interv Imaging

Department of Neuroradiology, CHU Lille, Salengro Hospital, Lille 59000, France; INSERM, U1172-LilNCog-Lille Neuroscience & Cognition, Université de Lille, Lille 59000, France.

Published: May 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Purpose: The purpose of this study was to compare the ability of two multimodal models (GPT-4o and Gemini 1.5 Pro) with that of radiologists to generate differential diagnoses from textual context alone, key images alone, or a combination of both using complex neuroradiology cases.

Materials And Methods: This retrospective study included neuroradiology cases from the "Diagnosis Please" series published in the Radiology journal between January 2008 and September 2024. The two multimodal models were asked to provide three differential diagnoses from textual context alone, key images alone, or the complete case. Six board-certified neuroradiologists solved the cases in the same setting, randomly assigned to two groups: context alone first and images alone first. Three radiologists solved the cases without, and then with the assistance of Gemini 1.5 Pro. An independent radiologist evaluated the quality of the image descriptions provided by GPT-4o and Gemini for each case. Differences in correct answers between multimodal models and radiologists were analyzed using McNemar test.

Results: GPT-4o and Gemini 1.5 Pro outperformed radiologists using clinical context alone (mean accuracy, 34.0 % [18/53] and 44.7 % [23.7/53] vs. 16.4 % [8.7/53]; both P < 0.01). Radiologists outperformed GPT-4o and Gemini 1.5 Pro using images alone (mean accuracy, 42.0 % [22.3/53] vs. 3.8 % [2/53], and 7.5 % [4/53]; both P < 0.01) and the complete cases (48.0 % [25.6/53] vs. 34.0 % [18/53], and 38.7 % [20.3/53]; both P < 0.001). While radiologists improved their accuracy when combining multimodal information (from 42.1 % [22.3/53] for images alone to 50.3 % [26.7/53] for complete cases; P < 0.01), GPT-4o and Gemini 1.5 Pro did not benefit from the multimodal context (from 34.0 % [18/53] for text alone to 35.2 % [18.7/53] for complete cases for GPT-4o; P = 0.48, and from 44.7 % [23.7/53] to 42.8 % [22.7/53] for Gemini 1.5 Pro; P = 0.54). Radiologists benefited significantly from the suggestion of Gemini 1.5 Pro, increasing their accuracy from 47.2 % [25/53] to 56.0 % [27/53] (P < 0.01). Both GPT-4o and Gemini 1.5 Pro correctly identified the imaging modality in 53/53 (100 %) and 51/53 (96.2 %) cases, respectively, but frequently failed to identify key imaging findings (43/53 cases [81.1 %] with incorrect identification of key imaging findings for GPT-4o and 50/53 [94.3 %] for Gemini 1.5).

Conclusion: Radiologists show a specific ability to benefit from the integration of textual and visual information, whereas multimodal models mostly rely on the clinical context to suggest diagnoses.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.diii.2025.04.006	DOI Listing

Publication Analysis

Top Keywords

gemini pro

gpt-4o gemini

multimodal models

340 [18/53]

complete cases

gemini

radiologists

cases

models radiologists

neuroradiology cases

Similar Publications

Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.

J Glaucoma

September 2025

Harvard Medical School, Boston, MA.

Dhruva Gupta , Sarah L Wagner , Alexandra G Castillejos Ellenthal , Andrew W Gross , Edward S Lu

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries.

View Article and Find Full Text PDF

Similar Publications

Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.

J Med Internet Res

September 2025

Artificial Intelligence Center, China Medical University Hospital, 2, Yude Road, Taichung, 404327, Taiwan, 886 4-22052121.

Hsing-Yu Hsu , Lu-Wen Chen , Wan-Tseng Hsu , Yow-Wen Hsieh , Shih-Sheng Chang

Background: The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

Objective: This study aimed to assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in 2 advanced LLMs as supportive tools for updating information.

View Article and Find Full Text PDF

Similar Publications

Clinical decision-making for uveal melanoma radiotherapy: comparative performance of experienced radiation oncologists and leading generative AI models.

Front Oncol

August 2025

Department of Ophthalmology, The First Affiliated Hospital of Chongqing Medical University, Chongqing Key Laboratory for the Prevention and Treatment of Major Blinding Eye Diseases, Chongqing, China.

Xing Wang , Peng Wang

Background: Uveal melanoma is the most common primary intraocular malignancy in adults, yet radiotherapy decision-making for this disease often remains complex and variable. Although emerging generative AI models have shown promise in synthesizing vast clinical information, few studies have systematically compared their performance against experienced radiation oncologists in this specialized domain. This study examined the comparative accuracy of three leading generative AI models and experienced radiation oncologists in guideline-based clinical decision-making for uveal melanoma.

View Article and Find Full Text PDF

Similar Publications

Inferring Personality From Social Media Activity Using Large Language Models: Cross-Model Agreement, Temporal Stability, and Convergent Validity With Self-Reports.

J Pers

September 2025

Department of Psychology, University of Turin, Turin, Italy.

Davide Marengo , Christian Montag , Michele Settanni

Introduction: Large language models (LLMs) offer a promising approach to infer personality traits unobtrusively from digital footprints. However, the reliability and validity of these inferences remain underexplored.

Method: Gemini 1.

View Article and Find Full Text PDF

Similar Publications

Performance of ChatGPT, Gemini and DeepSeek for non-critical triage support using real-world conversations in emergency department.

BMC Emerg Med

September 2025

Department of Emergency Medicine, Korea University Ansan Hospital, Ansan-si, 15355, Republic of Korea.

Sukyo Lee , Sumin Jung , Jong-Hak Park , Hanjin Cho , Sungwoo Moon

Background: Timely and accurate triage is crucial for the emergency department (ED) care. Recently, there has been growing interest in applying large language models (LLMs) to support triage decision-making. However, most existing studies have evaluated these models using simulated scenarios rather than real-world clinical cases.

View Article and Find Full Text PDF

Similar Publications