Assessing the Accuracy of Diagnostic Capabilities of Large Language Models.

Andrada Elena Urda-Cîmpean , Daniel-Corneliu Leucuța , Cristina Drugan , Alina-Gabriela Duțu , Tudor Călinici , Tudor Drugan

Diagnostics (Basel)

Department of Medical Informatics and Biostatistics, Iuliu Hațieganu University of Medicine and Pharmacy, 400349 Cluj-Napoca, Romania.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

In recent years, numerous artificial intelligence applications, especially generative large language models, have evolved in the medical field. This study conducted a structured comparative analysis of four leading generative large language models (LLMs)-ChatGPT-4o (OpenAI), Grok-3 (xAI), Gemini-2.0 Flash (Google), and DeepSeek-V3 (DeepSeek)-to evaluate their diagnostic performance in clinical case scenarios. We assessed medical knowledge recall and clinical reasoning capabilities through staged, progressively complex cases, with responses graded by expert raters using a 0-5 scale. All models performed better on knowledge-based questions than on reasoning tasks, highlighting the ongoing limitations in contextual diagnostic synthesis. Overall, DeepSeek outperformed the other models, achieving significantly higher scores across all evaluation dimensions ( < 0.05), particularly in regards to medical reasoning tasks. While these findings support the feasibility of using LLMs for medical training and decision support, the study emphasizes the need for improved interpretability, prompt optimization, and rigorous benchmarking to ensure clinical reliability. This structured, comparative approach contributes to ongoing efforts to establish standardized evaluation frameworks for integrating LLMs into diagnostic workflows.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12248924	PMC
http://dx.doi.org/10.3390/diagnostics15131657	DOI Listing

Publication Analysis

Top Keywords

large language

language models

generative large

structured comparative

reasoning tasks

models

assessing accuracy

diagnostic

accuracy diagnostic

diagnostic capabilities

Similar Publications

The Promise and Challenge of Large Language Models for Pharmacovigilance.

Drug Saf

September 2025

The MITRE Corporation, 202 Burlington Rd, Bedford, MA, 01730, USA.

Lynette Hirschman

View Article and Find Full Text PDF

Similar Publications

Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies.

Acta Neurochir (Wien)

September 2025

Department of Neurosurgery, Istinye University, Istanbul, Turkey.

Mahmut Çamlar , Umut Tan Sevgi , Gökberk Erol , Furkan Karakaş , Yücel Doğruel

Background: Recent studies suggest that large language models (LLMs) such as ChatGPT are useful tools for medical students or residents when preparing for examinations. These studies, especially those conducted with multiple-choice questions, emphasize that the level of knowledge and response consistency of the LLMs are generally acceptable; however, further optimization is needed in areas such as case discussion, interpretation, and language proficiency. Therefore, this study aimed to evaluate the performance of six distinct LLMs for Turkish and English neurosurgery multiple-choice questions and assess their accuracy and consistency in a specialized medical context.

View Article and Find Full Text PDF

Similar Publications

Usefulness and timing of the third-trimester ultrasound scan: a review of guidelines and underlying evidence.

Arch Gynecol Obstet

September 2025

Department of Obstetrics and Gynaecology, IRCCS San Raffaele Scientific Institute, 20132, Milan, Italy.

Doaa Emam , Giulia Corbella , Caterina Poziello , Simona Fabozzo , Antonio Farina

Objectives: Recommendations regarding the use of third-trimester ultrasound lack universal consensus. Yet, there is evidence which supports its value in assessing fetal growth, fetal well-being, and a number of pregnancy-related complications. This literature review evaluates the available scientific evidence regarding its applications, usefulness, and the timing of the third-trimester scan in a low-risk population.

View Article and Find Full Text PDF

Similar Publications

Comparing Performance of Large Language Model-Based Tools on Patient-Driven Glaucoma Inquiries.

J Glaucoma

September 2025

Harvard Medical School, Boston, MA.

Dhruva Gupta , Sarah L Wagner , Alexandra G Castillejos Ellenthal , Andrew W Gross , Edward S Lu

Purpose: Large language models (LLMs) can assist patients who seek medical knowledge online to guide their own glaucoma care. Understanding the differences in LLM performance on glaucoma-related questions can inform patients about the best resources to obtain relevant information.

Methods: This cross-sectional study evaluated the accuracy, comprehensiveness, quality, and readability of LLM-generated responses to glaucoma inquiries.

View Article and Find Full Text PDF

Similar Publications

CHATWELL: an AI-enabled adaptive tutoring system for improving mandarin composition skills in L2 students with learning difficulties.

Disabil Rehabil Assist Technol

September 2025

International Communication College, Jilin International Studies University, Changchun, Jilin, China.

Wenwen Cheng , Jiahui Guo , Duwen Lu

Background: Conventional automated writing evaluation systems typically provide insufficient support for students with special needs, especially in tonal language acquisition such as Chinese, primarily because of rigid feedback mechanisms and limited customisation.

Objective: This research develops context-aware Hierarchical AI Tutor for Writing Enhancement(CHATWELL), an intelligent tutoring platform that incorporates optimised large language models to deliver instantaneous, customised, and multi-dimensional writing assistance for Chinese language learners, with special consideration for those with cognitive learning barriers.

Methods: CHATWELL employs a hierarchical AI framework with a four-tier feedback mechanism designed to accommodate diverse learning needs.

View Article and Find Full Text PDF

Similar Publications