Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.

Int J Med Inform

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

Published: December 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.

Methods: In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models.

Results: A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72 % (36/50; 95 % CI 0.600-0.840), followed by DeepSeek-R1 at 68 % (34/50; 95 % CI 0.540-0.800), ChatGPT-4o at 64 % (32/50; 95 % CI 0.500-0.760), and DeepSeek-V3 at 32 % (16/50; 95 % CI 0.200-0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0-5.0; 95 % CI 5.0-5.0), DeepSeek-R1 (IQR 5.0-5.0; 95 % CI 5.0-5.0), and ChatGPT-4o (IQR 4.0-5.0; 95 % CI 4.5-5.0), and 4.0 for DeepSeek-V3 (IQR 3.0-5.0; 95 % CI 4.0-5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance.All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models.

Conclusions: The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.ijmedinf.2025.106088DOI Listing

Publication Analysis

Top Keywords

newly developed
20
critical illness
20
illness cases
20
icu settings
12
diagnostic accuracy
12
response quality
12
chatgpt-o3 deepseek-r1
12
diagnostic performance
8
large language
8
language models
8

Similar Publications

Objective: Pain hypersensitivity and hypersensitivity to other sensory modalities (visual, auditory, olfactory, and tactile) are considered defining features in nociplastic pain states. A self-report measure of sensory sensitivity may help to characterize sensory profiles across pain populations. This study aimed to evaluate the psychometric properties of a newly developed Danish nine-item Sensory Sensitivity Profile (SSP) questionnaire in patients with fibromyalgia.

View Article and Find Full Text PDF

Background: Invasive central nervous system (CNS) aspergillosis is rare among human immunodeficiency virus (HIV)-positive patients due to preserved neutrophil function, despite significant CD4+ T-cell depletion. Diagnosis typically requires histopathologic confirmation, but polymerase chain reaction (PCR) testing has introduced new challenges due to its high sensitivity but limited specificity.

Case Presentation: We describe a newly diagnosed 43-year-old HIV-positive male with concurrent Hodgkin lymphoma who presented with progressive neurological decline and a ring-enhancing brain lesion.

View Article and Find Full Text PDF

Objective: Anoikis is an anchorage-dependent programmed cell death implicated in multiple pathological processes of cancers; however, the prognostic value of anoikis-related genes (ANRGs) in hepatocellular carcinoma (HCC) remains unclear. Our study aims to develop an ANRGs-based prediction model to improve prognostic assessment in HCC patients.

Methods: The RNA-seq profile was performed to estimate the expression of ANRGs in HCC patients.

View Article and Find Full Text PDF

Gene mutation estimations via mutual information and Ewens sampling based CNN & machine learning algorithms.

J Appl Stat

February 2025

Department of Mathematics and State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, People's Republic of China.

We conduct gene mutation rate estimations via developing mutual information and Ewens sampling based convolutional neural network (CNN) and machine learning algorithms. More precisely, we develop a systematic methodology through constructing a CNN. Meanwhile, we develop two machine learning algorithms to study protein production with target gene sequences and protein structures.

View Article and Find Full Text PDF

Background: Inflammatory bowel disease (IBD) is a chronic condition characterized by the need for highly individualized treatment plans, requiring patients to make numerous complex medical decisions. Shared decision-making (SDM) has proven effective in improving treatment outcomes, patient satisfaction, and adherence in IBD management; however, its clinical implementation remains challenging. In China, formal SDM nurse roles have not yet been established.

View Article and Find Full Text PDF