Diagnostic performance of newly developed large language models in critical illness cases: A comparative study.

Int J Med Inform

Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.

Published: December 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.

Methods: In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature. Diagnostic accuracy and response quality were compared across models.

Results: A total of 50 critical illness cases were included. ChatGPT-o3 achieved the top diagnosis accuracy at 72 % (36/50; 95 % CI 0.600-0.840), followed by DeepSeek-R1 at 68 % (34/50; 95 % CI 0.540-0.800), ChatGPT-4o at 64 % (32/50; 95 % CI 0.500-0.760), and DeepSeek-V3 at 32 % (16/50; 95 % CI 0.200-0.460). ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o all significantly outperformed DeepSeek-V3, with no significant differences among the three. The median differential quality score was 5.0 for ChatGPT-o3 (IQR 5.0-5.0; 95 % CI 5.0-5.0), DeepSeek-R1 (IQR 5.0-5.0; 95 % CI 5.0-5.0), and ChatGPT-4o (IQR 4.0-5.0; 95 % CI 4.5-5.0), and 4.0 for DeepSeek-V3 (IQR 3.0-5.0; 95 % CI 4.0-5.0). ChatGPT-o3 and DeepSeek-R1 scored significantly higher than DeepSeek-V3; ChatGPT-4o showed a non-significant trend toward better performance.All models received high Likert ratings for response completeness, clarity, and usefulness. ChatGPT-o3, DeepSeek-R1, and ChatGPT-4o each showed a trend toward better response quality compared to DeepSeek-V3, although no significant differences were observed among the models.

Conclusions: The newly developed models, especially the reasoning models, demonstrated strong potential in supporting diagnosis in critical illness cases in ICU settings. With further domain-specific fine-tuning, their diagnostic accuracy could be further enhanced. Notably, the open-source reasoning model DeepSeek-R1 performed competitively, suggesting strong potential for scalable deployment in resource-limited settings.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.ijmedinf.2025.106088	DOI Listing

Publication Analysis

Top Keywords

newly developed

critical illness

illness cases

icu settings

diagnostic accuracy

response quality

chatgpt-o3 deepseek-r1

diagnostic performance

large language

language models

Similar Publications

Development and validation of a measure of sensory sensitivity - the Sensory Sensitivity Profile: a Rasch analysis.

Scand J Rheumatol

September 2025

The Parker Institute, Copenhagen University Hospital, Bispebjerg and Frederiksberg, Frederiksberg, Denmark.

K Amris , M U Rasmussen , T Alkjær , S K Magnúsdóttir , E E Wæhrens

Objective: Pain hypersensitivity and hypersensitivity to other sensory modalities (visual, auditory, olfactory, and tactile) are considered defining features in nociplastic pain states. A self-report measure of sensory sensitivity may help to characterize sensory profiles across pain populations. This study aimed to evaluate the psychometric properties of a newly developed Danish nine-item Sensory Sensitivity Profile (SSP) questionnaire in patients with fibromyalgia.

View Article and Find Full Text PDF

Similar Publications

Diagnostic and Therapeutic Paradoxes in PCR-Positive, Histopathology-Negative CNS Aspergillosis in A Patient with HIV and Hodgkin Lymphoma.

Eur J Case Rep Intern Med

August 2025

Department of Internal Medicine, Wayne State University School of Medicine, Trinity Health Oakland Hospital, Pontiac, USA.

Nikolas Kenaya , Joshua Hermiz , Joshua Hailo , Zahra Chehab , Emelia Johnson

Background: Invasive central nervous system (CNS) aspergillosis is rare among human immunodeficiency virus (HIV)-positive patients due to preserved neutrophil function, despite significant CD4+ T-cell depletion. Diagnosis typically requires histopathologic confirmation, but polymerase chain reaction (PCR) testing has introduced new challenges due to its high sensitivity but limited specificity.

Case Presentation: We describe a newly diagnosed 43-year-old HIV-positive male with concurrent Hodgkin lymphoma who presented with progressive neurological decline and a ring-enhancing brain lesion.

View Article and Find Full Text PDF

Similar Publications

Newly Established Anoikis-Associated Genes Predict the Prognosis of Hepatocellular Carcinoma.

J Hepatocell Carcinoma

September 2025

Department of Liver Disease, Shuguang Hospital Affiliated to Shanghai University of Traditional Chinese Medicine, Shanghai, 201203, People's Republic of China.

Yuyao Li , Er Li , Wenlan Zheng , Jia Shi , Shihan Yu

Objective: Anoikis is an anchorage-dependent programmed cell death implicated in multiple pathological processes of cancers; however, the prognostic value of anoikis-related genes (ANRGs) in hepatocellular carcinoma (HCC) remains unclear. Our study aims to develop an ANRGs-based prediction model to improve prognostic assessment in HCC patients.

Methods: The RNA-seq profile was performed to estimate the expression of ANRGs in HCC patients.

View Article and Find Full Text PDF

Similar Publications

Gene mutation estimations via mutual information and Ewens sampling based CNN & machine learning algorithms.

J Appl Stat

February 2025

Department of Mathematics and State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, People's Republic of China.

Wanyang Dai

We conduct gene mutation rate estimations via developing mutual information and Ewens sampling based convolutional neural network (CNN) and machine learning algorithms. More precisely, we develop a systematic methodology through constructing a CNN. Meanwhile, we develop two machine learning algorithms to study protein production with target gene sequences and protein structures.

View Article and Find Full Text PDF

Similar Publications

Pioneering the IBD specialist nurse role in China: a shared decision-making model for clinical practice innovation.

Front Med (Lausanne)

August 2025

Department of Nursing, The Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.

Shuyan Li , Zijun Ni , Yan Ma , Yan Chen , Hongling Sun

Background: Inflammatory bowel disease (IBD) is a chronic condition characterized by the need for highly individualized treatment plans, requiring patients to make numerous complex medical decisions. Shared decision-making (SDM) has proven effective in improving treatment outcomes, patient satisfaction, and adherence in IBD management; however, its clinical implementation remains challenging. In China, formal SDM nurse roles have not yet been established.

View Article and Find Full Text PDF

Similar Publications