Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study.

Jessica D Workum , Bas W S Volkers , Davy van de Sande , Sumesh Arora , Marco Goeijenbier , Diederik Gommers , Michel E van Genderen

Crit Care

Department of Adult Intensive Care, Erasmus MC University Medical Center, Rotterdam, The Netherlands.

Published: February 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large language models (LLMs) show increasing potential for their use in healthcare for administrative support and clinical decision making. However, reports on their performance in critical care medicine is lacking.

Methods: This study evaluated five LLMs (GPT-4o, GPT-4o-mini, GPT-3.5-turbo, Mistral Large 2407 and Llama 3.1 70B) on 1181 multiple choice questions (MCQs) from the gotheextramile.com database, a comprehensive database of critical care questions at European Diploma in Intensive Care examination level. Their performance was compared to random guessing and 350 human physicians on a 77-MCQ practice test. Metrics included accuracy, consistency, and domain-specific performance. Costs, as a proxy for energy consumption, were also analyzed.

Results: GPT-4o achieved the highest accuracy at 93.3%, followed by Llama 3.1 70B (87.5%), Mistral Large 2407 (87.9%), GPT-4o-mini (83.0%), and GPT-3.5-turbo (72.7%). Random guessing yielded 41.5% (p < 0.001). On the practice test, all models surpassed human physicians, scoring 89.0%, 80.9%, 84.4%, 80.3%, and 66.5%, respectively, compared to 42.7% for random guessing (p < 0.001) and 61.9% for the human physicians. However, in contrast to the other evaluated LLMs (p < 0.001), GPT-3.5-turbo's performance did not significantly outperform physicians (p = 0.196). Despite high overall consistency, all models gave consistently incorrect answers. The most expensive model was GPT-4o, costing over 25 times more than the least expensive model, GPT-4o-mini.

Conclusions: LLMs exhibit exceptional accuracy and consistency, with four outperforming human physicians on a European-level practice exam. GPT-4o led in performance but raised concerns about energy consumption. Despite their potential in critical care, all models produced consistently incorrect answers, highlighting the need for more thorough and ongoing evaluations to guide responsible implementation in clinical settings.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11809097	PMC
http://dx.doi.org/10.1186/s13054-025-05302-0	DOI Listing

Publication Analysis

Top Keywords

critical care

large language

language models

care questions

mistral large

large 2407

llama 70b

random guessing

comparative evaluation

performance

Similar Publications

The trial design of the concurrent optical and magnetic stimulation (COMS) therapy study for refractory diabetic foot ulcers (MAVERICKS): a multicenter, randomized, sham-controlled, double-blind investigational device exemption clinical study.

Wounds

August 2025

Department of Nursing, Federal University of Ceará, Ceará, Brazil.

Robert D Galiano , Rena A Li , John C Lantis , Alisha Oropallo , Jesus Ulloa

Background: Diabetic foot ulcers (DFUs) are a major clinical challenge, particularly among patients with refractory ulcers, that often lead to severe complications such as infection, amputation, and high mortality. Innovations supported by strong clinical evidence have the potential to improve healing outcomes, enhance quality of life, and reduce the economic burden on individuals and health care systems.

Objective: To describe the design of the concurrent optical and magnetic stimulation (COMS) therapy Investigational Device Exemption (IDE) study for refractory DFUs (MAVERICKS) trial.

View Article and Find Full Text PDF

Similar Publications

PPM1D is directly degraded by proteasomes in a ubiquitination-independent manner through its carboxyl-terminal region.

J Biomed Sci

September 2025

Department of Biochemistry, Faculty of Medicine and Graduate School of Medicine, Hokkaido University, Sapporo, Japan.

Masaki Takahashi , Takeshi Kondo , Shogo Kimura , Akira Nakazono , Shusei Yoshida

Background: PPM1D (protein phosphatase Mg⁺/Mn⁺ dependent 1D) is a Ser/Thr phosphatase that negatively regulates p53 and functions as an oncogenic driver. Its gene amplification and overexpression are frequently observed in various malignancies and disruption of PPM1D degradation has also been reported as a cause of cancer progression. However, the precise mechanisms regulating PPM1D stability remain to be elucidated.

View Article and Find Full Text PDF

Similar Publications

Designing an AI companion to support informal caregivers in role transition: insights from a design science approach.

BMC Nurs

September 2025

Institute of Business Administration and Business Informatics, IT for the Caring Society, University of Hildesheim, Hildesheim, Germany.

David Walter , Jennifer Pengel , Paul-Ferdinand Steuck , Marco Di Maria , Ralf Knackstedt

Background: As populations age, informal caregivers play an increasingly vital role in long-term care, with 80% of care provided by family members in Europe. However, many individuals do not immediately recognize themselves as caregivers, especially in the early stages. This lack of awareness can increase physical and emotional stress and delay access to support services.

View Article and Find Full Text PDF

Similar Publications

Role ambiguity and nursing interns' achievement of clinical rotation goals: a correlational study.

BMC Nurs

September 2025

Nursing Administration Department, Faculty of Nursing, Tanta University, Tanta, Egypt.

Manal Mohammed Ahmed Abdelaziz , Manal Saleh Moustafa Saleh , Zeinab Mohammed Aysha , Rehab Abd El-Moneim Abou Shaheen

Background: Nursing interns frequently encounter role ambiguity due to a mismatch between their expectations of the professional nursing role and the actual responsibilities they face in clinical settings. While clinical rotations during the internship year are intended to enhance clinical confidence and competence, such ambiguity can undermine these goals.

Objective: To examine the relationship between internship clinical rotation and role ambiguity among nursing interns.

View Article and Find Full Text PDF

Similar Publications

Bridging the knowledge gap: a mixed-methods study on general practitioners' information needs for mHealth apps in hypertension treatment in Germany.

BMC Health Serv Res

September 2025

Center for Health Services Research, Brandenburg Medical School Theodor Fontane, Seebad 82/83, Rüdersdorf, 15562, Germany.

S May , F Seifert , D Bruch , K Voß , M Heinze

Background: Hypertension remains a critical public health issue in Germany, affecting millions of individuals. Mobile health applications (mHealth apps) offer promising solutions for improving patient outcomes and adherence in hypertension management. Despite their advantages in healthcare, the adoption of mHealth apps by general practitioners (GPs) in Germany remains limited to date.

View Article and Find Full Text PDF

Similar Publications