Evaluating Large Language Model-Assisted Emergency Triage: A Comparison of Acuity Assessments by GPT-4 and Medical Experts.

Gal Ben Haim , Mor Saban , Yiftach Barash , David Cirulnik , Amit Shaham , Ben Zion Eisenman , Livnat Burshtein , Orly Mymon , Eyal Klang

J Clin Nurs

The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, USA.

Published: November 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Aim: To evaluate the accuracy of the Emergency Severity Index (ESI) assignments by GPT-4, a large language model (LLM), compared to senior emergency department (ED) nurses and physicians.

Method: An observational study of 100 consecutive adult ED patients was conducted. ESI scores assigned by GPT-4, triage nurses, and by a senior clinician. Both model and human experts were provided the same patient data.

Results: GPT-4 assigned a lower median ESI score (2.0) compared to human evaluators (median 3.0; p < 0.001), suggesting a potential overestimation of patient severity by the LLM. The results showed differences in the triage assessment approaches between GPT-4 and the human evaluators, including variations in how patient age and vital signs were considered in the ESI assignments.

Conclusion: While GPT-4 offers a novel methodology for patient triage, its propensity to overestimate patient severity highlights the necessity for further development and calibration of LLM tools in clinical environments. The findings underscore the potential and limitations of LLM in clinical decision-making, advocating for cautious integration of LLMs in healthcare settings.

Reporting Method: This study adhered to relevant EQUATOR guidelines for reporting observational studies.

Download full-text PDF	Source
http://dx.doi.org/10.1111/jocn.17490	DOI Listing

Publication Analysis

Top Keywords

large language

evaluating large

language model-assisted

model-assisted emergency

emergency triage

triage comparison

comparison acuity

acuity assessments

gpt-4

assessments gpt-4

Similar Publications

Commentary on "DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study".

Int J Surg

September 2025

The Third Affiliated Hospital of Zhejiang Chinese Medical University, Hangzhou, Zhejiang, China.

Hanzhe Lv , Longhao Chen , Zhizhen Lv , Lijiang Lv

View Article and Find Full Text PDF

Similar Publications

Guideline adherence in surgical decisions for T1 colorectal cancer after endoscopic resection: large language models vs clinicians.

Int J Surg

September 2025

Digestive Endoscopy Center, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China.

Liangtang Zeng , Cao Qinxing , Junyuan Deng , Junnan Hu , Minghui Pang

Background: Patients with T1 colorectal cancer (CRC) often show poor adherence to guideline-recommended treatment strategies after endoscopic resection. To address this challenge and improve clinical decision-making, this study aims to compare the accuracy of surgical management recommendations between large language models (LLMs) and clinicians.

Methods: This retrospective study enrolled 202 patients with T1 CRC who underwent endoscopic resection at three hospitals.

View Article and Find Full Text PDF

Similar Publications

Leveraging Language Model, Crystal Structure Prediction and First-Principles Calculation for Material Design.

J Chem Inf Model

September 2025

Songshan Lake Materials Laboratory, Dongguan 523808, PR China.

Lei Zhang , Ben Ni , Kaiyang Xu , Yiru Huang , Qingfang Li

Large language models (LLMs) have demonstrated transformative potential for materials discovery in condensed matter systems, but their full utility requires both broader application scenarios and integration with ab initio crystal structure prediction (CSP), density functional theory (DFT) methods and domain knowledge to benefit future inverse material design. Here, we develop an integrated computational framework combining language model-guided materials screening with genetic algorithm (GA) and graph neural network (GNN)-based CSP methods to predict new photovoltaic material. This LLM + CSP + DFT approach successfully identifies a previously overlooked oxide material with unexpected photovoltaic potential.

View Article and Find Full Text PDF

Similar Publications

Out-of-the-Box Large Language Models for Detecting and Classifying Critical Findings in Radiology Reports Using Various Prompt Strategies.

AJR Am J Roentgenol

September 2025

Department of Radiology, Stanford University, Stanford, CA, USA.

Ish A Talati , Juan M Zambrano Chaves , Avisha Das , Imon Banerjee , Daniel L Rubin

The increasing complexity and volume of radiology reports present challenges for timely critical findings communication. To evaluate the performance of two out-of-the-box LLMs in detecting and classifying critical findings in radiology reports using various prompt strategies. The analysis included 252 radiology reports of varying modalities and anatomic regions extracted from the MIMIC-III database, divided into a prompt engineering tuning set of 50 reports, a holdout test set of 125 reports, and a pool of 77 remaining reports used as examples for few-shot prompting.

View Article and Find Full Text PDF

Similar Publications

Diagnosing Actinic Keratosis and Squamous Cell Carcinoma With Large Language Models From Clinical Images.

Int J Dermatol

July 2025

Department of Dermatology, Venereology and Dermatooncology, Semmelweis University, Budapest, Hungary.

Mehdi Boostani , Giovanni Pellacani , Mohamad Goldust , Nóra Nádudvari , Dóra Rátky

View Article and Find Full Text PDF

Similar Publications