Evaluation of large language models as a diagnostic aid for complex medical cases.

Alejandro Ríos-Hoyo , Naing Lin Shan , Anran Li , Alexander T Pearson , Lajos Pusztai , Frederick M Howard

Front Med (Lausanne)

Department of Medicine, University of Chicago, Chicago, IL, United States.

Published: June 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals.

Objective: To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case.

Design: Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models.

Results: The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 ( < 0.0001). GPT4 was more frequently able to list the correct diagnosis as first (22% versus 20% with GPT3.5, = 0.86), provide the correct diagnosis among the top three generated diagnoses (42% versus 24%, = 0.075). GPT4 was better at providing the correct diagnosis, when the different diagnoses were classified into groups according to the medical specialty and include the correct diagnosis at any point in the differential list (68% versus 48%, = 0.0063). GPT4 provided a differential list that was more similar to the list provided by the case discussants than GPT3.5 (Jaccard Similarity Index 0.22 versus 0.12, = 0.001). Inclusion of the correct diagnosis in the generated differential was correlated with PubMed articles matching the diagnosis (OR 1.40, 95% CI 1.25-1.56 for GPT3.5, OR 1.25, 95% CI 1.13-1.40 for GPT4), but not with disease incidence.

Conclusions And Relevance: The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11222590	PMC
http://dx.doi.org/10.3389/fmed.2024.1380148	DOI Listing

Publication Analysis

Top Keywords

large language

language models

massachusetts general

general hospital

hospital case

evaluation large

models diagnostic

diagnostic aid

aid complex

complex medical

Similar Publications

A plain language summary of the MIRACLE study: benralizumab in people in Asia with severe asthma.

Immunotherapy

September 2025

aGuangzhou Institute of Respiratory Health, State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, National Center for Respiratory Medicine, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.

Kefang Lai , Dejun Sun , Ranran Dai , Hae-Sim Park , Annika Åstrand

View Article and Find Full Text PDF

Similar Publications

Integrating Generative Artificial Intelligence in Midwifery Education: Balancing Innovation, Ethics, and Academic Integrity.

J Midwifery Womens Health

September 2025

General Education Department Chair, Midwives College of Utah, Salt Lake City, Utah.

Megan Koontz , Stefanie Podlog

Applications driven by large language models (LLMs) are reshaping higher education by offering innovative tools that enhance learning, streamline administrative tasks, and support scholarly work. However, their integration into education institutions raises ethical concerns related to bias, misinformation, and academic integrity, necessitating thoughtful institutional responses. This article explores the evolving role of LLMs in midwifery higher education, providing historical context, key capabilities, and ethical considerations.

View Article and Find Full Text PDF

Similar Publications

It's Hey Jude, not Hey Jade: Input Variation and the Emergence of the Infant Lexicon.

J Child Lang

September 2025

Department of Psychology, University of TorontoMississauga, Mississauga, Ontario, Canada.

Helen Buckler , Elizabeth K Johnson

A growing literature explores the representational detail of infants' early lexical representations, but no study has investigated how exposure to real-life acoustic-phonetic variation impacts these representations. Indeed, previous experimental work with young infants has largely ignored the impact of accent exposure on lexical development. We ask how routine exposure to accent variation affects 6-month-olds' ability to detect mispronunciations.

View Article and Find Full Text PDF

Similar Publications

Resource Utilization for Brief Resolved Unexplained Events in a Pediatric and General Emergency Department.

Pediatr Emerg Care

September 2025

Albert Einstein College of Medicine.

Daniel M Fein , Leon Chen , Nina Samuel , Michael D Cabana

Objectives: The primary aim of this study was to compare resource utilization between lower and higher-risk brief resolved unexplained events (BRUE) in the general (GED) and pediatric (PED) emergency departments.

Methods: We conducted a retrospective chart review of BRUE cases from a large health system over 6-and-a-half years. Our primary outcome was the count of diagnostic tests per encounter.

View Article and Find Full Text PDF

Similar Publications

Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.

J Imaging Inform Med

September 2025

Department of Diagnostic, Interventional and Pediatric Radiology (DIPR), Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.

Fabio Dennstädt , Simon Fauser , Nikola Cihoric , Max Schmerder , Paolo Lombardo

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.

View Article and Find Full Text PDF

Similar Publications