Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank.

Rohaid Ali , Oliver Y Tang , Ian D Connolly , Jared S Fridley , John H Shin , Patricia L Zadnik Sullivan , Deus Cielo , Adetokunbo A Oyelese , Curtis E Doberstein , Albert E Telfeian , Ziya L Gokaslan , Wael F Asaad

Neurosurgery

Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence , Rhode Island , USA.

Published: November 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background And Objectives: General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation.

Methods: The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.

Results: On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 ( P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of "hallucination" on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001).

Conclusion: On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.

Download full-text PDF	Source
http://dx.doi.org/10.1227/neu.0000000000002551	DOI Listing

Publication Analysis

Top Keywords

question bank

google bard

neurosurgery oral

oral boards

gpt-35 gpt-4

gpt-4

bard

gpt-35

gpt-4 google

boards preparation

Similar Publications

From algorithms to operating room: can large language models master China's attending anesthesiology exam? a cross-sectional evaluation.

Int J Surg

September 2025

Department of Urology and Andrology Laboratory, West China Hospital, Sichuan University, Chengdu, Sichuan Province, China.

Qiyu He , Zhimin Tan , Wang Niu , Dongxu Chen , Xian Zhang

Objective: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education.

View Article and Find Full Text PDF

Similar Publications

AI-Driven Large Language Models in Health Consultations for HIV Patients.

J Multidiscip Healthc

August 2025

Department of Tuberculosis, The Fourth People's Hospital of Nanning, Nanning, Guangxi, People's Republic of China.

Chun-Yan Zhao , Chang Song , Tong Yang , Ai-Chun Huang , Hang-Biao Qiang

Purpose: This study endeavors to conduct a comprehensive assessment on the performance of large language models (LLMs) in health consultation for individuals living with HIV, delve into their applicability across a diverse array of dimensions, and provide evidence-based support for clinical deployment.

Patients And Methods: A 23-question multi-dimensional HIV-specific question bank was developed, covering fundamental knowledge, diagnosis, treatment, prognosis, and case analysis. Four advanced LLMs-ChatGPT-4o, Copilot, Gemini, and Claude-were tested using a multi-dimensional evaluation system assessing medical accuracy, comprehensiveness, understandability, reliability, and humanistic care (which encompasses elements such as individual needs attention, emotional support, and ethical considerations).

View Article and Find Full Text PDF

Similar Publications

The Fourth Survey on the Activity of Human Milk Banks in Italy.

Nutrients

August 2025

Neonatal Care Unit, AULSS 8 Berica, 36100 Vicenza, Italy.

Giuseppe De Nisi , Guido E Moro , Sertac Arslanoglu , Amalia M Ambruzzi , Enrico Bertino

Background: Scientific literature confirms the benefits of mother's own milk (MOM) for both term and preterm infants. The feeding of pathological newborns, in particular the very low birth weight infants (VLBWIs), is dependent on human milk. When MOM is not available, pasteurized donor human milk obtained from a recognized Human Milk Bank (HMB) is the best alternative.

View Article and Find Full Text PDF

Similar Publications

Efficacy of Platelet and Leukocyte Rich Fibrin (L-PRF) in the healing process and bone repair in maxillary sinus lift surgeries: a systematic review.

Cell Tissue Bank

August 2025

Federal University of Paraíba (UFPB), João Pessoa, Paraíba, Brazil.

Élio Hitoshi Shinohara , Ilan Hudson Gomes de Santana , Mayara Rebeca Martins Viana , Edmundo Junio Rodrigues de Almeida , Anderson Jara Ferreira

Sinus lift surgery is essential after pneumatization caused by loss of posterior teeth. Leukocyte-Platelet-Rich Fibrin (L-PRF) accelerates bone healing by releasing growth factors that promote angiogenesis, cell differentiation, and inflammatory modulation. To evaluate the efficacy of L-PRF in bone healing and repair in sinus lift surgeries, in addition to investigating its role in angiogenesis and inflammatory modulation.

View Article and Find Full Text PDF

Similar Publications

Large language models underperform in European general surgery board examinations: a comparative study with experts and surgical residents.

BMC Med Educ

August 2025

Department of Gastrointestinal Surgery, Afyonkarahisar State Hospital, Afyonkarahisar, Türkiye.

Melih Can Gül

Background: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels.

View Article and Find Full Text PDF

Similar Publications