98%
921
2 minutes
20
Background And Objectives: General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation.
Methods: The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ 2 , Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.
Results: On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both P < .01), and GPT-4 outperformed GPT-3.5 ( P = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all P < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, P = .042) and Bard (OR = 0.76, P = .014), but not GPT-4 (OR = 0.86, P = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, P = .044) and was comparable with Bard's (68.6% vs 66.7%, P = 1.000). However, GPT-4 demonstrated significantly lower rates of "hallucination" on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, P < .001) and Bard (2.3% vs 27.3%, P = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, P = .012) and Bard (OR = 2.09, P < .001).
Conclusion: On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1227/neu.0000000000002551 | DOI Listing |
Int J Surg
September 2025
Department of Urology and Andrology Laboratory, West China Hospital, Sichuan University, Chengdu, Sichuan Province, China.
Objective: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education.
View Article and Find Full Text PDFJ Multidiscip Healthc
August 2025
Department of Tuberculosis, The Fourth People's Hospital of Nanning, Nanning, Guangxi, People's Republic of China.
Purpose: This study endeavors to conduct a comprehensive assessment on the performance of large language models (LLMs) in health consultation for individuals living with HIV, delve into their applicability across a diverse array of dimensions, and provide evidence-based support for clinical deployment.
Patients And Methods: A 23-question multi-dimensional HIV-specific question bank was developed, covering fundamental knowledge, diagnosis, treatment, prognosis, and case analysis. Four advanced LLMs-ChatGPT-4o, Copilot, Gemini, and Claude-were tested using a multi-dimensional evaluation system assessing medical accuracy, comprehensiveness, understandability, reliability, and humanistic care (which encompasses elements such as individual needs attention, emotional support, and ethical considerations).
Nutrients
August 2025
Neonatal Care Unit, AULSS 8 Berica, 36100 Vicenza, Italy.
Background: Scientific literature confirms the benefits of mother's own milk (MOM) for both term and preterm infants. The feeding of pathological newborns, in particular the very low birth weight infants (VLBWIs), is dependent on human milk. When MOM is not available, pasteurized donor human milk obtained from a recognized Human Milk Bank (HMB) is the best alternative.
View Article and Find Full Text PDFCell Tissue Bank
August 2025
Federal University of Paraíba (UFPB), João Pessoa, Paraíba, Brazil.
Sinus lift surgery is essential after pneumatization caused by loss of posterior teeth. Leukocyte-Platelet-Rich Fibrin (L-PRF) accelerates bone healing by releasing growth factors that promote angiogenesis, cell differentiation, and inflammatory modulation. To evaluate the efficacy of L-PRF in bone healing and repair in sinus lift surgeries, in addition to investigating its role in angiogenesis and inflammatory modulation.
View Article and Find Full Text PDFBMC Med Educ
August 2025
Department of Gastrointestinal Surgery, Afyonkarahisar State Hospital, Afyonkarahisar, Türkiye.
Background: Artificial intelligence (AI) has become a transformative tool in medical education and assessment. Despite advancements, AI models such as GPT-4o demonstrate variable performance on high-stakes examinations. This study compared the performance of four AI models (Llama-3, Gemini, GPT-4o, and Copilot) with specialists and residents on European General Surgery Board test questions, focusing on accuracy across question formats, lengths, and difficulty levels.
View Article and Find Full Text PDF