Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery.

J Clin Neurosci

Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.

Published: May 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery.

Methods: A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references.

Results: GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation.

Conclusion: Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.

Download full-text PDF

Source
http://dx.doi.org/10.1016/j.jocn.2024.03.021DOI Listing

Publication Analysis

Top Keywords

large language
8
attending neurosurgeons
8
gpt-40 responses
8
71 % time
8
responses
7
gpt-40
6
evaluation safety
4
safety accuracy
4
accuracy helpfulness
4
helpfulness gpt-40
4

Similar Publications

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.

View Article and Find Full Text PDF

Inflammatory gene expression profile of oral plasmablastic lymphoma.

Virchows Arch

September 2025

Department of Oral Surgery and Pathology, School of Dentistry, Universidade Federal de Minas Gerais, Minas Gerais, Av. Antônio Carlos, Pampulha, Belo Horizonte, 31270-901, Brazil.

Plasmablastic lymphoma (PBL) is a rare and aggressive non-Hodgkin lymphoma with a poor prognosis and short survival rates. It is classified as a large B-cell lymphoma subtype, but carries a plasmacytic immunophenotype. Therefore, PBL has pathogenetic overlaps with diffuse large B-cell lymphoma not otherwise specified (DLBCL NOS) and plasma cell neoplasms (PCNs).

View Article and Find Full Text PDF

Purpose: Degenerative lumbar spinal stenosis (DLSS) represents an increasing challenge due to the aging population. The natural course of untreated DLSS is largely unknown. For the acute DLSS decompensations, the main concern remains the opportunity and timing of surgery, i.

View Article and Find Full Text PDF

Active use of latent tree-structured sentence representation in humans and large language models.

Nat Hum Behav

September 2025

Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China.

Understanding how sentences are represented in the human brain, as well as in large language models (LLMs), poses a substantial challenge for cognitive science. Here we develop a one-shot learning task to investigate whether humans and LLMs encode tree-structured constituents within sentences. Participants (total N = 372, native Chinese or English speakers, and bilingual in Chinese and English) and LLMs (for example, ChatGPT) were asked to infer which words should be deleted from a sentence.

View Article and Find Full Text PDF

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation.

View Article and Find Full Text PDF