Large language models in neurosurgery: a systematic review and meta-analysis.

Advait Patil , Paul Serrato , Nathan Chisvo , Omar Arnaout , Pokmeng Alfred See , Kevin T Huang

Acta Neurochir (Wien)

Harvard Medical School, Harvard University, Boston, MA, 02115, USA.

Published: November 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature.

Methods: We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures.

Results: Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning.

Conclusions: Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.

Download full-text PDF	Source
http://dx.doi.org/10.1007/s00701-024-06372-9	DOI Listing

Publication Analysis

Top Keywords

large language

language models

application areas

llm performance

llms

llm

models

neurosurgery

models neurosurgery

neurosurgery systematic

Similar Publications

Evaluating AI performance in pediatric surgery: temporal bias and multimodal limitations in large language model assessment.

Pediatr Surg Int

September 2025

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, Zhejiang Province, 310018, People's Republic of China.

Enjian Liu , Zekai Yu

View Article and Find Full Text PDF

Similar Publications

Medical SAM-Clip Grafting for brain tumor segmentation.

Comput Biol Med

August 2025

The First People Hospital of Foshan, Foshan City CN, China. Electronic address:

Xinjun Yu , Zhoushan Feng , Xiaohong Wu , Jianqiu Chen , Weidong Chen

Brain Tumor Segmentation (BTS) is crucial for accurate diagnosis and treatment planning, but existing CNN and Transformer-based methods often struggle with feature fusion and limited training data. While recent large-scale vision models like Segment Anything Model (SAM) and CLIP offer potential, SAM is trained on natural images, lacking medical domain knowledge, and its decoder struggles with accurate tumor segmentation. To address these challenges, we propose the Medical SAM-Clip Grafting Network (MSCG), which introduces a novel SC-grafting module.

View Article and Find Full Text PDF

Similar Publications

Living with risk, then and now: A dual review of Cam Grey's Living with Risk in the Late Roman World and of current AI-assisted book reviewing.

Risk Anal

September 2025

Edward J. Bloustein School, Rutgers University, New Brunswick, New Jersey, USA.

Louis Anthony Cox , Michael R Greenberg

This AI-assisted review article offers a dual review: a book review of Living with Risk in the Late Roman World by Cam Grey, and a critical review of the current potential of large language models (LLMs), specifically ChatGPT's DeepResearch mode, to assist in thoughtful and scholarly book reviewing within risk science. Grey's book presents an innovative reconstruction of how communities in the late Roman Empire perceived and adapted to chronic environmental and societal risks, emphasizing spatial variability, cultural interpretation, and the normalization of uncertainty. Drawing on commentary from a human reviewer and a parallel AI-assisted analysis, we compare the distinct strengths and limitations of each approach.

View Article and Find Full Text PDF

Similar Publications

Assessing the diagnostic and treatment accuracy of Large Language Models (LLMs) in Peri-Implant Diseases: a clinical experimental study.

J Dent

September 2025

Dental Clinic Post-Graduate Program, University Center of State of Pará, Belém, Pará, Brazil. Electronic address:

Igor Amador Barbosa , Mauro Sergio Almeida Alves , Paloma Rayse Zagalo de Almeida , Patricia de Almeida Rodrigues , Roberta Pimentel de Oliveira

Objective: This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.

Methods: A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri-implant mucositis and peri-implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs.

View Article and Find Full Text PDF

Similar Publications

Substitute economics and the threat of artificial intelligence providing pharmaceutical care.

Am J Pharm Educ

September 2025

Department of Pharmacotherapy, University of Utah College of Pharmacy, 30 South 2000 East, Salt Lake City, Utah 84112. Electronic address:

T Joseph Mattingly

The accelerating adoption of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT, has raised critical questions about the role of pharmacists and the potential for AI to substitute for human expertise in pharmaceutical care. Grounded in Porter's Five Forces framework-specifically the threat of substitutes-this commentary explores whether AI can adequately fulfill the complex and relational functions of pharmacists in delivering care to patients. Drawing from foundational definitions of pharmaceutical care and economic theories of substitution, the paper examines both historical and emerging competitors to pharmacist-provided services, including physicians, nurses, and now AI-powered tools.

View Article and Find Full Text PDF

Similar Publications