98%
921
2 minutes
20
Background Context: Clinical guidelines, developed in concordance with the literature, are often used to guide surgeons' clinical decision making. Recent advancements of large language models and artificial intelligence (AI) in the medical field come with exciting potential. OpenAI's generative AI model, known as ChatGPT, can quickly synthesize information and generate responses grounded in medical literature, which may prove to be a useful tool in clinical decision-making for spine care. The current literature has yet to investigate the ability of ChatGPT to assist clinical decision making with regard to degenerative spondylolisthesis.
Purpose: The study aimed to compare ChatGPT's concordance with the recommendations set forth by The North American Spine Society (NASS) Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and assess ChatGPT's accuracy within the context of the most recent literature.
Methods: ChatGPT-3.5 and 4.0 was prompted with questions from the NASS Clinical Guideline for the Diagnosis and Treatment of Degenerative Spondylolisthesis and graded its recommendations as "concordant" or "nonconcordant" relative to those put forth by NASS. A response was considered "concordant" when ChatGPT generated a recommendation that accurately reproduced all major points made in the NASS recommendation. Any responses with a grading of "nonconcordant" were further stratified into two subcategories: "Insufficient" or "Over-conclusive," to provide further insight into grading rationale. Responses between GPT-3.5 and 4.0 were compared using Chi-squared tests.
Results: ChatGPT-3.5 answered 13 of NASS's 28 total clinical questions in concordance with NASS's guidelines (46.4%). Categorical breakdown is as follows: Definitions and Natural History (1/1, 100%), Diagnosis and Imaging (1/4, 25%), Outcome Measures for Medical Intervention and Surgical Treatment (0/1, 0%), Medical and Interventional Treatment (4/6, 66.7%), Surgical Treatment (7/14, 50%), and Value of Spine Care (0/2, 0%). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-3.5 generated a concordant response 66.7% of the time (6/9). However, ChatGPT-3.5's concordance dropped to 36.8% when asked clinical questions that NASS did not provide a clear recommendation on (7/19). A further breakdown of ChatGPT-3.5's nonconcordance with the guidelines revealed that a vast majority of its inaccurate recommendations were due to them being "over-conclusive" (12/15, 80%), rather than "insufficient" (3/15, 20%). ChatGPT-4.0 answered 19 (67.9%) of the 28 total questions in concordance with NASS guidelines (P = 0.177). When NASS indicated there was sufficient evidence to offer a clear recommendation, ChatGPT-4.0 generated a concordant response 66.7% of the time (6/9). ChatGPT-4.0's concordance held up at 68.4% when asked clinical questions that NASS did not provide a clear recommendation on (13/19, P = 0.104).
Conclusions: This study sheds light on the duality of LLM applications within clinical settings: one of accuracy and utility in some contexts versus inaccuracy and risk in others. ChatGPT was concordant for most clinical questions NASS offered recommendations for. However, for questions NASS did not offer best practices, ChatGPT generated answers that were either too general or inconsistent with the literature, and even fabricated data/citations. Thus, clinicians should exercise extreme caution when attempting to consult ChatGPT for clinical recommendations, taking care to ensure its reliability within the context of recent literature.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1007/s00586-024-08198-6 | DOI Listing |
Purpose: Degenerative spinal diseases often require complex, patient-specific treatment, presenting a compelling challenge for artificial intelligence (AI) integration into clinical practice. While existing literature has focused on ChatGPT-4o performance in individual spine conditions, this study compares ChatGPT-4o, a traditional large language model (LLM), against NotebookLM, a novel retrieval-augmented model (RAG-LLM) supplemented with North American Spine Society (NASS) guidelines, for concordance with all five published NASS guidelines for degenerative spinal diseases.
Methods: A total of 118 questions from NASS guidelines regarding five degenerative spinal conditions were presented to ChatGPT-4o and NotebookLM.
N Am Spine Soc J
June 2025
Department of Orthopaedic Surgery, Chang Gung Memorial Hospital, Linkou Branch, Taoyuan, Taiwan.
Background: Isthmic spondylolisthesis is a prevalent condition often diagnosed in adults, especially those with low back pain. The main objective of this study was to evaluate the clinical viability of ChatGPT 3.5 and 4.
View Article and Find Full Text PDFSpine J
July 2025
Orthopaedic Associates of Wisconsin, Pewaukee, WI, USA.
Background Context: The North American Spine Society's (NASS) Evidence-Based Clinical Guideline for the Diagnosis and Treatment of Adults with Neoplastic Vertebral Fractures features evidence-based recommendations for diagnosing and treating adult patients with neoplastic vertebral fractures. The guideline is intended to reflect contemporary treatment concepts for neoplastic vertebral fractures as reflected in the highest quality clinical literature available on this subject as of October 2020.
Purpose: The purpose of the guideline is to provide an evidence-based educational tool to assist spine specialists when making clinical decisions for adult patients with neoplastic vertebral fractures.
Antibiotics (Basel)
March 2025
Department of Pathobiology, College of Veterinary Medicine, University of Illinois Urbana-Champaign, Urbana, IL 61802, USA.
: Understanding beef cattle farmers' knowledge, attitudes, and practices on infectious disease prevention, antimicrobial use, and antimicrobial resistance (AMR) is important to developing stewardship programs. : A cross-sectional stratified mail or phone survey of beef cattle producers in Illinois was conducted between June and August 2022. Ordinal logistic regression models assessed the impact of having a biosecurity plan on beef cattle farmers' familiarity with cattle diseases.
View Article and Find Full Text PDFSpine J
August 2025
Department of Spine Surgery, Hospital for Special Surgery, New York, NY, USA. Electronic address:
Background Context: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.
Purpose: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.