ChatGPT-4 Knows Its A B C D E but Cannot Cite Its Source.

JB JS Open Access

Department of Orthopaedic Surgery, The Johns Hopkins Hospital, Baltimore, Maryland.

Published: September 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Introduction: The artificial intelligence language model Chat Generative Pretrained Transformer (ChatGPT) has shown potential as a reliable and accessible educational resource in orthopaedic surgery. Yet, the accuracy of the references behind the provided information remains elusive, which poses a concern for maintaining the integrity of medical content. This study aims to examine the accuracy of the references provided by ChatGPT-4 concerning the Airway, Breathing, Circulation, Disability, Exposure (ABCDE) approach in trauma surgery.

Methods: Two independent reviewers critically assessed 30 ChatGPT-4-generated references supporting the well-established ABCDE approach to trauma protocol, grading them as 0 (nonexistent), 1 (inaccurate), or 2 (accurate). All discrepancies between the ChatGPT-4 and PubMed references were carefully reviewed and bolded. Cohen's Kappa coefficient was used to examine the agreement of the accuracy scores of the ChatGPT-4-generated references between reviewers. Descriptive statistics were used to summarize the mean reference accuracy scores. To compare the variance of the means across the 5 categories, one-way analysis of variance was used.

Results: ChatGPT-4 had an average reference accuracy score of 66.7%. Of the 30 references, only 43.3% were accurate and deemed "true" while 56.7% were categorized as "false" (43.3% inaccurate and 13.3% nonexistent). The accuracy was consistent across the 5 trauma protocol categories, with no significant statistical difference (p = 0.437).

Discussion: With 57% of references being inaccurate or nonexistent, ChatGPT-4 has fallen short in providing reliable and reproducible references-a concerning finding for the safety of using ChatGPT-4 for professional medical decision making without thorough verification. Only if used cautiously, with cross-referencing, can this language model act as an adjunct learning tool that can enhance comprehensiveness as well as knowledge rehearsal and manipulation.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11368215PMC
http://dx.doi.org/10.2106/JBJS.OA.24.00099DOI Listing

Publication Analysis

Top Keywords

language model
8
accuracy references
8
references provided
8
abcde approach
8
approach trauma
8
chatgpt-4-generated references
8
trauma protocol
8
accuracy scores
8
reference accuracy
8
references
7

Similar Publications

BackgroundTherapeutic plasma exchange (TPE) with albumin replacement has emerged as a potential treatment for Alzheimer's disease (AD). The AMBAR trial showed that TPE could slow cognitive and functional decline, along with changes in core and inflammatory biomarkers in cerebrospinal fluid.ObjectiveTo evaluate the safety and effectiveness of TPE in a real-world setting in Argentina.

View Article and Find Full Text PDF

Semantic composition allows us to construct complex meanings (e.g., "dog house", "house dog") from simpler constituents ("dog", "house").

View Article and Find Full Text PDF

Background: Patients with T1 colorectal cancer (CRC) often show poor adherence to guideline-recommended treatment strategies after endoscopic resection. To address this challenge and improve clinical decision-making, this study aims to compare the accuracy of surgical management recommendations between large language models (LLMs) and clinicians.

Methods: This retrospective study enrolled 202 patients with T1 CRC who underwent endoscopic resection at three hospitals.

View Article and Find Full Text PDF

Large language models (LLMs) have demonstrated transformative potential for materials discovery in condensed matter systems, but their full utility requires both broader application scenarios and integration with ab initio crystal structure prediction (CSP), density functional theory (DFT) methods and domain knowledge to benefit future inverse material design. Here, we develop an integrated computational framework combining language model-guided materials screening with genetic algorithm (GA) and graph neural network (GNN)-based CSP methods to predict new photovoltaic material. This LLM + CSP + DFT approach successfully identifies a previously overlooked oxide material with unexpected photovoltaic potential.

View Article and Find Full Text PDF