ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.

Neil Jain , Caleb Gottlich , John Fisher , Travis Winston , Kristofer Matullo , Dustin Greenhill

JB JS Open Access

Department of Orthopaedic Surgery, St. Luke's University Health Network, Bethlehem, Pennsylvania.

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.

Methods: The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An "ideal" response combined a correct answer with a consistent explanation. "Inadequate" responses provided a correct answer but no explanation. "Unacceptable" responses provided an incorrect answer or disparate explanation.

Results: ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).

Conclusions: ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.

Level Of Evidence: Level IV. See Instructions for Authors for a complete description of levels of evidence.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417002	PMC
http://dx.doi.org/10.2106/JBJS.OA.25.00112	DOI Listing

Publication Analysis

Top Keywords

orthopaedic surgery

responses provided

resident physicians

chatgpt-4o

orthopaedic

answer

answer explanations

2022 oites

chatbot's answer

answer explanation

Similar Publications

Influence of Age on Fracture Healing in Young and Middle-Aged Mice in a Proximal Femur Fracture Model.

J Orthop Res

September 2025

Institute of Orthopaedic Research and Biomechanics, University Medical Center Ulm, Ulm, Germany.

Tabea Schmid , Anna Kanewska , Charles Lam , Miriam Kalbitz , Sandra Dieterich

Osteoporotic hip fractures are a considerable cause of pain and disability particularly among the elderly. Osteoporosis causes loss of bone stability, which in turn leads to an increased risk of fractures especially in metaphyseal bone. Moreover, the body's capacity for healing is diminished, resulting in prolonged recovery times following these fractures.

View Article and Find Full Text PDF

Similar Publications

Kirschner wires combined with elastic tape for multilayer tension-reducing repair of a large stage 4 pressure injury of the greater trochanter: a case report.

Wounds

August 2025

Shenzhen Hospital (Futian) of Guangzhou University of Chinese Medicine, Shenzhen, China.

Xu Qiliang , Liang XiaoHua , Chen Haoxiong , Zhao Liang , Tan Jingchao

Background: Pressure injuries are common, difficult to manage, and carry a high economic burden. They are challenging to physicians and a burden to society.

Case Report: An 89-year-old male, who had previously undergone internal fixation with screws and rods for a right intertrochanteric fracture, developed a deep circular open ulcer measuring 11 cm × 7.

View Article and Find Full Text PDF

Similar Publications

The trial design of the concurrent optical and magnetic stimulation (COMS) therapy study for refractory diabetic foot ulcers (MAVERICKS): a multicenter, randomized, sham-controlled, double-blind investigational device exemption clinical study.

Wounds

August 2025

Department of Nursing, Federal University of Ceará, Ceará, Brazil.

Robert D Galiano , Rena A Li , John C Lantis , Alisha Oropallo , Jesus Ulloa

Background: Diabetic foot ulcers (DFUs) are a major clinical challenge, particularly among patients with refractory ulcers, that often lead to severe complications such as infection, amputation, and high mortality. Innovations supported by strong clinical evidence have the potential to improve healing outcomes, enhance quality of life, and reduce the economic burden on individuals and health care systems.

Objective: To describe the design of the concurrent optical and magnetic stimulation (COMS) therapy Investigational Device Exemption (IDE) study for refractory DFUs (MAVERICKS) trial.

View Article and Find Full Text PDF

Similar Publications

Monitoring Gait Recovery After Total Knee Arthroplasty Using Wearable Sensors: Responsiveness of Gait Accelerations.

J Orthop Res

September 2025

Interdisciplinary Orthopedics, Department of Orthopedics Surgery, Aalborg University Hospital, Aalborg, Denmark.

Arash Ghaffari , Pernille Damborg Clasen , Andreas Kappel , John Rasmussen , Reed D Gurchiek

Functional recovery after total knee arthroplasty (TKA) varies widely among individuals, and traditional assessments often fail to detect subtle changes in real-world walking ability. Wearable sensors offer continuous and objective tracking of gait outside of clinical settings. In this prospective, longitudinal study, thirty-one patients undergoing unilateral TKA wore thigh-mounted accelerometers continuously from 2 weeks before surgery through 90 days postoperatively.

View Article and Find Full Text PDF

Similar Publications

Tissue bridges offer valuable insights into bladder and bowel outcomes in chronic cervical spinal cord injury.

Eur Spine J

September 2025

Department of Spine Surgery, Third Affiliated Hospital of Sun Yat-sen University, Guangzhou, China.

Haiyang Yu , Na Li , Jun Kang , Bin Liu , Mao Pang

Purpose: This study aimed to investigate the relationship between tissue bridges and bladder and bowel outcomes in chronic cervical spinal cord injury (SCI).

Methods: Between July 2020 and January 2024, 44 patients with chronic cervical SCI were retrospectively included in this cross-sectional study at a specialized SCI center. Lesion severity was assessed by tissue bridges, lesion length, lesion width, and lesion area.

View Article and Find Full Text PDF

Similar Publications