98%
921
2 minutes
20
Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1002/ase.70074 | DOI Listing |
Sci Rep
August 2025
Neurosurgery of The First Affiliated Hospital, Jinan University, Guangzhou, China.
Artificial intelligence (AI) chatbots have emerged as promising tools for enhancing medical communication, yet their efficacy in interpreting complex radiological reports remains underexplored. This study evaluates the performance of AI chatbots in translating magnetic resonance imaging (MRI) reports into patient-friendly language and providing clinical recommendations. A cross-sectional analysis was conducted on 6174 MRI reports from tumor patients across three hospitals.
View Article and Find Full Text PDFAnn Med Surg (Lond)
August 2025
Department of Orthopedics, Beijing Jishuitan Hospital, Capital Medical University, Beijing, PR China.
Objective: This study aimed to evaluate and compare the performance of three large language models (LLMs)-ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro-in providing information on endoscopic lumbar surgery based on 10 frequently asked patient questions.
View Article and Find Full Text PDFChest
July 2025
Department of Critical Care Medicine, National Institutes of Health, Bethesda, MD.
Background: Although mechanical ventilation (MV) is a critical competency in critical care training, standardized methods for assessing MV-related knowledge are lacking. Traditional multiple-choice question (MCQ) development is resource intensive, and prior studies have suggested that generative AI tools could streamline question creation. However, the quality of AI-generated MCQs remains unclear.
View Article and Find Full Text PDFAnat Sci Educ
July 2025
Department of Anatomy and Cell Biology, Rush Medical College, Rush University, Chicago, Illinois, USA.
Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images.
View Article and Find Full Text PDFBackground: The introduction of o1-preview (OpenAI) has stirred discussions surrounding its potential applications for diagnosing complex patient cases. The authors gauged changes in o1-preview's capacity to diagnose complex cases compared with its predecessors ChatGPT-3.5 (OpenAI) and ChatGPT-4 (legacy) (OpenAI).
View Article and Find Full Text PDF