AI's ability to interpret unlabeled anatomy images and supplement educational research as an AI rater.

Lord J Hyeamang , Tejas C Sekhar , Emily Rush , Amy C Beresheim , Colleen M Cheverko , William S Brooks , Abbey C M Breckling , M Nazmul Karim , Christopher Ferrigno , Adam B Wilson

Anat Sci Educ

Department of Anatomy and Cell Biology, Rush Medical College, Rush University, Chicago, Illinois, USA.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Evidence suggests custom chatbots are superior to commercial generative artificial intelligence (GenAI) systems for text-based anatomy content inquiries. This study evaluates ChatGPT-4o's and Claude 3.5 Sonnet's capabilities to interpret unlabeled anatomical images. Secondarily, ChatGPT o1-preview was evaluated as an AI rater to grade AI-generated outputs using a rubric and was compared against human raters. Anatomical images (five musculoskeletal, five thoracic) representing diverse image-based media (e.g., illustrations, photographs, MRI) were annotated with identification markers (e.g., arrows, circles) and uploaded to each GenAI system for interpretation. Forty-five prompts (i.e., 15 first-order, 15 second-order, and 15 third-order questions) with associated images were submitted to both GenAI systems across two timepoints. Responses were graded by anatomy experts for factual accuracy and superfluity (the presence of excessive wording) on a three-point Likert scale. ChatGPT o1-preview was tested for agreement against human anatomy experts to determine its usefulness as an AI rater. Statistical analyses included inter-rater agreement, hierarchical linear modeling, and test-retest reliability. ChatGPT-4o's factual accuracy score across 45 outputs was 68.0% compared to Claude 3.5 Sonnet's score of 61.5% (p = 0.319). As an AI rater, ChatGPT o1-preview showed moderate to substantial agreement with human raters (Cohen's kappa = 0.545-0.755) for evaluating factual accuracy according to a rubric of textbook answers. Further improvements and evaluations are needed before commercial GenAI systems can be used as credible student resources in anatomy education. Similarly, ChatGPT o1-preview demonstrates promise as an AI assistant for educational research, though further investigation is warranted.

Download full-text PDF	Source
http://dx.doi.org/10.1002/ase.70074	DOI Listing

Publication Analysis

Top Keywords

chatgpt o1-preview

genai systems

factual accuracy

interpret unlabeled

claude sonnet's

anatomical images

human raters

anatomy experts

agreement human

anatomy

Similar Publications

Application of artificial intelligence chatbots in interpreting magnetic resonance imaging reports: a comparative study.

Sci Rep

August 2025

Neurosurgery of The First Affiliated Hospital, Jinan University, Guangzhou, China.

Xuexue Bai , Ming Feng , Wenbin Ma , Yonghong Liao

Artificial intelligence (AI) chatbots have emerged as promising tools for enhancing medical communication, yet their efficacy in interpreting complex radiological reports remains underexplored. This study evaluates the performance of AI chatbots in translating magnetic resonance imaging (MRI) reports into patient-friendly language and providing clinical recommendations. A cross-sectional analysis was conducted on 6174 MRI reports from tumor patients across three hospitals.

View Article and Find Full Text PDF

Similar Publications

Evaluation of the performance of large language models in endoscopic lumbar surgery: a comparative analysis.

Ann Med Surg (Lond)

August 2025

Department of Orthopedics, Beijing Jishuitan Hospital, Capital Medical University, Beijing, PR China.

Hao Li , Cheng Zeng , Lei Miao , Ye Wang , Jiyuan Xia

Objective: This study aimed to evaluate and compare the performance of three large language models (LLMs)-ChatGPT o1-preview, Claude 3.5 Sonnet, and Gemini 1.5 Pro-in providing information on endoscopic lumbar surgery based on 10 frequently asked patient questions.

View Article and Find Full Text PDF

Similar Publications

Quality of Human Expert vs Large Language Model-Generated Multiple-Choice Questions in the Field of Mechanical Ventilation.

Chest

July 2025

Department of Critical Care Medicine, National Institutes of Health, Bethesda, MD.

Sami Safadi , Roxana Amirahmadi , Abdulhakim Tlimat , Randal Rovinski , Junfeng Sun

Background: Although mechanical ventilation (MV) is a critical competency in critical care training, standardized methods for assessing MV-related knowledge are lacking. Traditional multiple-choice question (MCQ) development is resource intensive, and prior studies have suggested that generative AI tools could streamline question creation. However, the quality of AI-generated MCQs remains unclear.

View Article and Find Full Text PDF

Similar Publications

AI's ability to interpret unlabeled anatomy images and supplement educational research as an AI rater.

Anat Sci Educ

July 2025

Department of Anatomy and Cell Biology, Rush Medical College, Rush University, Chicago, Illinois, USA.

Lord J Hyeamang , Tejas C Sekhar , Emily Rush , Amy C Beresheim , Colleen M Cheverko

View Article and Find Full Text PDF

Similar Publications

Advancing dental diagnostics with OpenAI's o1-preview: A follow-up evaluation of ChatGPT's performance on diagnostic challenges.

J Am Dent Assoc

July 2025

Arman Danesh , Arsalan Danesh , Farzad Danesh

Background: The introduction of o1-preview (OpenAI) has stirred discussions surrounding its potential applications for diagnosing complex patient cases. The authors gauged changes in o1-preview's capacity to diagnose complex cases compared with its predecessors ChatGPT-3.5 (OpenAI) and ChatGPT-4 (legacy) (OpenAI).

View Article and Find Full Text PDF

Similar Publications