98%
921
2 minutes
20
Outcomes were to compare the accuracy of 2 large-language models-GPT-4o and o3-Mini-against medical-student performance on otolaryngology-focused, USMLE-style multiple-choice questions. With permission from AMBOSS, we extracted 146 Step 2 CK questions tagged "Otolaryngology" and stratified them by AMBOSS difficulty (levels 1-5). Each item was presented verbatim to GPT-4o and o3-Mini through their official APIs; outputs were scored correct/incorrect. Historical, de-identified student responses to the same items served as the comparator. Accuracy (%) was calculated per difficulty tier. Group differences were assessed with one-way ANOVA followed by independent-samples t tests (α=0.05). Mean accuracy across all items was 93.35% for o3-Mini and 90.45% for GPT-4o (P=0.465). Both models outperformed students (55.44%; P=0.008 and 0.012, respectively). Performance for GPT-4o and o3-Mini remained ≥86% across all 5 difficulty levels, whereas student accuracy declined from 85.6% (level 1) to 26.7% (level 5). At the hardest tier, o3-Mini achieved 100% accuracy. GPT-4o and o3-Mini markedly exceed average medical-student performance on ENT-specific USMLE-style questions, maintaining high accuracy even at the greatest difficulty. These findings support the integration of advanced language models as adjunctive learning tools in otolaryngology.
Download full-text PDF |
Source |
---|---|
http://dx.doi.org/10.1097/SCS.0000000000011831 | DOI Listing |
J Craniofac Surg
September 2025
University of Miami Miller School of Medicine, Miami, FL.
Outcomes were to compare the accuracy of 2 large-language models-GPT-4o and o3-Mini-against medical-student performance on otolaryngology-focused, USMLE-style multiple-choice questions. With permission from AMBOSS, we extracted 146 Step 2 CK questions tagged "Otolaryngology" and stratified them by AMBOSS difficulty (levels 1-5). Each item was presented verbatim to GPT-4o and o3-Mini through their official APIs; outputs were scored correct/incorrect.
View Article and Find Full Text PDFJ Dent
September 2025
Dental Clinic Post-Graduate Program, University Center of State of Pará, Belém, Pará, Brazil. Electronic address:
Objective: This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.
Methods: A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri-implant mucositis and peri-implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs.
Eur Radiol
September 2025
Institute of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.
Objectives: To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.
Materials And Methods: This retrospective study employed a dataset of 150 brain MRI cases derived from local imaging request forms. Reference protocols were established by two neuroradiologists.
AJR Am J Roentgenol
July 2025
Department of Diagnostic Radiology, Queen's University, Kingston Health Sciences Centre, Kingston ON.