Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Outcomes were to compare the accuracy of 2 large-language models-GPT-4o and o3-Mini-against medical-student performance on otolaryngology-focused, USMLE-style multiple-choice questions. With permission from AMBOSS, we extracted 146 Step 2 CK questions tagged "Otolaryngology" and stratified them by AMBOSS difficulty (levels 1-5). Each item was presented verbatim to GPT-4o and o3-Mini through their official APIs; outputs were scored correct/incorrect. Historical, de-identified student responses to the same items served as the comparator. Accuracy (%) was calculated per difficulty tier. Group differences were assessed with one-way ANOVA followed by independent-samples t tests (α=0.05). Mean accuracy across all items was 93.35% for o3-Mini and 90.45% for GPT-4o (P=0.465). Both models outperformed students (55.44%; P=0.008 and 0.012, respectively). Performance for GPT-4o and o3-Mini remained ≥86% across all 5 difficulty levels, whereas student accuracy declined from 85.6% (level 1) to 26.7% (level 5). At the hardest tier, o3-Mini achieved 100% accuracy. GPT-4o and o3-Mini markedly exceed average medical-student performance on ENT-specific USMLE-style questions, maintaining high accuracy even at the greatest difficulty. These findings support the integration of advanced language models as adjunctive learning tools in otolaryngology.

Download full-text PDF

Source
http://dx.doi.org/10.1097/SCS.0000000000011831DOI Listing

Publication Analysis

Top Keywords

gpt-4o o3-mini
16
usmle-style questions
8
medical-student performance
8
difficulty levels
8
o3-mini
6
accuracy
6
comparison gpt-4o
4
o3-mini otolaryngology
4
otolaryngology usmle-style
4
questions
4

Similar Publications

Outcomes were to compare the accuracy of 2 large-language models-GPT-4o and o3-Mini-against medical-student performance on otolaryngology-focused, USMLE-style multiple-choice questions. With permission from AMBOSS, we extracted 146 Step 2 CK questions tagged "Otolaryngology" and stratified them by AMBOSS difficulty (levels 1-5). Each item was presented verbatim to GPT-4o and o3-Mini through their official APIs; outputs were scored correct/incorrect.

View Article and Find Full Text PDF

Objective: This study evaluated the coherence, consistency, and diagnostic accuracy of eight AI-based chatbots in clinical scenarios related to dental implants.

Methods: A double-blind, clinical experimental study was carried out between February and March 2025, to evaluate eight AI-based chatbots using six fictional cases simulating peri-implant mucositis and peri-implantitis. Each chatbot answered five standardized clinical questions across three independent runs per case, generating 720 binary outputs.

View Article and Find Full Text PDF

Evaluating large language model-generated brain MRI protocols: performance of GPT4o, o3-mini, DeepSeek-R1 and Qwen2.5-72B.

Eur Radiol

September 2025

Institute of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany.

Objectives: To evaluate the potential of LLMs to generate sequence-level brain MRI protocols.

Materials And Methods: This retrospective study employed a dataset of 150 brain MRI cases derived from local imaging request forms. Reference protocols were established by two neuroradiologists.

View Article and Find Full Text PDF