Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation.

Miguel Mayo-Yáñez , Jerome R Lechien , Alberto Maria-Saibene , Luigi A Vaira , Antonino Maniaci , Carlos M Chiesa-Estomba

Indian J Otolaryngol Head Neck Surg

Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France.

Published: August 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT's proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306834	PMC
http://dx.doi.org/10.1007/s12070-024-04729-1	DOI Listing

Publication Analysis

Top Keywords

microsoft copilot

chatgpt internet-connected

internet-connected gpt-4

otolaryngologists exam

superior performance

chatgpt

copilot

examining performance

performance chatgpt

chatgpt microsoft

Similar Publications

Assessing the Potential of Generative Artificial Intelligence Models to Assist Experts in the Development of Pharmacokinetic Models.

Adv Pharm Bull

July 2025

Department of Telecommunications & Systems Engineering, Universitat Autònoma de Barcelona, Sabadell, 08202, Spain.

Sergio Sánchez-Herrero , Laura Calvet Liñan

Purpose: This study explores the potential of generative AI models to aid experts in developing scripts for pharmacokinetic (PK) models, with a focus on constructing a two-compartment population PK model using data from Hosseini et al.

Methods: Generative AI tools ChatGPT v3.5, Gemini v2.

View Article and Find Full Text PDF

Similar Publications

Multiple large language models versus clinical guidelines for postmenopausal osteoporosis: a comparative study of ChatGPT-3.5, ChatGPT-4.0, ChatGPT-4o, Google Gemini, Google Gemini Advanced, and Microsoft Copilot.

Arch Osteoporos

September 2025

Department of Family Medicine, Chang-Gung Memorial Hospital, Linkou Branch, Taoyuan City, Taiwan.

Chun-Ru Lin , Yi-Jun Chen , Po-An Tsai , Wen-Yuan Hsieh , Sung Huang Laurent Tsai

Unlabelled: The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.

Purpose: The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions.

View Article and Find Full Text PDF

Similar Publications

Frankenstein, thematic analysis and generative artificial intelligence: Quality appraisal methods and considerations for qualitative research.

PLoS One

September 2025

Faculty of Health Sciences and Medicine, Bond University, Gold Coast, Australia.

Tanisha Jowsey , Peta Stapleton , Shawna Campbell , Alexandra Davidson , Cher McGillivray

Objective: To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.

Introduction: With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.

View Article and Find Full Text PDF

Similar Publications

Potential cost savings of a wound bed-conforming silicone foam dressing with 3DFit Technology compared with standard of care.

J Wound Care

September 2025

Coloplast A/S, Holtedam 1-3, Humlebaek 3050, Denmark.

Caroline Dowsett , Julie Beck Christoffersen , Mette Irene Agerkvist Hansen , Paddy Markey

Objective: The increasing prevalence of chronic wounds, particularly venous leg ulcers (VLUs) and diabetic foot ulcers (DFUs), presents significant clinical and economic challenges within the National Health Service in the UK. This study was designed to evaluate the financial impact of replacing the standard of care (SoC), two-dressing regimen with a single silicone foam dressing with 3DFit Technology, Biatain Silicone (Coloplast, UK), in the treatment of these wounds in the community setting in the UK.

Method: A budget impact model was developed to estimate the potential cost savings of a progressive transition from SoC to the single silicone foam dressing with 3DFit Technology over a five-year horizon.

View Article and Find Full Text PDF

Similar Publications

Performance of large language models in fluoride-related dental knowledge: a comparative evaluation study of ChatGPT-4, Claude 3.5 Sonnet, Copilot, and Grok 3.

J Yeungnam Med Sci

September 2025

Department of Dentistry, Malda Medical College and Hospital, Malda, India.

Raju Biswas , Atanu Mukhopadhyay , Santanu Mukhopadhyay

Background: Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.

View Article and Find Full Text PDF

Similar Publications