Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation.

Indian J Otolaryngol Head Neck Surg

Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France.

Published: August 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT's proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306834PMC
http://dx.doi.org/10.1007/s12070-024-04729-1DOI Listing

Publication Analysis

Top Keywords

microsoft copilot
8
chatgpt internet-connected
8
internet-connected gpt-4
8
otolaryngologists exam
8
superior performance
8
chatgpt
7
copilot
6
examining performance
4
performance chatgpt
4
chatgpt microsoft
4

Similar Publications

Purpose: This study explores the potential of generative AI models to aid experts in developing scripts for pharmacokinetic (PK) models, with a focus on constructing a two-compartment population PK model using data from Hosseini et al.

Methods: Generative AI tools ChatGPT v3.5, Gemini v2.

View Article and Find Full Text PDF

Unlabelled: The study assesses the performance of AI models in evaluating postmenopausal osteoporosis. We found that ChatGPT-4o produced the most appropriate responses, highlighting the potential of AI to enhance clinical decision-making and improve patient care in osteoporosis management.

Purpose: The rise of artificial intelligence (AI) offers the potential for assisting clinical decisions.

View Article and Find Full Text PDF

Objective: To determine accuracy and efficiency of using generative artificial intelligence (GenAI) to undertake thematic analysis.

Introduction: With the increasing use of GenAI in data analysis, testing the reliability and suitability of using GenAI to conduct qualitative data analysis is needed. We propose a method for researchers to assess reliability of GenAI outputs using deidentified qualitative datasets.

View Article and Find Full Text PDF

Objective: The increasing prevalence of chronic wounds, particularly venous leg ulcers (VLUs) and diabetic foot ulcers (DFUs), presents significant clinical and economic challenges within the National Health Service in the UK. This study was designed to evaluate the financial impact of replacing the standard of care (SoC), two-dressing regimen with a single silicone foam dressing with 3DFit Technology, Biatain Silicone (Coloplast, UK), in the treatment of these wounds in the community setting in the UK.

Method: A budget impact model was developed to estimate the potential cost savings of a progressive transition from SoC to the single silicone foam dressing with 3DFit Technology over a five-year horizon.

View Article and Find Full Text PDF

Background: Large language models (LLMs) have rapidly emerged as valuable tools in medical and dental education that support clinical reasoning, patient communication, and academic instruction. However, their effectiveness in conveying specialized content, such as fluoride-related dental knowledge, requires a thorough evaluation. This study assesses the performance of four advanced LLMs-ChatGPT-4 (OpenAI), Claude 3.

View Article and Find Full Text PDF