Comparison of physician and large language model chatbot responses to online ear, nose, and throat inquiries.

Masaomi Motegi , Masato Shino , Mikio Kuwabara , Hideyuki Takahashi , Toshiyuki Matsuyama , Hiroe Tada , Hiroyuki Hagiwara , Kazuaki Chikamatsu

Sci Rep

Department of Otolaryngology-Head and Neck Surgery, Gunma University Graduate School of Medicine, 3-39-15 Showamachi, Maebashi, Gunma, 371-8511, Japan.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Large language models (LLMs) can potentially enhance the accessibility and quality of medical information. This study evaluates the reliability and quality of responses generated by ChatGPT-4, an LLM-driven chatbot, compared to those written by physicians, focusing on otorhinolaryngological advice in real-world, text-based workflows. Responses from a public social media forum were anonymized, and ChatGPT-4 generated corresponding replies. A panel of seven board-certified otorhinolaryngologists assessed both sets of responses using six criteria: overall quality, empathy, alignment with medical consensus, information accuracy, inquiry comprehension, and harm potential. Ordinal logistic regression analysis identified factors influencing response quality. ChatGPT-4 responses were preferred in 70.7% of cases and were significantly longer (median: 162 words) than physician responses (median: 67 words; P < .0001). The chatbot's responses received higher ratings across all criteria, with key predictors of this higher quality being greater empathy, stronger alignment with medical consensus, lower potential for harm, and fewer inaccuracies. ChatGPT-4 consistently outperformed physicians in generating responses that adhered to medical consensus, demonstrated accuracy, and conveyed empathy. These findings suggest that integrating AI tools into text-based healthcare consultations could help physicians better address complex, nuanced inquiries and provide high-quality, comprehensive medical advice.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12215459	PMC
http://dx.doi.org/10.1038/s41598-025-06769-1	DOI Listing

Publication Analysis

Top Keywords

large language

responses

comparison physician

physician large

language model

model chatbot

chatbot responses

responses online

online ear

ear nose

Similar Publications

Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.

J Imaging Inform Med

September 2025

Department of Diagnostic, Interventional and Pediatric Radiology (DIPR), Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.

Fabio Dennstädt , Simon Fauser , Nikola Cihoric , Max Schmerder , Paolo Lombardo

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.

View Article and Find Full Text PDF

Similar Publications

Inflammatory gene expression profile of oral plasmablastic lymphoma.

Virchows Arch

September 2025

Department of Oral Surgery and Pathology, School of Dentistry, Universidade Federal de Minas Gerais, Minas Gerais, Av. Antônio Carlos, Pampulha, Belo Horizonte, 31270-901, Brazil.

Roberta Rayra Martins-Chaves , Marina Gonçalves Diniz , Fernanda Faria Rocha , Cinthia Veronica Bardález López de Cáceres , Pablo Agustin Vargas

Plasmablastic lymphoma (PBL) is a rare and aggressive non-Hodgkin lymphoma with a poor prognosis and short survival rates. It is classified as a large B-cell lymphoma subtype, but carries a plasmacytic immunophenotype. Therefore, PBL has pathogenetic overlaps with diffuse large B-cell lymphoma not otherwise specified (DLBCL NOS) and plasma cell neoplasms (PCNs).

View Article and Find Full Text PDF

Similar Publications

Decompensation of degenerative lumbar stenosis: do patients need immediate surgery?

Eur Spine J

September 2025

Centre Hospitalier Universitaire de Tours, Tours, France.

Marie Duigou , Louis-Marie Terrier , Alexia Planty-Bonjour , Christophe Destrieux , Ilyess Zemmoura

Purpose: Degenerative lumbar spinal stenosis (DLSS) represents an increasing challenge due to the aging population. The natural course of untreated DLSS is largely unknown. For the acute DLSS decompensations, the main concern remains the opportunity and timing of surgery, i.

View Article and Find Full Text PDF

Similar Publications

Active use of latent tree-structured sentence representation in humans and large language models.

Nat Hum Behav

September 2025

Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, China.

Wei Liu , Ming Xiang , Nai Ding

Understanding how sentences are represented in the human brain, as well as in large language models (LLMs), poses a substantial challenge for cognitive science. Here we develop a one-shot learning task to investigate whether humans and LLMs encode tree-structured constituents within sentences. Participants (total N = 372, native Chinese or English speakers, and bilingual in Chinese and English) and LLMs (for example, ChatGPT) were asked to infer which words should be deleted from a sentence.

View Article and Find Full Text PDF

Similar Publications

Performance of GPT-4o combined with retrieval-augmented generation on nutritionist licensing exam questions.

Endocr J

September 2025

Institute of Liberal Arts and Science, Kanazawa University, Kanazawa, Japan.

Yu Ishikawa , Akitaka Higashi , Nozomu Arai , Daisuke Ozo , Wataru Hasegawa

GPT-4o, a general-purpose large language model, has a Retrieval-Augmented Variant (GPT-4o-RAG) that can assist in dietary counseling. However, research on its application in this field remains lacking. To bridge this gap, we used the Japanese National Examination for Registered Dietitians as a standardized benchmark for evaluation.

View Article and Find Full Text PDF

Similar Publications