A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.

Prashant D Tailor , Lauren A Dalvin , Matthew R Starr , Deena A Tajfirouz , Kevin D Chodnicki , Michael C Brodsky , Sasha A Mansukhani , Heather E Moss , Kevin E Lai , Melissa W Ko , Devin D Mackay , Marie A Di Nome , Oana M Dumitrascu , Misha L Pless , Eric R Eggenberger , John J Chen

J Neuroophthalmol

Department of Ophthalmology (PDT, LAD, MRS, DAT, KDC, MCB, SAM, JJC), Mayo Clinic, Rochester, Minnesota; Departments of Ophthalmology (HEM) and Neurology & Neurological Sciences (HEM), Stanford University, Palo Alto, California; Department of Ophthalmology (KEL, MWK, DDM), Glick Eye Institute, India

Published: March 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: While large language models (LLMs) are increasingly used in medicine, their effectiveness compared with human experts remains unclear. This study evaluates the quality and empathy of Expert + AI, human experts, and LLM responses in neuro-ophthalmology.

Methods: This randomized, masked, multicenter cross-sectional study was conducted from June to July 2023. We randomly assigned 21 neuro-ophthalmology questions to 13 experts. Each expert provided an answer and then edited a ChatGPT-4-generated response, timing both tasks. In addition, 5 LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) generated responses. Anonymized and randomized responses from Expert + AI, human experts, and LLMs were evaluated by the remaining 12 experts. The main outcome was the mean score for quality and empathy, rated on a 1-5 scale.

Results: Significant differences existed between response types for both quality and empathy ( P < 0.0001, P < 0.0001). For quality, Expert + AI (4.16 ± 0.81) performed the best, followed by GPT-4 (4.04 ± 0.92), GPT-3.5 (3.99 ± 0.87), Claude (3.6 ± 1.09), Expert (3.56 ± 1.01), Bard (3.5 ± 1.15), and Bing (3.04 ± 1.12). For empathy, Expert + AI (3.63 ± 0.87) had the highest score, followed by GPT-4 (3.6 ± 0.88), Bard (3.54 ± 0.89), GPT-3.5 (3.5 ± 0.83), Bing (3.27 ± 1.03), Expert (3.26 ± 1.08), and Claude (3.11 ± 0.78). For quality ( P < 0.0001) and empathy ( P = 0.002), Expert + AI performed better than Expert. Time taken for expert-created and expert-edited LLM responses was similar ( P = 0.75).

Conclusions: Expert-edited LLM responses had the highest expert-determined ratings of quality and empathy warranting further exploration of their potential benefits in clinical settings.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445389	PMC
http://dx.doi.org/10.1097/WNO.0000000000002145	DOI Listing

Publication Analysis

Top Keywords

human experts

quality empathy

large language

language models

llm responses

expert

neuro-ophthalmology questions

empathy expert

expert human

expert-edited llm

Similar Publications

Understanding and Addressing Challenges With Electronic Health Record Use in Gynecological Oncology: Cross-Sectional Survey of Multidisciplinary Professionals in the United Kingdom and Co-Design of an Integrated Informatics Platform to Support Clinical Decision-Making.

JMIR Cancer

September 2025

iCARE Secure Data Environment & Digital Collaboration Space, NIHR Imperial Biomedical Research Centre, London, United Kingdom.

Laura Tookman , Rachael Lear , Yusuf S Abdullahi , Amit Samani , Phoebe Averill

Background: Electronic health records (EHRs) are a cornerstone of modern health care delivery, but their current configuration often fragments information across systems, impeding timely and effective clinical decision-making. In gynecological oncology, where care involves complex, multidisciplinary coordination, these limitations can significantly impact the quality and efficiency of patient management. Few studies have examined how EHR systems support clinical decision-making from the perspective of end users.

View Article and Find Full Text PDF

Similar Publications

Methods for assessing submarining occurrence in PMHS frontal sled tests: Exploring potential indicators.

Traffic Inj Prev

September 2025

Department of Biomedical Engineering, Medical College of Wisconsin, Milwaukee, Wisconsin.

Karthik Somasundaram , Klaus Driesslein , Anjishnu Banerjee , Frank A Pintar

Objective: Assessment of submarining occurrence in PMHS (Post-Mortem Human Subject) testing can be challenging, particularly for obese PMHS. This study investigates varied kinetic and kinematic response parameters as potential indicators of submarining. Data from 36 whole-body PMHS frontal sled tests conducted under varying boundary conditions were analyzed, incorporating three spring-controlled seat configurations, two extreme anthropometric profiles, two crash pulses, and two seatback angles.

View Article and Find Full Text PDF

Similar Publications

INTEROBSERVER AGREEMENT OF INTRAPAPILLARY CAPILLARY LOOPS CLASSIFICATION FOR SUPERFICIAL ESOPHAGEAL SQUAMOUS CELL CARCINOMA IN A WESTERN CENTER.

Arq Gastroenterol

September 2025

Faculdade de Medicina da Universidade de São Paulo, Departamento de Gastroenterologia, São Paulo, SP, Brasil.

Fauze Maluf-Filho , Ossamu Okazaki , Beanie Conceição Medeiros Nunes , Adriana Vaz Safatle-Ribeiro , Luciano Lenz

Background: Accurate evaluation of the invasion depth of superficial esophageal squamous cell carcinoma (SESCC) is crucial for optimal treatment. While magnifying endoscopy (ME) using the Japanese Esophageal Society (JES) classification is reported as the most accurate method to predict invasion depth, its efficacy has not been tested in the Western world. This study aims to evaluate the interobserver agreement of the JES classification for SESCC and its accuracy in estimating invasion depth in a Brazilian tertiary hospital.

View Article and Find Full Text PDF

Similar Publications

Deep feature engineering for accurate sperm morphology classification using CBAM-enhanced ResNet50.

PLoS One

September 2025

School of Computer Science, CHART Laboratory, University of Nottingham, Nottingham, United Kingdom.

Şafak Kılıç

Background And Objective: Male fertility assessment through sperm morphology analysis remains a critical component of reproductive health evaluation, as abnormal sperm morphology is strongly correlated with reduced fertility rates and poor assisted reproductive technology outcomes. Traditional manual analysis performed by embryologists is time-intensive, subjective, and prone to significant inter-observer variability, with studies reporting up to 40% disagreement between expert evaluators. This research presents a novel deep learning framework combining Convolutional Block Attention Module (CBAM) with ResNet50 architecture and advanced deep feature engineering (DFE) techniques for automated, objective sperm morphology classification.

View Article and Find Full Text PDF

Similar Publications