A Comparative Study of Large Language Models, Human Experts, and Expert-Edited Large Language Models to Neuro-Ophthalmology Questions.

J Neuroophthalmol

Department of Ophthalmology (PDT, LAD, MRS, DAT, KDC, MCB, SAM, JJC), Mayo Clinic, Rochester, Minnesota; Departments of Ophthalmology (HEM) and Neurology & Neurological Sciences (HEM), Stanford University, Palo Alto, California; Department of Ophthalmology (KEL, MWK, DDM), Glick Eye Institute, India

Published: March 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: While large language models (LLMs) are increasingly used in medicine, their effectiveness compared with human experts remains unclear. This study evaluates the quality and empathy of Expert + AI, human experts, and LLM responses in neuro-ophthalmology.

Methods: This randomized, masked, multicenter cross-sectional study was conducted from June to July 2023. We randomly assigned 21 neuro-ophthalmology questions to 13 experts. Each expert provided an answer and then edited a ChatGPT-4-generated response, timing both tasks. In addition, 5 LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing, Bard) generated responses. Anonymized and randomized responses from Expert + AI, human experts, and LLMs were evaluated by the remaining 12 experts. The main outcome was the mean score for quality and empathy, rated on a 1-5 scale.

Results: Significant differences existed between response types for both quality and empathy ( P < 0.0001, P < 0.0001). For quality, Expert + AI (4.16 ± 0.81) performed the best, followed by GPT-4 (4.04 ± 0.92), GPT-3.5 (3.99 ± 0.87), Claude (3.6 ± 1.09), Expert (3.56 ± 1.01), Bard (3.5 ± 1.15), and Bing (3.04 ± 1.12). For empathy, Expert + AI (3.63 ± 0.87) had the highest score, followed by GPT-4 (3.6 ± 0.88), Bard (3.54 ± 0.89), GPT-3.5 (3.5 ± 0.83), Bing (3.27 ± 1.03), Expert (3.26 ± 1.08), and Claude (3.11 ± 0.78). For quality ( P < 0.0001) and empathy ( P = 0.002), Expert + AI performed better than Expert. Time taken for expert-created and expert-edited LLM responses was similar ( P = 0.75).

Conclusions: Expert-edited LLM responses had the highest expert-determined ratings of quality and empathy warranting further exploration of their potential benefits in clinical settings.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11445389PMC
http://dx.doi.org/10.1097/WNO.0000000000002145DOI Listing

Publication Analysis

Top Keywords

human experts
16
quality empathy
16
large language
12
language models
12
llm responses
12
expert
9
neuro-ophthalmology questions
8
empathy expert
8
expert human
8
expert-edited llm
8

Similar Publications

Background: Electronic health records (EHRs) are a cornerstone of modern health care delivery, but their current configuration often fragments information across systems, impeding timely and effective clinical decision-making. In gynecological oncology, where care involves complex, multidisciplinary coordination, these limitations can significantly impact the quality and efficiency of patient management. Few studies have examined how EHR systems support clinical decision-making from the perspective of end users.

View Article and Find Full Text PDF

Objective: Assessment of submarining occurrence in PMHS (Post-Mortem Human Subject) testing can be challenging, particularly for obese PMHS. This study investigates varied kinetic and kinematic response parameters as potential indicators of submarining. Data from 36 whole-body PMHS frontal sled tests conducted under varying boundary conditions were analyzed, incorporating three spring-controlled seat configurations, two extreme anthropometric profiles, two crash pulses, and two seatback angles.

View Article and Find Full Text PDF

Background: Accurate evaluation of the invasion depth of superficial esophageal squamous cell carcinoma (SESCC) is crucial for optimal treatment. While magnifying endoscopy (ME) using the Japanese Esophageal Society (JES) classification is reported as the most accurate method to predict invasion depth, its efficacy has not been tested in the Western world. This study aims to evaluate the interobserver agreement of the JES classification for SESCC and its accuracy in estimating invasion depth in a Brazilian tertiary hospital.

View Article and Find Full Text PDF

Deep feature engineering for accurate sperm morphology classification using CBAM-enhanced ResNet50.

PLoS One

September 2025

School of Computer Science, CHART Laboratory, University of Nottingham, Nottingham, United Kingdom.

Background And Objective: Male fertility assessment through sperm morphology analysis remains a critical component of reproductive health evaluation, as abnormal sperm morphology is strongly correlated with reduced fertility rates and poor assisted reproductive technology outcomes. Traditional manual analysis performed by embryologists is time-intensive, subjective, and prone to significant inter-observer variability, with studies reporting up to 40% disagreement between expert evaluators. This research presents a novel deep learning framework combining Convolutional Block Attention Module (CBAM) with ResNet50 architecture and advanced deep feature engineering (DFE) techniques for automated, objective sperm morphology classification.

View Article and Find Full Text PDF