Assessing the proficiency of large language models on funduscopic disease knowledge.

Jun-Yi Wu , Yan-Mei Zeng , Xian-Zhe Qian , Qi Hong , Jin-Yu Hu , Hong Wei , Jie Zou , Cheng Chen , Xiao-Yu Wang , Xu Chen , Yi Shao

Int J Ophthalmol

Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye Diseases, Shanghai 200080, China.

Published: July 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Aim: To assess the performance of five distinct large language models (LLMs; ChatGPT-3.5, ChatGPT-4, PaLM2, Claude 2, and SenseNova) in comparison to two human cohorts (a group of funduscopic disease experts and a group of ophthalmologists) on the specialized subject of funduscopic disease.

Methods: Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test. The performance of these entities was assessed by comparing their average scores, response stability, and answer confidence, thereby establishing a basis for evaluation.

Results: Among all the LLMs, ChatGPT-4 and PaLM2 exhibited the most substantial average correlation. Additionally, ChatGPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam. In comparison to human cohorts, ChatGPT-4 exhibited comparable performance to ophthalmologists, albeit falling short of the expertise demonstrated by funduscopic disease specialists.

Conclusion: The study provides evidence of the exceptional performance of ChatGPT-4 in the domain of funduscopic disease. With continued enhancements, validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12207300	PMC
http://dx.doi.org/10.18240/ijo.2025.07.03	DOI Listing

Publication Analysis

Top Keywords

funduscopic disease

large language

language models

chatgpt-4 palm2

comparison human

human cohorts

funduscopic

disease

chatgpt-4

assessing proficiency

A PHP Error was encountered