A PHP Error was encountered

Severity: Warning

Message: file_get_contents(https://...@gmail.com&api_key=61f08fa0b96a73de8c900d749fcb997acc09&a=1): Failed to open stream: HTTP request failed! HTTP/1.1 429 Too Many Requests

Filename: helpers/my_audit_helper.php

Line Number: 197

Backtrace:

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 197
Function: file_get_contents

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 271
Function: simplexml_load_file_from_url

File: /var/www/html/application/helpers/my_audit_helper.php
Line: 3165
Function: getPubMedXML

File: /var/www/html/application/controllers/Detail.php
Line: 597
Function: pubMedSearch_Global

File: /var/www/html/application/controllers/Detail.php
Line: 511
Function: pubMedGetRelatedKeyword

File: /var/www/html/index.php
Line: 317
Function: require_once

Assessing the proficiency of large language models on funduscopic disease knowledge. | LitMetric

Assessing the proficiency of large language models on funduscopic disease knowledge.

Int J Ophthalmol

Department of Ophthalmology, Shanghai General Hospital, Shanghai Jiao Tong University School of Medicine, National Clinical Research Center for Eye Diseases, Shanghai 200080, China.

Published: July 2025


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Aim: To assess the performance of five distinct large language models (LLMs; ChatGPT-3.5, ChatGPT-4, PaLM2, Claude 2, and SenseNova) in comparison to two human cohorts (a group of funduscopic disease experts and a group of ophthalmologists) on the specialized subject of funduscopic disease.

Methods: Five distinct LLMs and two distinct human groups independently completed a 100-item funduscopic disease test. The performance of these entities was assessed by comparing their average scores, response stability, and answer confidence, thereby establishing a basis for evaluation.

Results: Among all the LLMs, ChatGPT-4 and PaLM2 exhibited the most substantial average correlation. Additionally, ChatGPT-4 achieved the highest average score and demonstrated the utmost confidence during the exam. In comparison to human cohorts, ChatGPT-4 exhibited comparable performance to ophthalmologists, albeit falling short of the expertise demonstrated by funduscopic disease specialists.

Conclusion: The study provides evidence of the exceptional performance of ChatGPT-4 in the domain of funduscopic disease. With continued enhancements, validated LLMs have the potential to yield unforeseen advantages in enhancing healthcare for both patients and physicians.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12207300PMC
http://dx.doi.org/10.18240/ijo.2025.07.03DOI Listing

Publication Analysis

Top Keywords

funduscopic disease
20
large language
8
language models
8
chatgpt-4 palm2
8
comparison human
8
human cohorts
8
funduscopic
6
disease
5
chatgpt-4
5
assessing proficiency
4

Similar Publications