Evaluating the reliability of the responses of large language models to keratoconus-related questions.

Mustafa Kayabaşı , Seher Köksaldı , Ceren Durmaz Engin

Clin Exp Optom

Department of Ophthalmology, Izmir Democracy University Buca Seyfi Demirsoy Education and Research Hospital, Izmir, Turkey.

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Clinical Relevance: Artificial intelligence has undergone a rapid evolution and large language models (LLMs) have become promising tools for healthcare, with the ability of providing human-like responses to questions. The capabilities of these tools in addressing questions related to keratoconus (KCN) have not been previously explored.

Background: In this study, the responses were evaluated from three LLMs - ChatGPT-4, Copilot, and Gemini - to common patient questions regarding KCN.

Methods: Fifty real-life patient inquiries regarding general information, aetiology, symptoms and diagnosis, progression, and treatment of KCN were presented to the LLMs. Evaluations of the answers were conducted by three ophthalmologists with a 5-point Likert scale ranging from 'strongly disagreed' to 'strongly agreed'. The reliability of the responses provided by LLMs was evaluated using the DISCERN and the Ensuring Quality Information for Patients (EQIP) scales. Readability metrics (Flesch Reading Ease Score, Flesch-Kincaid Grade Level, and Coleman-Liau Index) were calculated to evaluate the complexity of responses.

Results: ChatGPT-4 consistently scored 3 points or higher for all (100%) its responses, while Copilot had five (10%) and Gemini had two (4%) responses scoring 2 points or below. ChatGPT-4 achieved a 'strongly agree' rate of 74% across all questions, markedly superior to Copilot at 34% and Gemini at 42% ( < 0.001); and recorded the highest 'strongly agree' rates in general information and symptoms & diagnosis categories (90% for both). The median Likert scores differed among LLMs ( < 0.001), with ChatGPT-4 scoring highest and Copilot scoring lowest. Although ChatGPT-4 exhibited more reliability based on the DISCERN scale, it was characterised by lower readability and higher complexity. While all LLMs provided responses categorised as 'extremely difficult to read', the responses provided by Copilot showed higher readability.

Conclusions: Despite the responses provided by ChatGPT-4 exhibiting lower readability and greater complexity, it emerged as the most proficient in answering KCN-related questions.

Download full-text PDF	Source
http://dx.doi.org/10.1080/08164622.2024.2419524	DOI Listing

Publication Analysis

Top Keywords

reliability responses

large language

language models

responses

questions

evaluating reliability

responses large

models keratoconus-related

keratoconus-related questions

questions clinical

Similar Publications

Fermented Food Consumption Across European Regions: Protocol for the Development and Validation of the Web-Based Fermented Foods Frequency Questionnaire (3FQ).

JMIR Res Protoc

September 2025

Department of Food Science and Technology, Kaunas University of Technology, Kaunas, Lithuania.

Emmanuella Magriplis , Sotiria Kotopoulou , Signe Adamberg , Kathryn Jane Burton-Pimentel , Vaida Kitryte-Syrpa

Background: Fermented foods vary significantly by food substrate and regional consumption patterns. Although they are consumed worldwide, their intake and potential health benefits remain understudied. Europe, in particular, lacks specific consumption recommendations for most fermented foods.

View Article and Find Full Text PDF

Similar Publications

Increasing Awareness and Early Detection of Common Skin Diseases in Indonesia Through an mHealth App: Protocol for an Awareness and Acceptability Study and Randomized Controlled Trial.

JMIR Res Protoc

September 2025

Department of Public Health, Erasmus MC University Medical Center, Rotterdam, The Netherlands.

Ulfah Abqari , Muhammad Atoillah Isfandiari , Jan Hendrik Richardus , Ida Korfage

Background: Various media are used to enhance public understanding about diseases. While mobile health apps are widely used, there is little proof for using such apps to raise awareness of skin diseases.

Objective: We intend to develop an app, called DEDIKASI-app, to raise awareness of skin diseases, including leprosy.

View Article and Find Full Text PDF

Similar Publications

[Electrophysiological methods in the diagnosis and monitoring of glaucoma].

Vestn Oftalmol

September 2025

Krasnov Research Institute of Eye Diseases, Moscow, Russia.

A A Antonov , I A Ronzina , E D Semenov

Primary open-angle glaucoma (POAG) is characterized by chronic progressive damage to the retinal ganglion cell layer (GCL) and their axons, leading to gradual visual function loss. Currently, the gold standards for structural and functional assessment of the retina in glaucoma are static automated perimetry (SAP) and optical coherence tomography (OCT). However, in clinical practice, data from SAP and OCT may be insufficient to reliably determine the stage of glaucomatous optic neuropathy, monitor its progression, or differentiate it from other causes of visual dysfunction.

View Article and Find Full Text PDF

Similar Publications

GEMsembler: consensus model assembly and structural comparison of genome-scale metabolic models across tools improve functional performance.

mSystems

September 2025

Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany.

Elena K Matveishina , Bartosz J Bartmanski , Sara Benito-Vaquerizo , Maria Zimmermann-Kogadeeva

Genome-scale metabolic models (GEMs) are widely used in systems biology to investigate metabolism and predict perturbation responses. Automatic GEM reconstruction tools generate GEMs with different properties and predictive capacities for the same organism. Since different models can excel at different tasks, combining them can increase metabolic network certainty and enhance model performance.

View Article and Find Full Text PDF

Similar Publications