Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.

Mingde Cao , Qianwen Wang , Xueyou Zhang , Zuru Liang , Jihong Qiu , Patrick Shu-Hang Yung , Michael Tim-Yun Ong

J Sport Health Sci

Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China. Electronic address:

Published: November 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.

Methods: We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.

Results: ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's χ test with Fisher's exact test, all p < 0.001).

Conclusion: Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12268069	PMC
http://dx.doi.org/10.1016/j.jshs.2024.101016	DOI Listing

Publication Analysis

Top Keywords

chatgpt-40 perplexity

large language

chatgpt-35 chatgpt-40

pearson's test

test fisher's

fisher's exact

exact test

chatgpt-40

perplexity

language models'