Large language models' performances regarding common patient questions about osteoarthritis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Perplexity.

J Sport Health Sci

Department of Orthopaedics and Traumatology, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong 999077, China; Center for Neuromusculoskeletal Restorative Medicine (CNRM), The Chinese University of Hong Kong, Hong Kong 999077, China. Electronic address:

Published: November 2024


Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: Large Language Models (LLMs) have gained much attention and, in part, have replaced common search engines as a popular channel for obtaining information due to their contextually relevant responses. Osteoarthritis (OA) is a common topic in skeletal muscle disorders, and patients often seek information about it online. Our study evaluated the ability of 3 LLMs (ChatGPT-3.5, ChatGPT-4.0, and Perplexity) to accurately answer common OA-related queries.

Methods: We defined 6 themes (pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis) based on a generalization of 25 frequently asked questions about OA. Three consultant-level orthopedic specialists independently rated the LLMs' replies on a 4-point accuracy scale. The final ratings for each response were determined using a majority consensus approach. Responses classified as "satisfactory" were evaluated for comprehensiveness on a 5-point scale.

Results: ChatGPT-4.0 demonstrated superior accuracy, with 64% of responses rated as "excellent", compared to 40% for ChatGPT-3.5 and 28% for Perplexity (Pearson's χ test with Fisher's exact test, all p < 0.001). All 3 LLM-chatbots had high mean comprehensiveness ratings (Perplexity = 3.88; ChatGPT-4.0 = 4.56; ChatGPT-3.5 = 3.96, out of a maximum score of 5). The LLM-chatbots performed reliably across domains, except for "treatment and prevention" However, ChatGPT-4.0 still outperformed ChatGPT-3.5 and Perplexity, garnering 53.8% "excellent" ratings (Pearson's χ test with Fisher's exact test, all p < 0.001).

Conclusion: Our findings underscore the potential of LLMs, specifically ChatGPT-4.0 and Perplexity, to deliver accurate and thorough responses to OA-related queries. Targeted correction of specific misconceptions to improve the accuracy of LLMs remains crucial.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12268069PMC
http://dx.doi.org/10.1016/j.jshs.2024.101016DOI Listing

Publication Analysis

Top Keywords

chatgpt-40 perplexity
12
large language
8
chatgpt-35 chatgpt-40
8
pearson's test
8
test fisher's
8
fisher's exact
8
exact test
8
chatgpt-40
5
perplexity
5
language models'
4

Similar Publications