98%
921
2 minutes
20
Objective: Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain. This study aims to assess the adherence of LLMs to Clinical Practice Guidelines (CPGs) in CM.
Methods: This cross-sectional study randomly selected ten CPGs in CM and constructed 150 questions across three categories: medication based on differential diagnosis (MDD), specific prescription consultation (SPC), and CM theory analysis (CTA). Eight LLMs (GPT-4o, Claude-3.5 Sonnet, Moonshot-v1, ChatGLM-4, DeepSeek-v3, DeepSeek-r1, Claude-4 sonnet, and Claude-4 sonnet thinking) were evaluated using both English and Chinese queries. The main evaluation metrics included accuracy, readability, and use of safety disclaimers.
Results: Overall, DeepSeek-v3 and DeepSeek-r1 demonstrated superior performance in both English (median 5.00, interquartile range (IQR) 4.00-5.00 vs. median 5.00, IQR 3.70-5.00) and Chinese (both median 5.00, IQR 4.30-5.00), significantly outperforming all other models. All models achieved significantly higher accuracy in Chinese versus English responses (all p < 0.05). Significant variations in accuracy were observed across the categories of questions, with MDD and SPC questions presenting more challenges than CTA questions. English responses had lower readability (mean flesch reading ease score 32.7) compared to Chinese responses. Moonshot-v1 provided the highest rate of safety disclaimers (98.7% English, 100% Chinese).
Conclusion: LLMs showed varying degrees of potential for acquiring CM knowledge. The performance of DeepSeek-v3 and DeepSeek-r1 is satisfactory. Optimizing LLMs to become effective tools for disseminating CM information is an important direction for future development.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12331602 | PMC |
http://dx.doi.org/10.3389/fphar.2025.1649041 | DOI Listing |
Int J Surg
September 2025
Department of Urology and Andrology Laboratory, West China Hospital, Sichuan University, Chengdu, Sichuan Province, China.
Objective: The performance of large language models (LLMs) in complex clinical reasoning tasks is not well established. This study compares ChatGPT (GPT-3.5, GPT-4) and DeepSeek (DeepSeek-V3, DeepSeek-R1) in the Chinese anesthesiology attending physician examination (CAAPE), aiming to set AI benchmarks in medical assessments and enhance AI-driven medical education.
View Article and Find Full Text PDFInt J Med Inform
December 2025
Department of Intensive Care Medicine, Affiliated Hospital of Southwest Jiaotong University, The Third People's Hospital of Chengdu, Chengdu, Sichuan, China.
Background: Large language models (LLMs) are increasingly used in clinical decision support, and newly developed models have demonstrated promising potential, yet their diagnostic performance for critically ill patients in intensive care unit (ICU) settings remains underexplored. This study evaluated the diagnostic accuracy, differential diagnosis quality, and response quality in critical illness cases of four newly developed LLMs.
Methods: In this cross-sectional comparative study, four newly developed LLMs-ChatGPT-4o, ChatGPT-o3, DeepSeek-V3, and DeepSeek-R1-were evaluated using 50 critical illness cases in ICU settings from published literature.
Sci Rep
August 2025
Department of Surgical Oncology, Taizhou Campus of Zhejiang Cancer Hospital (Taizhou Cancer Hospital), No. 50, Zhenxin Road, Xinhe Town, Wenling, 317502, Taizhou, China.
This study aims to investigate and compare the diagnostic performance, disease interpretation reliability, and treatment recommendation capabilities of multiple advanced large language models (GPT-4o, DeepSeek-R1, and DeepSeek-V3) in breast tumor cases. It retrospectively collected comprehensive clinical records of patients with breast tumors treated at Taizhou Cancer Hospital between January and April 2024. The study evaluated the accuracy of tumor classification (benign vs.
View Article and Find Full Text PDFFront Pharmacol
July 2025
Department of Health Policy and Management, School of Public Health, Lanzhou University, Lanzhou, China.
Objective: Whether large language models (LLMs) can effectively facilitate CM knowledge acquisition remains uncertain. This study aims to assess the adherence of LLMs to Clinical Practice Guidelines (CPGs) in CM.
Methods: This cross-sectional study randomly selected ten CPGs in CM and constructed 150 questions across three categories: medication based on differential diagnosis (MDD), specific prescription consultation (SPC), and CM theory analysis (CTA).