Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study.

Theresa Isabelle Wilhelm , Jonas Roos , Robert Kaczmarczyk

J Med Internet Res

Department of Dermatology and Allergy, School of Medicine, Technical University of Munich, Munich, Germany.

Published: October 2023

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: As advancements in artificial intelligence (AI) continue, large language models (LLMs) have emerged as promising tools for generating medical information. Their rapid adaptation and potential benefits in health care require rigorous assessment in terms of the quality, accuracy, and safety of the generated information across diverse medical specialties.

Objective: This study aimed to evaluate the performance of 4 prominent LLMs, namely, Claude-instant-v1.0, GPT-3.5-Turbo, Command-xlarge-nightly, and Bloomz, in generating medical content spanning the clinical specialties of ophthalmology, orthopedics, and dermatology.

Methods: Three domain-specific physicians evaluated the AI-generated therapeutic recommendations for a diverse set of 60 diseases. The evaluation criteria involved the mDISCERN score, correctness, and potential harmfulness of the recommendations. ANOVA and pairwise t tests were used to explore discrepancies in content quality and safety across models and specialties. Additionally, using the capabilities of OpenAI's most advanced model, GPT-4, an automated evaluation of each model's responses to the diseases was performed using the same criteria and compared to the physicians' assessments through Pearson correlation analysis.

Results: Claude-instant-v1.0 emerged with the highest mean mDISCERN score (3.35, 95% CI 3.23-3.46). In contrast, Bloomz lagged with the lowest score (1.07, 95% CI 1.03-1.10). Our analysis revealed significant differences among the models in terms of quality (P<.001). Evaluating their reliability, the models displayed strong contrasts in their falseness ratings, with variations both across models (P<.001) and specialties (P<.001). Distinct error patterns emerged, such as confusing diagnoses; providing vague, ambiguous advice; or omitting critical treatments, such as antibiotics for infectious diseases. Regarding potential harm, GPT-3.5-Turbo was found to be the safest, with the lowest harmfulness rating. All models lagged in detailing the risks associated with treatment procedures, explaining the effects of therapies on quality of life, and offering additional sources of information. Pearson correlation analysis underscored a substantial alignment between physician assessments and GPT-4's evaluations across all established criteria (P<.01).

Conclusions: This study, while comprehensive, was limited by the involvement of a select number of specialties and physician evaluators. The straightforward prompting strategy ("How to treat…") and the assessment benchmarks, initially conceptualized for human-authored content, might have potential gaps in capturing the nuances of AI-driven information. The LLMs evaluated showed a notable capability in generating valuable medical content; however, evident lapses in content quality and potential harm signal the need for further refinements. Given the dynamic landscape of LLMs, this study's findings emphasize the need for regular and methodical assessments, oversight, and fine-tuning of these AI tools to ensure they produce consistently trustworthy and clinically safe medical advice. Notably, the introduction of an auto-evaluation mechanism using GPT-4, as detailed in this study, provides a scalable, transferable method for domain-agnostic evaluations, extending beyond therapy recommendation assessments.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10644179	PMC
http://dx.doi.org/10.2196/49324	DOI Listing

Publication Analysis

Top Keywords

large language

language models

clinical specialties

generating medical

terms quality

mdiscern score

models

models therapy

therapy recommendations

recommendations clinical

Similar Publications

Patient-reported outcomes after lobectomy vs. segmentectomy for early-stage non-small cell lung cancer.

Surg Endosc

September 2025

Department of Thoracic Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, 100021, China.

Qian Hong , Yan Wang , Fengyan Ma , Yinyan Gao , Guochao Zhang

Background: Surgical resection is the cornerstone for early-stage non-small cell lung cancer (NSCLC), with lobectomy historically standard. Evolving techniques have spurred debate comparing lobectomy and segmentectomy. This study analyzed early postoperative patient-reported symptoms and functional status in patients with early NSCLC undergoing either procedure.

View Article and Find Full Text PDF

Similar Publications

The imitation game: large language models versus multidisciplinary tumor boards: benchmarking AI against 21 sarcoma centers from the ring trial.

J Cancer Res Clin Oncol

September 2025

Department of Surgery, Mannheim School of Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany.

Cheng-Peng Li , Aimé Terence Kalisa , Siyer Roohani , Kamal Hummedah , Franka Menge

Purpose: The study aims to compare the treatment recommendations generated by four leading large language models (LLMs) with those from 21 sarcoma centers' multidisciplinary tumor boards (MTBs) of the sarcoma ring trial in managing complex soft tissue sarcoma (STS) cases.

Methods: We simulated STS-MTBs using four LLMs-Llama 3.2-vison: 90b, Claude 3.

View Article and Find Full Text PDF

Similar Publications

Association Between Conversational Multitasking and Clinician Work Behaviors at a Large US Health Care System: Cohort Study.

J Med Internet Res

September 2025

Washington University in St. Louis, 660 South Euclid Avenue, Campus Box 8054, St Louis, MO, United States, 1 3142737801.

Linlin Xia , Daphne Lew , Laura Baratta , Elise Eiden , Sunny Lou

Background: Clinical communication is central to the delivery of effective, timely, and safe patient care. The use of text-based tools for clinician-to-clinician communication-commonly referred to as secure messaging-has increased exponentially over the past decade. The use of secure messaging has a potential impact on clinician work behaviors, workload, and cognitive burden.

View Article and Find Full Text PDF

Similar Publications

Artificial Intelligence in allergy and immunology: recent developments, implementation challenges, and the road towards clinical impact.

J Allergy Clin Immunol

September 2025

University of Groningen, University Medical Center Groningen, Beatrix Children's Hospital, Department of Pediatric Pulmonology and Pediatric Allergology, Groningen, the Netherlands; University of Groningen, University Medical Center Groningen, Groningen Research Institute for Asthma and COPD (GRIAC)

Merlijn van Breugel , Matt Greenhawt , Ibon Eguiluz-Gracia , Maria Jose Torres Jaen , Aikaterini Anagnostou

Artificial intelligence (AI) is increasingly recognized for its capacity to transform medicine. While publications applying AI in allergy and immunology have increased, clinical implementation substantially lags behind other specialties. By mid-2024, over 1,000 FDA-approved AI-enabled medical devices existed, but none specifically addressed allergy and immunology.

View Article and Find Full Text PDF

Similar Publications

[Artificial Intelligence Methods - a Perspective for Cardiovascular Telemedicine?].

Dtsch Med Wochenschr

September 2025

Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Charité Universitätsmedizin Berlin, Berlin, Deutschland.

Meike Hiddemann , Kerstin Köhler , Wilhelm Haverkamp , Juliane Köhler , Maximilian Bauser

Since 2022, an estimated 150000 to 200000 patients with heart failure (HF) in Germany have met the inclusion criteria for HF telemonitoring in accordance with the Federal Joint Committee's (G-BA) decision. Currently, only a few artificial intelligence (AI) applications are used in standard cardiovascular telemedicine care. However, AI applications could improve the predictive accuracy of existing telemedical sensor technology by recognising patterns across multiple data sources.

View Article and Find Full Text PDF

Similar Publications