Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?

Andrea De Vito , Nicholas Geremia , Davide Fiore Bavaro , Susan K Seo , Justin Laracy , Maria Mazzitelli , Andrea Marino , Alberto Enrico Maraolo , Antonio Russo , Agnese Colpani , Michele Bartoletti , Anna Maria Cattelan , Cristina Mussini , Saverio Giuseppe Parisi , Luigi Angelo Vaira , Giuseppe Nunnari , Giordano Madeddu

Clin Microbiol Infect

Unit of Infectious Diseases, Department of Medicine, Surgery and Pharmacy, Sassari, Italy.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objectives: Large language models (LLMs) show promise in clinical decision-making, but comparative evaluations of their antibiotic prescribing accuracy are limited. This study assesses the performance of various LLMs in recommending antibiotic treatments across diverse clinical scenarios.

Methods: Fourteen LLMs, including standard and premium versions of ChatGPT, Claude, Copilot, Gemini, Le Chat, Grok, Perplexity, and Pi.ai, were evaluated using 60 clinical cases with antibiograms covering 10 infection types. A standardized prompt was used for antibiotic recommendations focusing on drug choice, dosage, and treatment duration. Responses were anonymized and reviewed by a blinded expert panel assessing antibiotic appropriateness, dosage correctness, and duration adequacy.

Results: A total of 840 responses were collected and analysed. ChatGPT-o1 demonstrated the highest accuracy in antibiotic prescriptions, with 71.7% (43/60) of its recommendations classified as correct and only one (1.7%) incorrect. Gemini and Claude 3 Opus had the lowest accuracy. Dosage correctness was highest for ChatGPT-o1 (96.7%, 58/60), followed by Perplexity Pro (90.0%, 54/60) and Claude 3.5 Sonnet (91.7%, 55/60). In treatment duration, Gemini provided the most appropriate recommendations (75.0%, 45/60), whereas Claude 3.5 Sonnet tended to over-prescribe duration. Performance declined with increasing case complexity, particularly for difficult-to-treat microorganisms.

Discussion: There is significant variability among LLMs in prescribing appropriate antibiotics, dosages, and treatment durations. ChatGPT-o1 outperformed other models, indicating the potential of advanced LLMs as decision-support tools in antibiotic prescribing. However, decreased accuracy in complex cases and inconsistencies among models highlight the need for careful validation before clinical utilization.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.cmi.2025.03.002	DOI Listing

Publication Analysis

Top Keywords

antibiotic prescribing

large language

language models

treatment duration

dosage correctness

claude sonnet

antibiotic

clinical

llms

comparing large

A PHP Error was encountered