Extracting Clinical Guideline Information Using Two Large Language Models: Evaluation Study.

Hsing-Yu Hsu , Lu-Wen Chen , Wan-Tseng Hsu , Yow-Wen Hsieh , Shih-Sheng Chang

J Med Internet Res

Artificial Intelligence Center, China Medical University Hospital, 2, Yude Road, Taichung, 404327, Taiwan, 886 4-22052121.

Published: September 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: The effective implementation of personalized pharmacogenomics (PGx) requires the integration of released clinical guidelines into decision support systems to facilitate clinical applications. Large language models (LLMs) can be valuable tools for automating information extraction and updates.

Objective: This study aimed to assess the effectiveness of repeated cross-comparisons and an agreement-threshold strategy in 2 advanced LLMs as supportive tools for updating information.

Methods: The study evaluated the performance of 2 LLMs, GPT-4o and Gemini-1.5-Pro, in extracting PGx clinical guidelines and comparing their outputs with expert-annotated evaluations. The 2 LLMs classified 385 PGx clinical guidelines, with each recommendation tested 20 times per model. Accuracy was assessed by comparing the results with manually labeled data. Two prospectively defined strategies were used to identify inconsistent predictions. The first involved repeated cross-comparison, flagging discrepancies between the most frequent classifications from each model. The second used a consistency threshold strategy, which designated predictions appearing in less than 60% of the 40 combined outputs as unstable. Cases flagged by either strategy were subjected to manual review. This study also estimated the overall cost of model use and was conducted between October 1 and November 30, 2024.

Results: GPT-4o and Gemini-1.5-Pro yielded reproducibility rates of 97.8% (7534/7700) and 98.9% (7612/7700), respectively, based on the most frequent classification for each query. Compared with expert labels, GPT-4o achieved 93.5% accuracy (Cohen κ=0.90; P<.001) and Gemini-1.5-Pro 92.7% accuracy (Cohen κ=0.89; P<.001). Both models demonstrated high overall performance, with comparable weighted average F1-scores (GPT-4o: 0.929; Gemini: 0.935). The models generated consistent predictions for 341 of 385 guideline items, reducing the need for manual review by 88.6%. Among these agreed-upon cases, only one (0.3%) diverged from expert labels. Applying a predefined agreement-threshold strategy further reduced the number of priority manual review cases to 2.9% (11/385), although the error rate slightly increased to 0.5% (2/374). The inconsistencies identified through these methods prompted the prioritization of manual review to minimize errors and enhance clinical applicability. The total combined cost of using both LLMs was only US $0.76.

Conclusions: These findings suggest that using 2 LLMs can effectively streamline PGx guideline integration into clinical decision support systems while maintaining high performance and minimal cost. Although selective manual review remains necessary, this approach offers a practical and scalable solution for PGx guideline classification in clinical workflows.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12413144	PMC
http://dx.doi.org/10.2196/73486	DOI Listing

Publication Analysis

Top Keywords

clinical guidelines

large language

language models

gpt-4o gemini-15-pro

pgx clinical

extracting clinical

clinical guideline

guideline large

models evaluation

study

A PHP Error was encountered