Automating Evaluation of AI Text Generation in Healthcare with a Large Language Model (LLM)-as-a-Judge.

Emma Croxford , Yanjun Gao , Elliot First , Nicholas Pellegrino , Miranda Schnier , John Caskey , Madeline Oguss , Graham Wills , Guanhua Chen , Dmitriy Dligach , Matthew M Churpek , Anoop Mayampurath , Frank Liao , Cherodeep Goswami , Karen K Wong , Brian W Patterson , Majid Afshar

medRxiv

UW Health, Madison, USA.

Published: May 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Electronic Health Records (EHRs) store vast amounts of clinical information that are difficult for healthcare providers to summarize and synthesize relevant details to their practice. To reduce cognitive load on providers, generative AI with Large Language Models have emerged to automatically summarize patient records into clear, actionable insights and offload the cognitive burden for providers. However, LLM summaries need to be precise and free from errors, making evaluations on the quality of the summaries necessary. While human experts are the gold standard for evaluations, their involvement is time-consuming and costly. Therefore, we introduce and validate an automated method for evaluating real-world EHR multi-document summaries using an LLM as the evaluator, referred to as LLM-as-a-Judge. Benchmarking against the validated Provider Documentation Summarization Quality Instrument (PDSQI)-9 for human evaluation, our LLM-as-a-Judge framework demonstrated strong inter-rater reliability with human evaluators. GPT-o3-mini achieved the highest intraclass correlation coefficient of 0.818 (95% CI 0.772, 0.854), with a median score difference of 0 from human evaluators, and completes evaluations in just 22 seconds. Overall, the reasoning models excelled in inter-rater reliability, particularly in evaluations that require advanced reasoning and domain expertise, outperforming non-reasoning models, those trained on the task, and multi-agent workflows. Cross-task validation on the Problem Summarization task similarly confirmed high reliability. By automating high-quality evaluations, medical LLM-as-a-Judge offers a scalable, efficient solution to rapidly identify accurate and safe AI-generated summaries in healthcare settings.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12045442	PMC
http://dx.doi.org/10.1101/2025.04.22.25326219	DOI Listing

Publication Analysis

Top Keywords

large language

inter-rater reliability

human evaluators

evaluations

automating evaluation

evaluation text

text generation

generation healthcare

healthcare large

language model

Similar Publications

A plain language summary of the MIRACLE study: benralizumab in people in Asia with severe asthma.

Immunotherapy

September 2025

aGuangzhou Institute of Respiratory Health, State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, National Center for Respiratory Medicine, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, China.

Kefang Lai , Dejun Sun , Ranran Dai , Hae-Sim Park , Annika Åstrand

View Article and Find Full Text PDF

Similar Publications

Integrating Generative Artificial Intelligence in Midwifery Education: Balancing Innovation, Ethics, and Academic Integrity.

J Midwifery Womens Health

September 2025

General Education Department Chair, Midwives College of Utah, Salt Lake City, Utah.

Megan Koontz , Stefanie Podlog

Applications driven by large language models (LLMs) are reshaping higher education by offering innovative tools that enhance learning, streamline administrative tasks, and support scholarly work. However, their integration into education institutions raises ethical concerns related to bias, misinformation, and academic integrity, necessitating thoughtful institutional responses. This article explores the evolving role of LLMs in midwifery higher education, providing historical context, key capabilities, and ethical considerations.

View Article and Find Full Text PDF

Similar Publications

It's Hey Jude, not Hey Jade: Input Variation and the Emergence of the Infant Lexicon.

J Child Lang

September 2025

Department of Psychology, University of TorontoMississauga, Mississauga, Ontario, Canada.

Helen Buckler , Elizabeth K Johnson

A growing literature explores the representational detail of infants' early lexical representations, but no study has investigated how exposure to real-life acoustic-phonetic variation impacts these representations. Indeed, previous experimental work with young infants has largely ignored the impact of accent exposure on lexical development. We ask how routine exposure to accent variation affects 6-month-olds' ability to detect mispronunciations.

View Article and Find Full Text PDF

Similar Publications

Resource Utilization for Brief Resolved Unexplained Events in a Pediatric and General Emergency Department.

Pediatr Emerg Care

September 2025

Albert Einstein College of Medicine.

Daniel M Fein , Leon Chen , Nina Samuel , Michael D Cabana

Objectives: The primary aim of this study was to compare resource utilization between lower and higher-risk brief resolved unexplained events (BRUE) in the general (GED) and pediatric (PED) emergency departments.

Methods: We conducted a retrospective chart review of BRUE cases from a large health system over 6-and-a-half years. Our primary outcome was the count of diagnostic tests per encounter.

View Article and Find Full Text PDF

Similar Publications

Implementing a Resource-Light and Low-Code Large Language Model System for Information Extraction from Mammography Reports: A Pilot Study.

J Imaging Inform Med

September 2025

Department of Diagnostic, Interventional and Pediatric Radiology (DIPR), Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland.

Fabio Dennstädt , Simon Fauser , Nikola Cihoric , Max Schmerder , Paolo Lombardo

Large language models (LLMs) have been successfully used for data extraction from free-text radiology reports. Most current studies were conducted with LLMs accessed via an application programming interface (API). We evaluated the feasibility of using open-source LLMs, deployed on limited local hardware resources for data extraction from free-text mammography reports, using a common data element (CDE)-based structure.

View Article and Find Full Text PDF

Similar Publications