Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Background: The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application.

Objective: This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation.

Methods: We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria-appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators.

Results: The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=-0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=-0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability.

Conclusions: Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework's potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11618017PMC
http://dx.doi.org/10.2196/58329DOI Listing

Publication Analysis

Top Keywords

evaluation framework
8
large language
8
language models
8
medical documentation
8
health care
8
llm outputs
8
clinical evaluation
8
evaluation
6
clinical
5
framework large
4

Similar Publications

Background: In pediatric intensive care units, pain, sedation, delirium, and iatrogenic withdrawal syndrome (IWS) must be managed as interrelated conditions. Although clinical practice guidelines (CPGs) exist, new evidence needs to be incorporated, gaps in recommendations addressed, and recommendations adapted to the European context.

Objective: This protocol describes the development of the first patient- and family-informed European guideline for managing pain, sedation, delirium, and IWS by the European Society of Paediatric and Neonatal Intensive Care.

View Article and Find Full Text PDF

Purpose: Many mealtime interventions have been developed over the past ten years. The effective implementation of such interventions into clinical practice is crucial to improve the swallowing safety and/or mealtime-related quality of life for people living with dysphagia or at risk of malnutrition. This systematic review summarises and critically appraises the literature on implementation of mealtime interventions in inpatient and aged care settings.

View Article and Find Full Text PDF

Fetal standard plane detection is essential in prenatal care, enabling accurate assessment of fetal development and early identification of potential anomalies. Despite significant advancements in machine learning (ML) in this domain, its integration into clinical workflows remains limited-primarily due to the lack of standardized, end-to-end operational frameworks. To address this gap, we introduce FetalMLOps, the first comprehensive MLOps framework specifically designed for fetal ultrasound imaging.

View Article and Find Full Text PDF

A comprehensive evaluation framework for climate effect on plant viewing activities.

Int J Biometeorol

September 2025

Key Laboratory of Land Surface Pattern and Simulation, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, 100101, China.

Plant viewing activities, which encompass the enjoyment of seasonal plant phenomena such as flowering and autumn leaf coloration, have become popular worldwide. Plant viewing activities are increasingly challenged by climate change, as key components like plant phenology and climate comfort are highly sensitive to global warming. However, few studies have explored the impact of climate change on viewing activities, particularly from an integrated, multi-factor perspective.

View Article and Find Full Text PDF

Purpose Of The Article: Snail mucin (SM) has garnered significant attention in dermatology, particularly for its potential in scar therapy and wound healing, due to its bioactive compounds, like allantoin, glycolic acid, and hyaluronic acid. These compounds are known to promote tissue regeneration, enhance skin hydration, and reduce scarring.

Materials And Methods: However, despite growing interest, significant gaps remain in the clinical understanding of SM's therapeutic potential, including a lack of standardised formulations and limited clinical trials.

View Article and Find Full Text PDF