Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Introduction: Recent developments in the field of large language models have showcased impressive achievements in their ability to perform natural language processing tasks, opening up possibilities for use in critical domains like telehealth. We conducted a pilot study on the opportunities of utilizing large language models, specifically GPT-3.5, GPT-4, and LLaMA 2, in the context of zero-shot summarization of doctor-patient conversation during a palliative care teleconsultation.

Methods: We created a bespoke doctor-patient conversation to evaluate the quality of medical conversation summarization, employing established automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore for quality assessment, and using the Flesch-Kincaid grade Level for readability to understand the efficacy and suitability of these models in the medical domain.

Results: For automatic metrics, LLaMA2-7B scored the highest in BLEU, indicating strong n-gram precision, while GPT-4 excelled in both ROUGE-L and METEOR, demonstrating its capability to capture longer sequences and semantic accuracy. GPT-4 also led in BERTScore, suggesting better semantic similarity at the token level compared to others. For readability, LLaMA 7B and LLaMA 13B produced summaries with Flesch-Kincaid grade levels of 11.9 and 12.6, respectively, which are somewhat more complex than the reference value of 10.6. LLaMA 70B generated summaries closest to the reference in simplicity, with a score of 10.7. GPT-3.5's summaries were the most complex at a grade level of 15.2, while GPT-4's summaries had a grade level of 13.1, making them moderately accessible.

Conclusion: Our findings indicate that all the models have similar performance for the palliative care consultation, with GPT-4 being slightly better at balancing understanding content and maintaining structural similarity to the source, which makes it a potentially better choice for creating patient-friendly medical summaries. Threats and limitations of such approaches are also embedded in our analysis.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11577459PMC
http://dx.doi.org/10.1177/20552076241293932DOI Listing

Publication Analysis

Top Keywords

large language
12
language models
12
palliative care
12
grade level
12
doctor-patient conversation
8
automatic metrics
8
rouge-l meteor
8
flesch-kincaid grade
8
models
5
summaries
5

Similar Publications

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

JMIR Med Inform

September 2025

Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.

Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.

Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.

View Article and Find Full Text PDF

Background: With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.

Objective: This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.

View Article and Find Full Text PDF

This study aimed to assess the ability of two off-the-shelf large language models, ChatGPT and Gemini, to support the design of pharmacoepidemiological studies. We assessed 48 study protocols of pharmacoepidemiological studies published between 2018 and 2024, covering various study types, including disease epidemiology, drug utilization, safety, and effectiveness. The coherence (i.

View Article and Find Full Text PDF

Artificial intelligence (AI), particularly large language models (LLMs), offers the potential to augment clinical decision-making, including in palliative care pharmacy, where personalized treatment and assessments are important. Despite the growing interest in AI, its role in clinical reasoning within specialized fields such as palliative care remains uncertain. This study examines the performance of four commercial-grade LLMs on a Script Concordance Test (SCT) designed for pharmacy students in a pain and palliative care elective, comparing AI outputs with human learners' performance at baseline.

View Article and Find Full Text PDF

Purpose: Prior studies of vocal auditory-motor control in people with hyperfunctional voice disorders (HVDs) have found evidence of unusually large responses to auditory feedback perturbations of fundamental frequency (0) and more variable voice onset times in unperturbed speech. However, it is unknown whether people with HVDs perform similarly to people with typical voices when asked to make small changes in vocal parameters in volitional tasks. The purpose of this study was to compare performance on minimal movement tasks for 0 and intensity in people with and without HVDs.

View Article and Find Full Text PDF