Exploring the opportunities of large language models for summarizing palliative care consultations: A pilot comparative study.

Xiao Chen , Wei Zhou , Rashina Hoda , Andy Li , Chris Bain , Peter Poon

Digit Health

Faculty of Medicine, Nursing and Health Sciences, Monash University, Clayton, VIC, Australia.

Published: November 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Introduction: Recent developments in the field of large language models have showcased impressive achievements in their ability to perform natural language processing tasks, opening up possibilities for use in critical domains like telehealth. We conducted a pilot study on the opportunities of utilizing large language models, specifically GPT-3.5, GPT-4, and LLaMA 2, in the context of zero-shot summarization of doctor-patient conversation during a palliative care teleconsultation.

Methods: We created a bespoke doctor-patient conversation to evaluate the quality of medical conversation summarization, employing established automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore for quality assessment, and using the Flesch-Kincaid grade Level for readability to understand the efficacy and suitability of these models in the medical domain.

Results: For automatic metrics, LLaMA2-7B scored the highest in BLEU, indicating strong n-gram precision, while GPT-4 excelled in both ROUGE-L and METEOR, demonstrating its capability to capture longer sequences and semantic accuracy. GPT-4 also led in BERTScore, suggesting better semantic similarity at the token level compared to others. For readability, LLaMA 7B and LLaMA 13B produced summaries with Flesch-Kincaid grade levels of 11.9 and 12.6, respectively, which are somewhat more complex than the reference value of 10.6. LLaMA 70B generated summaries closest to the reference in simplicity, with a score of 10.7. GPT-3.5's summaries were the most complex at a grade level of 15.2, while GPT-4's summaries had a grade level of 13.1, making them moderately accessible.

Conclusion: Our findings indicate that all the models have similar performance for the palliative care consultation, with GPT-4 being slightly better at balancing understanding content and maintaining structural similarity to the source, which makes it a potentially better choice for creating patient-friendly medical summaries. Threats and limitations of such approaches are also embedded in our analysis.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11577459	PMC
http://dx.doi.org/10.1177/20552076241293932	DOI Listing

Publication Analysis

Top Keywords

large language

language models

palliative care

grade level

doctor-patient conversation

automatic metrics

rouge-l meteor

flesch-kincaid grade

models

summaries

Similar Publications

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

JMIR Med Inform

September 2025

Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.

Youmei Chen , Mengshi Dong , Jie Sun , Zhanao Meng , Yiqing Yang

Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.

Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.

View Article and Find Full Text PDF

Similar Publications

Multicriteria Assessment of Text Quality in Large Language Model-Generated Gynecomastia Materials: DeepSeek Versus OpenAI Versus Claude.

J Craniofac Surg

September 2025

Department of Breast Plastic Surgery, Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shijingshan, Beijing, China.

Tianying Zang , Jiaojiao Li , Lisha Wei , Yijin Wang

Background: With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.

Objective: This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.

View Article and Find Full Text PDF

Similar Publications

Off-the-Shelf Large Language Models for Guiding Pharmacoepidemiological Study Design.

Clin Pharmacol Ther

September 2025

Department of Drug Design and Pharmacology, University of Copenhagen, Copenhagen, Denmark.

Gerard Ompad , Keele Wurst , Darmendra Ramcharran , Anders Hviid , Andrew Bate

This study aimed to assess the ability of two off-the-shelf large language models, ChatGPT and Gemini, to support the design of pharmacoepidemiological studies. We assessed 48 study protocols of pharmacoepidemiological studies published between 2018 and 2024, covering various study types, including disease epidemiology, drug utilization, safety, and effectiveness. The coherence (i.

View Article and Find Full Text PDF

Similar Publications

Evaluating the Clinical Reasoning of Generative AI in Palliative Care: A Comparison with Five Years of Pharmacy Learners.

J Palliat Med

September 2025

Skaggs School of Pharmacy & Pharmaceutical Sciences, UC San Diego Health Sciences, San Diego, California, USA.

Mikaila T Lane , Toluwalase A Ajayi , Kyle P Edmonds , Rabia S Atayee

Artificial intelligence (AI), particularly large language models (LLMs), offers the potential to augment clinical decision-making, including in palliative care pharmacy, where personalized treatment and assessments are important. Despite the growing interest in AI, its role in clinical reasoning within specialized fields such as palliative care remains uncertain. This study examines the performance of four commercial-grade LLMs on a Script Concordance Test (SCT) designed for pharmacy students in a pain and palliative care elective, comparing AI outputs with human learners' performance at baseline.

View Article and Find Full Text PDF

Similar Publications

Volitional Control of Frequency and Intensity in Speakers With and Without Hyperfunctional Voice Disorders.

J Speech Lang Hear Res

September 2025

Department of Speech, Language, and Hearing Sciences, Boston University, MA.

Mara R Kapsner-Smith , Juli Rosenzweig , Haley Wilcox , Neel Bhatt , J P Giliberto

Purpose: Prior studies of vocal auditory-motor control in people with hyperfunctional voice disorders (HVDs) have found evidence of unusually large responses to auditory feedback perturbations of fundamental frequency (0) and more variable voice onset times in unperturbed speech. However, it is unknown whether people with HVDs perform similarly to people with typical voices when asked to make small changes in vocal parameters in volitional tasks. The purpose of this study was to compare performance on minimal movement tasks for 0 and intensity in people with and without HVDs.

View Article and Find Full Text PDF

Similar Publications