98%
921
2 minutes
20
Introduction: Recent developments in the field of large language models have showcased impressive achievements in their ability to perform natural language processing tasks, opening up possibilities for use in critical domains like telehealth. We conducted a pilot study on the opportunities of utilizing large language models, specifically GPT-3.5, GPT-4, and LLaMA 2, in the context of zero-shot summarization of doctor-patient conversation during a palliative care teleconsultation.
Methods: We created a bespoke doctor-patient conversation to evaluate the quality of medical conversation summarization, employing established automatic metrics such as BLEU, ROUGE-L, METEOR, and BERTScore for quality assessment, and using the Flesch-Kincaid grade Level for readability to understand the efficacy and suitability of these models in the medical domain.
Results: For automatic metrics, LLaMA2-7B scored the highest in BLEU, indicating strong n-gram precision, while GPT-4 excelled in both ROUGE-L and METEOR, demonstrating its capability to capture longer sequences and semantic accuracy. GPT-4 also led in BERTScore, suggesting better semantic similarity at the token level compared to others. For readability, LLaMA 7B and LLaMA 13B produced summaries with Flesch-Kincaid grade levels of 11.9 and 12.6, respectively, which are somewhat more complex than the reference value of 10.6. LLaMA 70B generated summaries closest to the reference in simplicity, with a score of 10.7. GPT-3.5's summaries were the most complex at a grade level of 15.2, while GPT-4's summaries had a grade level of 13.1, making them moderately accessible.
Conclusion: Our findings indicate that all the models have similar performance for the palliative care consultation, with GPT-4 being slightly better at balancing understanding content and maintaining structural similarity to the source, which makes it a potentially better choice for creating patient-friendly medical summaries. Threats and limitations of such approaches are also embedded in our analysis.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC11577459 | PMC |
http://dx.doi.org/10.1177/20552076241293932 | DOI Listing |
JMIR Med Inform
September 2025
Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.
Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.
Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.
J Craniofac Surg
September 2025
Department of Breast Plastic Surgery, Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shijingshan, Beijing, China.
Background: With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.
Objective: This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.
Clin Pharmacol Ther
September 2025
Department of Drug Design and Pharmacology, University of Copenhagen, Copenhagen, Denmark.
This study aimed to assess the ability of two off-the-shelf large language models, ChatGPT and Gemini, to support the design of pharmacoepidemiological studies. We assessed 48 study protocols of pharmacoepidemiological studies published between 2018 and 2024, covering various study types, including disease epidemiology, drug utilization, safety, and effectiveness. The coherence (i.
View Article and Find Full Text PDFJ Palliat Med
September 2025
Skaggs School of Pharmacy & Pharmaceutical Sciences, UC San Diego Health Sciences, San Diego, California, USA.
Artificial intelligence (AI), particularly large language models (LLMs), offers the potential to augment clinical decision-making, including in palliative care pharmacy, where personalized treatment and assessments are important. Despite the growing interest in AI, its role in clinical reasoning within specialized fields such as palliative care remains uncertain. This study examines the performance of four commercial-grade LLMs on a Script Concordance Test (SCT) designed for pharmacy students in a pain and palliative care elective, comparing AI outputs with human learners' performance at baseline.
View Article and Find Full Text PDFJ Speech Lang Hear Res
September 2025
Department of Speech, Language, and Hearing Sciences, Boston University, MA.
Purpose: Prior studies of vocal auditory-motor control in people with hyperfunctional voice disorders (HVDs) have found evidence of unusually large responses to auditory feedback perturbations of fundamental frequency (0) and more variable voice onset times in unperturbed speech. However, it is unknown whether people with HVDs perform similarly to people with typical voices when asked to make small changes in vocal parameters in volitional tasks. The purpose of this study was to compare performance on minimal movement tasks for 0 and intensity in people with and without HVDs.
View Article and Find Full Text PDF