Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making.

Ivan Civettini , Arianna Zappaterra , Bianca Maria Granelli , Giovanni Rindone , Andrea Aroldi , Stefano Bonfanti , Federica Colombo , Marilena Fedele , Giovanni Grillo , Matteo Parma , Paola Perfetti , Elisabetta Terruzzi , Carlo Gambacorti-Passerini , Daniele Ramazzotti , Fabrizio Cavalca

Br J Haematol

Department of Haematology and Bone Marrow Trasplantation Unit, Fondazione IRCCS San Gerardo dei Tintori, Monza, Italy.

Published: April 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.

Download full-text PDF	Source
http://dx.doi.org/10.1111/bjh.19200	DOI Listing

Publication Analysis

Top Keywords

large language

language models

haematopoietic stem

stem cell

cell transplantation

evaluating performance

performance large

models haematopoietic

transplantation decision-making

decision-making first-of-its-kind

Similar Publications

Leveraging GPT-4o for Automated Extraction and Categorization of CAD-RADS Features From Free-Text Coronary CT Angiography Reports: Diagnostic Study.

JMIR Med Inform

September 2025

Departments of Radiology, The Third Affiliated Hospital, Sun Yat-Sen University, 600 Tianhe Road, Guangzhou, Guangdong, 510630, China, 86 18922109279, 86 20852523108.

Youmei Chen , Mengshi Dong , Jie Sun , Zhanao Meng , Yiqing Yang

Background: Despite the Coronary Artery Reporting and Data System (CAD-RADS) providing a standardized approach, radiologists continue to favor free-text reports. This preference creates significant challenges for data extraction and analysis in longitudinal studies, potentially limiting large-scale research and quality assessment initiatives.

Objective: To evaluate the ability of the generative pre-trained transformer (GPT)-4o model to convert real-world coronary computed tomography angiography (CCTA) free-text reports into structured data and automatically identify CAD-RADS categories and P categories.

View Article and Find Full Text PDF

Similar Publications

Multicriteria Assessment of Text Quality in Large Language Model-Generated Gynecomastia Materials: DeepSeek Versus OpenAI Versus Claude.

J Craniofac Surg

September 2025

Department of Breast Plastic Surgery, Plastic Surgery Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Shijingshan, Beijing, China.

Tianying Zang , Jiaojiao Li , Lisha Wei , Yijin Wang

Background: With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.

Objective: This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.

View Article and Find Full Text PDF

Similar Publications

Off-the-Shelf Large Language Models for Guiding Pharmacoepidemiological Study Design.

Clin Pharmacol Ther

September 2025

Department of Drug Design and Pharmacology, University of Copenhagen, Copenhagen, Denmark.

Gerard Ompad , Keele Wurst , Darmendra Ramcharran , Anders Hviid , Andrew Bate

This study aimed to assess the ability of two off-the-shelf large language models, ChatGPT and Gemini, to support the design of pharmacoepidemiological studies. We assessed 48 study protocols of pharmacoepidemiological studies published between 2018 and 2024, covering various study types, including disease epidemiology, drug utilization, safety, and effectiveness. The coherence (i.

View Article and Find Full Text PDF

Similar Publications

Evaluating the Clinical Reasoning of Generative AI in Palliative Care: A Comparison with Five Years of Pharmacy Learners.

J Palliat Med

September 2025

Skaggs School of Pharmacy & Pharmaceutical Sciences, UC San Diego Health Sciences, San Diego, California, USA.

Mikaila T Lane , Toluwalase A Ajayi , Kyle P Edmonds , Rabia S Atayee

Artificial intelligence (AI), particularly large language models (LLMs), offers the potential to augment clinical decision-making, including in palliative care pharmacy, where personalized treatment and assessments are important. Despite the growing interest in AI, its role in clinical reasoning within specialized fields such as palliative care remains uncertain. This study examines the performance of four commercial-grade LLMs on a Script Concordance Test (SCT) designed for pharmacy students in a pain and palliative care elective, comparing AI outputs with human learners' performance at baseline.

View Article and Find Full Text PDF

Similar Publications

Volitional Control of Frequency and Intensity in Speakers With and Without Hyperfunctional Voice Disorders.

J Speech Lang Hear Res

September 2025

Department of Speech, Language, and Hearing Sciences, Boston University, MA.

Mara R Kapsner-Smith , Juli Rosenzweig , Haley Wilcox , Neel Bhatt , J P Giliberto

Purpose: Prior studies of vocal auditory-motor control in people with hyperfunctional voice disorders (HVDs) have found evidence of unusually large responses to auditory feedback perturbations of fundamental frequency (0) and more variable voice onset times in unperturbed speech. However, it is unknown whether people with HVDs perform similarly to people with typical voices when asked to make small changes in vocal parameters in volitional tasks. The purpose of this study was to compare performance on minimal movement tasks for 0 and intensity in people with and without HVDs.

View Article and Find Full Text PDF

Similar Publications