Utilizing Large language models to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation.

Xiangming Cai , Yuanming Geng , Yiming Du , Bart Westerman , Duolao Wang , Chiyuan Ma , Juan J Garcia Vallejo

BMC Med Res Methodol

Department of Molecular Cell Biology & Immunology, Amsterdam Infection & Immunity Institute and Cancer Center Amsterdam, Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam, The Netherlands.

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large language models (LLMs) like ChatGPT showed great potential in aiding medical research. A heavy workload in filtering records is needed during the research process of evidence-based medicine, especially meta-analysis. However, few studies tried to use LLMs to help screen records in meta-analysis.

Objective: In this research, we aimed to explore the possibility of incorporating multiple LLMs to facilitate the screening step based on the title and abstract of records during meta-analysis.

Methods: Various LLMs were evaluated, which includes GPT-3.5, GPT-4, Deepseek-R1-Distill, Qwen-2.5, Phi-4, Llama-3.1, Gemma-2 and Claude-2. To assess our strategy, we selected three meta-analyses from the literature, together with a glioma meta-analysis embedded in the study, as additional validation. For the automatic selection of records from curated meta-analyses, a four-step strategy called LARS-GPT was developed, consisting of (1) criteria selection and single-prompt (prompt with one criterion) creation, (2) best combination identification, (3) combined-prompt (prompt with one or more criteria) creation, and (4) request sending and answer summary. Recall, workload reduction, precision, and F1 score were calculated to assess the performance of LARS-GPT.

Results: A variable performance was found between different single-prompts, with a mean recall of 0.800. Based on these single-prompts, we were able to find combinations with better performance than the pre-set threshold. Finally, with a best combination of criteria identified, LARS-GPT showed a 40.1% workload reduction on average with a recall greater than 0.9.

Conclusions: We show here the groundbreaking finding that automatic selection of literature for meta-analysis is possible with LLMs. We provide it here as a pipeline, LARS-GPT, which showed a great workload reduction while maintaining a pre-set recall.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12036192	PMC
http://dx.doi.org/10.1186/s12874-025-02569-3	DOI Listing

Publication Analysis

Top Keywords

workload reduction

large language

language models

literature meta-analysis

reduction maintaining

automatic selection

best combination

workload

recall

llms

Similar Publications

RETURN-TO-WORK AFTER ACETABULAR FRACTURES: THE IMPACT OF INJURY SEVERITY ON THE POST-REHABILITATION WORKING CAPACITY AND WORKLOAD.

J Rehabil Med Clin Commun

September 2025

Department of Trauma and Reconstructive Surgery, Eberhard Karls University Tuebingen, BG Trauma Center Tuebingen, Tuebingen, Germany.

Anna L Schiltenwolf , Tina Histing , Maximilian M Menger , Christof K Audretsch , Florian Laux

Objective: Acetabular fractures are among the most severe injuries in trauma surgery. In younger patients, they typically result from high-energy trauma and are often associated with polytrauma. Treatment complexity and rehabilitation outcomes are influenced by overall injury severity.

View Article and Find Full Text PDF

Similar Publications

An economic scenario analysis of implementing artificial intelligence in BreastScreen Norway-Impact on radiologist person-years, costs and effects.

J Med Screen

September 2025

The Cancer Registry of Norway, Department of Screening programs, Norwegian Institute of Public Health, Oslo, Norway.

Tron Anders Moger , Sahand Barati Nardin , Åsne Sørlien Holen , Nataliia Moshina , Solveig Hofvind

ObjectiveTo study the implications of implementing artificial intelligence (AI) as a decision support tool in the Norwegian breast cancer screening program concerning cost-effectiveness and time savings for radiologists.MethodsIn a decision tree model using recent data from AI vendors and the Cancer Registry of Norway, and assuming equal effectiveness of radiologists plus AI compared to standard practice, we simulated costs, effects and radiologist person-years over the next 20 years under different scenarios: 1) Assuming a €1 additional running cost of AI instead of the €3 assumed in the base case, 2) varying the AI-score thresholds for single vs. double readings, 3) varying the consensus and recall rates, and 4) reductions in the interval cancer rate compared to standard practice.

View Article and Find Full Text PDF

Similar Publications

Graph neural networks with configuration cross-attention for tensor compilers.

Front Artif Intell

August 2025

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

Dmitrii Khizbullin , Eduardo Rocha de Andrade , Thanh Hau Nguyen , Matheus Pedroza Ferreira , David R Pugh

With the recent popularity of neural networks comes the need for efficient serving of inference workloads. A neural network inference workload can be represented as a computational graph with nodes as operators transforming multidimensional tensors. The tensors can be transposed and/or tiled in a combinatorially large number of ways, some configurations leading to accelerated inference.

View Article and Find Full Text PDF

Similar Publications

Untangling teacher burnout: a network analysis of demands, resources, and out-of-field teaching challenges in rural China.

Front Public Health

September 2025

China Institute of Rural Education Development, Northeast Normal University, Changchun, China.

Ming Huo

Introduction: Teacher burnout poses a significant threat to the sustainability of rural education. However, the effect of out-of-field teaching as a job demand remains understudied. This study applies the Job Demands-Resources (JD-R) model to explore how job demands, job resources, and personal resources interact with burnout among rural teachers.

View Article and Find Full Text PDF

Similar Publications

Development of a Clinical Clerkship Mentor Using Generative AI and Evaluation of Its Effectiveness in a Medical Student Trial Compared to Student Mentors: 2-Part Comparative Study.

JMIR Med Educ

September 2025

Department of Medical Education, Graduate School of Medicine, Chiba University, Chiba, Japan.

Hayato Ebihara , Hajime Kasai , Ikuo Shimizu , Kiyoshi Shikino , Hiroshi Tajima

Background: At the beginning of their clinical clerkships (CCs), medical students face multiple challenges related to acquiring clinical and communication skills, building professional relationships, and managing psychological stress. While mentoring and structured feedback are known to provide critical support, existing systems may not offer sufficient and timely guidance owing to the faculty's limited availability. Generative artificial intelligence, particularly large language models, offers new opportunities to support medical education by providing context-sensitive responses.

View Article and Find Full Text PDF

Similar Publications