Validation of automated paper screening for esophagectomy systematic review using large language models.

Rashi Ramchandani , Eddie Guo , Esra Rakab , Jharna Rathod , Jamie Strain , William Klement , Risa Shorr , Erin Williams , Daniel Jones , Sebastien Gilbert

PeerJ Comput Sci

Division of General Surgery, Department of Surgery, The Ottawa Hospital, Ottawa, Ontario, Canada.

Published: April 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Large language models (LLMs) offer a potential solution to the labor-intensive nature of systematic reviews. This study evaluated the ability of the GPT model to identify articles that discuss perioperative risk factors for esophagectomy complications. To test the performance of the model, we tested GPT-4 on narrower inclusion criterion and by assessing its ability to discriminate relevant articles that solely identified preoperative risk factors for esophagectomy.

Methods: A literature search was run by a trained librarian to identify studies ( = 1,967) discussing risk factors to esophagectomy complications. The articles underwent title and abstract screening by three independent human reviewers and GPT-4. The Python script used for the analysis made Application Programming Interface (API) calls to GPT-4 with screening criteria in natural language. GPT-4's inclusion and exclusion decision were compared to those decided human reviewers.

Results: The agreement between the GPT model and human decision was 85.58% for perioperative factors and 78.75% for preoperative factors. The AUC value was 0.87 and 0.75 for the perioperative and preoperative risk factors query, respectively. In the evaluation of perioperative risk factors, the GPT model demonstrated a high recall for included studies at 89%, a positive predictive value of 74%, and a negative predictive value of 84%, with a low false positive rate of 6% and a macro-F1 score of 0.81. For preoperative risk factors, the model showed a recall of 67% for included studies, a positive predictive value of 65%, and a negative predictive value of 85%, with a false positive rate of 15% and a macro-F1 score of 0.66. The interobserver reliability was substantial, with a kappa score of 0.69 for perioperative factors and 0.61 for preoperative factors. Despite lower accuracy under more stringent criteria, the GPT model proved valuable in streamlining the systematic review workflow. Preliminary evaluation of inclusion and exclusion justification provided by the GPT model were reported to have been useful by study screeners, especially in resolving discrepancies during title and abstract screening.

Conclusion: This study demonstrates promising use of LLMs to streamline the workflow of systematic reviews. The integration of LLMs in systematic reviews could lead to significant time and cost savings, however caution must be taken for reviews involving stringent a narrower and exclusion criterion. Future research is needed and should explore integrating LLMs in other steps of the systematic review, such as full text screening or data extraction, and compare different LLMs for their effectiveness in various types of systematic reviews.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12190591	PMC
http://dx.doi.org/10.7717/peerj-cs.2822	DOI Listing

Publication Analysis

Top Keywords

risk factors

gpt model

systematic reviews

systematic review

preoperative risk

factors

large language

language models

perioperative risk

factors esophagectomy

Similar Publications

Dietary inflammatory index and the risk of colorectal adenomas and cancer: a systematic review and dose-response meta-analysis.

Nutr J

September 2025

Department of Gastroenterology and Hepatology, Hangzhou Red Cross Hospital, 208 Huancheng Dong Road, Hangzhou, 310003, Zhejiang Province, China.

Yi-Jun Wu , Wen-Hua Wang , Yu-Ping Wang , Hong Xu

Background: The potential association between dietary inflammatory index (DII) and colorectal cancer (CRC) risk, as well as colorectal adenomas (CRA) risk, has been extensively studied, but the findings remain inconclusive. We conducted this systematic review and dose-response meta-analysis to investigate the relationship between the DII and CRC and CRA.

Methods: We comprehensively searched the PubMed, Embase, Cochrane Library, and Web of Science databases for cohort and case-control studies reporting the relationship between DII and CRA, or between DII and CRC, as of 15 July 2025.

View Article and Find Full Text PDF

Similar Publications

Development and validation of a gastric cancer prognostic model utilizing lymphatic endothelial cell-related genes.

Diagn Pathol

September 2025

Department of Gastrointestinal Medical Oncology, Fudan University Shanghai Cancer Center, Shanghai, 200032, China.

Sijie Sun , Jieyun Zhang , Weijian Guo

Background: Gastric cancer is one of the most common cancers worldwide, with its prognosis influenced by factors such as tumor clinical stage, histological type, and the patient's overall health. Recent studies highlight the critical role of lymphatic endothelial cells (LECs) in the tumor microenvironment. Perturbations in LEC function in gastric cancer, marked by aberrant activation or damage, disrupt lymphatic fluid dynamics and impede immune cell infiltration, thereby modulating tumor progression and patient prognosis.

View Article and Find Full Text PDF

Similar Publications

One-plate versus two-plate fixation in the treatment of mandibular angle fractures: a retrospective two-centre comparative study.

Head Face Med

September 2025

Department of Oral and Maxillofacial Surgery, University Hospital Tübingen, Tübingen, Germany.

Andreas Sakkas , Mario Scheurer , Robin Kasper , Marcel Ebeling , Alexander Schramm

Background: The treatment of mandibular angle fractures remains controversial, particularly regarding the method of fixation. The primary aim of this study was to compare surgical outcomes following treatment with 1-plate versus 2-plate fixation across two oral and maxillofacial surgery clinics. The secondary aim was to evaluate associations between patient-, trauma-, and procedure-specific factors with postoperative complications and to identify high-risk patients for secondary osteosynthesis.

View Article and Find Full Text PDF

Similar Publications

Haptic-visual evaluation of the J-Sign revealed a significant correlation with patella height, tibial tubercle - trochlear groove distance, trochlear bump height and the total number of risk factors for lateral patellar instability.

J Orthop Surg Res

September 2025

Arcus Sportklinik, Pforzheim, Germany.

Felix Zimmermann , Eric Mandelka , Jula Gierse , Sven Y Vetter , Paul Alfred Grützner

View Article and Find Full Text PDF

Similar Publications

Epidemiology, resistance profiles, and risk factors of multidrug- and carbapenem-resistant Serratia marcescens infections: a retrospective study of 242 cases.

BMC Infect Dis

September 2025

Department of Laboratory Medicine, Affiliated Hospital of Medical School, Nanjing Drum Tower Hospital, Nanjing University, Nanjing, China.

Hong Zhu , Fengyan Li , Xiaoli Cao , Yan Zhang , Chang Liu

Background: Serratia marcescens is an opportunistic pathogen increasingly associated with healthcare-associated infections and rising antimicrobial resistance. The emergence of multidrug-resistant (MDR) and carbapenem-resistant S. marcescens (CRSM) presents significant therapeutic challenges.

View Article and Find Full Text PDF

Similar Publications