Assessing the feasibility and external validity of natural language processing-extracted data for advanced lung cancer patients.

Yuchen Li , Jennifer Law , Lisa W Le , Janice J N Li , Christopher Pettengell , Patricia Demarco , Michael Duong , David Merritt , Sean Davidson , Mike Sung , Qixuan Li , Sally Cm Lau , Sajda Zahir , Ryan Chu , Malcom Ryan , Khizar Karim , Josh Morganstein , Adrian Sacher , Lawson Eng , Frances A Shepherd

Lung Cancer

Dept. of Medical Oncology, Princess Margaret Cancer Center, Toronto, ON, Canada.

Published: January 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Manual extraction of real-world clinical data for research can be time-consuming and prone to error. We assessed the feasibility of using natural language processing (NLP), an AI technique, to automate data extraction for patients with advanced lung cancer (aLC). We assessed the external validity of our NLP-extracted data by comparing our findings to those reported in the literature.

Methods: Patients diagnosed with stage IIIB or IV lung cancer between January 2015 to December 2017 at Princess Margaret Cancer Centre who received at least one dose of systemic therapy were included. Their electronic health records were provided to Pentavere's NLP platform, DARWEN, in March 2019. Descriptive statistics summarized baseline patient and cancer characteristics, molecular biomarkers, and first-line systemic therapies. Cox multivariate models were used to evaluate prognostic factors for advanced non-small cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) cohort.

Result: NLP extracted clinical information (n = 333 patients) in a total of 8 hours, with only a few missing data for smoking status (n = 2), and Eastern Cooperative Oncology Group (ECOG) status (n = 5). Baseline patient and cancer characteristics summarized from NLP-extracted data were comparable to those in previous studies and population reports. For NSCLC patients, being male (HR 1.44, 95 % CI [1.04, 2.00]), having worse ECOG (1.48 [1.22, 1.81]), and having liver (2.24 [1.45, 3.46]), bone (2.09 [1.48, 2.96]), or lung metastases (2.54 [1.05, 2.26]) were associated with worse survival outcomes. For SCLC patients, having older age (HR 1.70 per 10 years, 95 % CI [1.10, 2.63]) and liver metastases (3.81 [1.61, 9.01]) were associated with worse survival outcomes.

Conclusion: Our study demonstrated that automated data extraction using NLP is feasible and time efficient. Additionally, the NLP-extracted data can be used to identify valid and useful clinical endpoints for research. NLP holds significant potential to accelerate the extraction of real-world data for future observational studies.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.lungcan.2025.108080	DOI Listing

Publication Analysis

Top Keywords

lung cancer

nlp-extracted data

data

external validity

natural language

advanced lung

cancer

extraction real-world

data extraction

baseline patient

A PHP Error was encountered