Automated Extraction of Stroke Severity from Unstructured Electronic Health Records using Natural Language Processing.

Marta Fernandes , M Brandon Westover , Aneesh B Singhal , Sahar F Zafar

medRxiv

Department of Neurology, Massachusetts General Hospital (MGH), Boston, Massachusetts, United States.

Published: March 2024

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Multi-center electronic health records (EHR) can support quality improvement initiatives and comparative effectiveness research in stroke care. However, limitations of EHR-based research include challenges in abstracting key clinical variables from non-structured data at scale. This is further compounded by missing data. Here we develop a natural language processing (NLP) model that automatically reads EHR notes to determine the NIH stroke scale (NIHSS) score of patients with acute stroke.

Methods: The study included notes from acute stroke patients (>= 18 years) admitted to the Massachusetts General Hospital (MGH) (2015-2022). The MGH data were divided into training (70%) and hold-out test (30%) sets. A two-stage model was developed to predict the admission NIHSS. A linear model with the least absolute shrinkage and selection operator (LASSO) was trained within the training set. For notes in the test set where the NIHSS was documented, the scores were extracted using regular expressions (stage 1), for notes where NIHSS was not documented, LASSO was used for prediction (stage 2). The reference standard for NIHSS was obtained from Get With The Guidelines Stroke Registry. The two-stage model was tested on the hold-out test set and validated in the MIMIC-III dataset (Medical Information Mart for Intensive Care-MIMIC III 2001-2012) v1.4, using root mean squared error (RMSE) and Spearman correlation (SC).

Results: We included 4,163 patients (MGH = 3,876; MIMIC = 287); average age of 69 [SD 15] years; 53% male, and 72% white. 90% patients had ischemic stroke and 10% hemorrhagic stroke. The two-stage model achieved a RMSE [95% CI] of 3.13 [2.86-3.41] (SC = 0.90 [0.88-0. 91]) in the MGH hold-out test set and 2.01 [1.58-2.38] (SC = 0.96 [0.94-0.97]) in the MIMIC validation set.

Conclusions: The automatic NLP-based model can enable large-scale stroke severity phenotyping from EHR and therefore support real-world quality improvement and comparative effectiveness studies in stroke.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10980121	PMC
http://dx.doi.org/10.1101/2024.03.08.24304011	DOI Listing

Publication Analysis

Top Keywords

hold-out test

two-stage model

test set

stroke

stroke severity

electronic health

health records

natural language

language processing

ehr support

Similar Publications

Non-invasive maturity assessment of iPSC-CMs based on optical maturity characteristics using interpretable AI.

Comput Struct Biotechnol J

August 2025

Institute of Biomedical Engineering, TU Dresden, Fetscherstr. 29, Dresden 01307, Germany.

Fabian Scheurer , Alexander Hammer , Mario Schubert , Robert-Patrick Steiner , Oliver Gamm

Human induced pluripotent stem cell-derived cardiomyocytes (iPSC-CMs) are an important resource for identifying novel therapeutic targets and cardioprotective drugs. However, a key limitation of iPSC-CMs is their immature, fetal-like phenotype. Cultivation of iPSC-CMs in lipid-supplemented maturation media (MM) enhances the structural, metabolic and electrophysiological properties of iPSC-CMs.

View Article and Find Full Text PDF

Similar Publications

A Machine Learning Approach for Identifying People With Neuroinfectious Diseases in Electronic Health Records: Algorithm Development and Validation.

JMIR Med Inform

August 2025

Department of Neurology, Massachusetts General Hospital, 55 Fruit St, Wang ACC 835, Boston, MA, 02114, United States, 1 2163379887.

Arjun Singh , Shadi Sartipi , Haoqi Sun , Rebecca Milde , Niels Turley

Background: Identifying neuroinfectious disease (NID) cases using International Classification of Diseases billing codes is often imprecise, while manual chart reviews are labor-intensive. Machine learning models can leverage unstructured electronic health records to detect subtle NID indicators, process large data volumes efficiently, and reduce misclassification. While accurate NID classification is needed for research and clinical decision support, using unstructured notes for this purpose remains underexplored.

View Article and Find Full Text PDF

Similar Publications

Machine Learning for Predicting Recurrent Course in Uveitis Using Baseline Clinical Characteristics.

Invest Ophthalmol Vis Sci

August 2025

Programme for Ocular Inflammation & Infection Translational Research, Department of Ophthalmology, National Healthcare Group Eye Institute, Tan Tock Seng Hospital, Singapore, Singapore.

William Rojas-Carabali , Carlos Cifuentes-González , Anna Utami , Manisha Agarwal , John H Kempen

Purpose: We developed and evaluated machine learning models for predicting the risk of recurrent uveitis using baseline clinical characteristics, to inform clinical decision-making and risk stratification.

Methods: A retrospective analysis was conducted using the Ocular Autoimmune Systemic Inflammatory Infectious Study registry, including 966 patients (1432 eyes) with uveitis. Three machine learning classifiers-random Forest, eXtreme Gradient Boosting, and radial basis function support vector classifier-were trained on preprocessed baseline demographic and clinical data.

View Article and Find Full Text PDF

Similar Publications

Population-Specific Radiomics From Biparametric Magnetic Resonance Imaging Improves Prostate Cancer Risk Stratification in African American Men.

JU Open Plus

July 2025

Indiana University, Indianapolis, Indiana.

Abhishek Midya , Sreeharsha Tirumani , Leonardo Kayat Bittencourt , Sena Azamat , Siddharth Balakrishnan

Purpose: To quantify population-specific differences in prostate cancer (PCa) presentation between African American (AA) and White (W) men on MRI using radiomics.

Materials And Methods: We identified N = 149 men with PCa who underwent 3T MRI, a confirmatory biopsy and for whom self-reported race was available. Patient studies were partitioned into training (D) and hold-out test set (D).

View Article and Find Full Text PDF

Similar Publications

Prediction model for intrapartum cesarean delivery among women with gestational diabetes mellitus.

Arch Gynecol Obstet

August 2025

Department of Obstetrics and Gynecology, Lis Hospital for Women's Health, Tel Aviv Sourasky Medical Center, 6 Weizmann St, 6423906, Tel Aviv, Israel.

Itamar Gilboa , Daniel Gabbai , Emmanuel Attali , Liran Hiersch , Anat Lavie

Purpose: To identify risk factors and to develop a predictive model for cesarean delivery (CD) in women with gestational diabetes mellitus (GDM).

Study Design: A retrospective cohort study, in a single university-affiliated tertiary medical center, was performed. All women with GDM and a singleton pregnancy who had a trial of labor between 2011 and 2023 were included.

View Article and Find Full Text PDF

Similar Publications