Enhancing risk prediction base on health administrative data using high-dimensional prediction model.

Md Belal Hossain , Mohsen Sadatsafavi , Hubert Wong , Victoria J Cook , James C Johnston , Mohammad Ehsanul Karim

J Clin Epidemiol

School of Population and Public Health, University of British Columbia, Vancouver, British Columbia, Canada; Centre for Advancing Health Outcomes, St. Paul's Hospital, Vancouver, British Columbia, Canada.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Objectives: Health administrative datasets often do not contain important clinical variables for predicting the risk of medical outcomes. However, they often contain a wide range of health-care variables that can be used to develop a high-dimensional prediction model (hdPM) that compensates for the lack of clinical predictors. We aimed to compare the predictive performance of an hdPM with a conventional model that relies only on investigator-specified clinical predictors.

Study Design And Setting: Using data on 2923 individuals diagnosed with tuberculosis (TB), a Cox proportional hazards model was used to simulate a time-to-event outcome using plasmode simulation. We considered two scenarios: whether strong or weak predictors were unavailable in the development sample. Conventional and hdPMs were fitted without and with least absolute shrinkage and selection operator (LASSO) shrinkage and were compared in terms of internally validated time-dependent c-statistic and calibration.

Results: The hdPMs had a better time-dependent c-statistic in predicting TB mortality and also outperformed the conventional model in terms of time-dependent c-statistic in our simulations. Compared to a c-statistic of 0.78 for the conventional model with a strong unobserved predictor, LASSO-based hdPMs had a c-statistic of 0.90. While non-penalized hdPMs exhibited overfitting, LASSO-based hdPMs demonstrated superior cross-validated discrimination and calibration. Results were consistent in sensitivity analyses with varying numbers of additional health-care variables and different outcome types.

Conclusion: Health administrative data can compensate for the lack of known and important clinical variables with many health-care variables from the linked databases, especially in hdPMs with LASSO-regularization, substantially enhance predictive accuracy and offer a robust approach for risk stratification and assessment in epidemiological research.

Plain Language Summary: Researchers develop prediction models with only clinical variables. But health administrative data often do not contain some clinical variables. For example, smoking, weight, height, physical activity, and diet data are unavailable. They do have codes such as International Classification of Diseases (ICD)-9/10 diagnostic codes. We transformed these codes into binary and count variables. We created models to predict tuberculosis mortality. The models were not very accurate when using only clinical variables. Accuracy improved when we added the codes. We can use this kind of model in policy and research. For example, we can identify people at high mortality risk. We can then design interventions for the high-risk group.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.jclinepi.2025.111857	DOI Listing

Publication Analysis

Top Keywords

clinical variables

health administrative

administrative data

health-care variables

conventional model

time-dependent c-statistic

variables

high-dimensional prediction

prediction model

lack clinical

A PHP Error was encountered