Research on New Methods of Topic Mining and Topic Prediction for Medical Preprints on Emerging Infectious Diseases.

Zongjing Liang , Yun Kuang , Gongcheng Liang , Zhijie Li , Mingfeng Jiang

Cureus

Institute of Library and Information Studies, Guangxi Normal University, Guilin, CHN.

Published: June 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background and purpose To cope with the continuous risk of sudden infectious diseases and achieve real-time monitoring of research trends, this paper proposes a new prediction framework that combines public attention indicators with medical preprint topic analysis. In view of the lag problem of traditional topic prediction methods, this paper introduces Google Trends data to improve the timeliness of prediction. Methods In this study, 18,060 COVID-19-related preprint abstracts were obtained from the medRxiv platform using web crawler technology. The unsupervised probabilistic modeling method, Latent Dirichlet Allocation (LDA), was used to extract the latent topic structure in the text. In order to analyze the dynamic relationship between research topic intensity and public attention, the Autoregressive Distributed Lag (ARDL) model, which can simultaneously process I(0) and I(1) time series, was introduced. Text data preprocessing included word segmentation, stop word removal, lemmatization, and synonym standardization. Time series data were aggregated by week, the original data were logarithmized, the Augmented Dickey-Fuller (ADF) unit root test was used to determine stationarity, and non-stationary variables were differenced. The models were implemented in Python and EViews10, respectively. Results Seven major research topics were identified through LDA modeling. ARDL analysis verified that there was a significant dynamic relationship between public search trends and topic intensity, and that the model had good predictive performance. Conclusion This study combined LDA with ARDL models to construct a real-time prediction method that can be used to track the evolution of medical preprint topics. This method has important theoretical and practical significance in the field of public health informatics and provides feasible predictive support for the monitoring and prevention of future infectious diseases.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12248262	PMC
http://dx.doi.org/10.7759/cureus.85773	DOI Listing

Publication Analysis

Top Keywords

infectious diseases

topic prediction

public attention

medical preprint

prediction methods

dynamic relationship

topic intensity

time series

topic

prediction

A PHP Error was encountered