Domain-Specific Pretraining of NorDeClin-Bidirectional Encoder Representations From Transformers for Code Prediction in Norwegian Clinical Texts: Model Development and Evaluation Study.

Phuong Dinh Ngo , Miguel Ángel Tejedor Hernández , Taridzo Chomutare , Andrius Budrionis , Therese Olsen Svenning , Torbjørn Torsvik , Anastasios Lamproudis , Hercules Dalianis

JMIR AI

Norwegian Centre for E-health Research, University Hospital of Northern Norway, P.O. Box 35, N-9038, Tromsø, Norway, 47 92699162.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Accurately assigning ICD-10 (International Statistical Classification of Diseases, Tenth Revision) codes is critical for clinical documentation, reimbursement processes, epidemiological studies, and health care planning. Manual coding is time-consuming, labor-intensive, and prone to errors, underscoring the need for automated solutions within the Norwegian health care system. Recent advances in natural language processing (NLP) and transformer-based language models have shown promising results in automating ICD (International Classification of Diseases) coding in several languages. However, prior work has focused primarily on English and other high-resource languages, leaving a gap in Norwegian-specific clinical NLP research.

Objective: This study introduces 2 versions of NorDeClin-BERT (NorDeClin Bidirectional Encoder Representations from Transformers), domain-specific Norwegian BERT-based models pretrained on a large corpus of Norwegian clinical text to enhance their understanding of medical language. Both models were subsequently fine-tuned to predict ICD-10 diagnosis codes. We aimed to evaluate the impact of domain-specific pretraining and model size on classification performance and to compare NorDeClin-BERT with general-purpose and cross-lingual BERT models in the context of Norwegian ICD-10 coding.

Methods: Two versions of NorDeClin-BERT were pretrained on the ClinCode Gastro Corpus, a large-scale dataset comprising 8.8 million deidentified Norwegian clinical notes, to enhance domain-specific language modeling. The base model builds upon NorBERT3-base and was pretrained on a large, relevant subset of the corpus, while the large model builds upon NorBERT3-large and was trained on the full dataset. Both models were benchmarked against SweDeClin-BERT, ScandiBERT, NorBERT3-base, and NorBERT3-large, using standard evaluation metrics: accuracy, precision, recall, and F1-score.

Results: The results show that both versions of NorDeClin-BERT outperformed general-purpose Norwegian BERT models and Swedish clinical BERT models in classifying both prevalent and less common ICD-10 codes. Notably, NorDeClin-BERT-large achieved the highest overall performance across evaluation metrics, demonstrating the impact of domain-specific clinical pretraining in Norwegian. These results highlight that domain-specific pretraining on Norwegian clinical text, combined with model capacity, improves ICD-10 classification accuracy compared with general-domain Norwegian models and Swedish models pretrained on clinical text. Furthermore, while Swedish clinical models demonstrated some transferability to Norwegian, their performance remained suboptimal, emphasizing the necessity of Norwegian-specific clinical pretraining.

Conclusions: This study highlights the potential of NorDeClin-BERT to improve ICD-10 code classification for the gastroenterology domain in Norway, ultimately streamlining clinical documentation, reporting processes, reducing administrative burden, and enhancing coding accuracy in Norwegian health care institutions. The benchmarking evaluation establishes NorDeClin-BERT as a state-of-the-art model for processing Norwegian clinical text and predicting ICD-10 coding, establishing a new baseline for future research in Norwegian medical NLP. Future work may explore further domain adaptation techniques, external knowledge integration, and cross-hospital generalizability to enhance ICD coding performance across broader clinical settings.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377785	PMC
http://dx.doi.org/10.2196/66153	DOI Listing

Publication Analysis

Top Keywords

norwegian clinical

clinical text

norwegian

clinical

domain-specific pretraining

health care

versions nordeclin-bert

bert models

models

encoder representations

A PHP Error was encountered