Performance of multi-vendor auto-segmentation models for thoracic organs at risk trained on a single dataset.

Sevgi Emin , Elia Rossi , Mattias Hedman , Marcela Giovenco , Fernanda Villegas , Eva Onjukka

Phys Med

Department of Nuclear Medicine and Medical Physics, Karolinska University Hospital, 171 76 Stockholm, Sweden; Department of Oncology-Pathology, Karolinska Institutet, 171 77 Stockholm, Sweden.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Introduction: This study evaluates the delineation quality of artificial intelligence (AI)-based models for auto-segmentation trained on the same dataset, as the intrinsic performance cannot be evaluated for commercial solutions due to differences in training datasets. A diverse set of challenging thoracic organs-at-risk (OAR) were chosen, to reveal potential limitations of AI-based tools which are relevant for their clinical adoption.

Materials & Methods: A structure set with 16 OAR was delineated and reviewed by radiation oncology experts for 250 patients with lung tumours (200/50 for training/testing). Three participating vendors had access to the training dataset for a limited time to develop a model mimicking their commercial model development strategies. The models were tested on the blind test dataset by the authors. A quantitative analysis was performed employing Dice Similarity Coefficient (DSC), surface DSC (sDSC), the 95-th percentile of the Hausdorff Distance (HD95) and average symmetric surface distance (ASSD). Inter-observer variability in manual segmentation was estimated by three independent expert delineations for a subset of five test patients.

Results: 13 OAR had DSC > 0.8, 9 had sDSC > 0.8, 10 had ASSD < 0.5 mm and 5 had HD95 < 1 mm. The most challenging structures to auto-segment were the brachial plexus, pulmonary vein, and vena cava inferior. The overall results for all models were exceeding the inter-observer variability for all metrics.

Conclusion: While the evaluated AI-models perform very well for some OAR, they appear less successful at modelling organs with branching structures and poor image contrast, even when trained on a large homogeneous dataset.

Download full-text PDF	Source
http://dx.doi.org/10.1016/j.ejmp.2025.105089	DOI Listing

Publication Analysis

Top Keywords

performance multi-vendor

multi-vendor auto-segmentation

auto-segmentation models

models thoracic

thoracic organs

organs risk

risk trained

trained single

dataset

single dataset

A PHP Error was encountered