Magnitude and Impact of Hallucinations in Tabular Synthetic Health Data on Prognostic Machine Learning Models: Validation Study.

Lisa Pilgram , Samer El Kababji , Dan Liu , Khaled El Emam

J Med Internet Res

School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada.

Published: August 2025

Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

Background: Generative artificial intelligence (AI) for tabular synthetic data generation (SDG) has significant potential to accelerate health care research and innovation. A critical limitation of generative AI, however, is hallucinations. Although this has been commonly observed in text-generating models, it may also occur in tabular SDG.

Objective: This study aims to investigate the magnitude of hallucinations in tabular synthetic data, whether their frequency increases with training data complexity, and the extent to which they impact the utility of synthetic data for downstream prognostic machine learning (ML) modeling tasks.

Methods: On the basis of 12 large and high-dimensional real-world health care datasets, 6354 training datasets of different complexity were created by varying the subset of variables included in each dataset. Synthetic data were generated using 7 different SDG models. Hallucinations were defined as synthetic records that did not exist in the population, and the hallucination rate (HR) was the proportion of hallucinations in a synthetic dataset. Classification was the downstream prognostic modeling task, conducted via an ML approach (light gradient boosted machine) and an artificial neural network (multilayer perceptron). Mixed-effects models were fitted to examine the relationship between training data complexity and the HR and the HR and the predictive performance of AI and ML models when trained on the synthetic data.

Results: The HR ranged from 0.3% to 100% (median 99.1%, IQR 98.5%-100.0%) and increased with training data complexity. However, in most SDG models, the HR did not affect AI and ML prognostic model performance. In the SDG models in which a significant association was detected, the estimated effect was very small, with a maximum decrease in the area under the receiver operating characteristic curve of -0.0002 (95% CI -0.0003 to -0.0002, P<.001) in light gradient boosting machine and -0.0001 (95% CI -0.0002 to -0.0001, P=.002) in multilayer perceptron.

Conclusions: These findings suggest that while hallucinations may be very common in synthetic tabular health data, they do not necessarily impair its utility for prognostic modeling.

Download full-text PDF	Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12402739	PMC
http://dx.doi.org/10.2196/77893	DOI Listing

Publication Analysis

Top Keywords

synthetic data

tabular synthetic

training data

data complexity

sdg models

hallucinations tabular

synthetic

data

prognostic machine

machine learning

Similar Publications

The impact of obesity and overweight on rheumatoid arthritis patients: real-world insights from a biologic and targeted synthetic DMARDs registry.

Rheumatol Int

September 2025

Department of Physical Medicine and Rehabilitaton, Ankara Bilkent City Hospital, Faculty of Medicine, Yıldırım Beyazıt University, Ankara, Türkiye, Turkey.

Tuba Güler , Fatma Gül Yurdakul , Şebnem Ataman , Özgür Akgül , Meltem Alkan Melikoğlu

The Impact of Obesity and Overweight on Rheumatoid Arthritis Patients: Real-World Insights from a Biologic and Targeted Synthetic DMARDs Registry. The management of rheumatoid arthritis (RA) has advanced with biological and targeted synthetic disease-modifying anti-rheumatic drugs (b/tsDMARDs). However, obesity, a common comorbidity, impacts treatment and disease progression efficacy.

View Article and Find Full Text PDF

Similar Publications

Brain tumor classification using GAN-augmented data with autoencoders and Swin Transformers.

Front Med (Lausanne)

August 2025

OTEHM, Manchester Metropolitan University, Manchester, United Kingdom.

Abdullah Almuhaimeed , Anas Bilal , Abdulkareem Alzahrani , Malek Alrashidi , Mansoor Alghamdi

Introduction: Brain tumor classification remains one of the most challenging tasks in medical image analysis, with diagnostic errors potentially leading to severe consequences. Existing methods often fail to fully exploit all relevant features, focusing on a limited set of deep features that may miss the complexity of the task.

Methods: In this paper, we propose a novel deep learning model combining a Swin Transformer and AE-cGAN augmentation to overcome challenges such as data imbalance and feature extraction.

View Article and Find Full Text PDF

Similar Publications

Survey on sampling conditioned brain images and imaging measures with generative models.

Biomed Eng Lett

September 2025

Computer Science and Engineering, Pohang University of Science and Technology, 77 Cheongam-Ro. Nam-Gu, Pohang, Gyeongbuk 37673 Korea.

Sehyoung Cheong , Hoseok Lee , Won Hwa Kim

Generative models have become innovative tools across various domains, including neuroscience, where they enable the synthesis of realistic brain imaging data that captures complex anatomical and functional patterns. These models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models, leverage deep learning to generate high-quality brain images while maintaining biological and clinical relevance. These models address critical challenges in brain imaging, e.

View Article and Find Full Text PDF

Similar Publications

Lung nodule synthesis guided by customized multi-confidence masks.

Biomed Eng Lett

September 2025

Department of Radiology, Guizhou International Science and Technology Cooperation Base of Precision Imaging for Diagnosis and Treatment, Guizhou Provincial People's Hospital, Guiyang, Guizhou China.

Huashan Chen , Yongxu Liu , Chen Liu , Qiuli Wang , Rongping Wang

The generated lung nodule data plays an indispensable role in the development of intelligent assisted diagnosis of lung cancer. Existing generative models, primarily based on Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPM), have demonstrated effectiveness but also come with certain limitations: GANs often produce artifacts and unnatural boundaries, and due to dataset limitations, they struggle with irregular nodules. While DDPMs are capable of generating a diverse range of nodules, their inherent randomness and lack of control limit their applicability in tasks such as segmentation.

View Article and Find Full Text PDF

Similar Publications

Integrative Analysis of BMI and Gene Expression Reveals Molecular Interactions Underlying Cancer Progression.

Front Biosci (Landmark Ed)

August 2025

Institute of Statistics, National University of Kaohsiung, 811 Kaohsiung, Taiwan.

Jie-Huei Wang , Hui-Chen Lu , Zih-Han Wu , Tzu-Chi Chang

Background: Obesity is a chronic condition linked to health issues such as diabetes, heart disease, and increased cancer risk. High body mass index (BMI) is associated with cancers such as breast and colorectal cancer due to hormone imbalances and inflammation from excess fat, whereas a low BMI can raise cancer risk by weakening the immune system. Maintaining a normal BMI improves cancer treatment outcomes, but in some cases, higher BMI might offer protective effects-a phenomenon known as the "obesity paradox".

View Article and Find Full Text PDF

Similar Publications