98%
921
2 minutes
20
Background: Generative artificial intelligence (AI) for tabular synthetic data generation (SDG) has significant potential to accelerate health care research and innovation. A critical limitation of generative AI, however, is hallucinations. Although this has been commonly observed in text-generating models, it may also occur in tabular SDG.
Objective: This study aims to investigate the magnitude of hallucinations in tabular synthetic data, whether their frequency increases with training data complexity, and the extent to which they impact the utility of synthetic data for downstream prognostic machine learning (ML) modeling tasks.
Methods: On the basis of 12 large and high-dimensional real-world health care datasets, 6354 training datasets of different complexity were created by varying the subset of variables included in each dataset. Synthetic data were generated using 7 different SDG models. Hallucinations were defined as synthetic records that did not exist in the population, and the hallucination rate (HR) was the proportion of hallucinations in a synthetic dataset. Classification was the downstream prognostic modeling task, conducted via an ML approach (light gradient boosted machine) and an artificial neural network (multilayer perceptron). Mixed-effects models were fitted to examine the relationship between training data complexity and the HR and the HR and the predictive performance of AI and ML models when trained on the synthetic data.
Results: The HR ranged from 0.3% to 100% (median 99.1%, IQR 98.5%-100.0%) and increased with training data complexity. However, in most SDG models, the HR did not affect AI and ML prognostic model performance. In the SDG models in which a significant association was detected, the estimated effect was very small, with a maximum decrease in the area under the receiver operating characteristic curve of -0.0002 (95% CI -0.0003 to -0.0002, P<.001) in light gradient boosting machine and -0.0001 (95% CI -0.0002 to -0.0001, P=.002) in multilayer perceptron.
Conclusions: These findings suggest that while hallucinations may be very common in synthetic tabular health data, they do not necessarily impair its utility for prognostic modeling.
Download full-text PDF |
Source |
---|---|
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC12402739 | PMC |
http://dx.doi.org/10.2196/77893 | DOI Listing |
Rheumatol Int
September 2025
Department of Physical Medicine and Rehabilitaton, Ankara Bilkent City Hospital, Faculty of Medicine, Yıldırım Beyazıt University, Ankara, Türkiye, Turkey.
The Impact of Obesity and Overweight on Rheumatoid Arthritis Patients: Real-World Insights from a Biologic and Targeted Synthetic DMARDs Registry. The management of rheumatoid arthritis (RA) has advanced with biological and targeted synthetic disease-modifying anti-rheumatic drugs (b/tsDMARDs). However, obesity, a common comorbidity, impacts treatment and disease progression efficacy.
View Article and Find Full Text PDFFront Med (Lausanne)
August 2025
OTEHM, Manchester Metropolitan University, Manchester, United Kingdom.
Introduction: Brain tumor classification remains one of the most challenging tasks in medical image analysis, with diagnostic errors potentially leading to severe consequences. Existing methods often fail to fully exploit all relevant features, focusing on a limited set of deep features that may miss the complexity of the task.
Methods: In this paper, we propose a novel deep learning model combining a Swin Transformer and AE-cGAN augmentation to overcome challenges such as data imbalance and feature extraction.
Biomed Eng Lett
September 2025
Computer Science and Engineering, Pohang University of Science and Technology, 77 Cheongam-Ro. Nam-Gu, Pohang, Gyeongbuk 37673 Korea.
Generative models have become innovative tools across various domains, including neuroscience, where they enable the synthesis of realistic brain imaging data that captures complex anatomical and functional patterns. These models, such as Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion models, leverage deep learning to generate high-quality brain images while maintaining biological and clinical relevance. These models address critical challenges in brain imaging, e.
View Article and Find Full Text PDFBiomed Eng Lett
September 2025
Department of Radiology, Guizhou International Science and Technology Cooperation Base of Precision Imaging for Diagnosis and Treatment, Guizhou Provincial People's Hospital, Guiyang, Guizhou China.
The generated lung nodule data plays an indispensable role in the development of intelligent assisted diagnosis of lung cancer. Existing generative models, primarily based on Generative Adversarial Networks (GANs) and Denoising Diffusion Probabilistic Models (DDPM), have demonstrated effectiveness but also come with certain limitations: GANs often produce artifacts and unnatural boundaries, and due to dataset limitations, they struggle with irregular nodules. While DDPMs are capable of generating a diverse range of nodules, their inherent randomness and lack of control limit their applicability in tasks such as segmentation.
View Article and Find Full Text PDFFront Biosci (Landmark Ed)
August 2025
Institute of Statistics, National University of Kaohsiung, 811 Kaohsiung, Taiwan.
Background: Obesity is a chronic condition linked to health issues such as diabetes, heart disease, and increased cancer risk. High body mass index (BMI) is associated with cancers such as breast and colorectal cancer due to hormone imbalances and inflammation from excess fat, whereas a low BMI can raise cancer risk by weakening the immune system. Maintaining a normal BMI improves cancer treatment outcomes, but in some cases, higher BMI might offer protective effects-a phenomenon known as the "obesity paradox".
View Article and Find Full Text PDF