Category Ranking

98%

Total Visits

921

Avg Visit Duration

2 minutes

Citations

20

Article Abstract

Differentially private (DP) synthetic datasets are a solution for sharing data while preserving the privacy of individual data providers. Understanding the effects of utilizing DP synthetic data in end-to-end machine learning pipelines impacts areas such as health care and humanitarian action, where data is scarce and regulated by restrictive privacy laws. In this work, we investigate the extent to which synthetic data can replace real, tabular data in machine learning pipelines and identify the most effective synthetic data generation techniques for training and evaluating machine learning models. We systematically investigate the impacts of differentially private synthetic data on downstream classification tasks from the point of view of utility as well as fairness. Our analysis is comprehensive and includes representatives of the two main types of synthetic data generation algorithms: marginal-based and GAN-based. To the best of our knowledge, our work is the first that: (i) proposes a training and evaluation framework that does not assume that real data is available for testing the utility and fairness of machine learning models trained on synthetic data; (ii) presents the most extensive analysis of synthetic dataset generation algorithms in terms of utility and fairness when used for training machine learning models; and (iii) encompasses several different definitions of fairness. Our findings demonstrate that marginal-based synthetic data generators surpass GAN-based ones regarding model training utility for tabular data. Indeed, we show that models trained using data generated by marginal-based algorithms can exhibit similar utility to models trained using real data. Our analysis also reveals that the marginal-based synthetic data generated using AIM and MWEM PGM algorithms can train models that simultaneously achieve utility and fairness characteristics close to those obtained by models trained with real data.

Download full-text PDF

Source
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC10843030PMC
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0297271PLOS

Publication Analysis

Top Keywords

synthetic data
36
machine learning
24
data
19
utility fairness
16
models trained
16
differentially private
12
private synthetic
12
learning pipelines
12
tabular data
12
learning models
12

Similar Publications

Anaerobic bacteria cause a wide range of infections, varying from mild to severe, whether localized, implant-associated, or invasive, often leading to high morbidity and mortality. These infections are challenging to manage due to antimicrobial resistance against common antibiotics such as carbapenems and nitroimidazoles. The empirical use of antibiotics has contributed to the emergence of resistant organisms, making the identification and development of new antibiotics increasingly difficult.

View Article and Find Full Text PDF

A method is presented for determining the significant parameters, maximum wind speed and radius of maximum wind speed, of the surface winds associated with a hurricane. The method is based on Bayesian inversion, using Markov chain Monte Carlo sampling. Underwater acoustic measurements are used to estimate parameters in the axisymmetric Holland model for hurricane surface winds.

View Article and Find Full Text PDF

PERC: a suite of software tools for the curation of cryoEM data with application to simulation, modeling and machine learning.

Acta Crystallogr F Struct Biol Commun

October 2025

Science and Technology Facilities Council, Research Complex at Harwell, Didcot OX11 0FA, United Kingdom.

Ease of access to data, tools and models expedites scientific research. In structural biology there are now numerous open repositories of experimental and simulated data sets. Being able to easily access and utilize these is crucial to allow researchers to make optimal use of their research effort.

View Article and Find Full Text PDF

Focus on 2004 to 2024The rediscovery of natural products (NPs) as a critical source of new therapeutics has been greatly advanced by the development of heterologous expression platforms for biosynthetic gene clusters (BGCs). Among these, species have emerged as the most widely used and versatile chassis for expressing complex BGCs from diverse microbial origins. In this review, we provide a comprehensive analysis of over 450 peer-reviewed studies published between 2004 and 2024 that describe the heterologous expression of BGCs in hosts.

View Article and Find Full Text PDF

Modeling the time evolution of the structure factor during polymeric spinodal decomposition using dynamic mode decomposition.

J Chem Phys

September 2025

School of Mathematical and Physical Sciences, University of Sheffield, Hicks Building, Hounsfield Road, Sheffield S3 7RH, United Kingdom.

The development of the microstructure during polymeric spinodal decomposition can be monitored in real time using small-angle scattering. Information about the microstructure can be deduced from measurements of the structure factor-a quantity directly proportional to the scattered intensity. While the time evolution of the structure factor can be measured relatively easily, modeling it has proved to be much more difficult.

View Article and Find Full Text PDF