Density ratios to evaluate and improve the utility of synthetic data

Thom Benjamin Volker
t.b.volker@uu.nl

Imagine you have access to all the data in the world…

If real data is no option,

maybe synthetic data is!

Synthetic data

Fake data, generated data, simulated data, digital twins

Potential use-cases of synthetic data


  • Advancing access to private data for research (e.g., in statistical institutes)

  • Advancing open science workflows

  • Educational materials

  • Software / model testing

Synthetic data generation cycle

  1. Create synthetic data with simple models

  2. Evaluate the quality of the synthetic data

  3. If necessary, add complexity (alter models, transformations, interactions)

  4. Iterate between (2.) and (3.) until synthetic data is good enough

How do we know whether the synthetic data is good enough?

Intuitively

Can we use the synthetic data for the same purposes as we wanted to use the real data?

Do the observed and synthetic data have similar distributions?


Practically

Do the observed and synthetic data produce similar results under the same analysis?

Can we distinguish between the observed and synthetic data?

Density ratios for utility1



r(x) = \frac{p(\boldsymbol{X}_{\text{syn }})}{p(\boldsymbol{X}_{\text{obs}})}



Density ratios for utility evaluation

Density ratios in practice

  1. Estimate the density ratio directly and non-parametrically
  1. Calculate a discrepancy measure for the synthetic data
  • Kullback-Leibler divergence; Pearson divergence
  1. Compare discrepancy measures for different data sets

  2. Optionally: Test the null hypothesis p(\boldsymbol{X}_{\text{syn}}) = p(\boldsymbol{X}_{\text{obs}})

Density ratios for synthetic data (multivariate examples)

U.S. Current Population Survey (n = 5000)1

  • Four continuous variables (age, income, social security payments, household taxes)
  • Four categorical variables (sex, race, marital status, educational attainment)

Synthetic data models

(Multinomial) logistic regression for categorical variables

  1. Linear regression
  2. Linear regression with transformations (cubic root)
  3. Linear regression with transformations and semi-continuous modelling

Utility of the synthetic data

Reweighting synthetic data: regression coefficients

Reweighting synthetic data: regression coefficients

Reweighting synthetic data: regression coefficients

Observed Synthetic Reweighted
b0 -0.0179771 0.0203401 -0.0189604
b1 0.5152371 0.0385202 0.4779748

Other advantages of density ratios for utility

Use density ratios to discard synthetic outliers

High-dimensional extensions

  • Find a m < p-dimensional subspace in which the synthetic and observed data are maximally different

  • Estimate the density ratio in this subspace

Automatic cross-validation for hyperparameter selection

Thanks for your attention!

Even if it was simulated…


Questions?

t.b.volker@uu.nl