Density ratios to evaluate and improve the utility of synthetic data

Thom Benjamin Volker
t.b.volker@uu.nl

Imagine you have access to all the data in the world…

If real data is no option,

maybe synthetic data is!

Synthetic data

Fake data, generated data, simulated data, digital twins

Potential use-cases of synthetic data

Advancing access to private data for research (e.g., in statistical institutes)
Advancing open science workflows
Educational materials
Software / model testing

Synthetic data generation cycle

Create synthetic data with simple models
Evaluate the quality of the synthetic data
If necessary, add complexity (alter models, transformations, interactions)
Iterate between (2.) and (3.) until synthetic data is good enough

How do we know whether the synthetic data is good enough?

Intuitively

Can we use the synthetic data for the same purposes as we wanted to use the real data?

Do the observed and synthetic data have similar distributions?

Practically

Do the observed and synthetic data produce similar results under the same analysis?

Can we distinguish between the observed and synthetic data?

Density ratios for utility¹

r(x) = \frac{p(\boldsymbol{X}_{\text{syn }})}{p(\boldsymbol{X}_{\text{obs}})}

We propose that density ratios are very suitable for this task. Density ratio estimation is a set of techniques developed in machine learning to estimate the ratio of two probability density functions. The density ratio can be used for various tasks, as broad as change-point detection, prediction and two-sample testing. But it is also very useful for evaluating the utility of synthetic data. That is, if the observed and synthetic data have similar density over the entire multivariate space, then any analysis will yield similar results. So, this means that if the density ratio is close to one at every point in the multivariate space, then the synthetic data is good enough. If the density ratio is far from one in some subspace of the data, we have to improve the synthetic data in that subspace. Importantly, the density ratio is estimated directly, rather than estimating the two probability density functions separately and then taking their ratio to improve estimation accuracy.

Density ratios for utility evaluation

Density ratios in practice

Estimate the density ratio directly and non-parametrically

Implemented in R-package densityratio

Calculate a discrepancy measure for the synthetic data

Kullback-Leibler divergence; Pearson divergence

Compare discrepancy measures for different data sets
Optionally: Test the null hypothesis p(\boldsymbol{X}_{\text{syn}}) = p(\boldsymbol{X}_{\text{obs}})

Density ratios for synthetic data (multivariate examples)

U.S. Current Population Survey (n = 5000)¹

Four continuous variables (age, income, social security payments, household taxes)
Four categorical variables (sex, race, marital status, educational attainment)

Synthetic data models

(Multinomial) logistic regression for categorical variables

Linear regression
Linear regression with transformations (cubic root)
Linear regression with transformations and semi-continuous modelling

Utility of the synthetic data

Reweighting synthetic data: regression coefficients

	Observed	Synthetic	Reweighted
b0	-0.0179771	0.0203401	-0.0189604
b1	0.5152371	0.0385202	0.4779748

Other advantages of density ratios for utility

Use density ratios to discard synthetic outliers

High-dimensional extensions

Find a m < p-dimensional subspace in which the synthetic and observed data are maximally different
Estimate the density ratio in this subspace

Automatic cross-validation for hyperparameter selection

Thanks for your attention!

Even if it was simulated…

Questions?

t.b.volker@uu.nl

Density ratios to evaluate and improve the utility of synthetic data

Imagine you have access to all the data in the world…

If real data is no option,

Synthetic data

Potential use-cases of synthetic data

Synthetic data generation cycle

How do we know whether the synthetic data is good enough?

Intuitively

Practically

Density ratios for utility1

Density ratios for utility evaluation

Density ratios in practice

Density ratios for synthetic data (multivariate examples)

U.S. Current Population Survey (n = 5000)1

Synthetic data models

Utility of the synthetic data

Reweighting synthetic data: regression coefficients

Reweighting synthetic data: regression coefficients

Reweighting synthetic data: regression coefficients

Other advantages of density ratios for utility

Thanks for your attention!

Density ratios for utility¹

U.S. Current Population Survey (n = 5000)¹