Evaluating the quality of synthetic data

A density ratio approach

Thom Benjamin Volker

Imagine …

Getty Images

That would be a privacy disaster!

Who am I?

Thom Volker (t.b.volker@uu.nl)

MSc. in Methododology and Statistics & Sociology
PhD-candidate at Utrecht University and Statistics Netherlands
- Project: Contributing to the development of safe and high-quality synthetic data

Open materials

This presentation can be found online at

https://thomvolker.github.io/synth-utility

All source code and data can be found at

https://github.com/thomvolker/synth-utility

Synthetic data

Fake data, generated data, simulated data, digital twins

Creating synthetic data

Synthetic data is created with a generative model

\[p(\boldsymbol{X} | \theta)\]

A model \(f\) for the data \(\boldsymbol{X}\);
With parameters \(\theta\);
Estimated on real data

Definition

Generative models learn the distribution of the data \(\boldsymbol{X}\) given the parameters \(\theta\).

Examples of generative models

A normal distribution with parameters \(\theta = \{\mu, \sigma\}\).

In R: rnorm(n = 100, mean = 1, sd = 2)

A histogram with bins and proportions.

Sequential prediction models for a multivariate distribution.

A neural network with thousands of parameters.

Generating synthetic data with `mice`

library(mice)

emptydat <- mtcars
emptydat[1:nrow(mtcars), 1:ncol(mtcars)] <- NA
dat <- rbind(mtcars, emptydat)

fit <- mice(
  dat, 
  m = 1, 
  maxit = 1, 
  predictorMatrix = lower.tri(diag(ncol(dat))),
  ignore = rep(c(FALSE, TRUE), each = nrow(mtcars)),
  seed = 123
)

Generating synthetic data with mice

Observed data

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.320	18.61	1	1	4	1

Synthetic data

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX41	15.2	8	275.8	175	3.21	3.780	18.52	0	0	3	2
Mazda RX4 Wag1	27.3	4	120.3	95	4.43	1.513	17.40	1	1	4	1
Datsun 7101	15.2	8	351.0	150	3.07	3.840	18.52	0	0	3	2

Generating synthetic data is easy

But generating high-quality synthetic data is hard!

The synthetic data generation cycle

Create synthetic data with simple models
Evaluate the quality of the synthetic data
Add complexities where necessary (transformations, interactions, non-linear relations)
Iterate between (2.) and (3.) until the synthetic data is of sufficient quality

Evaluating the utility of synthetic data

Intuitively

Can we use the synthetic data for the same purposes as the real data?
Does the synthetic data have the same properties as the real data?

Practically

Do analyses on the synthetic data yield similar results as on the real data?
Can we distinguish the synthetic data from the real data?

The utility of synthetic data depends on what it’s used for

But we rarely know what the data will be used for…

Global utility measures

If the synthetic and observed data have similar distributions, they should yield similar results.

Existing global utility measures: \(pMSE\)

Bind synthetic and observed data together
Predict for each observation the probability \(\pi_i\) that it is synthetic
Calculate \(pMSE\) as \(\sum^N_{i=1} (\pi_i - c)^2/N\), with \(c = n_{\text{syn}} / (n_{\text{syn}} + n_{\text{obs}})\)
Compare \(pMSE\) with the expected value under a correct generative model

\(pMSE\)

Intuitive and flexible, easy to calculate

Sometimes too simple

Model specification can be difficult

A density ratio framework

Density ratios¹ as a utility measure

\[r(x) = \frac{p(\boldsymbol{X}_{\text{obs}})}{p(\boldsymbol{X}_{\text{syn}})}\]

Let’s go back to the observation that synthetic data has high quality if it’s distribution is similar to the the distribution of the observed data, i.e., if we cannot distinguish the two distributions. We can express this trait as a ratio. If the ratio is large, there are too few synthetic data points in a region where there are many observed data points, and if the ratio is small, there too many synthetic observations in a region with relatively few observed cases. This can be done on a univariate level, variable by variable, but this ratio can also be estimated for the multivariate distributions of the observed and synthetic data. The density ratio can be estimated by estimating the probability distributions of the observed and synthetic data separately, and then taking the ratio. However, this method has the disadvantage that estimation errors are made for both probability distributions, and taking the ratio of these estimated probability magnifies these errors. Research in this field showed that you can obtain better estimates of the density ratio by estimating these directly, without estimating the densities separately.

Density ratios

Estimating density ratios

Estimate the density ratio using a non-parametric method

Implemented in the R-package densityratio.

Calculate a discrepancy measure for the synthetic data (Kullback-Leibler divergence, Pearson divergence)
Compare the divergence measure for different data sets
Optionally: Test the null hypothesis \(p(\boldsymbol{X}_{\text{syn}}) = p(\boldsymbol{X}_{\text{obs}})\) using a permutation test.

Estimating density ratios in `R`

library(densityratio)
dr <- ulsif(mtcars, syn)
summary(dr, test = TRUE)


Call:
ulsif(df_numerator = mtcars, df_denominator = syn)

Kernel Information:
  Kernel type: Gaussian with L2 norm distances
  Number of kernels: 32

Optimal sigma: 1.200419
Optimal lambda: 0.1623777
Optimal kernel weights (loocv): num [1:33] 0.831 0.185 0.149 -0.14 0.324 ...
 
Pearson divergence between P(nu) and P(de): 0.2304
Pr(P(nu)=P(de)) =  0.38
Bonferroni-corrected for testing with r(x) = P(nu)/P(de) AND r*(x) = P(de)/P(nu).

Estimating density ratios in `R`

Density ratios for synthetic data (univariate)

Density ratios for synthetic data (multivariate)

Observed data distribution

\[ \begin{aligned} X_{1:4} &\sim \mathcal{MVN}(\mathbf{\mu}, \mathbf{\Sigma}), ~~ X_5 \sim \mathcal{N}(X_1^2, V[X_1^2])\\ X_{1:20} &\sim \mathcal{MVN}(\mathbf{\mu}, \mathbf{\Sigma}), ~~ X_{20+i} \sim \mathcal{N}(X_i^{(i+1)}, V[X_i^{(i+1)}]) ~~~~~~~~ \text{for } i \in 1, \dots, 5 \\ &\mathbf{\mu} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}, \mathbf{\Sigma} = \begin{bmatrix} 1 & & & \\ 0.5 & 1 & & \\ \vdots & \ddots & 1 & \\ 0.5 & \cdots & 0.5 & 1 \end{bmatrix} \end{aligned} \]

Synthetic data distributions: (1.) Uncorrelated multivariate normal, (2.) correlated multivariate normal, (3.) correct distribution.

Density ratios for synthetic data (multivariate)

Proportion of simulations the estimated density ratios and \(pMSE\) values are ranked correctly (in terms of quality).

N	P	PE	pMSE
500	5	0.80	0.74
500	25	0.97	0.75
2500	5	1.00	0.97
2500	25	1.00	0.75

U.S. Current Population Survey (n = 5000)¹

Four continuous variables (age, income, social security payments, household property taxes)
Four categorical variables (gender, race, marital status, level of education)

Synthetische data models

Categorical variables: (multinomial) logistic regression

Continuous variables:

Linear regression
Linear regression with transformations (cubic root)
Linear regression with transformations and semi-continuous modeling

U.S. Current Population Survey

Utility of the synthetic data

Additional advantages of density ratios

Utility scores for individual data points

Reweighting synthetic data analyses

These “utility scores” can be used for reweighting synthetic data analyses.

Method	Intercept	Slope
Observed	0.0033124	0.4566900
Synthetic	0.0100967	-0.0035725
Reweighted	-0.0230975	0.4169210

Automatic hyperparameter selection

Automatic cross-validation implemented in the package
No model specification required

Extensions for high-dimensional data

Dimension reduction: estimate the density ratio in a lower-dimensional subspace.
Supervised dimension reduction: estimate the density ratio in a subspace where the observed and synthetic data are most different.

Thanks for your attention!

Questions?

In case of further questions, please reach out!

t.b.volker@uu.nl

Evaluating the quality of synthetic data

Imagine …

That would be a privacy disaster!

Who am I?

Open materials

Synthetic data

Creating synthetic data

Examples of generative models

Generating synthetic data with mice

Generating synthetic data with mice

Generating synthetic data is easy

The synthetic data generation cycle

Evaluating the utility of synthetic data

The utility of synthetic data depends on what it’s used for

Global utility measures

Existing global utility measures: \(pMSE\)

\(pMSE\)

A density ratio framework

Density ratios

Estimating density ratios

Estimating density ratios in R

Estimating density ratios in R

Density ratios for synthetic data (univariate)

Density ratios for synthetic data (multivariate)

Density ratios for synthetic data (multivariate)

Density ratios for synthetic data (multivariate)

U.S. Current Population Survey (n = 5000)1

Synthetische data models

U.S. Current Population Survey

Utility of the synthetic data

Additional advantages of density ratios

Utility scores for individual data points

Reweighting synthetic data analyses

Automatic hyperparameter selection

Extensions for high-dimensional data

Thanks for your attention!

Generating synthetic data with `mice`

Estimating density ratios in `R`

Estimating density ratios in `R`

U.S. Current Population Survey (n = 5000)¹