Evaluating the quality of synthetic data

A density ratio approach

Thom Benjamin Volker

Imagine …

Getty Images

That would be a privacy disaster!

Who am I?


Thom Volker (t.b.volker@uu.nl)

  • MSc. in Methododology and Statistics & Sociology

  • PhD-candidate at Utrecht University and Statistics Netherlands

    • Project: Contributing to the development of safe and high-quality synthetic data

Open materials

This presentation can be found online at

https://thomvolker.github.io/synth-utility

All source code and data can be found at

https://github.com/thomvolker/synth-utility

Synthetic data

Fake data, generated data, simulated data, digital twins

Creating synthetic data

Synthetic data is created with a generative model

\[p(\boldsymbol{X} | \theta)\]

  • A model \(f\) for the data \(\boldsymbol{X}\);

  • With parameters \(\theta\);

  • Estimated on real data

Definition

Generative models learn the distribution of the data \(\boldsymbol{X}\) given the parameters \(\theta\).

Examples of generative models

A normal distribution with parameters \(\theta = \{\mu, \sigma\}\).

  • In R: rnorm(n = 100, mean = 1, sd = 2)


A histogram with bins and proportions.


Sequential prediction models for a multivariate distribution.


A neural network with thousands of parameters.

Generating synthetic data with mice

library(mice)

emptydat <- mtcars
emptydat[1:nrow(mtcars), 1:ncol(mtcars)] <- NA
dat <- rbind(mtcars, emptydat)

fit <- mice(
  dat, 
  m = 1, 
  maxit = 1, 
  predictorMatrix = lower.tri(diag(ncol(dat))),
  ignore = rep(c(FALSE, TRUE), each = nrow(mtcars)),
  seed = 123
)

Generating synthetic data with mice

Observed data

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1


Synthetic data

mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX41 15.2 8 275.8 175 3.21 3.780 18.52 0 0 3 2
Mazda RX4 Wag1 27.3 4 120.3 95 4.43 1.513 17.40 1 1 4 1
Datsun 7101 15.2 8 351.0 150 3.07 3.840 18.52 0 0 3 2

Generating synthetic data is easy

But generating high-quality synthetic data is hard!

The synthetic data generation cycle

  1. Create synthetic data with simple models

  2. Evaluate the quality of the synthetic data

  3. Add complexities where necessary (transformations, interactions, non-linear relations)

  4. Iterate between (2.) and (3.) until the synthetic data is of sufficient quality

Evaluating the utility of synthetic data

Intuitively

  • Can we use the synthetic data for the same purposes as the real data?

  • Does the synthetic data have the same properties as the real data?

Practically

  • Do analyses on the synthetic data yield similar results as on the real data?

  • Can we distinguish the synthetic data from the real data?

The utility of synthetic data depends on what it’s used for

But we rarely know what the data will be used for…

Global utility measures

If the synthetic and observed data have similar distributions, they should yield similar results.

Existing global utility measures: \(pMSE\)

  1. Bind synthetic and observed data together

  2. Predict for each observation the probability \(\pi_i\) that it is synthetic

  3. Calculate \(pMSE\) as \(\sum^N_{i=1} (\pi_i - c)^2/N\), with \(c = n_{\text{syn}} / (n_{\text{syn}} + n_{\text{obs}})\)

  4. Compare \(pMSE\) with the expected value under a correct generative model

\(pMSE\)

Intuitive and flexible, easy to calculate

Sometimes too simple

Model specification can be difficult

A density ratio framework

Density ratios1 as a utility measure


\[r(x) = \frac{p(\boldsymbol{X}_{\text{obs}})}{p(\boldsymbol{X}_{\text{syn}})}\]

Density ratios

Estimating density ratios

  1. Estimate the density ratio using a non-parametric method
  1. Calculate a discrepancy measure for the synthetic data (Kullback-Leibler divergence, Pearson divergence)

  2. Compare the divergence measure for different data sets

  3. Optionally: Test the null hypothesis \(p(\boldsymbol{X}_{\text{syn}}) = p(\boldsymbol{X}_{\text{obs}})\) using a permutation test.

Estimating density ratios in R

library(densityratio)
dr <- ulsif(mtcars, syn)
summary(dr, test = TRUE)

Call:
ulsif(df_numerator = mtcars, df_denominator = syn)

Kernel Information:
  Kernel type: Gaussian with L2 norm distances
  Number of kernels: 32

Optimal sigma: 1.200419
Optimal lambda: 0.1623777
Optimal kernel weights (loocv): num [1:33] 0.831 0.185 0.149 -0.14 0.324 ...
 
Pearson divergence between P(nu) and P(de): 0.2304
Pr(P(nu)=P(de)) =  0.38
Bonferroni-corrected for testing with r(x) = P(nu)/P(de) AND r*(x) = P(de)/P(nu).

Estimating density ratios in R

Density ratios for synthetic data (univariate)

Density ratios for synthetic data (multivariate)

Observed data distribution

\[ \begin{aligned} X_{1:4} &\sim \mathcal{MVN}(\mathbf{\mu}, \mathbf{\Sigma}), ~~ X_5 \sim \mathcal{N}(X_1^2, V[X_1^2])\\ X_{1:20} &\sim \mathcal{MVN}(\mathbf{\mu}, \mathbf{\Sigma}), ~~ X_{20+i} \sim \mathcal{N}(X_i^{(i+1)}, V[X_i^{(i+1)}]) ~~~~~~~~ \text{for } i \in 1, \dots, 5 \\ &\mathbf{\mu} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}, \mathbf{\Sigma} = \begin{bmatrix} 1 & & & \\ 0.5 & 1 & & \\ \vdots & \ddots & 1 & \\ 0.5 & \cdots & 0.5 & 1 \end{bmatrix} \end{aligned} \]

Synthetic data distributions: (1.) Uncorrelated multivariate normal, (2.) correlated multivariate normal, (3.) correct distribution.

Density ratios for synthetic data (multivariate)

Density ratios for synthetic data (multivariate)

Proportion of simulations the estimated density ratios and \(pMSE\) values are ranked correctly (in terms of quality).

N P PE pMSE
500 5 0.80 0.74
500 25 0.97 0.75
2500 5 1.00 0.97
2500 25 1.00 0.75

U.S. Current Population Survey (n = 5000)1

  • Four continuous variables (age, income, social security payments, household property taxes)
  • Four categorical variables (gender, race, marital status, level of education)

Synthetische data models

Categorical variables: (multinomial) logistic regression

Continuous variables:

  1. Linear regression
  2. Linear regression with transformations (cubic root)
  3. Linear regression with transformations and semi-continuous modeling

U.S. Current Population Survey

Utility of the synthetic data

Additional advantages of density ratios

Utility scores for individual data points

Reweighting synthetic data analyses

  • These “utility scores” can be used for reweighting synthetic data analyses.
Method Intercept Slope
Observed 0.0033124 0.4566900
Synthetic 0.0100967 -0.0035725
Reweighted -0.0230975 0.4169210

Automatic hyperparameter selection

  • Automatic cross-validation implemented in the package

  • No model specification required

Extensions for high-dimensional data

  • Dimension reduction: estimate the density ratio in a lower-dimensional subspace.

  • Supervised dimension reduction: estimate the density ratio in a subspace where the observed and synthetic data are most different.

Thanks for your attention!

Questions?



In case of further questions, please reach out!

t.b.volker@uu.nl