Observed | Synthetic | Reweighted | |
---|---|---|---|
b0 | -0.0179771 | 0.0203401 | -0.0189604 |
b1 | 0.5152371 | 0.0385202 | 0.4779748 |
maybe synthetic data is!
Fake data, generated data, simulated data, digital twins
Advancing access to private data for research (e.g., in statistical institutes)
Advancing open science workflows
Educational materials
Software / model testing
Create synthetic data with simple models
Evaluate the quality of the synthetic data
If necessary, add complexity (alter models, transformations, interactions)
Iterate between (2.) and (3.) until synthetic data is good enough
Can we use the synthetic data for the same purposes as we wanted to use the real data?
Do the observed and synthetic data have similar distributions?
Do the observed and synthetic data produce similar results under the same analysis?
Can we distinguish between the observed and synthetic data?
r(x) = \frac{p(\boldsymbol{X}_{\text{syn }})}{p(\boldsymbol{X}_{\text{obs}})}
R
-package densityratio
Compare discrepancy measures for different data sets
Optionally: Test the null hypothesis p(\boldsymbol{X}_{\text{syn}}) = p(\boldsymbol{X}_{\text{obs}})
(Multinomial) logistic regression for categorical variables
Observed | Synthetic | Reweighted | |
---|---|---|---|
b0 | -0.0179771 | 0.0203401 | -0.0189604 |
b1 | 0.5152371 | 0.0385202 | 0.4779748 |
Use density ratios to discard synthetic outliers
High-dimensional extensions
Find a m < p-dimensional subspace in which the synthetic and observed data are maximally different
Estimate the density ratio in this subspace
Automatic cross-validation for hyperparameter selection
Even if it was simulated…
Questions?