Practical 2: Evaluating utility and privacy of synthetic data

Fake it ’till you make it: Generating synthetic data with high utility in R

Author

Thom Volker & Erik-Jan van Kesteren


Note. This practical builds on Practical 1, and assumes you have completed all these exercises.


Synthetic data utility

The quality of synthetic data sets can be assessed on multiple levels and in multiple different ways (e.g., quantitatively, but also visually). Starting on a univariate level, the distributions of the synthetic data sets can be compared with the distribution of the observed data. For categorical variables, the observed counts in each category can be compared between the real and synthetic data. For continuous variables, the density of the real and synthetic data can be compared. Later on, we also look at the utility of the synthetic data on a multivariate level.


Univariate data utility

1. To get an idea of whether creating the synthetic data went accordingly, compare the first 10 rows of the original data with the first 10 rows of the synthetic data sets (inspect both the parametric and the non-parametric set). Do you notice any differences?


Hint: You can extract the synthetic data from the synthetic data object by called $syn on the particular object.

heart_failure |> head(10)
syn_param$syn |> head(10)
syn_nonparam$syn |> head(10)
age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking hypertension deceased follow_up
75 No 582 No 20 265000 1.9 130 Male No Yes Yes 4
55 No 7861 No 38 263358 1.1 136 Male No No Yes 6
65 No 146 No 20 162000 1.3 129 Male Yes No Yes 7
50 Yes 111 No 20 210000 1.9 137 Male No No Yes 7
65 Yes 160 Yes 20 327000 2.7 116 Female No No Yes 8
90 Yes 47 No 40 204000 2.1 132 Male Yes Yes Yes 8
75 Yes 246 No 15 127000 1.2 137 Male No No Yes 10
60 Yes 315 Yes 60 454000 1.1 131 Male Yes No Yes 10
65 No 157 No 65 263358 1.5 138 Female No No Yes 10
80 Yes 123 No 35 388000 9.4 133 Male Yes Yes Yes 10
age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking hypertension deceased follow_up
53 Yes 178 Yes 19 367732.63 1.4222665 134 Male Yes No Yes 222
61 Yes -991 Yes 45 142856.18 1.5209304 147 Male Yes No Yes 113
50 Yes 2066 No 45 77936.18 2.2470844 140 Male No No No 167
40 Yes 1858 Yes 37 355969.52 0.8467097 145 Male Yes No No 152
50 No 117 No 14 209264.97 1.1433037 138 Male Yes No Yes 50
59 Yes 324 No 65 319616.23 -0.5504862 133 Male No No No 134
70 No -991 No 37 229258.91 1.9875611 141 Male No No No 166
65 No 1260 Yes 21 436055.22 4.2620808 135 Male Yes Yes Yes 48
75 Yes 1841 No 53 376515.15 1.9276125 134 Male Yes No No 123
78 Yes -1279 No 37 305902.66 0.1441727 140 Male No No No 181
age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking hypertension deceased follow_up
53 No 207 No 60 263358 0.70 133 Male Yes Yes No 74
61 No 231 No 40 279000 0.90 136 Male Yes No No 76
50 No 250 Yes 35 302000 1.00 132 Male No No No 186
40 Yes 358 No 45 237000 1.18 134 Female No No No 186
50 No 582 No 60 194000 2.30 131 Female No No Yes 6
59 Yes 60 No 38 162000 1.00 136 Male Yes No No 71
70 No 81 Yes 35 427000 1.83 134 Female No Yes Yes 115
65 Yes 135 No 50 362000 0.80 142 Male No Yes No 146
75 Yes 855 No 38 149000 1.30 127 Male Yes No No 71
78 No 211 Yes 25 149000 1.10 142 Male No No No 180

You might notice that some of the continuous variables are not rounded as in the original data when using parametric synthesis models. Additionally, there are negative values in the synthetic version of the variable creatinine_phosphokinase, while the original data is strictly positive.

Both of these issues are not present when using CART, because CART draws values from the observed data (and thus can’t create values that are not in the observed data).


Apart from inspecting the data itself, we can assess distributional similarity between the observed and synthetic data.


2. Compare the descriptive statistics from the synthetic data sets with the descriptive statistics from the observed data. What do you see?


Hint: Use the function describe() from the psych package to do this.

heart_failure |> 
  describe()

syn_param$syn |>
  describe()

syn_nonparam$syn |>
  describe()
Observed data
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 299 60.83 11.89 60.0 60.22 14.83 40.0 95.0 55.0 0.42 -0.22 0.69
anaemia* 2 299 1.43 0.50 1.0 1.41 0.00 1.0 2.0 1.0 0.28 -1.93 0.03
creatinine_phosphokinase 3 299 581.84 970.29 250.0 365.49 269.83 23.0 7861.0 7838.0 4.42 24.53 56.11
diabetes* 4 299 1.42 0.49 1.0 1.40 0.00 1.0 2.0 1.0 0.33 -1.90 0.03
ejection_fraction 5 299 38.08 11.83 38.0 37.43 11.86 14.0 80.0 66.0 0.55 0.00 0.68
platelets 6 299 263358.03 97804.24 262000.0 256730.09 65234.40 25100.0 850000.0 824900.0 1.45 6.03 5656.17
serum_creatinine 7 299 1.39 1.03 1.1 1.19 0.30 0.5 9.4 8.9 4.41 25.19 0.06
serum_sodium 8 299 136.63 4.41 137.0 136.82 4.45 113.0 148.0 35.0 -1.04 3.98 0.26
sex* 9 299 1.65 0.48 2.0 1.68 0.00 1.0 2.0 1.0 -0.62 -1.62 0.03
smoking* 10 299 1.32 0.47 1.0 1.28 0.00 1.0 2.0 1.0 0.76 -1.42 0.03
hypertension* 11 299 1.35 0.48 1.0 1.32 0.00 1.0 2.0 1.0 0.62 -1.62 0.03
deceased* 12 299 1.32 0.47 1.0 1.28 0.00 1.0 2.0 1.0 0.76 -1.42 0.03
follow_up 13 299 130.26 77.61 115.0 129.28 105.26 4.0 285.0 281.0 0.13 -1.22 4.49
Synthetic data (parametric)
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 299 60.33 12.57 60.00 59.47 14.83 40.00 95.00 55.00 0.53 -0.13 0.73
anaemia* 2 299 1.39 0.49 1.00 1.37 0.00 1.00 2.00 1.00 0.43 -1.82 0.03
creatinine_phosphokinase 3 299 493.29 1036.92 496.00 489.15 1009.65 -2172.00 4382.00 6554.00 0.14 0.21 59.97
diabetes* 4 299 1.42 0.49 1.00 1.40 0.00 1.00 2.00 1.00 0.33 -1.90 0.03
ejection_fraction 5 299 36.86 12.50 37.00 36.72 13.34 -1.00 69.00 70.00 0.07 -0.13 0.72
platelets 6 299 260279.71 97198.21 258297.06 259569.64 98523.49 -14121.19 503313.43 517434.62 0.05 -0.26 5621.12
serum_creatinine 7 299 1.41 1.09 1.47 1.44 1.09 -1.37 4.26 5.63 -0.17 -0.37 0.06
serum_sodium 8 299 136.32 4.79 136.00 136.27 4.45 123.00 149.00 26.00 0.07 -0.25 0.28
sex* 9 299 1.65 0.48 2.00 1.68 0.00 1.00 2.00 1.00 -0.62 -1.62 0.03
smoking* 10 299 1.35 0.48 1.00 1.31 0.00 1.00 2.00 1.00 0.64 -1.60 0.03
hypertension* 11 299 1.31 0.47 1.00 1.27 0.00 1.00 2.00 1.00 0.80 -1.37 0.03
deceased* 12 299 1.29 0.46 1.00 1.24 0.00 1.00 2.00 1.00 0.90 -1.20 0.03
follow_up 13 299 140.02 74.84 141.00 141.93 75.61 -85.00 359.00 444.00 -0.26 0.22 4.33
Synthetic data (non-parametric)
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 299 60.33 12.57 60.0 59.47 14.83 40.0 95 55.0 0.53 -0.13 0.73
anaemia* 2 299 1.41 0.49 1.0 1.39 0.00 1.0 2 1.0 0.34 -1.89 0.03
creatinine_phosphokinase 3 299 550.49 906.65 250.0 362.28 252.04 47.0 7702 7655.0 4.92 30.51 52.43
diabetes* 4 299 1.44 0.50 1.0 1.43 0.00 1.0 2 1.0 0.22 -1.96 0.03
ejection_fraction 5 299 38.02 11.53 38.0 37.11 10.38 14.0 80 66.0 0.74 0.72 0.67
platelets 6 299 273733.81 100656.21 263358.0 265781.36 81012.18 25100.0 742000 716900.0 1.11 2.80 5821.10
serum_creatinine 7 299 1.40 0.95 1.1 1.21 0.30 0.5 9 8.5 3.49 17.58 0.06
serum_sodium 8 299 136.06 4.98 136.0 136.41 4.45 113.0 145 32.0 -1.30 4.02 0.29
sex* 9 299 1.66 0.47 2.0 1.70 0.00 1.0 2 1.0 -0.67 -1.56 0.03
smoking* 10 299 1.32 0.47 1.0 1.28 0.00 1.0 2 1.0 0.75 -1.45 0.03
hypertension* 11 299 1.38 0.49 1.0 1.35 0.00 1.0 2 1.0 0.49 -1.77 0.03
deceased* 12 299 1.33 0.47 1.0 1.29 0.00 1.0 2 1.0 0.71 -1.49 0.03
follow_up 13 299 127.09 77.44 113.0 125.99 108.23 4.0 280 276.0 0.11 -1.25 4.48



The descriptive statistics are not exactly similar, but come rather close in terms of mean and standard deviation. When looking at higher-order moments and the minimum and maximum, we see that there are some noticeable differences for parametrically synthesized data, but not so much for the non-parametrically synthesized data. We pay more attention to these issues when we visually inspect the synthetic data.


We will now visually compare the distributions of the observed and synthetic data, as this typically provides a more thorough understanding of the quality of the synthetic data.


3. Use compare() from the synthpop package to compare the distributions of the observed and parametric synthetic data set, set the parameter utility.stats = NULL. What do you see?


For now, ignore the table below the figures, we will come to this at a later point.

compare(syn_param, heart_failure)

Comparing percentages observed with synthetic

Press return for next variable(s): 

Press return for next variable(s): 

Press return for next variable(s): 


Selected utility measures:
                             pMSE    S_pMSE df
age                      0.000441  0.527833  4
anaemia                  0.000349  1.669216  1
creatinine_phosphokinase 0.067826 81.119499  4
diabetes                 0.000000  0.000000  1
ejection_fraction        0.013486 16.129066  4
platelets                0.006824  8.161064  4
serum_creatinine         0.054388 65.048276  4
serum_sodium             0.005093  6.090879  4
sex                      0.000000  0.000000  1
smoking                  0.000201  0.961608  1
hypertension             0.000381  1.822597  1
deceased                 0.000210  1.004831  1
follow_up                0.012051 14.413308  4

You might notice that there are substantial differences between the distributions of some of the continuous variables. Especially for the variables creatinine_phosphokinase, serum_creatinine and follow_up, the synthetic data does not seem to capture the distribution of the observed data well. Also for the other variables, there are some discrepancies between the marginal distributions of the observed and synthetic data.

Of course, this could have been expected, since some of the variables are highly skewed, while we impose a normal distribution on each variable with the current set of parametric models. It is quite likely that we could have done a better job by using more elaborate data manipulation (e.g., transforming variables such that there distribution corresponds more closely to a normal distribution (and back-transforming afterwards)).

For the categorical variables, we seem to be doing a decent job on the marginal levels, as there are only small differences between the observed and synthetic frequencies in each level.


4. Use compare() from the synthpop package to compare the distributions of the observed and non-parametric synthetic data set, set the parameter utility.stats = NULL. What do you see?


Again, ignore the table below the figures, we will come to this at a later point.

compare(syn_nonparam, heart_failure)

Comparing percentages observed with synthetic

Press return for next variable(s): 

Press return for next variable(s): 

Press return for next variable(s): 


Selected utility measures:
                             pMSE   S_pMSE df
age                      0.000441 0.527833  4
anaemia                  0.000072 0.342556  1
creatinine_phosphokinase 0.000863 1.032628  4
diabetes                 0.000182 0.872595  1
ejection_fraction        0.000329 0.393494  4
platelets                0.001892 2.262856  4
serum_creatinine         0.001142 1.366083  4
serum_sodium             0.001286 1.538074  4
sex                      0.000028 0.132992  1
smoking                  0.000003 0.015301  1
hypertension             0.000244 1.167167  1
deceased                 0.000029 0.136973  1
follow_up                0.000246 0.294417  4

Using non-parametric synthesis models (i.e., CART), we do a much better job in recreating the shape of the original data. In fact, the marginal distributions are close to identical, including all irregularities in the original data.


There are also other, more formal, ways to assess the utility of the synthetic data, although there is some critique against these methods (see, e.g., Drechsler 2022). Here, we will discuss one of these measures, the \(pMSE\), but there are others (although utility measures tend to correlate strongly in general). The intuition behind the \(pMSE\) is to predict whether an observation is actually observed, or a synthetic record. If this is possible, the observed and synthetic data differ on at least one dimension, which allows to distinguish between the records.

Formally, the \(pMSE\) is defined as \[ pMSE = \frac{1}{n_{obs} + n_{syn}} \Bigg( \sum^{n_{obs}}_{i=1} \Big(\hat{\pi}_i - \frac{n_{obs}}{n_{obs} + n_{syn}}\Big)^2 + \sum^{n_{obs} + n_{syn}}_{i={(n_{obs} + 1)}} \Big(\hat{\pi_i} - \frac{n_{syn}}{n_{obs} + n_{syn}}\Big)^2 \Bigg), \] which, in our case, simplifies to \[ pMSE = \frac{1}{598} \Bigg( \sum^{n_{obs} + n_{syn}}_{i=1} \Big(\hat{\pi}_i - 0.5\Big)^2 \Bigg), \] where \(n_{obs}\) and \(n_{syn}\) are the sample sizes of the observed and synthetic data, \(\hat{\pi}_i\) is the probability of belonging to the synthetic data.


5. Calculate the \(pMSE\) for the variable creatinine_phosphokinase for both synthetic sets and compare the values between both synthesis methods. Use a logistic regression model to create the probabilities \(\pi\). What do you see?


Hint: You can use the function utility.gen() and set the arguments method = "logit" (this denotes the model used to predict the probabilities), vars = "creatinine_phosphokinase" and maxorder = 0 (which denotes that we don’t want to specify interactions, as we only have a single variable here).

utility.gen(syn_param, 
            heart_failure, 
            method = "logit", 
            vars = "creatinine_phosphokinase",
            maxorder = 0)

utility.gen(syn_nonparam, 
            heart_failure, 
            method = "logit", 
            vars = "creatinine_phosphokinase",
            maxorder = 0)

Utility score calculated by method: logit

Call:
utility.gen.synds(object = syn_param, data = heart_failure, method = "logit", 
    maxorder = 0, vars = "creatinine_phosphokinase")

Selected utility measures
    pMSE   S_pMSE 
0.000487 2.328462 

Utility score calculated by method: logit

Call:
utility.gen.synds(object = syn_nonparam, data = heart_failure, 
    method = "logit", maxorder = 0, vars = "creatinine_phosphokinase")

Selected utility measures
    pMSE   S_pMSE 
0.000070 0.334188 

The \(pMSE\) is about seven times higher for the parametrically synthesized data set.

It can be hard to interpret the values of the \(pMSE\), because they say little about how useful the synthetic data is in general. To get a more insightful measure, we can take ratio of the calculated \(pMSE\) over the expected \(pMSE\) under the null distribution of a correct synthesis model (i.e., in line with the data-generating model). The \(pMSE\) ratio is given by \[ \begin{aligned} pMSE \text{ ratio } &= \frac{pMSE} {(k-1)(\frac{n_{\text{obs}}}{n_{\text{syn}} + n_{\text{obs}}})^2(\frac{n_{\text{syn}}}{n_{\text{syn}} + n_{\text{obs}}}) / (n_{\text{obs}} + n_{\text{syn}})} \\ &= \frac{pMSE}{(k-1)(\frac{1}{2})^3/(n_{obs} + n_{syn})}, \end{aligned} \] where \(k\) denotes the number of predictors in the propensity score model, including the intercept. Note that this formulation only holds for a \(pMSE\) that is obtained through logistic regression. When different methods are used to calculate the probabilities, the null distribution can be obtained by using a permutation test.

Ideally, the \(pMSE\) ratio equals \(1\), but according to the synthpop authors, values below \(3\) are indicative of high quality synthetic data, while values below \(10\) are deemed acceptable (Raab, Nowok, and Dibben 2021). This would indicate that both synthesis models are very good models to synthesize the variable creatinine_phosphokinase. However, our logistic regression model only evaluates whether the mean of the two variables is similar, and might thus not be the best model for evaluating the quality of the synthetic data in this case.


6. Recalculate the \(pMSE\) for the variable creatinine_phosphokinase for both synthetic sets, but this time using a CART model to estimate the probabilities \(\pi\). What do you see?


Hint: You can again use the function utility.gen() and set the arguments method = "cart" and vars = "creatinine_phosphokinase".

utility.gen(syn_param, 
            heart_failure, 
            method = "cart", 
            vars = "creatinine_phosphokinase")

utility.gen(syn_nonparam, 
            heart_failure, 
            method = "cart", 
            vars = "creatinine_phosphokinase")
Running 50 permutations to get NULL utilities and printing every 10th.
synthesis 10 20 30 40 50 

Utility score calculated by method: cart

Call:
utility.gen.synds(object = syn_param, data = heart_failure, method = "cart", 
    vars = "creatinine_phosphokinase")

Null utilities simulated from a permutation test with 50 replications.

Selected utility measures
    pMSE   S_pMSE 
0.137999 4.822987 
Running 50 permutations to get NULL utilities and printing every 10th.
synthesis 10 20 30 40 50 

Utility score calculated by method: cart

Call:
utility.gen.synds(object = syn_nonparam, data = heart_failure, 
    method = "cart", vars = "creatinine_phosphokinase")

Null utilities simulated from a permutation test with 50 replications.

Selected utility measures
    pMSE   S_pMSE 
0.021460 1.120306 

The \(pMSE\)-ratio is about four times higher for the parametrically synthesized data set using the CART model to estimate the probabilities \(\pi\). This indicates that the nonparametric synthesis method is better at reproducing the variable creatinine_phosphokinase. However, both \(pMSE\)-ratio values are still well below \(10\), indicating reasonable synthetic data quality, where I would argue that the parametric synthetic version of creatinine_phosphokinase is a poor representation of the original data.


Multivariate data utility

Being able to reproduce the original univariate distributions is a good first step, but generally the goal of synthetic data reaches beyond that. Specifically, we often want to reproduce the relationships between the variables in the data. In the previous section, we saw that an evaluation of utility is often best carried out through visualizations. However, creating visualizations is cumbersome for multivariate relationships. Creating visualizations beyond bivariate relationships is often not feasible, whereas displaying all bivariate relationships in the data already results in \(p(p-1)/2\) different figures.

In the synthetic data literature, a distinction is often made between general and specific utility measures. General utility measures assess to what extent the relationships between combinations of variables (and potential interactions between them) are preserved in the synthetic data set. These measures are often for pairs of variables, or for all combinations of variables. Specific utility measures focus, as the name already suggests, on a specific analysis. This analysis is performed on the observed data and the synthetic data, and the similarity between inferences on these data sets is quantified.


General utility measures


Continuing with our \(pMSE\) approach, we can inspect which variables can predict whether observations are “true” or “synthetic” using the \(pMSE\)-ratio, similarly to what we just did using individual variables. We first try to predict the class of all observations by using all variables simultaneously, and hereafter we look at the results for all unique pairs of variables in the data.


7. Use the function utility.gen() from the synthpop package to calculate the \(pMSE\)-ratio using all variables for both synthetic sets. What do you see?


utility.gen(syn_param, heart_failure)
utility.gen(syn_nonparam, heart_failure)

Utility score calculated by method: cart

Call:
utility.gen.synds(object = syn_param, data = heart_failure, print.flag = F)

Null utilities simulated from a permutation test with 50 replications.

Selected utility measures
    pMSE   S_pMSE 
0.173414 3.237469 

Utility score calculated by method: cart

Call:
utility.gen.synds(object = syn_nonparam, data = heart_failure, 
    print.flag = F)

Null utilities simulated from a permutation test with 50 replications.

Selected utility measures
    pMSE   S_pMSE 
0.100042 1.953030 

The CART model was somewhat better, but the difference is relatively small. To get more insight into which variables and bivariate relationships were synthesized accordingly, and which can be improved, we can use utility.tables.list().


8. Use the function utility.tables() from the synthpop package to calculate the \(pMSE\)-ratio for each pair of variables for both synthetic sets. What do you see?


Hint: To use the same color scale for both synthetic data sets, you can set the arguments min.scale = 0 and max.scale = 45.

utility.tables(syn_param, heart_failure, min.scale = 0, max.scale = 45)
utility.tables(syn_nonparam, heart_failure, min.scale = 0, max.scale = 45)

Two-way utility: S_pMSE value plotted for 78 pairs of variables.

Variable combinations with worst 5 utility scores (S_pMSE):
     02.anaemia:03.creatinine_phosphokinase 
                                    39.1081 
    03.creatinine_phosphokinase:04.diabetes 
                                    36.7613 
    03.creatinine_phosphokinase:12.deceased 
                                    36.5158 
     03.creatinine_phosphokinase:10.smoking 
                                    36.4113 
03.creatinine_phosphokinase:11.hypertension 
                                    36.3940 


Medians and maxima of selected utility measures for all tables compared
       Medians  Maxima
pMSE    0.0141  0.1110
S_pMSE  5.1911 39.1081
df      9.0000 24.0000

For more details of all scores use print.tabs = TRUE.

Two-way utility: S_pMSE value plotted for 78 pairs of variables.

Variable combinations with worst 5 utility scores (S_pMSE):
                          02.anaemia:09.sex 
                                     4.1218 
                        06.platelets:09.sex 
                                     3.5183 
                      02.anaemia:10.smoking 
                                     3.2361 
03.creatinine_phosphokinase:08.serum_sodium 
                                     2.6997 
          05.ejection_fraction:13.follow_up 
                                     2.6599 


Medians and maxima of selected utility measures for all tables compared
       Medians  Maxima
pMSE    0.0027  0.0135
S_pMSE  1.2914  4.1218
df      9.0000 24.0000

For more details of all scores use print.tabs = TRUE.

Here, we finally see that our parametric synthesis model is severely flawed. Quite some of the \(pMSE\) ratios are larger than 20, which means that we did in poor job in synthesizing these variables or the relationship of these variables with other variables. Note that we partly knew this already from our visualizations. Our non-parametric synthesis model is doing very good. The highest \(pMSE\)-ratio values are (much) smaller than \(10\), which actually indicates that our synthetic data are of high quality.


Specific utility measures

Specific utility measures assess whether the same analysis on the observed and the synthetic data gives similar results. Say that we are interested in, for instance, the relationship between whether a person survives, the age of this person, whether this person has diabetes and whether or not this person smokes, including the follow-up time as a control variable in the model.


9. Fit this model as a logistic regression model using glm.synds() with family = binomial and data = synthetic_data_object. Compare the results obtained with both synthetic data sets with the results obtained on the original data. What do you see?

Hint: You can also use compare.fit.synds() to compare the results of the models fitted on the synthetic data sets with the model fitted on the observed data.


fit_param <- glm.synds(deceased ~ age + diabetes + smoking + follow_up,
                       family = binomial, 
                       data = syn_param)

fit_nonparam <- glm.synds(deceased ~ age + diabetes + smoking + follow_up,
                          family = binomial, 
                          data = syn_nonparam)

fit_obs <- glm(deceased ~ age + diabetes + smoking + follow_up,
               family = binomial, 
               data = heart_failure)
summary(fit_param)
Fit to synthetic data set with a single synthesis. Inference to coefficients
and standard errors that would be obtained from the original data.

Call:
glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
    family = binomial, data = syn_param)

Combined estimates:
            xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)    
(Intercept)   0.203118      0.948495  0.2141        0.83043    
age           0.027391      0.012734  2.1511        0.03147 *  
diabetesYes   0.107183      0.339593  0.3156        0.75229    
smokingYes    0.330859      0.343451  0.9633        0.33538    
follow_up    -0.024043      0.003265 -7.3638      1.788e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_nonparam)
Fit to synthetic data set with a single synthesis. Inference to coefficients
and standard errors that would be obtained from the original data.

Call:
glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
    family = binomial, data = syn_nonparam)

Combined estimates:
            xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)    
(Intercept)  0.2566964     0.9447743  0.2717        0.78585    
age          0.0304290     0.0142923  2.1290        0.03325 *  
diabetesYes  0.4642362     0.3400045  1.3654        0.17213    
smokingYes  -0.3846838     0.3597232 -1.0694        0.28489    
follow_up   -0.0285308     0.0034808 -8.1966      2.474e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit_obs)

Call:
glm(formula = deceased ~ age + diabetes + smoking + follow_up, 
    family = binomial, data = heart_failure)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.84667    0.90336  -0.937  0.34863    
age          0.03651    0.01332   2.740  0.00614 ** 
diabetesYes  0.11021    0.31027   0.355  0.72242    
smokingYes  -0.20590    0.32636  -0.631  0.52811    
follow_up   -0.01932    0.00258  -7.486 7.08e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 375.35  on 298  degrees of freedom
Residual deviance: 270.87  on 294  degrees of freedom
AIC: 280.87

Number of Fisher Scoring iterations: 5
compare.fit.synds(fit_param, heart_failure)

Call used to fit models to the data:
glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
    family = binomial, data = syn_param)

Differences between results based on synthetic and observed data:
              Synthetic    Observed         Diff Std. coef diff CI overlap
(Intercept)  0.20311831 -0.84666625  1.049784561    1.162090890  0.7035428
age          0.02739155  0.03650504 -0.009113495   -0.684058013  0.8254922
diabetesYes  0.10718298  0.11021309 -0.003030108   -0.009766139  0.9975086
smokingYes   0.33085915 -0.20589862  0.536757766    1.644690904  0.5804283
follow_up   -0.02404289 -0.01931552 -0.004727370   -1.832233929  0.5325848

Measures for one synthesis and 5 coefficients
Mean confidence interval overlap:  0.7279113
Mean absolute std. coef diff:  1.066568

Mahalanobis distance ratio for lack-of-fit (target 1.0): 1.72
Lack-of-fit test: 8.576733; p-value 0.1272 for test that synthesis model is
compatible with a chi-squared test with 5 degrees of freedom.

Confidence interval plot:

compare.fit.synds(fit_nonparam, heart_failure)

Call used to fit models to the data:
glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
    family = binomial, data = syn_nonparam)

Differences between results based on synthetic and observed data:
              Synthetic    Observed         Diff Std. coef diff CI overlap
(Intercept)  0.25669642 -0.84666625  1.103362666      1.2214008 0.68841244
age          0.03042904  0.03650504 -0.006076006     -0.4560644 0.88365490
diabetesYes  0.46423621  0.11021309  0.354023115      1.1410284 0.70891597
smokingYes  -0.38468383 -0.20589862 -0.178785211     -0.5478196 0.86024754
follow_up   -0.02853082 -0.01931552 -0.009215308     -3.5716685 0.08884333

Measures for one synthesis and 5 coefficients
Mean confidence interval overlap:  0.6460148
Mean absolute std. coef diff:  1.387596

Mahalanobis distance ratio for lack-of-fit (target 1.0): 3.01
Lack-of-fit test: 15.05894; p-value 0.0101 for test that synthesis model is
compatible with a chi-squared test with 5 degrees of freedom.

Confidence interval plot:

The results obtained for both synthetic data sets are quite similar, but the parametrically synthesized data are somewhat closer to the results from the analysis on the real data than the non-parametrically synthesized data. This is quite paradoxical, as we saw before that the non-parametric synthesis model yielded much more realistic data than the parametric synthesis model. This shows an important mismatch between general and specific utility. That is, to obtain high specific utility, it is not necessary to have high general utility. Moreover, high general utility does not guarantee high specific utility. Additionally, these results show that synthetic data with lower general utility can still be very useful if the goal is to perform specific analyses.


Statistical disclosure control


Synthetic data can provide a relatively safe framework for sharing data. However, some risks will remain present, and it is important to evaluate these risks. For example, it can be the case that the synthesis models were so complex that the synthetic records are very similar or even identical to the original records, which can lead to privacy breaches.

Privacy of synthetic data

Synthetic data by itself does not provide any formal privacy guarantees. These guarantees can be incorporated, for example by using differentially private synthesis methods. However, these methods are not yet widely available in R. If privacy is not built-in by design, it remains important to inspect the synthetic data for potential risks. Especially if you’re not entirely sure, it is better to stay at the safe side: use relatively simple, parametric models, check for outliers, and potentially add additional noise to the synthetic data. See also Chapter 4 in the book Synthetic Data for Official Statistics.


10. Call the function replicated.uniques() on the synthetic data. This function checks whether there are duplicates of observations that were unique in the original data.


replicated.uniques(syn_param, heart_failure)
replicated.uniques(syn_nonparam, heart_failure)
$replications
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$no.uniques
[1] 299

$no.replications
[1] 0

$per.replications
[1] 0
$replications
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

$no.uniques
[1] 299

$no.replications
[1] 0

$per.replications
[1] 0

None of observations occur repeatedly, so we have not accidentally copied any of the “true” observations into the synthetic sets. This provides some safeguard against accidentally releasing sensitive information. However, if the data contains really sensitive information, this might not be enough, and one could for example check whether the synthetic data differs from the observed data along multiple dimensions (i.e., variables). Such additional checks depend on the problem at hand. Additionally, one might want to take additional measures against accidentally disclosing information about observations, for example by drawing some of the variables from a parametric distribution. Even before distribution synthetic data, think wisely about whether there may remain any disclosure risks with respect to the data that will be distributed.


If you find the synthetic data too risky to be released as is, you can impose additional statistical disclosure limitation techniques, that additionally reduce the information in the synthetic data. For example, you can add noise to the synthetic data by using smoothing, or you can impose additional top/bottom coding, such that extreme values cannot appear in the synthetic data. This is easily done using the function sdc() as implemented in synthpop. Which statistical disclosure techniques to apply typically depends on the problem at hand, the sensitivity of the data and the synthesis strategies used. For example, for our non-parametric synthesis strategy, we re-use observed values, which might lead to an unacceptable risk of disclosure. Then, we could apply smoothing to the synthetic data to reduce the risk of disclosure.

Inferences from synthetic data

Lastly, when you have obtained a synthetic data set and want to make inferences from this set, you have to be careful, because generating synthetic data adds variance to the already present sampling variance that you take into account when evaluating hypotheses. Specifically, if you want to make inferences with respect to the sample of original observations, you can use unaltered analysis techniques and corresponding, conventional standard errors.

However, if you want to inferences with respect to the population the sample is taken from, you will have to adjust the standard errors, to account for the fact that the synthesis procedure adds additional variance. The amount of variance that is added, depends on the number of synthetic data sets that are generated. Intuitively, when generating multiple synthetic data sets, the additional random noise that is induced by the synthesis cancels out, making the parameter estimates more stable.

There are two ways to obtain statistically valid results from synthetic data. The first requires that you have multiple synthetic data sets, and estimates the variance between the obtained estimates in each of the synthetic data sets. The corresponding pooling rules are presented in Reiter (2003). For scalar \(Q\), with \(q^{(i)}\) and \(u^{(i)}\) the point estimate and the corresponding variance estimate in synthetic data set \(D^{(i)}\) for \(i = 1, \dots, m\), the following quantities are needed for inferences:

\[ \begin{aligned} \bar{q}_m &= \sum_{i=1}^m \frac{q^{(i)}}{m}, \\ b_m &= \sum_{i=1}^m \frac{(q^{(i)} - \bar{q}_m)}{m-1}, \\ \bar{u}_m &= \sum_{i=1}^m \frac{u^{(i)}}{m}. \end{aligned} \]

The analyst can use \(\bar{q}_m\) to estimate \(Q\) and \[ T_p = \frac{b_m}{m} + \bar{u}_m \] to estimate the variance of \(\bar{q}_m\). Then, \(\frac{b_m}{m}\) is the correction factor for the additional variance due to using a finite number of imputations.

The second way to obtain statistically valid results from synthetic data allows for multiple synthetic data sets, but does not require it (Raab, Nowok, and Dibben 2016). In this case, the between-imputation variance is estimated from the standard error(s) of the estimates, which simplifies the total variance of each estimate to \[ T_s = \frac{\bar{u}_m}{m} + \bar{u}_m. \] When you have \(m = 1\) synthetic data set, we have \(T_s = 2u\), where \(u\) is the variance estimate obtained in that synthetic set.

References

Drechsler, Jörg. 2022. “Challenges in Measuring Utility for Fully Synthetic Data.” In Privacy in Statistical Databases, edited by Josep Domingo-Ferrer and Maryline Laurent, 220–33. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-13945-1_16.
Raab, Gillian M, Beata Nowok, and Chris Dibben. 2016. “Practical Data Synthesis for Large Samples.” Journal of Privacy and Confidentiality 7 (3): 67–97.
———. 2021. “Assessing, Visualizing and Improving the Utility of Synthetic Data.” arXiv Preprint arXiv:2109.12717.
Reiter, Jerome P. 2003. “Inference for Partially Synthetic, Public Use Microdata Sets.” Survey Methodology 29 (2): 181–88.