Introduction

In this workshop, you will learn how to create and evaluate synthetic data in R. In the practical, we will work with the R package synthpop (Nowok, Raab, and Dibben 2016), which is one of the most advanced and dedicated packages in R to create synthetic data. Other alternatives to create synthetic data are, for example, the R-package mice (van Buuren and Groothuis-Oudshoorn 2011; see Volker and Vink 2021), or the stand-alone software IVEware (“IVEware: Imputation and Variance Estimation Software,” n.d.).

If you have R and R Studio installed on your device, you can follow all the steps from this practical using your local version of R Studio. In case you do not have an installation of R and R Studio, you can quickly create an account on R Studio Cloud. Note that you have the opportunity to work with your own data (you can also use data provided by us). If you are going to work via R Studio Cloud, you may not want to upload your own data to this server. In this case, you can still decide to work with the data provided by us. Theoretically, you could also install R and R Studio on the spot, but since we only have 45 minutes, we advise to use R Studio Cloud if you have no access to R and R Studio on your device already.

In this workshop, you will (at least) use the following packages. Make sure to load them (in case you haven’t installed them already, install them first, using install.packages("package.name")).

library(synthpop)
library(magrittr)
library(psych)

Data

For this workshop, we have prepared all exercises with the Heart failure clinical records data set. However, you may also choose to work with a data set of your own liking. All steps exercises and solutions that we outline here should be applicable to another data set as well, but some data processing might be required before our example code works as it should. In the worst case, you might run into errors that we could not foresee, but we are more than happy to think along and help you to solve these issues.

Heart failure clinical records

The Heart failure clinical records data set is a medical data set from the UCI Machine Learning Repository (click here for the source), originally collected by Ahmad (2017) from the Government College University, Faisalabad, Pakistan, and adapted and uploaded to the UCI MLR by Chicco and Jurman (2020). This data set contains medical information of 299 individuals on 13 variables, and is typically used to predict whether or not a patient will survive during the follow-up period, using several biomedical predictors.

If you decide to work with the Heart failure clinical records data and work in R Studio Cloud, you can access the environment related to this workshop here, including a script osf_synthetic.R that gets you started on installing and loading the required packages, and that imports the data for you. You can continue to work in this script. Make sure to save the project on your account, so that your changes are not deleted if you, for some reason, have to refresh the browser.

If you have R Studio installed on your own machine, you can download the cleaned version of the Heart failure clinical records data set from my GitHub and load it as heart_failure, by running the following line of code.

heart_failure <- readRDS(url("https://thomvolker.github.io/osf_synthetic/data/heart_failure.RDS"))

The Heart failure clinical records data consists of the following variables:

age: Age in years
anaemia: Whether the patient has a decrease of red blood cells (No/Yes)
creatinine_phosphokinase: Level of the creatinine phosphokinase enzyme in the blood (mcg/L)
diabetes: Whether the patient has diabetes (No/Yes)
ejection_fraction: Percentage of blood leaving the heart at each contraction
platelets: Platelets in de blood (kiloplatelets/mL)
serum_creatinine: Level of serum creatinine in the blood (mg/dL)
serum_sodium: Level of serum sodium in the blood (mg/dL)
sex: Sex (Female/Male)
smoking: Whether the patient smokes (No/Yes)
hypertension: Whether the patient has high blood pressure (No/Yes)
deceased: Whether the patient decreased during the follow-up period
follow_up: Follow-up period (days)

After loading the data, it is always wise to first inspect the data, so that you have an idea what to expect.

head(heart_failure)

age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	platelets	serum_creatinine	serum_sodium	sex	smoking	hypertension	deceased	follow_up
75	No	582	No	20	265000	1.9	130	Male	No	Yes	Yes	4
55	No	7861	No	38	263358	1.1	136	Male	No	No	Yes	6
65	No	146	No	20	162000	1.3	129	Male	Yes	No	Yes	7
50	Yes	111	No	20	210000	1.9	137	Male	No	No	Yes	7
65	Yes	160	Yes	20	327000	2.7	116	Female	No	No	Yes	8
90	Yes	47	No	40	204000	2.1	132	Male	Yes	Yes	Yes	8

Additionally, we can ask for a summary of all variables, or use describe() from the psych-package to provide descriptive statistics of all variables.

Note. Make sure to install psych if you haven’t done so in the past.

summary(heart_failure)

##       age        anaemia   creatinine_phosphokinase diabetes  ejection_fraction
##  Min.   :40.00   No :170   Min.   :  23.0           No :174   Min.   :14.00    
##  1st Qu.:51.00   Yes:129   1st Qu.: 116.5           Yes:125   1st Qu.:30.00    
##  Median :60.00             Median : 250.0                     Median :38.00    
##  Mean   :60.83             Mean   : 581.8                     Mean   :38.08    
##  3rd Qu.:70.00             3rd Qu.: 582.0                     3rd Qu.:45.00    
##  Max.   :95.00             Max.   :7861.0                     Max.   :80.00    
##    platelets      serum_creatinine  serum_sodium       sex      smoking  
##  Min.   : 25100   Min.   :0.500    Min.   :113.0   Female:105   No :203  
##  1st Qu.:212500   1st Qu.:0.900    1st Qu.:134.0   Male  :194   Yes: 96  
##  Median :262000   Median :1.100    Median :137.0                         
##  Mean   :263358   Mean   :1.394    Mean   :136.6                         
##  3rd Qu.:303500   3rd Qu.:1.400    3rd Qu.:140.0                         
##  Max.   :850000   Max.   :9.400    Max.   :148.0                         
##  hypertension deceased    follow_up    
##  No :194      No :203   Min.   :  4.0  
##  Yes:105      Yes: 96   1st Qu.: 73.0  
##                         Median :115.0  
##                         Mean   :130.3  
##                         3rd Qu.:203.0  
##                         Max.   :285.0

This gives a good impression about the measurement levels of all variables, as well as the range of the possible values each variable can have.

describe(heart_failure)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
age	1	299	6.083389e+01	1.189481e+01	60.0	6.021715e+01	14.82600	40.0	95.0	55.0	0.4188266	-0.2204793	0.6878946
anaemia*	2	299	1.431438e+00	4.961073e-01	1.0	1.414938e+00	0.00000	1.0	2.0	1.0	0.2754750	-1.9305367	0.0286906
creatinine_phosphokinase	3	299	5.818395e+02	9.702879e+02	250.0	3.654938e+02	269.83320	23.0	7861.0	7838.0	4.4184296	24.5254138	56.1131970
diabetes*	4	299	1.418060e+00	4.940671e-01	1.0	1.398340e+00	0.00000	1.0	2.0	1.0	0.3305857	-1.8970241	0.0285726
ejection_fraction	5	299	3.808361e+01	1.183484e+01	38.0	3.742739e+01	11.86080	14.0	80.0	66.0	0.5498228	0.0005484	0.6844265
platelets	6	299	2.633580e+05	9.780424e+04	262000.0	2.567301e+05	65234.40000	25100.0	850000.0	824900.0	1.4476814	6.0252322	5656.1650591
serum_creatinine	7	299	1.393880e+00	1.034510e+00	1.1	1.189295e+00	0.29652	0.5	9.4	8.9	4.4113866	25.1888415	0.0598273
serum_sodium	8	299	1.366254e+02	4.412477e+00	137.0	1.368216e+02	4.44780	113.0	148.0	35.0	-1.0376430	3.9841899	0.2551802
sex*	9	299	1.648829e+00	4.781364e-01	2.0	1.684647e+00	0.00000	1.0	2.0	1.0	-0.6204576	-1.6204183	0.0276513
smoking*	10	299	1.321070e+00	4.676704e-01	1.0	1.278008e+00	0.00000	1.0	2.0	1.0	0.7626368	-1.4231112	0.0270461
hypertension*	11	299	1.351171e+00	4.781364e-01	1.0	1.315353e+00	0.00000	1.0	2.0	1.0	0.6204576	-1.6204183	0.0276513
deceased*	12	299	1.321070e+00	4.676704e-01	1.0	1.278008e+00	0.00000	1.0	2.0	1.0	0.7626368	-1.4231112	0.0270461
follow_up	13	299	1.302609e+02	7.761421e+01	115.0	1.292780e+02	105.26460	4.0	285.0	281.0	0.1265232	-1.2238150	4.4885455

The describe() function gives more distributional information about all variables.

Loading your own data

In case you brought your own data, you can load it into R using a function that matches your data format. Below, you can find several functions that might be helpful if you want to load the your data into R. You can use these functions both locally, or on R Studio Cloud, but make sure to install the required package first.

Programme	Format	Command
Excel	.csv	readxl::read_xlsx(“path_to_file/data_name.xlsx”)
Excel	.xlsx	readr::read_csv(“path_to_file/data_name.csv”)
SPSS	.sav	haven::read_sav(“path_to_file/data_name.sav”)
Stata	.dta	haven::read_dta(“path_to_file/data_name.dta”)

After loading in your own data, make sure that the variables in your data are coded accordingly (this can go wrong when transferring between data types). That is, make sure that your categorical variables are coded as factors and your numeric variables as numeric variables. To do so, you can make use of the following code. Note, however, that this is not a workshop on data wrangling: if importing your data into R creates a mess, it might be better to use the Heart failure clinical records data, so that you can spend your valuable time on creating synthetic data.

data_name$variable  <- as.numeric(data_name$variable)
data_name$variable2 <- factor(data_name$variable, 
                              levels = values,       # values of the data
                              labels = value_labels) # labels of these values

If your data has the correct format, we can proceed to the next steps. Given that you are using your own data, we assume that you have (at least some) knowledge about the variables in your data. We will therefore skip the steps to obtain some descriptive information of the variables in your data, and continue to creating and evaluating synthetic data.

In the sequel, we will outline how to create and evaluate synthetic data using the Heart failure clinical records data, but most of these steps should be directly applicable to your own data. In case something gives an error, do not hesitate to ask how the problem can be solved!

Creating synthetic data

In the part, you created synthetic data using univariate imputation methods. Relationships between variables were completely neglected. In this part, you will create synthetic data that preserves the relationships between the variables in the data.

Using the default settings, synthpop proceeds as follows. The first variable is randomly sampled from the observed data, such that a random subset of the original values is used, without taking relations with the other variables into account. Subsequently, the second variable is generated using a classification / regression tree (CART). A CART model is trained on the observed data, using the first variable as a predictor, and the second variable as the outcome. The first synthetic variable is used to predict the second synthetic variable (by sampling predicted values from the final nodes of the CART model trained on the observed data). The next step is to apply this procedure for the third variable, using variable 1 and 2 as predictors. This procedure is repeated, sequentially using additional predictors in the generative model, until all but the last variable are used as predictors for the last variable. By incrementally incorporating variables, the relations in the original data are generally preserved in the synthetic data.

1. Use the syn() function from the synthpop package to create a synthetic version of your data, using the default settings.

OPTIONAL. Fix the random seed before generating the synthetic data, so that you can compare your output with our output.

set.seed(123)

synthetic <- syn(heart_failure)

## Synthesis
## -----------
##  age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking
##  hypertension deceased follow_up

You have now created the object synthetic, which is of class synds and contains several pieces of information related to your synthetic data. We will quickly go over the most important output.

2. Inspect the output object, and pay attention to the number of synthetic data sets, the synthesis methods, the visit sequence and the predictor matrix.

synthetic

## Call:
## ($call) syn(data = heart_failure)
## 
## Number of synthesised data sets: 
## ($m)  1 
## 
## First rows of synthesised data set: 
## ($syn)
##   age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets
## 1  65      No                      118      Yes                30    267000
## 2  85     Yes                      582       No                38    140000
## 3  42      No                      582      Yes                38    263358
## 4  73      No                      675       No                40    243000
## 5  63      No                      936       No                38    297000
## 6  50      No                      482       No                45    371000
##   serum_creatinine serum_sodium    sex smoking hypertension deceased follow_up
## 1              1.4          135 Female      No           No       No       212
## 2              1.2          136   Male      No           No      Yes       109
## 3              1.1          134   Male     Yes           No       No       145
## 4              1.1          137 Female      No          Yes       No       107
## 5              1.0          137   Male     Yes          Yes       No       187
## 6              1.0          137 Female      No           No       No       140
## ...
## 
## Synthesising methods: 
## ($method)
##                      age                  anaemia creatinine_phosphokinase 
##                 "sample"                   "cart"                   "cart" 
##                 diabetes        ejection_fraction                platelets 
##                   "cart"                   "cart"                   "cart" 
##         serum_creatinine             serum_sodium                      sex 
##                   "cart"                   "cart"                   "cart" 
##                  smoking             hypertension                 deceased 
##                   "cart"                   "cart"                   "cart" 
##                follow_up 
##                   "cart" 
## 
## Order of synthesis: 
## ($visit.sequence)
##                      age                  anaemia creatinine_phosphokinase 
##                        1                        2                        3 
##                 diabetes        ejection_fraction                platelets 
##                        4                        5                        6 
##         serum_creatinine             serum_sodium                      sex 
##                        7                        8                        9 
##                  smoking             hypertension                 deceased 
##                       10                       11                       12 
##                follow_up 
##                       13 
## 
## Matrix of predictors: 
## ($predictor.matrix)
##                          age anaemia creatinine_phosphokinase diabetes
## age                        0       0                        0        0
## anaemia                    1       0                        0        0
## creatinine_phosphokinase   1       1                        0        0
## diabetes                   1       1                        1        0
## ejection_fraction          1       1                        1        1
## platelets                  1       1                        1        1
## serum_creatinine           1       1                        1        1
## serum_sodium               1       1                        1        1
## sex                        1       1                        1        1
## smoking                    1       1                        1        1
## hypertension               1       1                        1        1
## deceased                   1       1                        1        1
## follow_up                  1       1                        1        1
##                          ejection_fraction platelets serum_creatinine
## age                                      0         0                0
## anaemia                                  0         0                0
## creatinine_phosphokinase                 0         0                0
## diabetes                                 0         0                0
## ejection_fraction                        0         0                0
## platelets                                1         0                0
## serum_creatinine                         1         1                0
## serum_sodium                             1         1                1
## sex                                      1         1                1
## smoking                                  1         1                1
## hypertension                             1         1                1
## deceased                                 1         1                1
## follow_up                                1         1                1
##                          serum_sodium sex smoking hypertension deceased
## age                                 0   0       0            0        0
## anaemia                             0   0       0            0        0
## creatinine_phosphokinase            0   0       0            0        0
## diabetes                            0   0       0            0        0
## ejection_fraction                   0   0       0            0        0
## platelets                           0   0       0            0        0
## serum_creatinine                    0   0       0            0        0
## serum_sodium                        0   0       0            0        0
## sex                                 1   0       0            0        0
## smoking                             1   1       0            0        0
## hypertension                        1   1       1            0        0
## deceased                            1   1       1            1        0
## follow_up                           1   1       1            1        1
##                          follow_up
## age                              0
## anaemia                          0
## creatinine_phosphokinase         0
## diabetes                         0
## ejection_fraction                0
## platelets                        0
## serum_creatinine                 0
## serum_sodium                     0
## sex                              0
## smoking                          0
## hypertension                     0
## deceased                         0
## follow_up                        0

If all went well, you should have a single synthetic data set (increasing $m$, the number of synthetic data sets, can decrease the variance of inferences on synthetic data). Also, the first variable should have synthesis method "sample", whereas all other variables should have method "cart". The visit sequence should increase sequentially from $1$ to the number of variables in your data set ($13$ if you are using the Heart failure clinical records data). Lastly, the predictor matrix shows which variables in the rows are used to predict the variable in the columns (and should be a lower triangular matrix containing ones, excluding the diagonal).

The output object synthetic contains additional information, that you can access by specifically asking for it, using, for example, synthetic$smoothing, which shows for which of the variables smoothing was applied. In general, you will not need most of this information, but sometimes it can be helpful to check whether more sophisticated arguments were passed successfully.

Assessing utility on a univariate level

The quality of synthetic data sets can be assessed on multiple levels and in multiple different ways. Starting on a univariate level, we can assess whether similar values seem to appear in the synthetic data as in the observed data. Additionally, the distributions of the synthetic data sets can be compared with the distribution of the observed data.

3. Inspect the synthetic data by asking for head(synthetic$syn) and tail(synthetic$syn), to check whether the synthetic data looks similar to the observed data (which we inspected previously already).

head(synthetic$syn)
tail(synthetic$syn)

age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	platelets	serum_creatinine	serum_sodium	sex	smoking	hypertension	deceased	follow_up
65	No	118	Yes	30	267000	1.4	135	Female	No	No	No	212
85	Yes	582	No	38	140000	1.2	136	Male	No	No	Yes	109
42	No	582	Yes	38	263358	1.1	134	Male	Yes	No	No	145
73	No	675	No	40	243000	1.1	137	Female	No	Yes	No	107
63	No	936	No	38	297000	1.0	137	Male	Yes	Yes	No	187
50	No	482	No	45	371000	1.0	137	Female	No	No	No	140

	age	anaemia	creatinine_phosphokinase	diabetes	ejection_fraction	platelets	serum_creatinine	serum_sodium	sex	smoking	hypertension	deceased	follow_up
294	50	Yes	168	Yes	45	454000	1.4	137	Female	No	Yes	No	216
295	64	Yes	52	No	60	226000	1.8	131	Male	Yes	No	No	90
296	50	No	115	No	35	254000	0.7	138	Male	Yes	Yes	No	145
297	59	Yes	129	Yes	35	260000	1.0	136	Male	Yes	Yes	No	95
298	45	No	7702	No	25	226000	0.8	127	Male	Yes	No	No	112
299	44	No	582	No	50	374000	0.7	140	Female	No	Yes	Yes	95

If all went well, you should at least observe that the variables in the synthetic data have the same measurement level as the variables in the observed data.

4. Check the distributional similarity by comparing the summary of the observed and synthetic data. Additionally, you may inspect the similarity by running describe() from the psych-package on both the observed and the synthetic data.

summary(heart_failure)

##       age        anaemia   creatinine_phosphokinase diabetes  ejection_fraction
##  Min.   :40.00   No :170   Min.   :  23.0           No :174   Min.   :14.00    
##  1st Qu.:51.00   Yes:129   1st Qu.: 116.5           Yes:125   1st Qu.:30.00    
##  Median :60.00             Median : 250.0                     Median :38.00    
##  Mean   :60.83             Mean   : 581.8                     Mean   :38.08    
##  3rd Qu.:70.00             3rd Qu.: 582.0                     3rd Qu.:45.00    
##  Max.   :95.00             Max.   :7861.0                     Max.   :80.00    
##    platelets      serum_creatinine  serum_sodium       sex      smoking  
##  Min.   : 25100   Min.   :0.500    Min.   :113.0   Female:105   No :203  
##  1st Qu.:212500   1st Qu.:0.900    1st Qu.:134.0   Male  :194   Yes: 96  
##  Median :262000   Median :1.100    Median :137.0                         
##  Mean   :263358   Mean   :1.394    Mean   :136.6                         
##  3rd Qu.:303500   3rd Qu.:1.400    3rd Qu.:140.0                         
##  Max.   :850000   Max.   :9.400    Max.   :148.0                         
##  hypertension deceased    follow_up    
##  No :194      No :203   Min.   :  4.0  
##  Yes:105      Yes: 96   1st Qu.: 73.0  
##                         Median :115.0  
##                         Mean   :130.3  
##                         3rd Qu.:203.0  
##                         Max.   :285.0

summary(synthetic$syn)

##       age        anaemia   creatinine_phosphokinase diabetes  ejection_fraction
##  Min.   :40.00   No :166   Min.   :  23.0           No :179   Min.   :15.00    
##  1st Qu.:53.00   Yes:133   1st Qu.: 118.0           Yes:120   1st Qu.:30.00    
##  Median :60.00             Median : 213.0                     Median :38.00    
##  Mean   :62.02             Mean   : 568.8                     Mean   :39.41    
##  3rd Qu.:70.00             3rd Qu.: 582.0                     3rd Qu.:47.50    
##  Max.   :95.00             Max.   :7702.0                     Max.   :70.00    
##    platelets      serum_creatinine  serum_sodium       sex      smoking  
##  Min.   : 25100   Min.   :0.500    Min.   :116.0   Female:110   No :207  
##  1st Qu.:211500   1st Qu.:0.900    1st Qu.:134.0   Male  :189   Yes: 92  
##  Median :255000   Median :1.100    Median :137.0                         
##  Mean   :261243   Mean   :1.426    Mean   :136.9                         
##  3rd Qu.:297500   3rd Qu.:1.650    3rd Qu.:140.0                         
##  Max.   :742000   Max.   :6.800    Max.   :148.0                         
##  hypertension deceased    follow_up    
##  No :189      No :197   Min.   :  6.0  
##  Yes:110      Yes:102   1st Qu.: 70.0  
##                         Median :109.0  
##                         Mean   :125.5  
##                         3rd Qu.:188.0  
##                         Max.   :285.0

describe(heart_failure)
describe(synthetic$syn)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
age	1	299	6.083389e+01	1.189481e+01	60.0	6.021715e+01	14.82600	40.0	95.0	55.0	0.4188266	-0.2204793	0.6878946
anaemia*	2	299	1.431438e+00	4.961073e-01	1.0	1.414938e+00	0.00000	1.0	2.0	1.0	0.2754750	-1.9305367	0.0286906
creatinine_phosphokinase	3	299	5.818395e+02	9.702879e+02	250.0	3.654938e+02	269.83320	23.0	7861.0	7838.0	4.4184296	24.5254138	56.1131970
diabetes*	4	299	1.418060e+00	4.940671e-01	1.0	1.398340e+00	0.00000	1.0	2.0	1.0	0.3305857	-1.8970241	0.0285726
ejection_fraction	5	299	3.808361e+01	1.183484e+01	38.0	3.742739e+01	11.86080	14.0	80.0	66.0	0.5498228	0.0005484	0.6844265
platelets	6	299	2.633580e+05	9.780424e+04	262000.0	2.567301e+05	65234.40000	25100.0	850000.0	824900.0	1.4476814	6.0252322	5656.1650591
serum_creatinine	7	299	1.393880e+00	1.034510e+00	1.1	1.189295e+00	0.29652	0.5	9.4	8.9	4.4113866	25.1888415	0.0598273
serum_sodium	8	299	1.366254e+02	4.412477e+00	137.0	1.368216e+02	4.44780	113.0	148.0	35.0	-1.0376430	3.9841899	0.2551802
sex*	9	299	1.648829e+00	4.781364e-01	2.0	1.684647e+00	0.00000	1.0	2.0	1.0	-0.6204576	-1.6204183	0.0276513
smoking*	10	299	1.321070e+00	4.676704e-01	1.0	1.278008e+00	0.00000	1.0	2.0	1.0	0.7626368	-1.4231112	0.0270461
hypertension*	11	299	1.351171e+00	4.781364e-01	1.0	1.315353e+00	0.00000	1.0	2.0	1.0	0.6204576	-1.6204183	0.0276513
deceased*	12	299	1.321070e+00	4.676704e-01	1.0	1.278008e+00	0.00000	1.0	2.0	1.0	0.7626368	-1.4231112	0.0270461
follow_up	13	299	1.302609e+02	7.761421e+01	115.0	1.292780e+02	105.26460	4.0	285.0	281.0	0.1265232	-1.2238150	4.4885455

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
age	1	299	6.201561e+01	1.227605e+01	60.0	6.150070e+01	13.34340	40.0	95.0	55.0	0.3992243	-0.3082979	0.7099424
anaemia*	2	299	1.444816e+00	4.977785e-01	1.0	1.431535e+00	0.00000	1.0	2.0	1.0	0.2209793	-1.9576820	0.0287873
creatinine_phosphokinase	3	299	5.688227e+02	9.977052e+02	213.0	3.412656e+02	217.94220	23.0	7702.0	7679.0	4.3466736	22.8739358	57.6987835
diabetes*	4	299	1.401338e+00	4.909909e-01	1.0	1.377593e+00	0.00000	1.0	2.0	1.0	0.4005461	-1.8457022	0.0283947
ejection_fraction	5	299	3.940803e+01	1.230952e+01	38.0	3.912033e+01	11.86080	15.0	70.0	55.0	0.3016707	-0.6143742	0.7118778
platelets	6	299	2.612429e+05	9.314287e+04	255000.0	2.554046e+05	63751.80000	25100.0	742000.0	716900.0	1.3503730	5.1309274	5386.5911051
serum_creatinine	7	299	1.425819e+00	9.438396e-01	1.1	1.228299e+00	0.29652	0.5	6.8	6.3	2.9535346	10.6961787	0.0545837
serum_sodium	8	299	1.368930e+02	4.405067e+00	137.0	1.369876e+02	4.44780	116.0	148.0	32.0	-0.5800515	2.1367155	0.2547516
sex*	9	299	1.632107e+00	4.830405e-01	2.0	1.663900e+00	0.00000	1.0	2.0	1.0	-0.5451518	-1.7084900	0.0279350
smoking*	10	299	1.307692e+00	4.623122e-01	1.0	1.261411e+00	0.00000	1.0	2.0	1.0	0.8291562	-1.3168707	0.0267362
hypertension*	11	299	1.367893e+00	4.830405e-01	1.0	1.336100e+00	0.00000	1.0	2.0	1.0	0.5451518	-1.7084900	0.0279350
deceased*	12	299	1.341137e+00	4.748861e-01	1.0	1.302905e+00	0.00000	1.0	2.0	1.0	0.6668191	-1.5605380	0.0274634
follow_up	13	299	1.254749e+02	7.556495e+01	109.0	1.234232e+02	96.36900	6.0	285.0	279.0	0.2344404	-1.1107920	4.3700341

In principle, you should observe that on a univariate level, the information in the observed and synthetic data is quite similar. That is, all variables are similarly distributed, with only small differences between the means and variances in the observed and synthetic data. Higher-order moments as the skewness and kurtosis are also reasonably close.

5. Inspect the univariate distributional similarity of the observed and synthetic data in more detail, using the function compare() from the synthpop-package.

compare(synthetic, heart_failure)

## 
## Comparing percentages observed with synthetic

## Press return for next variable(s):

## Press return for next variable(s):

## Press return for next variable(s):

## 
## Selected utility measures:
##                              pMSE   S_pMSE df
## age                      0.000772 0.923311  4
## anaemia                  0.000045 0.217375  1
## creatinine_phosphokinase 0.001149 1.374546  4
## diabetes                 0.000072 0.345725  1
## ejection_fraction        0.001424 1.702747  4
## platelets                0.000386 0.461658  4
## serum_creatinine         0.001276 1.526175  4
## serum_sodium             0.001066 1.275218  4
## sex                      0.000076 0.363106  1
## smoking                  0.000052 0.248262  1
## hypertension             0.000076 0.363106  1
## deceased                 0.000114 0.543636  1
## follow_up                0.000478 0.572108  4

In each of these figures, the distribution of the observed and synthetic data should be similar. That is, the bars of the histograms corresponding to the same categories should have approximately the same height.

A more formal quantification of the similarity between the synthetic and the observed data is provided by the $pMSE$ values. There are tons of other utility measures available, see, for example this paper, but for now, we will focus on the $pMSE$ (these utility measures tend to correlate highly anyway). These values can quantify univariate or multivariate similarity between the observed and synthetic data, but in the output of compare(), they solely concern univariate similarity.

Technically, the $pMSE$ is defined by \[ pMSE = \frac{1}{n_{\text{obs}} + n_{\text{syn}}} \sum^{n_{\text{obs}}}_{i=1}\Big(\hat{\pi_i} - \frac{n_{\text{syn}}}{n_{\text{obs}} + n_{\text{syn}}}\Big)^2 + \sum^{n_{\text{syn}}}_{i=1}\Big(\hat{\pi_i} - \frac{n_{\text{syn}}}{n_{\text{obs}} + n_{\text{syn}}}\Big)^2, \] which, in our case, simplifies to \[ pMSE = \frac{1}{n_{\text{obs}} + n_{\text{syn}}} \sum^{n_{\text{obs}}}_{i=1} (\hat{\pi_i} - 0.5)^2 + \sum^{n_{\text{syn}}}_{i=1} (\hat{\pi_i} - 0.5)^2, \] where $n_{\text{obs}}$ is the sample size of the observed data, $n_{\text{syn}}$ is the number of synthetic records, and $\hat{\pi_i}$ is each observation’s predicted probability of being a synthetic record. Intuitively, this method quantifies to what extent it is possible to predict whether an observation is a “true” record or a synthetic record. When it is possible to accurately predict whether an observation comes from the real data, there are important differences between the observed and synthetic data on some variables.

OPTIONAL. Calculate the $pMSE$ for the variables diabetes and age by hand, and check whether the values correspond to the values reported by synthpop’s compare()-function.

Hint. For continuous variables, synthpop creates a categorical variable with $5$ categories using 5 quantiles of equal size, using cut(x, breaks = quantile(x, 0:5/5), right=F, include.lowest=T)

# Create an indicator for whether an observation belongs to the true or synthetic
# data
ind <- factor(c(rep("Real", nrow(heart_failure)),  
                rep("Syn", nrow(synthetic$syn))))

# Combine the true and synthetic diabetes scores
obs_syn_diabetes <- c(heart_failure$diabetes, synthetic$syn$diabetes)

# Calculate predicted probabilities of whether an observation is "real" or 
# "synthetic" based on it's diabetes value.
pi_diabetes <- glm(ind ~ obs_syn_diabetes, family = binomial) %>%
  predict(type = "response")

pMSE_diabetes <- mean((pi_diabetes - 0.5)^2)
pMSE_diabetes

## [1] 7.226687e-05

# Combine observed and synthetic age values into a single vector
obs_syn_age <- c(heart_failure$age, synthetic$syn$age)

# Create a categorical version of this vector with 5 categories
obs_syn_age_cat <- cut(obs_syn_age, 
                       breaks = quantile(obs_syn_age, probs = 0:5/5),
                       right = FALSE,
                       include.lowest = TRUE)

pi_age <- glm(ind ~ obs_syn_age_cat, family = binomial) %>%
  predict(type = "response")

pMSE_age <- mean((pi_age - 0.5)^2)
pMSE_age

## [1] 0.0007719991

To correct for the complexity of the model (that is, the number of parameters used to estimate the predicted probabilities of being a true or synthetic record), the $pMSE$ is often standardized. Given that we used the logistic regression model, the standardized $pMSE$ is defined by \[ S pMSE = \frac{pMSE} {(k-1)(\frac{n_{\text{obs}}}{n_{\text{syn}} + n_{\text{obs}}})^2(\frac{n_{\text{syn}}}{n_{\text{syn}} + n_{\text{obs}}}) / (n_{\text{obs}} + n_{\text{syn}})}. \] Given that we used an equal number of observed and synthetic records, this formula reduces to \[ S pMSE = \frac{pMSE} {(k-1)\Big(\frac{1}{2}\Big)^3 / (n_{\text{obs}} + n_{\text{syn}})}. \]

For the previously calculated $pMSE$ values of diabetes and age, we can calculate the standardized $pMSE$ values as follows.

SpMSE_diabetes <- pMSE_diabetes / (1 * (0.5^3 / length(ind)))
SpMSE_diabetes

## [1] 0.3457247

SpMSE_age <- pMSE_age / (4 * (0.5^3 / length(ind)))
SpMSE_age

## [1] 0.9233109

In this case, we have been able to generate synthetic data that is quite similar to the observed data on a univariate level.

However, given the multivariate synthesizing procedure, univariate similarity between the observed and synthetic data is not sufficient. We also want that relationships between the variables are preserved, that is, we want the synthetic and observed data to be similar on a multivariate level as well.

Assessing utility on a multivariate level

On a univariate level, visual inspection of similarity between observed and synthetic data generally gives the most detailed overview. However, visualizing multivariate relationships quickly becomes an onerous task, especially as the dimensionality of the data increases. Hence, we have to rely on different methods to inspect how well the synthetic data preserves relationships between variables.

In the synthetic data literature, an often made distinction is the one between general and specific utility measures. General utility measures assess to what extent the relationships between combinations of variables (and potential interactions between them) are preserved in the synthetic data set. These measures are often for pairs of variables, or for all combinations of variables. Specific utility measures focus, as the name already suggests, on a specific analysis. This analysis is performed on the observed data and the synthetic data, and the similarity between inferences on these data sets is quantified.

General utility measures

First, we can inspect which interactions of variables can predict whether observations are “true” or “synthetic” using the standardized $pMSE$ measure, similarly to what we just did using individual variables. Hence, we predict whether observations can be classified based on the interaction of two variables. Again, continuous variables are by default recoded as categorical variables with $5$ categories.

Ideally, the standardized $pMSE$ equals $1$, but according to the synthpop authors, values below $10$ are indicative of high utility.

6. Use the function utility.tables() to inspect whether any interaction of two variables can predict whether observations are “real” or “synthetic”.

utility.tables(synthetic, heart_failure)

## 
## Two-way utility: S_pMSE value plotted for 78 pairs of variables.
## 
## Variable combinations with worst 5 utility scores (S_pMSE):
##                      08.serum_sodium:09.sex 
##                                      2.5541 
##                  10.smoking:11.hypertension 
##                                      2.3784 
##     03.creatinine_phosphokinase:12.deceased 
##                                      2.2979 
## 03.creatinine_phosphokinase:11.hypertension 
##                                      2.2898 
##                      09.sex:11.hypertension 
##                                      2.1884

## 
## Medians and maxima of selected utility measures for all tables compared
##        Medians  Maxima
## pMSE    0.0021  0.0101
## S_pMSE  1.1167  2.5541
## df      9.0000 24.0000
## 
## For more details of all scores use print.tabs = TRUE.

The maximum standardized $pMSE$ value equals 2.55, which is well below the threshold value of $10$ that is given by the synthpop authors. This shows that none of the interactions between any pair of variables is capable of distinguishing observed records from synthetic records.

As a next step, we can assess whether a model containing all variables is capable of predicting which observations are real and which are synthetic.

7. Use the function utility.gen() with parameter method = "logit" to calculate the standardized $pMSE$ based on a logistic prediction model with all variables included as predictors (the default).

utility.gen(synthetic, heart_failure, method = "logit")

## You will be fitting a large model with  92  parameters and only  299  records
## that may take a long time and fail to converge.
## Have you selected variables with vars?

## 
## Utility score calculated by method: logit
## 
## Call:
## utility.gen.synds(object = synthetic, data = heart_failure, method = "logit")
## 
## Selected utility measures
##     pMSE   S_pMSE 
## 0.026218 1.378318

The standardized $pMSE$ is very close to $1$, indicating that also the combination of all variables cannot predict whether the observations stem from the observed or the synthetic data. However, a logistic model without interaction terms may miss important interactions that were present in the observed data, but that were not reproduced in the synthetic data. CART models are better capable to assess such interactions.

8. Use the function utility.gen() with parameter method = "cart" to calculate the standardized $pMSE$.

*Note. CART does not have an analytical expression of the standardized $pMSE$, which is therefore calculated using a permutation test.

utility.gen(synthetic, heart_failure, method = "cart")

## Running 50 permutations to get NULL utilities and printing every 10th.
## synthesis 1   10 20 30 40 50

## 
## Utility score calculated by method: cart
## 
## Call:
## utility.gen.synds(object = synthetic, data = heart_failure, method = "cart")
## 
## Null utilities simulated from a permutation test with 50 replications.
## 
## Selected utility measures
##     pMSE   S_pMSE 
## 0.086162 1.688813

Again, the standardized $pMSE$ is very small. Hence, up to now, there are no indications that we have created data with poor utility.

If you would observe indications of poor utility (either due to one or two variables, or to a larger set of variables), it might be necessary to tweak the imputation model. The synthpop authors walk you through the first steps that you could take here.

Specific utility measures

Specific utility measures assess whether the same analysis on the observed and the synthetic data gives similar results. Say that we are interested in, for instance, the relationship between whether a person survives, the age of this person, whether this person has diabetes and whether or not this person smokes, including the follow-up time as a control variable in the model.

9. Fit this model as a logistic regression model using glm.synds() and inspect the output- using summary(), first solely focusing on the output on the synthetic data.

syn_fit <- glm.synds(deceased ~ age + diabetes + smoking + follow_up, 
                     family = binomial, 
                     data = synthetic)

summary(syn_fit)

## Fit to synthetic data set with a single synthesis. Inference to coefficients
## and standard errors that would be obtained from the original data.
## 
## Call:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
##     family = binomial, data = synthetic)
## 
## Combined estimates:
##             xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)    
## (Intercept)  0.1527608     0.9462009  0.1614        0.87174    
## age          0.0317922     0.0131817  2.4118        0.01587 *  
## diabetesYes -0.3101992     0.3304237 -0.9388        0.34784    
## smokingYes  -0.0302040     0.3585668 -0.0842        0.93287    
## follow_up   -0.0255069     0.0032734 -7.7922      6.583e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Age and the follow up period are related to a person’s survival chances, but whether a person has diabetes or whether the person smokes are not, at least, if we can trust the synthetic data.

10. Compare the output of this model on the synthetic data with the output that would have been obtained when running the model on the observed data, using compare.fit.synds() on the fitted model object.

compare.fit.synds(syn_fit, heart_failure)

## 
## Call used to fit models to the data:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
##     family = binomial, data = synthetic)
## 
## Differences between results based on synthetic and observed data:
##               Synthetic    Observed         Diff Std. coef diff CI overlap
## (Intercept)  0.15276084 -0.84666625  0.999427094      1.1063462  0.7177636
## age          0.03179219  0.03650504 -0.004712856     -0.3537465  0.9097569
## diabetesYes -0.31019920  0.11021309 -0.420412293     -1.3550030  0.6543296
## smokingYes  -0.03020402 -0.20589862  0.175694594      0.5383495  0.8626634
## follow_up   -0.02550695 -0.01931552 -0.006191430     -2.3996741  0.3878270
## 
## Measures for one synthesis and 5 coefficients
## Mean confidence interval overlap:  0.7064681
## Mean absolute std. coef diff:  1.150624
## 
## Mahalanobis distance ratio for lack-of-fit (target 1.0): 1.64
## Lack-of-fit test: 8.220822; p-value 0.1445 for test that synthesis model is compatible 
## with a chi-squared test with 5 degrees of freedom.
## 
## Confidence interval plot:

The figure shows that the confidence intervals for each of the regression coefficients are to a relatively large degree overlapping. On average, the intervals overlap for $71\%$, and the lack-of-fit tests indicate no significant lag-of-fit. Moreover, these results show that we would draw the same conclusions from the synthetic data as from the observed data.

Statistical disclosure control

Altogether, all our tests of utility indicated that we were able to generate high utility synthetic data, as we aimed for! However, it is generally the case that the greater the utility, the higher the loss of privacy. Unfortunately, it is often complicated to evaluate how much privacy loss occurred due to synthesizing our data, not in the last place because hardly any formal measures of privacy loss for synthetic data exist.

In synthpop, there is one function to assess statistical disclosure control, which shows whether there are any unique observations in the observed data that are reproduced in the synthetic data, as these values might bear a relatively high risk of having there privacy disclosed. Additionally, the sdc() function offers additional functionality that can be useful if you deem the remaining disclosure risks of the synthetic data too high. Options are to top/bottom code variables or to add noise to the continuous variables using smoothing.

11. Remove the unique observations in the observed data that are reproduced in the synthetic data, using sdc() with argument rm.replicated.uniques = TRUE.

sdc_out <- sdc(synthetic, heart_failure, rm.replicated.uniques = TRUE)

## no. of replicated uniques: 0

There are no unique observations that were reproduced in the observed data, so there are no cases removed.

Inferences using synthetic data

Lastly, when you have obtained a synthetic data set and want to make inferences from this set, you have to be careful, because generating synthetic data adds variance to the already present sampling variance that you take into account when evaluating hypotheses. Specifically, if you want to make inferences with respect to the sample of original observations, you can use unaltered analysis techniques and corresponding, conventional standard errors.

However, if you want to inferences with respect to the population the sample is taken from, you will have to adjust the standard errors, to account for the fact that the synthesis procedure adds additional variance. The amount of variance that is added, depends on the number of synthetic data sets that are generated. Intuitively, when generating multiple synthetic data sets, the additional random noise that is induced by the synthesis cancels out, making the parameter estimates more stable. So far, we have generated a single synthetic data set, but it might be possible to generate $2$, $5$, $10$ or $1000000$ synthetic data sets (be warned that more synthetic data sets means more information, so more disclosure risk).

When generating multiple ($m$) synthetic data sets, the point estimate of the parameter of interest $Q$ is obtained by pooling the synthetic data estimates \[ \bar{q}_m = \frac{1}{m} \sum^m_{i=1} \hat{q}_i, \] where $\hat{q_i}$ is the parameter estimate in each sample. Additionally, the total variance over synthetic data sets $T_m$ of $Q$ is defined as \[ T_m = \Bigg(1 + \frac{1}{m}\Bigg) \Bigg(\frac{1}{m}\Bigg) \sum^m_{i=1} v_i, \] where $v_i$ is defined as the estimated variance of the quantity in each sample.

Given that we use only a single synthetic data set, the parameter estimate is defined as $\bar{q}_m = \hat{q}$ with corresponding variance $2v$.

12. For the analysis of interest (predicting survival by age, diabetes, smoking and the follow-up period), make inferences with respect to the population by calling summary() on the previously created syn_fit object, with parameter population.inference = TRUE.

summary(syn_fit, population.inference = TRUE)

## Fit to synthetic data set with a single synthesis. Inference to population
## coefficients when all variables in the model are synthesised.
## 
## Call:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up, 
##     family = binomial, data = synthetic)
## 
## Combined estimates:
##               Beta.syn se.Beta.syn   z.syn Pr(>|z.syn|)    
## (Intercept)  0.1527608   1.3381301  0.1142      0.90911    
## age          0.0317922   0.0186417  1.7054      0.08811 .  
## diabetesYes -0.3101992   0.4672897 -0.6638      0.50680    
## smokingYes  -0.0302040   0.5070900 -0.0596      0.95250    
## follow_up   -0.0255069   0.0046293 -5.5099    3.589e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To assess whether the standard errors are adjusted accordingly, run

sum_syn_fit <- summary(syn_fit)
sqrt(sum_syn_fit$coefficients[,2]^2 * 2)

## (Intercept)         age diabetesYes  smokingYes   follow_up 
## 1.338130146 0.018641714 0.467289685 0.507089972 0.004629256

These standard errors are equal to the standard errors given by the summary function with argument population.inference = TRUE.

References

Ahmad, Assia AND Bhatti, Tanvir AND Munir. 2017. “Survival Analysis of Heart Failure Patients: A Case Study.” PLOS ONE 12 (7): 1–8. https://doi.org/10.1371/journal.pone.0181001.

Chicco, Davide, and Giuseppe Jurman. 2020. “Machine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection Fraction Alone.” BMC Medical Informatics and Decision Making 20 (1): 16. https://doi.org/10.1186/s12911-020-1023-5.

“IVEware: Imputation and Variance Estimation Software.” n.d. https://www.src.isr.umich.edu/wp-content/uploads/iveware-manual-Version-0.3.pdf.

Nowok, Beata, Gillian M. Raab, and Chris Dibben. 2016. “synthpop: Bespoke Creation of Synthetic Data in R.” Journal of Statistical Software 74 (11): 1–26. https://doi.org/10.18637/jss.v074.i11.

van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. “mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.

Volker, Thom Benjamin, and Gerko Vink. 2021. “Anonymiced Shareable Data: Using Mice to Create and Analyze Multiply Imputed Synthetic Datasets.” Psych 3 (4): 703–16. https://doi.org/10.3390/psych3040045.

Synthetic data in `R`: Generating synthetic data with high utility using `synthpop`

Thom Volker, Raoul Schram, Erik-Jan van Kesteren

Introduction

Data

Heart failure clinical records

Loading your own data

Creating synthetic data

Assessing utility on a univariate level

Assessing utility on a multivariate level

General utility measures

Specific utility measures

Statistical disclosure control

Inferences using synthetic data

References

Synthetic data in R: Generating synthetic data with high utility using synthpop

Thom Volker, Raoul Schram, Erik-Jan van Kesteren

Introduction

Data

Heart failure clinical records

Loading your own data

Creating synthetic data

Assessing utility on a univariate level

Assessing utility on a multivariate level

General utility measures

Specific utility measures

Statistical disclosure control

Inferences using synthetic data

References

Synthetic data in `R`: Generating synthetic data with high utility using `synthpop`