R
:
Generating synthetic data with high utility using
synthpop
In this workshop, you will learn how to create and evaluate synthetic
data in R
. In the practical, we will work with the
R
package synthpop
(Nowok, Raab, and Dibben
2016), which is one of the most advanced and dedicated
packages in R
to create synthetic data. Other alternatives
to create synthetic data are, for example, the R-package
mice
(van Buuren and Groothuis-Oudshoorn 2011; see Volker
and Vink 2021), or the stand-alone software
IVEware
(“IVEware: Imputation and
Variance Estimation Software,” n.d.).
If you have R
and R Studio
installed on
your device, you can follow all the steps from this practical using your
local version of R Studio. In case you do not have an installation of
R
and R Studio
, you can quickly create an
account on R Studio
Cloud. Note that you have the opportunity to work with your own data
(you can also use data provided by us). If you are going to work via
R Studio Cloud
, you may not want to upload your own data to
this server. In this case, you can still decide to work with the data
provided by us. Theoretically, you could also install R
and
R Studio
on the spot, but since we only have 45 minutes, we
advise to use R Studio Cloud
if you have no access to
R
and R Studio
on your device already.
In this workshop, you will (at least) use the following packages.
Make sure to load them (in case you haven’t installed them already,
install them first, using
install.packages("package.name")
).
library(synthpop)
library(magrittr)
library(psych)
For this workshop, we have prepared all exercises with the Heart failure clinical records data set. However, you may also choose to work with a data set of your own liking. All steps exercises and solutions that we outline here should be applicable to another data set as well, but some data processing might be required before our example code works as it should. In the worst case, you might run into errors that we could not foresee, but we are more than happy to think along and help you to solve these issues.
The Heart failure clinical records data set is a medical data set from the UCI Machine Learning Repository (click here for the source), originally collected by Ahmad (2017) from the Government College University, Faisalabad, Pakistan, and adapted and uploaded to the UCI MLR by Chicco and Jurman (2020). This data set contains medical information of 299 individuals on 13 variables, and is typically used to predict whether or not a patient will survive during the follow-up period, using several biomedical predictors.
If you decide to work with the Heart failure clinical
records data and work in R Studio Cloud
, you can
access the environment related to this workshop here, including a
script osf_synthetic.R
that gets you started on installing
and loading the required packages, and that imports the data for you.
You can continue to work in this script. Make sure to save the project
on your account, so that your changes are not deleted if you, for some
reason, have to refresh the browser.
If you have R Studio
installed on your own machine, you
can download the cleaned version of the Heart failure
clinical records data set from my GitHub and load it as
heart_failure
, by running the following line of code.
heart_failure <- readRDS(url("https://thomvolker.github.io/osf_synthetic/data/heart_failure.RDS"))
The Heart failure clinical records data consists of the following variables:
age
: Age in yearsanaemia
: Whether the patient has a decrease of red
blood cells (No/Yes)creatinine_phosphokinase
: Level of the creatinine
phosphokinase enzyme in the blood (mcg/L)diabetes
: Whether the patient has diabetes
(No/Yes)ejection_fraction
: Percentage of blood leaving the
heart at each contractionplatelets
: Platelets in de blood
(kiloplatelets/mL)serum_creatinine
: Level of serum creatinine in the
blood (mg/dL)serum_sodium
: Level of serum sodium in the blood
(mg/dL)sex
: Sex (Female/Male)smoking
: Whether the patient smokes (No/Yes)hypertension
: Whether the patient has high blood
pressure (No/Yes)deceased
: Whether the patient decreased during the
follow-up periodfollow_up
: Follow-up period (days)After loading the data, it is always wise to first inspect the data, so that you have an idea what to expect.
head(heart_failure)
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | platelets | serum_creatinine | serum_sodium | sex | smoking | hypertension | deceased | follow_up |
---|---|---|---|---|---|---|---|---|---|---|---|---|
75 | No | 582 | No | 20 | 265000 | 1.9 | 130 | Male | No | Yes | Yes | 4 |
55 | No | 7861 | No | 38 | 263358 | 1.1 | 136 | Male | No | No | Yes | 6 |
65 | No | 146 | No | 20 | 162000 | 1.3 | 129 | Male | Yes | No | Yes | 7 |
50 | Yes | 111 | No | 20 | 210000 | 1.9 | 137 | Male | No | No | Yes | 7 |
65 | Yes | 160 | Yes | 20 | 327000 | 2.7 | 116 | Female | No | No | Yes | 8 |
90 | Yes | 47 | No | 40 | 204000 | 2.1 | 132 | Male | Yes | Yes | Yes | 8 |
Additionally, we can ask for a summary of all variables, or use
describe()
from the psych
-package to provide
descriptive statistics of all variables.
Note. Make sure to install psych
if you haven’t
done so in the past.
summary(heart_failure)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 No :170 Min. : 23.0 No :174 Min. :14.00
## 1st Qu.:51.00 Yes:129 1st Qu.: 116.5 Yes:125 1st Qu.:30.00
## Median :60.00 Median : 250.0 Median :38.00
## Mean :60.83 Mean : 581.8 Mean :38.08
## 3rd Qu.:70.00 3rd Qu.: 582.0 3rd Qu.:45.00
## Max. :95.00 Max. :7861.0 Max. :80.00
## platelets serum_creatinine serum_sodium sex smoking
## Min. : 25100 Min. :0.500 Min. :113.0 Female:105 No :203
## 1st Qu.:212500 1st Qu.:0.900 1st Qu.:134.0 Male :194 Yes: 96
## Median :262000 Median :1.100 Median :137.0
## Mean :263358 Mean :1.394 Mean :136.6
## 3rd Qu.:303500 3rd Qu.:1.400 3rd Qu.:140.0
## Max. :850000 Max. :9.400 Max. :148.0
## hypertension deceased follow_up
## No :194 No :203 Min. : 4.0
## Yes:105 Yes: 96 1st Qu.: 73.0
## Median :115.0
## Mean :130.3
## 3rd Qu.:203.0
## Max. :285.0
This gives a good impression about the measurement levels of all variables, as well as the range of the possible values each variable can have.
describe(heart_failure)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 1 | 299 | 6.083389e+01 | 1.189481e+01 | 60.0 | 6.021715e+01 | 14.82600 | 40.0 | 95.0 | 55.0 | 0.4188266 | -0.2204793 | 0.6878946 |
anaemia* | 2 | 299 | 1.431438e+00 | 4.961073e-01 | 1.0 | 1.414938e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.2754750 | -1.9305367 | 0.0286906 |
creatinine_phosphokinase | 3 | 299 | 5.818395e+02 | 9.702879e+02 | 250.0 | 3.654938e+02 | 269.83320 | 23.0 | 7861.0 | 7838.0 | 4.4184296 | 24.5254138 | 56.1131970 |
diabetes* | 4 | 299 | 1.418060e+00 | 4.940671e-01 | 1.0 | 1.398340e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.3305857 | -1.8970241 | 0.0285726 |
ejection_fraction | 5 | 299 | 3.808361e+01 | 1.183484e+01 | 38.0 | 3.742739e+01 | 11.86080 | 14.0 | 80.0 | 66.0 | 0.5498228 | 0.0005484 | 0.6844265 |
platelets | 6 | 299 | 2.633580e+05 | 9.780424e+04 | 262000.0 | 2.567301e+05 | 65234.40000 | 25100.0 | 850000.0 | 824900.0 | 1.4476814 | 6.0252322 | 5656.1650591 |
serum_creatinine | 7 | 299 | 1.393880e+00 | 1.034510e+00 | 1.1 | 1.189295e+00 | 0.29652 | 0.5 | 9.4 | 8.9 | 4.4113866 | 25.1888415 | 0.0598273 |
serum_sodium | 8 | 299 | 1.366254e+02 | 4.412477e+00 | 137.0 | 1.368216e+02 | 4.44780 | 113.0 | 148.0 | 35.0 | -1.0376430 | 3.9841899 | 0.2551802 |
sex* | 9 | 299 | 1.648829e+00 | 4.781364e-01 | 2.0 | 1.684647e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | -0.6204576 | -1.6204183 | 0.0276513 |
smoking* | 10 | 299 | 1.321070e+00 | 4.676704e-01 | 1.0 | 1.278008e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.7626368 | -1.4231112 | 0.0270461 |
hypertension* | 11 | 299 | 1.351171e+00 | 4.781364e-01 | 1.0 | 1.315353e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.6204576 | -1.6204183 | 0.0276513 |
deceased* | 12 | 299 | 1.321070e+00 | 4.676704e-01 | 1.0 | 1.278008e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.7626368 | -1.4231112 | 0.0270461 |
follow_up | 13 | 299 | 1.302609e+02 | 7.761421e+01 | 115.0 | 1.292780e+02 | 105.26460 | 4.0 | 285.0 | 281.0 | 0.1265232 | -1.2238150 | 4.4885455 |
The describe()
function gives more distributional
information about all variables.
In case you brought your own data, you can load it into
R
using a function that matches your data format. Below,
you can find several functions that might be helpful if you want to load
the your data into R
. You can use these functions both
locally, or on R Studio Cloud
, but make sure to install the
required package first.
Programme | Format | Command |
---|---|---|
Excel | .csv | readxl::read_xlsx(“path_to_file/data_name.xlsx”) |
Excel | .xlsx | readr::read_csv(“path_to_file/data_name.csv”) |
SPSS | .sav | haven::read_sav(“path_to_file/data_name.sav”) |
Stata | .dta | haven::read_dta(“path_to_file/data_name.dta”) |
After loading in your own data, make sure that the variables in your
data are coded accordingly (this can go wrong when transferring between
data types). That is, make sure that your categorical variables are
coded as factors and your numeric variables as numeric variables. To do
so, you can make use of the following code. Note, however, that this is
not a workshop on data wrangling: if importing your data into
R
creates a mess, it might be better to use the Heart
failure clinical records data, so that you can spend your valuable
time on creating synthetic data.
data_name$variable <- as.numeric(data_name$variable)
data_name$variable2 <- factor(data_name$variable,
levels = values, # values of the data
labels = value_labels) # labels of these values
If your data has the correct format, we can proceed to the next steps. Given that you are using your own data, we assume that you have (at least some) knowledge about the variables in your data. We will therefore skip the steps to obtain some descriptive information of the variables in your data, and continue to creating and evaluating synthetic data.
In the sequel, we will outline how to create and evaluate synthetic data using the Heart failure clinical records data, but most of these steps should be directly applicable to your own data. In case something gives an error, do not hesitate to ask how the problem can be solved!
In the part, you created synthetic data using univariate imputation methods. Relationships between variables were completely neglected. In this part, you will create synthetic data that preserves the relationships between the variables in the data.
Using the default settings, synthpop
proceeds as
follows. The first variable is randomly sampled from the observed data,
such that a random subset of the original values is used, without taking
relations with the other variables into account. Subsequently, the
second variable is generated using a classification / regression tree
(CART). A CART model is trained on the observed data, using the first
variable as a predictor, and the second variable as the outcome. The
first synthetic variable is used to predict the second synthetic
variable (by sampling predicted values from the final nodes of the CART
model trained on the observed data). The next step is to apply this
procedure for the third variable, using variable 1 and 2 as predictors.
This procedure is repeated, sequentially using additional predictors in
the generative model, until all but the last variable are used as
predictors for the last variable. By incrementally incorporating
variables, the relations in the original data are generally preserved in
the synthetic data.
1. Use the syn()
function from the
synthpop
package to create a synthetic version of your
data, using the default settings.
OPTIONAL. Fix the random seed before generating the synthetic data, so that you can compare your output with our output.
set.seed(123)
synthetic <- syn(heart_failure)
## Synthesis
## -----------
## age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking
## hypertension deceased follow_up
You have now created the object synthetic
, which is of
class synds
and contains several pieces of information
related to your synthetic data. We will quickly go over the most
important output.
2. Inspect the output object, and pay attention to the number of synthetic data sets, the synthesis methods, the visit sequence and the predictor matrix.
synthetic
## Call:
## ($call) syn(data = heart_failure)
##
## Number of synthesised data sets:
## ($m) 1
##
## First rows of synthesised data set:
## ($syn)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets
## 1 65 No 118 Yes 30 267000
## 2 85 Yes 582 No 38 140000
## 3 42 No 582 Yes 38 263358
## 4 73 No 675 No 40 243000
## 5 63 No 936 No 38 297000
## 6 50 No 482 No 45 371000
## serum_creatinine serum_sodium sex smoking hypertension deceased follow_up
## 1 1.4 135 Female No No No 212
## 2 1.2 136 Male No No Yes 109
## 3 1.1 134 Male Yes No No 145
## 4 1.1 137 Female No Yes No 107
## 5 1.0 137 Male Yes Yes No 187
## 6 1.0 137 Female No No No 140
## ...
##
## Synthesising methods:
## ($method)
## age anaemia creatinine_phosphokinase
## "sample" "cart" "cart"
## diabetes ejection_fraction platelets
## "cart" "cart" "cart"
## serum_creatinine serum_sodium sex
## "cart" "cart" "cart"
## smoking hypertension deceased
## "cart" "cart" "cart"
## follow_up
## "cart"
##
## Order of synthesis:
## ($visit.sequence)
## age anaemia creatinine_phosphokinase
## 1 2 3
## diabetes ejection_fraction platelets
## 4 5 6
## serum_creatinine serum_sodium sex
## 7 8 9
## smoking hypertension deceased
## 10 11 12
## follow_up
## 13
##
## Matrix of predictors:
## ($predictor.matrix)
## age anaemia creatinine_phosphokinase diabetes
## age 0 0 0 0
## anaemia 1 0 0 0
## creatinine_phosphokinase 1 1 0 0
## diabetes 1 1 1 0
## ejection_fraction 1 1 1 1
## platelets 1 1 1 1
## serum_creatinine 1 1 1 1
## serum_sodium 1 1 1 1
## sex 1 1 1 1
## smoking 1 1 1 1
## hypertension 1 1 1 1
## deceased 1 1 1 1
## follow_up 1 1 1 1
## ejection_fraction platelets serum_creatinine
## age 0 0 0
## anaemia 0 0 0
## creatinine_phosphokinase 0 0 0
## diabetes 0 0 0
## ejection_fraction 0 0 0
## platelets 1 0 0
## serum_creatinine 1 1 0
## serum_sodium 1 1 1
## sex 1 1 1
## smoking 1 1 1
## hypertension 1 1 1
## deceased 1 1 1
## follow_up 1 1 1
## serum_sodium sex smoking hypertension deceased
## age 0 0 0 0 0
## anaemia 0 0 0 0 0
## creatinine_phosphokinase 0 0 0 0 0
## diabetes 0 0 0 0 0
## ejection_fraction 0 0 0 0 0
## platelets 0 0 0 0 0
## serum_creatinine 0 0 0 0 0
## serum_sodium 0 0 0 0 0
## sex 1 0 0 0 0
## smoking 1 1 0 0 0
## hypertension 1 1 1 0 0
## deceased 1 1 1 1 0
## follow_up 1 1 1 1 1
## follow_up
## age 0
## anaemia 0
## creatinine_phosphokinase 0
## diabetes 0
## ejection_fraction 0
## platelets 0
## serum_creatinine 0
## serum_sodium 0
## sex 0
## smoking 0
## hypertension 0
## deceased 0
## follow_up 0
If all went well, you should have a single synthetic data set
(increasing \(m\), the number of
synthetic data sets, can decrease the variance of inferences on
synthetic data). Also, the first variable should have synthesis method
"sample"
, whereas all other variables should have method
"cart"
. The visit sequence should increase sequentially
from \(1\) to the number of variables
in your data set (\(13\) if you are
using the Heart failure clinical records data). Lastly, the
predictor matrix shows which variables in the rows are used to predict
the variable in the columns (and should be a lower triangular matrix
containing ones, excluding the diagonal).
The output object synthetic
contains additional
information, that you can access by specifically asking for it, using,
for example, synthetic$smoothing
, which shows for which of
the variables smoothing was applied. In general, you will not need most
of this information, but sometimes it can be helpful to check whether
more sophisticated arguments were passed successfully.
The quality of synthetic data sets can be assessed on multiple levels and in multiple different ways. Starting on a univariate level, we can assess whether similar values seem to appear in the synthetic data as in the observed data. Additionally, the distributions of the synthetic data sets can be compared with the distribution of the observed data.
3. Inspect the synthetic data by asking for
head(synthetic$syn)
and tail(synthetic$syn)
,
to check whether the synthetic data looks similar to the observed data
(which we inspected previously already).
head(synthetic$syn)
tail(synthetic$syn)
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | platelets | serum_creatinine | serum_sodium | sex | smoking | hypertension | deceased | follow_up |
---|---|---|---|---|---|---|---|---|---|---|---|---|
65 | No | 118 | Yes | 30 | 267000 | 1.4 | 135 | Female | No | No | No | 212 |
85 | Yes | 582 | No | 38 | 140000 | 1.2 | 136 | Male | No | No | Yes | 109 |
42 | No | 582 | Yes | 38 | 263358 | 1.1 | 134 | Male | Yes | No | No | 145 |
73 | No | 675 | No | 40 | 243000 | 1.1 | 137 | Female | No | Yes | No | 107 |
63 | No | 936 | No | 38 | 297000 | 1.0 | 137 | Male | Yes | Yes | No | 187 |
50 | No | 482 | No | 45 | 371000 | 1.0 | 137 | Female | No | No | No | 140 |
age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | platelets | serum_creatinine | serum_sodium | sex | smoking | hypertension | deceased | follow_up | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
294 | 50 | Yes | 168 | Yes | 45 | 454000 | 1.4 | 137 | Female | No | Yes | No | 216 |
295 | 64 | Yes | 52 | No | 60 | 226000 | 1.8 | 131 | Male | Yes | No | No | 90 |
296 | 50 | No | 115 | No | 35 | 254000 | 0.7 | 138 | Male | Yes | Yes | No | 145 |
297 | 59 | Yes | 129 | Yes | 35 | 260000 | 1.0 | 136 | Male | Yes | Yes | No | 95 |
298 | 45 | No | 7702 | No | 25 | 226000 | 0.8 | 127 | Male | Yes | No | No | 112 |
299 | 44 | No | 582 | No | 50 | 374000 | 0.7 | 140 | Female | No | Yes | Yes | 95 |
If all went well, you should at least observe that the variables in the synthetic data have the same measurement level as the variables in the observed data.
4. Check the distributional similarity by comparing
the summary of the observed and synthetic data. Additionally, you may
inspect the similarity by running describe()
from the
psych
-package on both the observed and the synthetic
data.
summary(heart_failure)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 No :170 Min. : 23.0 No :174 Min. :14.00
## 1st Qu.:51.00 Yes:129 1st Qu.: 116.5 Yes:125 1st Qu.:30.00
## Median :60.00 Median : 250.0 Median :38.00
## Mean :60.83 Mean : 581.8 Mean :38.08
## 3rd Qu.:70.00 3rd Qu.: 582.0 3rd Qu.:45.00
## Max. :95.00 Max. :7861.0 Max. :80.00
## platelets serum_creatinine serum_sodium sex smoking
## Min. : 25100 Min. :0.500 Min. :113.0 Female:105 No :203
## 1st Qu.:212500 1st Qu.:0.900 1st Qu.:134.0 Male :194 Yes: 96
## Median :262000 Median :1.100 Median :137.0
## Mean :263358 Mean :1.394 Mean :136.6
## 3rd Qu.:303500 3rd Qu.:1.400 3rd Qu.:140.0
## Max. :850000 Max. :9.400 Max. :148.0
## hypertension deceased follow_up
## No :194 No :203 Min. : 4.0
## Yes:105 Yes: 96 1st Qu.: 73.0
## Median :115.0
## Mean :130.3
## 3rd Qu.:203.0
## Max. :285.0
summary(synthetic$syn)
## age anaemia creatinine_phosphokinase diabetes ejection_fraction
## Min. :40.00 No :166 Min. : 23.0 No :179 Min. :15.00
## 1st Qu.:53.00 Yes:133 1st Qu.: 118.0 Yes:120 1st Qu.:30.00
## Median :60.00 Median : 213.0 Median :38.00
## Mean :62.02 Mean : 568.8 Mean :39.41
## 3rd Qu.:70.00 3rd Qu.: 582.0 3rd Qu.:47.50
## Max. :95.00 Max. :7702.0 Max. :70.00
## platelets serum_creatinine serum_sodium sex smoking
## Min. : 25100 Min. :0.500 Min. :116.0 Female:110 No :207
## 1st Qu.:211500 1st Qu.:0.900 1st Qu.:134.0 Male :189 Yes: 92
## Median :255000 Median :1.100 Median :137.0
## Mean :261243 Mean :1.426 Mean :136.9
## 3rd Qu.:297500 3rd Qu.:1.650 3rd Qu.:140.0
## Max. :742000 Max. :6.800 Max. :148.0
## hypertension deceased follow_up
## No :189 No :197 Min. : 6.0
## Yes:110 Yes:102 1st Qu.: 70.0
## Median :109.0
## Mean :125.5
## 3rd Qu.:188.0
## Max. :285.0
describe(heart_failure)
describe(synthetic$syn)
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 1 | 299 | 6.083389e+01 | 1.189481e+01 | 60.0 | 6.021715e+01 | 14.82600 | 40.0 | 95.0 | 55.0 | 0.4188266 | -0.2204793 | 0.6878946 |
anaemia* | 2 | 299 | 1.431438e+00 | 4.961073e-01 | 1.0 | 1.414938e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.2754750 | -1.9305367 | 0.0286906 |
creatinine_phosphokinase | 3 | 299 | 5.818395e+02 | 9.702879e+02 | 250.0 | 3.654938e+02 | 269.83320 | 23.0 | 7861.0 | 7838.0 | 4.4184296 | 24.5254138 | 56.1131970 |
diabetes* | 4 | 299 | 1.418060e+00 | 4.940671e-01 | 1.0 | 1.398340e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.3305857 | -1.8970241 | 0.0285726 |
ejection_fraction | 5 | 299 | 3.808361e+01 | 1.183484e+01 | 38.0 | 3.742739e+01 | 11.86080 | 14.0 | 80.0 | 66.0 | 0.5498228 | 0.0005484 | 0.6844265 |
platelets | 6 | 299 | 2.633580e+05 | 9.780424e+04 | 262000.0 | 2.567301e+05 | 65234.40000 | 25100.0 | 850000.0 | 824900.0 | 1.4476814 | 6.0252322 | 5656.1650591 |
serum_creatinine | 7 | 299 | 1.393880e+00 | 1.034510e+00 | 1.1 | 1.189295e+00 | 0.29652 | 0.5 | 9.4 | 8.9 | 4.4113866 | 25.1888415 | 0.0598273 |
serum_sodium | 8 | 299 | 1.366254e+02 | 4.412477e+00 | 137.0 | 1.368216e+02 | 4.44780 | 113.0 | 148.0 | 35.0 | -1.0376430 | 3.9841899 | 0.2551802 |
sex* | 9 | 299 | 1.648829e+00 | 4.781364e-01 | 2.0 | 1.684647e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | -0.6204576 | -1.6204183 | 0.0276513 |
smoking* | 10 | 299 | 1.321070e+00 | 4.676704e-01 | 1.0 | 1.278008e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.7626368 | -1.4231112 | 0.0270461 |
hypertension* | 11 | 299 | 1.351171e+00 | 4.781364e-01 | 1.0 | 1.315353e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.6204576 | -1.6204183 | 0.0276513 |
deceased* | 12 | 299 | 1.321070e+00 | 4.676704e-01 | 1.0 | 1.278008e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.7626368 | -1.4231112 | 0.0270461 |
follow_up | 13 | 299 | 1.302609e+02 | 7.761421e+01 | 115.0 | 1.292780e+02 | 105.26460 | 4.0 | 285.0 | 281.0 | 0.1265232 | -1.2238150 | 4.4885455 |
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 1 | 299 | 6.201561e+01 | 1.227605e+01 | 60.0 | 6.150070e+01 | 13.34340 | 40.0 | 95.0 | 55.0 | 0.3992243 | -0.3082979 | 0.7099424 |
anaemia* | 2 | 299 | 1.444816e+00 | 4.977785e-01 | 1.0 | 1.431535e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.2209793 | -1.9576820 | 0.0287873 |
creatinine_phosphokinase | 3 | 299 | 5.688227e+02 | 9.977052e+02 | 213.0 | 3.412656e+02 | 217.94220 | 23.0 | 7702.0 | 7679.0 | 4.3466736 | 22.8739358 | 57.6987835 |
diabetes* | 4 | 299 | 1.401338e+00 | 4.909909e-01 | 1.0 | 1.377593e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.4005461 | -1.8457022 | 0.0283947 |
ejection_fraction | 5 | 299 | 3.940803e+01 | 1.230952e+01 | 38.0 | 3.912033e+01 | 11.86080 | 15.0 | 70.0 | 55.0 | 0.3016707 | -0.6143742 | 0.7118778 |
platelets | 6 | 299 | 2.612429e+05 | 9.314287e+04 | 255000.0 | 2.554046e+05 | 63751.80000 | 25100.0 | 742000.0 | 716900.0 | 1.3503730 | 5.1309274 | 5386.5911051 |
serum_creatinine | 7 | 299 | 1.425819e+00 | 9.438396e-01 | 1.1 | 1.228299e+00 | 0.29652 | 0.5 | 6.8 | 6.3 | 2.9535346 | 10.6961787 | 0.0545837 |
serum_sodium | 8 | 299 | 1.368930e+02 | 4.405067e+00 | 137.0 | 1.369876e+02 | 4.44780 | 116.0 | 148.0 | 32.0 | -0.5800515 | 2.1367155 | 0.2547516 |
sex* | 9 | 299 | 1.632107e+00 | 4.830405e-01 | 2.0 | 1.663900e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | -0.5451518 | -1.7084900 | 0.0279350 |
smoking* | 10 | 299 | 1.307692e+00 | 4.623122e-01 | 1.0 | 1.261411e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.8291562 | -1.3168707 | 0.0267362 |
hypertension* | 11 | 299 | 1.367893e+00 | 4.830405e-01 | 1.0 | 1.336100e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.5451518 | -1.7084900 | 0.0279350 |
deceased* | 12 | 299 | 1.341137e+00 | 4.748861e-01 | 1.0 | 1.302905e+00 | 0.00000 | 1.0 | 2.0 | 1.0 | 0.6668191 | -1.5605380 | 0.0274634 |
follow_up | 13 | 299 | 1.254749e+02 | 7.556495e+01 | 109.0 | 1.234232e+02 | 96.36900 | 6.0 | 285.0 | 279.0 | 0.2344404 | -1.1107920 | 4.3700341 |
In principle, you should observe that on a univariate level, the information in the observed and synthetic data is quite similar. That is, all variables are similarly distributed, with only small differences between the means and variances in the observed and synthetic data. Higher-order moments as the skewness and kurtosis are also reasonably close.
5. Inspect the univariate distributional similarity
of the observed and synthetic data in more detail, using the function
compare()
from the synthpop
-package.
compare(synthetic, heart_failure)
##
## Comparing percentages observed with synthetic
## Press return for next variable(s):
## Press return for next variable(s):
## Press return for next variable(s):
##
## Selected utility measures:
## pMSE S_pMSE df
## age 0.000772 0.923311 4
## anaemia 0.000045 0.217375 1
## creatinine_phosphokinase 0.001149 1.374546 4
## diabetes 0.000072 0.345725 1
## ejection_fraction 0.001424 1.702747 4
## platelets 0.000386 0.461658 4
## serum_creatinine 0.001276 1.526175 4
## serum_sodium 0.001066 1.275218 4
## sex 0.000076 0.363106 1
## smoking 0.000052 0.248262 1
## hypertension 0.000076 0.363106 1
## deceased 0.000114 0.543636 1
## follow_up 0.000478 0.572108 4
In each of these figures, the distribution of the observed and synthetic data should be similar. That is, the bars of the histograms corresponding to the same categories should have approximately the same height.
A more formal quantification of the similarity between the synthetic
and the observed data is provided by the \(pMSE\) values. There are tons of other
utility measures available, see, for example this paper, but for now,
we will focus on the \(pMSE\) (these
utility measures tend to correlate highly anyway). These values can
quantify univariate or multivariate similarity between the observed and
synthetic data, but in the output of compare()
, they solely
concern univariate similarity.
Technically, the \(pMSE\) is defined by \[ pMSE = \frac{1}{n_{\text{obs}} + n_{\text{syn}}} \sum^{n_{\text{obs}}}_{i=1}\Big(\hat{\pi_i} - \frac{n_{\text{syn}}}{n_{\text{obs}} + n_{\text{syn}}}\Big)^2 + \sum^{n_{\text{syn}}}_{i=1}\Big(\hat{\pi_i} - \frac{n_{\text{syn}}}{n_{\text{obs}} + n_{\text{syn}}}\Big)^2, \] which, in our case, simplifies to \[ pMSE = \frac{1}{n_{\text{obs}} + n_{\text{syn}}} \sum^{n_{\text{obs}}}_{i=1} (\hat{\pi_i} - 0.5)^2 + \sum^{n_{\text{syn}}}_{i=1} (\hat{\pi_i} - 0.5)^2, \] where \(n_{\text{obs}}\) is the sample size of the observed data, \(n_{\text{syn}}\) is the number of synthetic records, and \(\hat{\pi_i}\) is each observation’s predicted probability of being a synthetic record. Intuitively, this method quantifies to what extent it is possible to predict whether an observation is a “true” record or a synthetic record. When it is possible to accurately predict whether an observation comes from the real data, there are important differences between the observed and synthetic data on some variables.
OPTIONAL. Calculate the \(pMSE\) for the variables
diabetes
and age
by hand, and check whether
the values correspond to the values reported by synthpop
’s
compare()
-function.
Hint. For continuous variables, synthpop
creates a categorical variable with \(5\) categories using 5 quantiles of equal
size, using
cut(x, breaks = quantile(x, 0:5/5), right=F, include.lowest=T)
# Create an indicator for whether an observation belongs to the true or synthetic
# data
ind <- factor(c(rep("Real", nrow(heart_failure)),
rep("Syn", nrow(synthetic$syn))))
# Combine the true and synthetic diabetes scores
obs_syn_diabetes <- c(heart_failure$diabetes, synthetic$syn$diabetes)
# Calculate predicted probabilities of whether an observation is "real" or
# "synthetic" based on it's diabetes value.
pi_diabetes <- glm(ind ~ obs_syn_diabetes, family = binomial) %>%
predict(type = "response")
pMSE_diabetes <- mean((pi_diabetes - 0.5)^2)
pMSE_diabetes
## [1] 7.226687e-05
# Combine observed and synthetic age values into a single vector
obs_syn_age <- c(heart_failure$age, synthetic$syn$age)
# Create a categorical version of this vector with 5 categories
obs_syn_age_cat <- cut(obs_syn_age,
breaks = quantile(obs_syn_age, probs = 0:5/5),
right = FALSE,
include.lowest = TRUE)
pi_age <- glm(ind ~ obs_syn_age_cat, family = binomial) %>%
predict(type = "response")
pMSE_age <- mean((pi_age - 0.5)^2)
pMSE_age
## [1] 0.0007719991
To correct for the complexity of the model (that is, the number of parameters used to estimate the predicted probabilities of being a true or synthetic record), the \(pMSE\) is often standardized. Given that we used the logistic regression model, the standardized \(pMSE\) is defined by \[ S pMSE = \frac{pMSE} {(k-1)(\frac{n_{\text{obs}}}{n_{\text{syn}} + n_{\text{obs}}})^2(\frac{n_{\text{syn}}}{n_{\text{syn}} + n_{\text{obs}}}) / (n_{\text{obs}} + n_{\text{syn}})}. \] Given that we used an equal number of observed and synthetic records, this formula reduces to \[ S pMSE = \frac{pMSE} {(k-1)\Big(\frac{1}{2}\Big)^3 / (n_{\text{obs}} + n_{\text{syn}})}. \]
For the previously calculated \(pMSE\) values of diabetes
and
age
, we can calculate the standardized \(pMSE\) values as follows.
SpMSE_diabetes <- pMSE_diabetes / (1 * (0.5^3 / length(ind)))
SpMSE_diabetes
## [1] 0.3457247
SpMSE_age <- pMSE_age / (4 * (0.5^3 / length(ind)))
SpMSE_age
## [1] 0.9233109
In this case, we have been able to generate synthetic data that is quite similar to the observed data on a univariate level.
However, given the multivariate synthesizing procedure, univariate similarity between the observed and synthetic data is not sufficient. We also want that relationships between the variables are preserved, that is, we want the synthetic and observed data to be similar on a multivariate level as well.
On a univariate level, visual inspection of similarity between observed and synthetic data generally gives the most detailed overview. However, visualizing multivariate relationships quickly becomes an onerous task, especially as the dimensionality of the data increases. Hence, we have to rely on different methods to inspect how well the synthetic data preserves relationships between variables.
In the synthetic data literature, an often made distinction is the one between general and specific utility measures. General utility measures assess to what extent the relationships between combinations of variables (and potential interactions between them) are preserved in the synthetic data set. These measures are often for pairs of variables, or for all combinations of variables. Specific utility measures focus, as the name already suggests, on a specific analysis. This analysis is performed on the observed data and the synthetic data, and the similarity between inferences on these data sets is quantified.
First, we can inspect which interactions of variables can predict whether observations are “true” or “synthetic” using the standardized \(pMSE\) measure, similarly to what we just did using individual variables. Hence, we predict whether observations can be classified based on the interaction of two variables. Again, continuous variables are by default recoded as categorical variables with \(5\) categories.
Ideally, the standardized \(pMSE\)
equals \(1\), but according to the
synthpop
authors, values below \(10\) are indicative of high utility.
6. Use the function utility.tables()
to
inspect whether any interaction of two variables can predict whether
observations are “real” or “synthetic”.
utility.tables(synthetic, heart_failure)
##
## Two-way utility: S_pMSE value plotted for 78 pairs of variables.
##
## Variable combinations with worst 5 utility scores (S_pMSE):
## 08.serum_sodium:09.sex
## 2.5541
## 10.smoking:11.hypertension
## 2.3784
## 03.creatinine_phosphokinase:12.deceased
## 2.2979
## 03.creatinine_phosphokinase:11.hypertension
## 2.2898
## 09.sex:11.hypertension
## 2.1884
##
## Medians and maxima of selected utility measures for all tables compared
## Medians Maxima
## pMSE 0.0021 0.0101
## S_pMSE 1.1167 2.5541
## df 9.0000 24.0000
##
## For more details of all scores use print.tabs = TRUE.
The maximum standardized \(pMSE\)
value equals 2.55, which is well below the threshold value of \(10\) that is given by the
synthpop
authors. This shows that none of the interactions
between any pair of variables is capable of distinguishing observed
records from synthetic records.
As a next step, we can assess whether a model containing all variables is capable of predicting which observations are real and which are synthetic.
7. Use the function utility.gen()
with
parameter method = "logit"
to calculate the standardized
\(pMSE\) based on a logistic prediction
model with all variables included as predictors (the default).
utility.gen(synthetic, heart_failure, method = "logit")
## You will be fitting a large model with 92 parameters and only 299 records
## that may take a long time and fail to converge.
## Have you selected variables with vars?
##
## Utility score calculated by method: logit
##
## Call:
## utility.gen.synds(object = synthetic, data = heart_failure, method = "logit")
##
## Selected utility measures
## pMSE S_pMSE
## 0.026218 1.378318
The standardized \(pMSE\) is very close to \(1\), indicating that also the combination of all variables cannot predict whether the observations stem from the observed or the synthetic data. However, a logistic model without interaction terms may miss important interactions that were present in the observed data, but that were not reproduced in the synthetic data. CART models are better capable to assess such interactions.
8. Use the function utility.gen()
with
parameter method = "cart"
to calculate the standardized
\(pMSE\).
*Note. CART
does not have an analytical expression of
the standardized \(pMSE\), which is
therefore calculated using a permutation test.
utility.gen(synthetic, heart_failure, method = "cart")
## Running 50 permutations to get NULL utilities and printing every 10th.
## synthesis 1 10 20 30 40 50
##
## Utility score calculated by method: cart
##
## Call:
## utility.gen.synds(object = synthetic, data = heart_failure, method = "cart")
##
## Null utilities simulated from a permutation test with 50 replications.
##
## Selected utility measures
## pMSE S_pMSE
## 0.086162 1.688813
Again, the standardized \(pMSE\) is very small. Hence, up to now, there are no indications that we have created data with poor utility.
If you would observe indications of poor utility (either due to one
or two variables, or to a larger set of variables), it might be
necessary to tweak the imputation model. The synthpop
authors walk you through the first steps that you could take here.
Specific utility measures assess whether the same analysis on the observed and the synthetic data gives similar results. Say that we are interested in, for instance, the relationship between whether a person survives, the age of this person, whether this person has diabetes and whether or not this person smokes, including the follow-up time as a control variable in the model.
9. Fit this model as a logistic regression model
using glm.synds()
and inspect the output- using
summary()
, first solely focusing on the output on the
synthetic data.
syn_fit <- glm.synds(deceased ~ age + diabetes + smoking + follow_up,
family = binomial,
data = synthetic)
summary(syn_fit)
## Fit to synthetic data set with a single synthesis. Inference to coefficients
## and standard errors that would be obtained from the original data.
##
## Call:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up,
## family = binomial, data = synthetic)
##
## Combined estimates:
## xpct(Beta) xpct(se.Beta) xpct(z) Pr(>|xpct(z)|)
## (Intercept) 0.1527608 0.9462009 0.1614 0.87174
## age 0.0317922 0.0131817 2.4118 0.01587 *
## diabetesYes -0.3101992 0.3304237 -0.9388 0.34784
## smokingYes -0.0302040 0.3585668 -0.0842 0.93287
## follow_up -0.0255069 0.0032734 -7.7922 6.583e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Age and the follow up period are related to a person’s survival chances, but whether a person has diabetes or whether the person smokes are not, at least, if we can trust the synthetic data.
10. Compare the output of this model on the
synthetic data with the output that would have been obtained when
running the model on the observed data, using
compare.fit.synds()
on the fitted model object.
compare.fit.synds(syn_fit, heart_failure)
##
## Call used to fit models to the data:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up,
## family = binomial, data = synthetic)
##
## Differences between results based on synthetic and observed data:
## Synthetic Observed Diff Std. coef diff CI overlap
## (Intercept) 0.15276084 -0.84666625 0.999427094 1.1063462 0.7177636
## age 0.03179219 0.03650504 -0.004712856 -0.3537465 0.9097569
## diabetesYes -0.31019920 0.11021309 -0.420412293 -1.3550030 0.6543296
## smokingYes -0.03020402 -0.20589862 0.175694594 0.5383495 0.8626634
## follow_up -0.02550695 -0.01931552 -0.006191430 -2.3996741 0.3878270
##
## Measures for one synthesis and 5 coefficients
## Mean confidence interval overlap: 0.7064681
## Mean absolute std. coef diff: 1.150624
##
## Mahalanobis distance ratio for lack-of-fit (target 1.0): 1.64
## Lack-of-fit test: 8.220822; p-value 0.1445 for test that synthesis model is compatible
## with a chi-squared test with 5 degrees of freedom.
##
## Confidence interval plot:
The figure shows that the confidence intervals for each of the regression coefficients are to a relatively large degree overlapping. On average, the intervals overlap for \(71\%\), and the lack-of-fit tests indicate no significant lag-of-fit. Moreover, these results show that we would draw the same conclusions from the synthetic data as from the observed data.
Altogether, all our tests of utility indicated that we were able to generate high utility synthetic data, as we aimed for! However, it is generally the case that the greater the utility, the higher the loss of privacy. Unfortunately, it is often complicated to evaluate how much privacy loss occurred due to synthesizing our data, not in the last place because hardly any formal measures of privacy loss for synthetic data exist.
In synthpop
, there is one function to assess statistical
disclosure control, which shows whether there are any unique
observations in the observed data that are reproduced in the synthetic
data, as these values might bear a relatively high risk of having there
privacy disclosed. Additionally, the sdc()
function offers
additional functionality that can be useful if you deem the remaining
disclosure risks of the synthetic data too high. Options are to
top/bottom code variables or to add noise to the continuous variables
using smoothing.
11. Remove the unique observations in the observed
data that are reproduced in the synthetic data, using sdc()
with argument rm.replicated.uniques = TRUE
.
sdc_out <- sdc(synthetic, heart_failure, rm.replicated.uniques = TRUE)
## no. of replicated uniques: 0
There are no unique observations that were reproduced in the observed data, so there are no cases removed.
Lastly, when you have obtained a synthetic data set and want to make inferences from this set, you have to be careful, because generating synthetic data adds variance to the already present sampling variance that you take into account when evaluating hypotheses. Specifically, if you want to make inferences with respect to the sample of original observations, you can use unaltered analysis techniques and corresponding, conventional standard errors.
However, if you want to inferences with respect to the population the sample is taken from, you will have to adjust the standard errors, to account for the fact that the synthesis procedure adds additional variance. The amount of variance that is added, depends on the number of synthetic data sets that are generated. Intuitively, when generating multiple synthetic data sets, the additional random noise that is induced by the synthesis cancels out, making the parameter estimates more stable. So far, we have generated a single synthetic data set, but it might be possible to generate \(2\), \(5\), \(10\) or \(1000000\) synthetic data sets (be warned that more synthetic data sets means more information, so more disclosure risk).
When generating multiple (\(m\)) synthetic data sets, the point estimate of the parameter of interest \(Q\) is obtained by pooling the synthetic data estimates \[ \bar{q}_m = \frac{1}{m} \sum^m_{i=1} \hat{q}_i, \] where \(\hat{q_i}\) is the parameter estimate in each sample. Additionally, the total variance over synthetic data sets \(T_m\) of \(Q\) is defined as \[ T_m = \Bigg(1 + \frac{1}{m}\Bigg) \Bigg(\frac{1}{m}\Bigg) \sum^m_{i=1} v_i, \] where \(v_i\) is defined as the estimated variance of the quantity in each sample.
Given that we use only a single synthetic data set, the parameter estimate is defined as \(\bar{q}_m = \hat{q}\) with corresponding variance \(2v\).
12. For the analysis of interest (predicting
survival by age, diabetes, smoking and the follow-up period), make
inferences with respect to the population by calling
summary()
on the previously created syn_fit
object, with parameter population.inference = TRUE
.
summary(syn_fit, population.inference = TRUE)
## Fit to synthetic data set with a single synthesis. Inference to population
## coefficients when all variables in the model are synthesised.
##
## Call:
## glm.synds(formula = deceased ~ age + diabetes + smoking + follow_up,
## family = binomial, data = synthetic)
##
## Combined estimates:
## Beta.syn se.Beta.syn z.syn Pr(>|z.syn|)
## (Intercept) 0.1527608 1.3381301 0.1142 0.90911
## age 0.0317922 0.0186417 1.7054 0.08811 .
## diabetesYes -0.3101992 0.4672897 -0.6638 0.50680
## smokingYes -0.0302040 0.5070900 -0.0596 0.95250
## follow_up -0.0255069 0.0046293 -5.5099 3.589e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To assess whether the standard errors are adjusted accordingly, run
sum_syn_fit <- summary(syn_fit)
sqrt(sum_syn_fit$coefficients[,2]^2 * 2)
## (Intercept) age diabetesYes smokingYes follow_up
## 1.338130146 0.018641714 0.467289685 0.507089972 0.004629256
These standard errors are equal to the standard errors given by the
summary function with argument
population.inference = TRUE
.