<- readRDS(url("https://thomvolker.github.io/OSWS_Synthetic/data/heart_failure.RDS")) heart_failure
Practical 1: Creating synthetic data
Fake it ’till you make it: Generating synthetic data with high utility in R
Introduction
In this workshop, you will learn how to create and evaluate synthetic data in R
. In the practical, we will work with the R
package synthpop
(Nowok, Raab, and Dibben 2016). The synthpop
package is a powerful tool explicitly designed to generate synthetic data. Other alternatives to create synthetic data are, for example, the R-package mice
(van Buuren and Groothuis-Oudshoorn 2011), or the stand-alone software IVEware
(“IVEware: Imputation and Variance Estimation Software,” n.d.).
If you have R
and R Studio
installed on your device, you can follow all the steps from this practical using your local version of R Studio. In case you do not have an installation of R
and R Studio
, you can quickly create an account on R Studio Cloud, and work with a project that is set-up for this workshop (the link will follow). Note that you have the opportunity to work with your own data (you can also use data provided by us). If you are going to work via R Studio Cloud
, you may not want to upload your own data to this server. In this case, you can still decide to work with the data provided by us. You could also install R
and R Studio
on the spot, but since we do not have infinite time, we advise to use R Studio Cloud
if you have no access to R
and R Studio
on your device already.
Data
For this workshop, we have prepared all exercises with the Heart failure clinical records data set. However, you may also choose to work with a data set of your own liking. All steps exercises and solutions that we outline here should be applicable to another data set as well, but some data processing might be required before our example code works as it should. In the worst case, you might run into errors that we could not foresee, but we are more than happy to think along and help you to solve these issues.
Heart failure clinical records
The Heart failure clinical records data set is a medical data set from the UCI Machine Learning Repository (click here for the source), originally collected by Ahmad et al. (2017) from the Government College University, Faisalabad, Pakistan, and adapted and uploaded to the UCI MLR by Chicco and Jurman (2020). This data set contains medical information of \(299\) individuals on \(13\) variables, and is typically used to predict whether or not a patient will survive during the follow-up period, using several biomedical predictors.
If you decide to work with the Heart failure clinical records data and work in R Studio Cloud
, you can access the environment related to this workshop here, including the scripts P1.R
and P2.R
that gets you started on importing the data, and installing and loading the required packages. You can continue working in this script. Make sure to save the project on your account, so that your changes are not deleted if you, for some reason, have to refresh the browser.
If you have R Studio
installed on your own machine, you can download the cleaned version of the Heart failure clinical records data set from my GitHub and load it as heart_failure
, by running the following line of code.
The Heart failure clinical records data consists of the following variables:
age
: Age in yearsanaemia
: Whether the patient has a decrease of red blood cells (No/Yes)hypertension
: Whether the patient has high blood pressure (No/Yes)creatinine_phosphokinase
: Level of the creatinine phosphokinase enzyme in the blood (mcg/L)diabetes
: Whether the patient has diabetes (No/Yes)ejection_fraction
: Percentage of blood leaving the heart at each contractionplatelets
: Platelets in de blood (kiloplatelets/mL)sex
: Sex (Female/Male)serum_creatinine
: Level of serum creatinine in the blood (mg/dL)serum_sodium
: Level of serum sodium in the blood (mg/dL)smoking
: Whether the patient smokes (No/Yes)follow_up
: Follow-up period (days)deceased
: Whether the patient deceased during the follow-up period
Loading your own data
In case you brought your own data, you can load it into R
using a function that matches your data format. Below, you can find several functions that might be helpful if you want to load the your data into R
. You can use these functions both locally, or on R Studio Cloud
, but make sure to install the required package first.
Programme | Format | Command |
---|---|---|
Excel | .xlsx | readxl::read_xlsx("path_to_file/data_name.xlsx") |
Excel | .csv | readr::read_csv("path_to_file/data_name.csv") |
SPSS | .sav | haven::read_sav("path_to_file/data_name.sav") |
Stata | .dta | haven::read_dta("path_to_file/data_name.dta") |
After loading in your own data, make sure that the variables in your data are coded accordingly (this can go wrong when transferring between data types). That is, make sure that your categorical variables are coded as factors and your numeric variables as numeric variables. To do so, you can make use of the following code. Note, however, that this is not a workshop on data wrangling: if importing your data into R
creates a mess, it might be better to use the Heart failure clinical records data, so that you can spend your valuable time on creating synthetic data.
$variable <- as.numeric(data_name$variable)
data_name$variable2 <- factor(data_name$variable,
data_namelevels = values, # values of the data
labels = value_labels) # labels of these values
If your data has the correct format, we can proceed to the next steps. Given that you are using your own data, we assume that you have (at least some) knowledge about the variables in your data. We will therefore skip the steps to obtain some descriptive information of the variables in your data, and continue to creating and evaluating synthetic data.
In the sequel, we will outline how to create and evaluate synthetic data using the Heart failure clinical records data, but most of these steps should be directly applicable to your own data. In case something gives an error, do not hesitate to ask how the problem can be solved!
Loading required packages
In this workshop, you will (at least) use the packages synthpop
(Nowok, Raab, and Dibben 2016), ggplot2
(Wickham 2016), patchwork
(Pedersen 2022), psych
(Revelle 2022) and purrr
(Henry and Wickham 2022). Make sure to load them (in case you haven’t installed them already, install them first, using install.packages("package.name")
).
1. Install the R
-packages synthpop
, ggplot2
, patchwork
, psych
, tidyverse
, purrr
from CRAN
, and load these packages using library()
.
install.packages("synthpop")
install.packages("ggplot2")
install.packages("patchwork")
install.packages("psych")
install.packages("purrr")
install.packages("tidyverse")
library(synthpop) # to assess the utility of our synthetic data
library(ggplot2) # required when using ggmice
library(patchwork) # to stitch multiple figures together
library(psych) # to obtain descriptive statistics
library(purrr) # to work with multiply imputed synthetic datasets
library(tidyverse) # to do some data wrangling
Getting to know the data
Before starting to work with any data, you must always get a basic understanding of what the data looks like. If you know what types your variables are and what the relationships between variables are supposed to be, it’s (somewhat) easier to spot any errors you make in coding.
2. Inspect the first few rows of the data using head()
.
head(heart_failure)
3. Use the summary()
or describe()
function to get a higher-level overview of the data.
summary()
summary(heart_failure)
describe()
describe(heart_failure)
The summary()
function gives a basic description of the variables, whereas describe()
also gives some information on the standard deviation, skewness and kurtosis.
Creating synthetic data
We will focus on two ways of creating synthetic data:
- Parametric methods
- Non-parametric methods.
Broadly speaking, two methods for creating synthetic data can be distinguished. The first one is based on parametric imputation models, which assumes that the structure of the data is fixed, and draws synthetic values from a pre-specified probability distribution. That is, after estimating a statistical model, the synthetic data are generated from a probability distribution, without making any further use of the observed data. In general, this procedure is less likely to result in an accidental release of disclosive information. However, these parametric methods are often less capable of capturing the complex nature of real-world data sets.
The subtleties of real-world data are often better reproduced with non-parametric imputation models. Using this approach, a non-parametric model is estimated, resulting in a donor pool out of which a single observation per observation and per variable is drawn. These models thus reuse the observed data to serve as synthetic data. Accordingly, much of the values that were in the observed data end up in the synthetic data. However, these observed data are generally combined in unique ways, it is generally not possible to link this information to the original respondents. The non-parametric procedures often yield better inferences, while still being able to prevent disclosure risk (although more research into measures to qualify the remaining risks is required). Therefore, this practical will showcase how to generate synthetic data using one such non-parametric method: classification and regression trees [CART; Breiman et al. (1984)].
We will use both approaches to generate synthetic data. For the parametric methods, this implies that all variables are synthesized through linear and logistic conditional models. For the non-parametric methods, we will synthesize all data using classification and regression trees (CART; Breiman et al. 1984).
The synthpop
algorithm is based on strategies that are used for imputing missing data, and is inspired by the mice
algorithm. In general, synthpop
proceeds as follows: from first to the last column in your data set, the given variable is synthesized based on all previously synthesized variables in the data. Specifically, a model is trained on the observed data, and new values for variable \(X_j\) are imputed on the basis of the variables \(X_{1:(j-1)}\). This procedure is repeated sequentially, until all variables are synthesized. In this way, the relationships between the variables are generally preserved.
We will use synthpop
to generate synthetic data using a parametric and a non-parametric synthesis strategy.
4. Use synthpop()
to create a synthetic data set in an object called syn_param
using method = "parametric", and set the argument
default.method = c(“norm”, “logreg”, “polyreg”, “polr”)`.
Hint: Use seed = 1
if you want to reproduce our results.
<- syn(heart_failure,
syn_param method = "parametric",
default.method = c("norm", "logreg", "polyreg", "polr"),
seed = 1)
5. Inspect the syn_param
object, what do you see?
syn_param
Calling syn_param
shows you some important features of the synthesis procedure. First, it shows the number of synthetic data sets that were generated (syn_param$m
). Also, it shows for every variable the method that was used to synthesize the data (syn_param$method
). If you want to know more about a specific synthesis method, for example, logreg
, you can call ?syn.logreg
to get more information.
6. Use synthpop()
to create a synthetic data set in an object called syn_nonparam
using `method = “cart”.
The CART
method samples observations from the original data and uses these to construct the synthetic observations. In practice, this might yield un unacceptably large privacy risk (also depending on the sensitivity of your data). This risk can be reduced by using smoothing, for example, by using the smoothing
argument in the syn()
function. For the sake of exposition, we don’t use smoothing in this practical, but in practical situations it is recommended to use smoothing.
Hint: Use seed = 1
if you want to reproduce our results.
<- syn(heart_failure, method = "cart", seed = 1) syn_nonparam
Creating the synthetic data is a piece of cake. However, after creating the synthetic data, we must assess its quality in terms of data utility and disclosure risk. This is what we will do in Practical 2.
END OF PRACTICAL 1
Session Info
sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)
Matrix products: default
locale:
[1] LC_COLLATE=Dutch_Netherlands.utf8 LC_CTYPE=Dutch_Netherlands.utf8
[3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C
[5] LC_TIME=Dutch_Netherlands.utf8
time zone: Europe/Amsterdam
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] readr_2.1.5 tidyr_1.3.0 tibble_3.2.1 tidyverse_2.0.0
[9] purrr_1.0.2 psych_2.3.9 patchwork_1.2.0 ggplot2_3.5.1
[13] synthpop_1.8-0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 libcoin_1.0-10
[4] fastmap_1.2.0 TH.data_1.1-2 digest_0.6.33
[7] rpart_4.1.23 timechange_0.2.0 lifecycle_1.0.4
[10] survival_3.6-4 Rsolnp_1.16 magrittr_2.0.3
[13] compiler_4.4.1 rlang_1.1.2 tools_4.4.1
[16] utf8_1.2.4 yaml_2.3.10 knitr_1.45
[19] htmlwidgets_1.6.4 classInt_0.4-10 mnormt_2.1.1
[22] plyr_1.8.9 xml2_1.3.6 multcomp_1.4-25
[25] KernSmooth_2.23-24 polspline_1.1.24 party_1.3-14
[28] foreign_0.8-86 withr_3.0.1 numDeriv_2016.8-1.1
[31] nnet_7.3-19 grid_4.4.1 mipfp_3.2.1
[34] stats4_4.4.1 fansi_1.0.6 broman_0.80
[37] e1071_1.7-14 colorspace_2.1-1 scales_1.3.0
[40] MASS_7.3-60.2 cli_3.6.2 mvtnorm_1.2-4
[43] rmarkdown_2.25 generics_0.1.3 rstudioapi_0.16.0
[46] tzdb_0.4.0 proxy_0.4-27 modeltools_0.2-23
[49] splines_4.4.1 parallel_4.4.1 matrixStats_1.2.0
[52] vctrs_0.6.5 Matrix_1.7-0 sandwich_3.1-0
[55] jsonlite_1.8.9 hms_1.1.3 rmutil_1.1.10
[58] cmm_1.0 systemfonts_1.0.5 proto_1.0.0
[61] glue_1.6.2 codetools_0.2-20 stringi_1.8.3
[64] strucchange_1.5-3 gtable_0.3.5 munsell_0.5.1
[67] pillar_1.9.0 htmltools_0.5.7 randomForest_4.7-1.1
[70] truncnorm_1.0-9 R6_2.5.1 evaluate_0.23
[73] kableExtra_1.4.0 lattice_0.22-6 highr_0.10
[76] class_7.3-22 Rcpp_1.0.12 nlme_3.1-164
[79] svglite_2.1.3 ranger_0.16.0 coin_1.4-3
[82] xfun_0.41 zoo_1.8-12 pkgconfig_2.0.3