Practical 1: Creating synthetic data

Fake it ’till you make it: Generating synthetic data with high utility in R

Author

Thom Volker & Erik-Jan van Kesteren

Introduction


In this workshop, you will learn how to create and evaluate synthetic data in R. In the practical, we will work with the R package synthpop (Nowok, Raab, and Dibben 2016). The synthpop package is a powerful tool explicitly designed to generate synthetic data. Other alternatives to create synthetic data are, for example, the R-package mice (van Buuren and Groothuis-Oudshoorn 2011), or the stand-alone software IVEware (IVEware: Imputation and Variance Estimation Software,” n.d.).

If you have R and R Studio installed on your device, you can follow all the steps from this practical using your local version of R Studio. In case you do not have an installation of R and R Studio, you can quickly create an account on R Studio Cloud, and work with a project that is set-up for this workshop (the link will follow). Note that you have the opportunity to work with your own data (you can also use data provided by us). If you are going to work via R Studio Cloud, you may not want to upload your own data to this server. In this case, you can still decide to work with the data provided by us. You could also install R and R Studio on the spot, but since we do not have infinite time, we advise to use R Studio Cloud if you have no access to R and R Studio on your device already.


Data


For this workshop, we have prepared all exercises with the Heart failure clinical records data set. However, you may also choose to work with a data set of your own liking. All steps exercises and solutions that we outline here should be applicable to another data set as well, but some data processing might be required before our example code works as it should. In the worst case, you might run into errors that we could not foresee, but we are more than happy to think along and help you to solve these issues.


Heart failure clinical records

The Heart failure clinical records data set is a medical data set from the UCI Machine Learning Repository (click here for the source), originally collected by Ahmad et al. (2017) from the Government College University, Faisalabad, Pakistan, and adapted and uploaded to the UCI MLR by Chicco and Jurman (2020). This data set contains medical information of \(299\) individuals on \(13\) variables, and is typically used to predict whether or not a patient will survive during the follow-up period, using several biomedical predictors.

If you decide to work with the Heart failure clinical records data and work in R Studio Cloud, you can access the environment related to this workshop here, including the scripts P1.R and P2.R that gets you started on importing the data, and installing and loading the required packages. You can continue working in this script. Make sure to save the project on your account, so that your changes are not deleted if you, for some reason, have to refresh the browser.

If you have R Studio installed on your own machine, you can download the cleaned version of the Heart failure clinical records data set from my GitHub and load it as heart_failure, by running the following line of code.

heart_failure <- readRDS(url("https://thomvolker.github.io/OSWS_Synthetic/data/heart_failure.RDS"))

The Heart failure clinical records data consists of the following variables:

  • age: Age in years
  • anaemia: Whether the patient has a decrease of red blood cells (No/Yes)
  • hypertension: Whether the patient has high blood pressure (No/Yes)
  • creatinine_phosphokinase: Level of the creatinine phosphokinase enzyme in the blood (mcg/L)
  • diabetes: Whether the patient has diabetes (No/Yes)
  • ejection_fraction: Percentage of blood leaving the heart at each contraction
  • platelets: Platelets in de blood (kiloplatelets/mL)
  • sex: Sex (Female/Male)
  • serum_creatinine: Level of serum creatinine in the blood (mg/dL)
  • serum_sodium: Level of serum sodium in the blood (mg/dL)
  • smoking: Whether the patient smokes (No/Yes)
  • follow_up: Follow-up period (days)
  • deceased: Whether the patient deceased during the follow-up period

Loading your own data

In case you brought your own data, you can load it into R using a function that matches your data format. Below, you can find several functions that might be helpful if you want to load the your data into R. You can use these functions both locally, or on R Studio Cloud, but make sure to install the required package first.

Programme Format Command
Excel .xlsx readxl::read_xlsx("path_to_file/data_name.xlsx")
Excel .csv readr::read_csv("path_to_file/data_name.csv")
SPSS .sav haven::read_sav("path_to_file/data_name.sav")
Stata .dta haven::read_dta("path_to_file/data_name.dta")

After loading in your own data, make sure that the variables in your data are coded accordingly (this can go wrong when transferring between data types). That is, make sure that your categorical variables are coded as factors and your numeric variables as numeric variables. To do so, you can make use of the following code. Note, however, that this is not a workshop on data wrangling: if importing your data into R creates a mess, it might be better to use the Heart failure clinical records data, so that you can spend your valuable time on creating synthetic data.

data_name$variable  <- as.numeric(data_name$variable)
data_name$variable2 <- factor(data_name$variable, 
                              levels = values,       # values of the data
                              labels = value_labels) # labels of these values

If your data has the correct format, we can proceed to the next steps. Given that you are using your own data, we assume that you have (at least some) knowledge about the variables in your data. We will therefore skip the steps to obtain some descriptive information of the variables in your data, and continue to creating and evaluating synthetic data.

In the sequel, we will outline how to create and evaluate synthetic data using the Heart failure clinical records data, but most of these steps should be directly applicable to your own data. In case something gives an error, do not hesitate to ask how the problem can be solved!


Loading required packages


In this workshop, you will (at least) use the packages synthpop (Nowok, Raab, and Dibben 2016), ggplot2 (Wickham 2016), patchwork (Pedersen 2022), psych (Revelle 2022) and purrr (Henry and Wickham 2022). Make sure to load them (in case you haven’t installed them already, install them first, using install.packages("package.name")).


1. Install the R-packages synthpop, ggplot2, patchwork, psych, tidyverse, purrr from CRAN, and load these packages using library().


install.packages("synthpop")
install.packages("ggplot2")
install.packages("patchwork")
install.packages("psych")
install.packages("purrr")
install.packages("tidyverse")
library(synthpop)  # to assess the utility of our synthetic data
library(ggplot2)   # required when using ggmice
library(patchwork) # to stitch multiple figures together
library(psych)     # to obtain descriptive statistics
library(purrr)     # to work with multiply imputed synthetic datasets
library(tidyverse) # to do some data wrangling

Getting to know the data


Before starting to work with any data, you must always get a basic understanding of what the data looks like. If you know what types your variables are and what the relationships between variables are supposed to be, it’s (somewhat) easier to spot any errors you make in coding.


2. Inspect the first few rows of the data using head().


head(heart_failure)
age anaemia creatinine_phosphokinase diabetes ejection_fraction platelets serum_creatinine serum_sodium sex smoking hypertension deceased follow_up
75 No 582 No 20 265000 1.9 130 Male No Yes Yes 4
55 No 7861 No 38 263358 1.1 136 Male No No Yes 6
65 No 146 No 20 162000 1.3 129 Male Yes No Yes 7
50 Yes 111 No 20 210000 1.9 137 Male No No Yes 7
65 Yes 160 Yes 20 327000 2.7 116 Female No No Yes 8
90 Yes 47 No 40 204000 2.1 132 Male Yes Yes Yes 8

3. Use the summary() or describe() function to get a higher-level overview of the data.


summary(heart_failure)
      age        anaemia   creatinine_phosphokinase diabetes  ejection_fraction
 Min.   :40.00   No :170   Min.   :  23.0           No :174   Min.   :14.00    
 1st Qu.:51.00   Yes:129   1st Qu.: 116.5           Yes:125   1st Qu.:30.00    
 Median :60.00             Median : 250.0                     Median :38.00    
 Mean   :60.83             Mean   : 581.8                     Mean   :38.08    
 3rd Qu.:70.00             3rd Qu.: 582.0                     3rd Qu.:45.00    
 Max.   :95.00             Max.   :7861.0                     Max.   :80.00    
   platelets      serum_creatinine  serum_sodium       sex      smoking  
 Min.   : 25100   Min.   :0.500    Min.   :113.0   Female:105   No :203  
 1st Qu.:212500   1st Qu.:0.900    1st Qu.:134.0   Male  :194   Yes: 96  
 Median :262000   Median :1.100    Median :137.0                         
 Mean   :263358   Mean   :1.394    Mean   :136.6                         
 3rd Qu.:303500   3rd Qu.:1.400    3rd Qu.:140.0                         
 Max.   :850000   Max.   :9.400    Max.   :148.0                         
 hypertension deceased    follow_up    
 No :194      No :203   Min.   :  4.0  
 Yes:105      Yes: 96   1st Qu.: 73.0  
                        Median :115.0  
                        Mean   :130.3  
                        3rd Qu.:203.0  
                        Max.   :285.0  
describe(heart_failure)
vars n mean sd median trimmed mad min max range skew kurtosis se
age 1 299 6.083389e+01 1.189481e+01 60.0 6.021715e+01 14.82600 40.0 95.0 55.0 0.4188266 -0.2204793 0.6878946
anaemia* 2 299 1.431438e+00 4.961073e-01 1.0 1.414938e+00 0.00000 1.0 2.0 1.0 0.2754750 -1.9305367 0.0286906
creatinine_phosphokinase 3 299 5.818395e+02 9.702879e+02 250.0 3.654938e+02 269.83320 23.0 7861.0 7838.0 4.4184296 24.5254138 56.1131970
diabetes* 4 299 1.418060e+00 4.940671e-01 1.0 1.398340e+00 0.00000 1.0 2.0 1.0 0.3305857 -1.8970241 0.0285726
ejection_fraction 5 299 3.808361e+01 1.183484e+01 38.0 3.742739e+01 11.86080 14.0 80.0 66.0 0.5498228 0.0005484 0.6844265
platelets 6 299 2.633580e+05 9.780424e+04 262000.0 2.567301e+05 65234.40000 25100.0 850000.0 824900.0 1.4476814 6.0252322 5656.1650591
serum_creatinine 7 299 1.393880e+00 1.034510e+00 1.1 1.189295e+00 0.29652 0.5 9.4 8.9 4.4113866 25.1888415 0.0598273
serum_sodium 8 299 1.366254e+02 4.412477e+00 137.0 1.368216e+02 4.44780 113.0 148.0 35.0 -1.0376430 3.9841899 0.2551802
sex* 9 299 1.648829e+00 4.781364e-01 2.0 1.684647e+00 0.00000 1.0 2.0 1.0 -0.6204576 -1.6204183 0.0276513
smoking* 10 299 1.321070e+00 4.676704e-01 1.0 1.278008e+00 0.00000 1.0 2.0 1.0 0.7626368 -1.4231112 0.0270461
hypertension* 11 299 1.351171e+00 4.781364e-01 1.0 1.315353e+00 0.00000 1.0 2.0 1.0 0.6204576 -1.6204183 0.0276513
deceased* 12 299 1.321070e+00 4.676704e-01 1.0 1.278008e+00 0.00000 1.0 2.0 1.0 0.7626368 -1.4231112 0.0270461
follow_up 13 299 1.302609e+02 7.761421e+01 115.0 1.292780e+02 105.26460 4.0 285.0 281.0 0.1265232 -1.2238150 4.4885455


The summary() function gives a basic description of the variables, whereas describe() also gives some information on the standard deviation, skewness and kurtosis.


Creating synthetic data

We will focus on two ways of creating synthetic data:

  • Parametric methods
  • Non-parametric methods.

Broadly speaking, two methods for creating synthetic data can be distinguished. The first one is based on parametric imputation models, which assumes that the structure of the data is fixed, and draws synthetic values from a pre-specified probability distribution. That is, after estimating a statistical model, the synthetic data are generated from a probability distribution, without making any further use of the observed data. In general, this procedure is less likely to result in an accidental release of disclosive information. However, these parametric methods are often less capable of capturing the complex nature of real-world data sets.

The subtleties of real-world data are often better reproduced with non-parametric imputation models. Using this approach, a non-parametric model is estimated, resulting in a donor pool out of which a single observation per observation and per variable is drawn. These models thus reuse the observed data to serve as synthetic data. Accordingly, much of the values that were in the observed data end up in the synthetic data. However, these observed data are generally combined in unique ways, it is generally not possible to link this information to the original respondents. The non-parametric procedures often yield better inferences, while still being able to prevent disclosure risk (although more research into measures to qualify the remaining risks is required). Therefore, this practical will showcase how to generate synthetic data using one such non-parametric method: classification and regression trees [CART; Breiman et al. (1984)].


We will use both approaches to generate synthetic data. For the parametric methods, this implies that all variables are synthesized through linear and logistic conditional models. For the non-parametric methods, we will synthesize all data using classification and regression trees (CART; Breiman et al. 1984).

The synthpop algorithm is based on strategies that are used for imputing missing data, and is inspired by the mice algorithm. In general, synthpop proceeds as follows: from first to the last column in your data set, the given variable is synthesized based on all previously synthesized variables in the data. Specifically, a model is trained on the observed data, and new values for variable \(X_j\) are imputed on the basis of the variables \(X_{1:(j-1)}\). This procedure is repeated sequentially, until all variables are synthesized. In this way, the relationships between the variables are generally preserved.


We will use synthpop to generate synthetic data using a parametric and a non-parametric synthesis strategy.


4. Use synthpop() to create a synthetic data set in an object called syn_param using method = "parametric", and set the argumentdefault.method = c(“norm”, “logreg”, “polyreg”, “polr”)`.


Hint: Use seed = 1 if you want to reproduce our results.

syn_param <- syn(heart_failure, 
                 method = "parametric", 
                 default.method = c("norm", "logreg", "polyreg", "polr"),
                 seed = 1)

5. Inspect the syn_param object, what do you see?


syn_param

Calling syn_param shows you some important features of the synthesis procedure. First, it shows the number of synthetic data sets that were generated (syn_param$m). Also, it shows for every variable the method that was used to synthesize the data (syn_param$method). If you want to know more about a specific synthesis method, for example, logreg, you can call ?syn.logreg to get more information.


6. Use synthpop() to create a synthetic data set in an object called syn_nonparam using `method = “cart”.

Important

The CART method samples observations from the original data and uses these to construct the synthetic observations. In practice, this might yield un unacceptably large privacy risk (also depending on the sensitivity of your data). This risk can be reduced by using smoothing, for example, by using the smoothing argument in the syn() function. For the sake of exposition, we don’t use smoothing in this practical, but in practical situations it is recommended to use smoothing.


Hint: Use seed = 1 if you want to reproduce our results.

syn_nonparam <- syn(heart_failure, method = "cart", seed = 1)

Creating the synthetic data is a piece of cake. However, after creating the synthetic data, we must assess its quality in terms of data utility and disclosure risk. This is what we will do in Practical 2.


END OF PRACTICAL 1


Session Info

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=Dutch_Netherlands.utf8  LC_CTYPE=Dutch_Netherlands.utf8   
[3] LC_MONETARY=Dutch_Netherlands.utf8 LC_NUMERIC=C                      
[5] LC_TIME=Dutch_Netherlands.utf8    

time zone: Europe/Amsterdam
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] readr_2.1.5     tidyr_1.3.0     tibble_3.2.1    tidyverse_2.0.0
 [9] purrr_1.0.2     psych_2.3.9     patchwork_1.2.0 ggplot2_3.5.1  
[13] synthpop_1.8-0 

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     viridisLite_0.4.2    libcoin_1.0-10      
 [4] fastmap_1.2.0        TH.data_1.1-2        digest_0.6.33       
 [7] rpart_4.1.23         timechange_0.2.0     lifecycle_1.0.4     
[10] survival_3.6-4       Rsolnp_1.16          magrittr_2.0.3      
[13] compiler_4.4.1       rlang_1.1.2          tools_4.4.1         
[16] utf8_1.2.4           yaml_2.3.10          knitr_1.45          
[19] htmlwidgets_1.6.4    classInt_0.4-10      mnormt_2.1.1        
[22] plyr_1.8.9           xml2_1.3.6           multcomp_1.4-25     
[25] KernSmooth_2.23-24   polspline_1.1.24     party_1.3-14        
[28] foreign_0.8-86       withr_3.0.1          numDeriv_2016.8-1.1 
[31] nnet_7.3-19          grid_4.4.1           mipfp_3.2.1         
[34] stats4_4.4.1         fansi_1.0.6          broman_0.80         
[37] e1071_1.7-14         colorspace_2.1-1     scales_1.3.0        
[40] MASS_7.3-60.2        cli_3.6.2            mvtnorm_1.2-4       
[43] rmarkdown_2.25       generics_0.1.3       rstudioapi_0.16.0   
[46] tzdb_0.4.0           proxy_0.4-27         modeltools_0.2-23   
[49] splines_4.4.1        parallel_4.4.1       matrixStats_1.2.0   
[52] vctrs_0.6.5          Matrix_1.7-0         sandwich_3.1-0      
[55] jsonlite_1.8.9       hms_1.1.3            rmutil_1.1.10       
[58] cmm_1.0              systemfonts_1.0.5    proto_1.0.0         
[61] glue_1.6.2           codetools_0.2-20     stringi_1.8.3       
[64] strucchange_1.5-3    gtable_0.3.5         munsell_0.5.1       
[67] pillar_1.9.0         htmltools_0.5.7      randomForest_4.7-1.1
[70] truncnorm_1.0-9      R6_2.5.1             evaluate_0.23       
[73] kableExtra_1.4.0     lattice_0.22-6       highr_0.10          
[76] class_7.3-22         Rcpp_1.0.12          nlme_3.1-164        
[79] svglite_2.1.3        ranger_0.16.0        coin_1.4-3          
[82] xfun_0.41            zoo_1.8-12           pkgconfig_2.0.3     

References

Ahmad, Tanvir, Assia Munir, Sajjad Haider Bhatti, Muhammad Aftab, and Muhammad Ali Raza. 2017. “Survival Analysis of Heart Failure Patients: A Case Study.” PLOS ONE 12 (7): 1–8. https://doi.org/10.1371/journal.pone.0181001.
Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. New York: CRC press. https://doi.org/10.1201/9781315139470.
Chicco, Davide, and Giuseppe Jurman. 2020. “Machine Learning Can Predict Survival of Patients with Heart Failure from Serum Creatinine and Ejection Fraction Alone.” BMC Medical Informatics and Decision Making 20 (1): 16. https://doi.org/10.1186/s12911-020-1023-5.
Henry, Lionel, and Hadley Wickham. 2022. Purrr: Functional Programming Tools. https://CRAN.R-project.org/package=purrr.
IVEware: Imputation and Variance Estimation Software.” n.d. https://www.src.isr.umich.edu/wp-content/uploads/iveware-manual-Version-0.3.pdf.
Nowok, Beata, Gillian M. Raab, and Chris Dibben. 2016. synthpop: Bespoke Creation of Synthetic Data in R.” Journal of Statistical Software 74 (11): 1–26. https://doi.org/10.18637/jss.v074.i11.
Pedersen, Thomas Lin. 2022. Patchwork: The Composer of Plots. https://CRAN.R-project.org/package=patchwork.
Revelle, William. 2022. Psych: Procedures for Psychological, Psychometric, and Personality Research. Evanston, Illinois: Northwestern University. https://CRAN.R-project.org/package=psych.
van Buuren, Stef, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in r.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.