Statistical disclosure control for medical output

Utrecht University & Statistics Netherlands

Data privacy matters

About me

Thom Benjamin Volker

  • Utrecht University & Statistics Netherlands
  • PhD. Candidate in Methodology and Statistics


Research interests: methods to enhance data privacy, synthetic data and multiple imputation of missing data.

Okay, sure, data privacy matters

But are the risks large enough to justify our data protection efforts?

Disclosure limitation gone wrong

  • Sweeney (1997): Linked anonymized medical discharge data with voter registration
    • Anonymized \(\neq\) unidentifiable
  • Narayanan & Shmatikov (2007): How to break anonymity of the netflix prize dataset
    • Linking ratings and timestamps to IMDB

Anonymized data can be identifying!

Even aggregated data can be revealing

  • Published tables (e.g., census counts by age, sex, ethnicity) may seem harmless

  • Reconstruction attacks can use these tables to infer plausible individual-level data (Dick et al., 2022).

    • Leading to reconstruction of individual records

But: it is not clear how severe this disclosure is.

Two types of disclosure

  • Identity disclosure
    • Based on published statistics, we can identify individuals in the data
  • Attribute disclosure
    • Without per se identifying individuals, we can learn something about them from the data that we could not learn without the data

Disclosure limitation strategies

Suppression

A B C
Male 4 19 3
Female 12 0 1
A B C
Male 4 19 3
Female 12 NA NA

\(n_F = 13 \implies n_{F,B} + n_{F,C} = 1\)

Suppose disease \(B\) is prostate cancer \(\implies n_{F,C} = 1\)

“Optimal” cell suppression can be performed with \(\tau\)-ARGUS

Rounding

A B C
Male 4 19 3
Female 12 0 1
A B C
Male 5 20 5
Female 10 0 0
  • Adjust all cells in table to a specified base.

  • Coarsens information

  • Typically not recommended: adjustments might provide better disclosure protection

Noise addition - cell key method

A B C
Male 4 19 3
Female 12 0 1
A B C
Male 3 19 2
Female 12 1 2
  • Some random perturbation added to every cell in a table

  • Additivity gets broken

  • Also implemented in \(\tau\)-ARGUS

Differential privacy

Ex ante guarantee: whatever can be learned about any individual in this data is bounded by \(\epsilon\)

  • Neighboring datasets: \(X\) and \(X'\) differ in a single row.

  • Sensitivity: \(\Delta_f = \max_{X, X'} |f(X) - f(X')|\)

Differential privacy: for all possible outputs \(a\)

\(P[\tilde f(X) = a] \leq \exp\{\epsilon\} P[\tilde f(X') = a]\)

Differential privacy - an example

Counting the number of males

  • One additional observation: at most one additional male
    • Sensitivity: \(\Delta_f = 1\)
    • \(\epsilon = 1\) (user-defined)

Privacy-utility trade-off

Thanks for your attention

Questions/remarks?

t.b.volker@uu.nl

thomvolker.github.io/sdc4vac4eu