Professional Documents
Culture Documents
Hands-on
12 september 2019
Thomas Van den Bossche
Introduction
Introduction
3
Re-identification by linking
Name Date of birth Gender Zip Code Disease Name Date of birth Gender Zip code
Andre 01/02/1992 Male 53715 Heart Disease Ronald 01/12/1997 Male 53715
Carol 19/09/1986 Female 53715 Hepatitis Dan 29/03/1946 Male 53715
Dan 17/08/1977 Female 53715 Bronchitis Carol 19/03/1986 Female 53715
Eve 10/03/1994 Female 53715 Broken arm Eve 01/02/1992 Female 53715
Ellen 12/01/1994 Male 53715 Flu Ellen 03/08/1991 Female 53715
… … … … … … … … …
4
How can we de-identify our dataset?
Born Born
Suppression t-closeness
Russia *
… …
5
Annotation of attributes
7 Data Utility
Used privacy model:
!-anonymity
!-Anonymity
Generalization Suppression
= make less precise (for outliers)
28 <30 Russia *
9
Example of !-Anonymity
Original dataset 3-anonymous dataset
Eq c
Identifier Quasi-identifier Sensitive Identifier Quasi-identifier Sensitive
ui las
va s
le
id Zip Code Age Nationality Disease id Zip Code Age Nationality Disease
nc
e
1 13053 28 Russia Cardiac 1 130** <30 * Cardiac
"
Probability that row corresponds to individual =
10 #
What is ARX?
ARX Data Anonymization Tool
› Open source
12
ARX Use cases
13
ARX
› Public API
14
ARX
Privacy models
!-Anonymity
k-map
Differential Privacy
l-diversity
t-closeness
δ-Presence
15
De-identification
use-case
Use case
Adult data set
› Used to predict whether a person makes over 50K per year
› Widely used in machine learning and statistical analysis
› > 48k records
› 9 attributes
17
Hands-on
Which steps do we take?
19
Which steps do we take?
Optional step:
1. Annotate dataset
› Utility loss
2. Create generalization rules
› Privacy risk
3. Select privacy model
20
Import data
Import data
22
Import Configure Anonymize Explore Analyze
23
Import Configure Anonymize Explore Analyze
24
Import Configure Anonymize Explore Analyze
2 Import dataset
File: adults.csv
Using labcomputer?
25
Import Configure Anonymize Explore Analyze
2 Import dataset
26
Configure
1. Annotate dataset
27
Import Configure Anonymize Explore Analyze
- Annotate dataset
- Create generalization rules
Input dataset
28
Import Configure Anonymize Explore Analyze
- Annotate dataset
- Create generalization rules
29
Import Configure Anonymize Explore Analyze
1 Annotate dataset
30
Import Configure Anonymize Explore Analyze
1 Annotate dataset
31
Import Configure Anonymize Explore Analyze
32
Import Configure Anonymize Explore Analyze
3 Choose intervals of 5
Maximum value
100
90 Snap from
0 Snap from
0 Minimum value
33
Import Configure Anonymize Explore Analyze
34
Import Configure Anonymize Explore Analyze
35
Import Configure Anonymize Explore Analyze
8 Result
36
Import Configure Anonymize Explore Analyze
Spouse present
37
Import Configure Anonymize Explore Analyze
38
Import Configure Anonymize Explore Analyze
39
Import Configure Anonymize Explore Analyze
5 Result
40
Import Configure Anonymize Explore Analyze
Using labcomputer?
1 Open project “workshop.deid” with already defined hierarchies
Path: ‘Desktop/Workshop GDPR/
workshop.deid’
41
Import Configure Anonymize Explore Analyze
42
Import Configure Anonymize Explore Analyze
1 Privacy model
!-anonymity
2 Choose ! = 5
43
Import Configure Anonymize Explore Analyze
= objective function
44
Import Configure Anonymize Explore Analyze
Utility measure
lower score:
è less loss of information
è higher data quality
Attribute weights
45
Anonymize
46
Import Configure Anonymize Explore Analyze
1 Anonymize
1 Click “Anonymize”
2 Select “Optimal”
47
Explore and select
48
Import Configure Anonymize Explore Analyze
49
Import Configure Anonymize Explore Analyze
marital-status native-country
age
race education workclass
sex occupation › Each transformation is identified by
salary-class generalization levels of the quasi-identifiers
Solution representation
50
Import Configure Anonymize Explore Analyze
› Select generalization levels › Organize transformations › View basic information about the
51
Import Configure Anonymize Explore Analyze
52
Analyze utility loss
› Utility loss
› Privacy risk
53
Import Configure Anonymize Explore Analyze
54
Import Configure Anonymize Explore Analyze
Summary statistics
First select an attribute (e.g. Age)
Empirical distribution
55
Import Configure Anonymize Explore Analyze utility
Contingency
Most people are Most people are
male and 51 Select two attributes (e.g. Age and Sex) male and
years old between 51 and
60 years old
Properties
Input
57
Analyze utility loss
› Utility loss
› Privacy risk
58
Import Configure Anonymize Explore Analyse risk
Re-identification
Detect quasi-
risks
identifiers
Detect attributes
Estimates of
which must be
population
modified
uniques
(HIPAA)
59
Import Configure Anonymize Explore Analyse risk
Re-identification risks
Attacker models used to compile global risks:
60
Import Configure Anonymize Explore Analyse risk
61
Import Configure Anonymize Explore Analyse risk
› Risk determined by smallest equivalence class in Public databset that maps to Anonymized
dataset
62
Import Configure Anonymize Explore Analyse risk
› In practice:
Health data: minimal equivalence class size of 5
Threshold probability = 0.2
63
Import Configure Anonymize Explore Analyse risk
64
Import Configure Anonymize Explore Analyse risk
Input dataset
Records affected [%]
Exponential growth
14.1% of the records
14.1
have a risk between 33% and 50%
] 33 , 50 ]
Output dataset
Cumulative risk
Records affected [%]
78.1 % of the records have a
risk under 5%
78.1%
Exponential decay
] 0.1 , 1 ] ]4,5]
High distinction and separation è indicators that quasi-identifiers are more likely to re-identify
67
Import Configure Anonymize Explore Analyse risk
Prosecutor
Journalist
Marketer
68
Import Configure Anonymize Explore Analyse risk
Input
dataset
Output
dataset
69
Import Configure Anonymize Explore Analyse risk
HIPAA Identifiers
70
Conclusion
1. Annotate dataset
› Utility loss
2. Create generalization rules › Privacy risk
3. Select privacy model
71