You are on page 1of 72

ARX Data Anonymization Tool

Hands-on

12 september 2019
Thomas Van den Bossche
Introduction
Introduction

Data custodians want to share data in a controlled manner with:

Government Research community Private sector

3
Re-identification by linking

Microdata Vote registration data

Name Date of birth Gender Zip Code Disease Name Date of birth Gender Zip code
Andre 01/02/1992 Male 53715 Heart Disease Ronald 01/12/1997 Male 53715
Carol 19/09/1986 Female 53715 Hepatitis Dan 29/03/1946 Male 53715
Dan 17/08/1977 Female 53715 Bronchitis Carol 19/03/1986 Female 53715
Eve 10/03/1994 Female 53715 Broken arm Eve 01/02/1992 Female 53715
Ellen 12/01/1994 Male 53715 Flu Ellen 03/08/1991 Female 53715
… … … … … … … … …

4
How can we de-identify our dataset?

Transform Enforce Measure utility

Methods Privacy models Utility models


Age Age
Generalization !-Anonymity Information loss
28 <30

Born Born
Suppression t-closeness
Russia *

Microaggregation Differential Privacy

… …

5
Annotation of attributes

Identifier Quasi-identifiers Sensitive


Name Date of birth Gender Zip Code Disease
Andre 01/12/1997 Male 53715 Heart Disease
Carol 01/12/1986 Female 53715 Hepatitis
Dan 01/12/1976 Male 53715 Bronchitis
Eve 01/12/1994 Female 53715 Broken arm
Ellen 01/12/1992 Female 53715 Flu
Eric 01/12/1991 Male 53715 Flu

Always Can be used for linking Used for


removed anonymized dataset with other research
before release datasets purposes

Combination identifies 87%


6 of US population
Privacy protection ⟷ Data utility
Privacy Protection

7 Data Utility
Used privacy model:
!-anonymity
!-Anonymity

› Person data cannot be distinguished from at least k - 1 individuals


› How?

Generalization Suppression
= make less precise (for outliers)

Age Age Nationality Nationality

28 <30 Russia *

9
Example of !-Anonymity
Original dataset 3-anonymous dataset

Eq c
Identifier Quasi-identifier Sensitive Identifier Quasi-identifier Sensitive

ui las
va s
le
id Zip Code Age Nationality Disease id Zip Code Age Nationality Disease

nc
e
1 13053 28 Russia Cardiac 1 130** <30 * Cardiac

2 13066 29 US Cardiac 2 130** <30 * Cardiac

3 13068 21 Japan Infectious 3 130** <30 * Infectious

4 14853 41 US Infectious 4 1485* >40 * Infectious

5 14853 50 India Cancer 5 1485* >40 * Cancer

6 14853 47 US Cardiac 6 1485* >40 * Cardiac

7 13053 37 India Cardiac 7 130** 3* * Cardiac

8 13068 39 US Cancer 8 130** 3* * Cancer


9 13068 31 US Infectious 9 130** 3* * Infectious

"
Probability that row corresponds to individual =
10 #
What is ARX?
ARX Data Anonymization Tool

› Transforming structured personal data

› Open source

Transform Enforce Measure utility

Methods Privacy models Utility models

12
ARX Use cases

Big data analytics Research projects Clinical trial


data sharing

13
ARX

› Public API

14
ARX

Privacy models

!-Anonymity

k-map

Differential Privacy

l-diversity

t-closeness

δ-Presence

15
De-identification
use-case
Use case
Adult data set
› Used to predict whether a person makes over 50K per year
› Widely used in machine learning and statistical analysis
› > 48k records
› 9 attributes

Our aim: minimize re-identification risk, while maintaining utility

17
Hands-on
Which steps do we take?

Import data Export data

Configure Anonymize Analyze

1. Annotate dataset › Utility loss


› Privacy risk
2. Create generalization rules

3. Select privacy model

4. Select utility measure

19
Which steps do we take?

Optional step:

Import data Export data


Explore and
Configure Anonymize Analyze
select

1. Annotate dataset
› Utility loss
2. Create generalization rules
› Privacy risk
3. Select privacy model

4. Select utility measure

20
Import data
Import data

Import data Export data


Explore and
Configure Anonymize Analyze
select

22
Import Configure Anonymize Explore Analyze

1 Generate new project

23
Import Configure Anonymize Explore Analyze

24
Import Configure Anonymize Explore Analyze

2 Import dataset

File: adults.csv

Using labcomputer?

Path: ‘Desktop/Workshop GDPR/


adults.csv’

25
Import Configure Anonymize Explore Analyze

2 Import dataset

26
Configure

Import data Export data


Explore and
Configure Anonymize Analyze
select

1. Annotate dataset

2. Create generalization rules

3. Select privacy model

4. Select utility measure

27
Import Configure Anonymize Explore Analyze

- Annotate dataset
- Create generalization rules

Input dataset

Select privacy model

Extract research sample Configure utility measure

28
Import Configure Anonymize Explore Analyze

- Annotate dataset
- Create generalization rules

29
Import Configure Anonymize Explore Analyze

1 Annotate dataset

1 Click an attribute column

2 Choose appropriate attribute type

30
Import Configure Anonymize Explore Analyze

1 Annotate dataset

3 Set attribute metadata

The variable we want to explain

31
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Age attribute

1 Select attribute age and Create Hierarchy…

2 Select Use intervals

32
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Age attribute

3 Choose intervals of 5
Maximum value
100

90 Top coding from

90 Snap from

0 Snap from

4 Top coding when age > 90


0 Bottom coding from

0 Minimum value

33
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Age attribute

5 Add new level

6 Choose size 2 (new level will span 2 intervals of previous level)

34
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Age attribute

7 Repeat until we have a generalization level where one interval = 80

35
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Age attribute

8 Result

# groups per hierarchy level Single record

36
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Marital-status attribute

1 Select Marital-status, Create Hierarchy… and choose ‘Use ordering’

2 Rearrange the values as follows:

Spouse present

Spouse not present

37
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Marital-status attribute

3 Make a group containing the first two values (= spouse present)

38
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Marital-status attribute

3 Add a new group for the other values (right-click)

4 Enter the following settings (= spouse not present)

39
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Marital-status attribute

5 Result

Level-0: Level-1: Level-2:


No generalization Generalization Suppression
(original values)

40
Import Configure Anonymize Explore Analyze

2 Create generalization hierarchies


Other attributes

Using labcomputer?
1 Open project “workshop.deid” with already defined hierarchies
Path: ‘Desktop/Workshop GDPR/
workshop.deid’

41
Import Configure Anonymize Explore Analyze

Select privacy model

42
Import Configure Anonymize Explore Analyze

1 Privacy model
!-anonymity

1 Add privacy model

2 Choose ! = 5

43
Import Configure Anonymize Explore Analyze

Configure utility measure

= objective function
44
Import Configure Anonymize Explore Analyze

1 configuration of utility measures


Max # records that can be
General settings
suppressed

Utility measure

lower score:
è less loss of information
è higher data quality

Attribute weights

ARX will try to reduce loss


of information for attributes
with higher weights

45
Anonymize

Import data Export data


Explore and
Configure Anonymize Analyze
select

46
Import Configure Anonymize Explore Analyze

1 Anonymize
1 Click “Anonymize”

2 Select “Optimal”

Use a stopping criterion in


case of scalability issues

Allow different generalizations


for each record

47
Explore and select

Import data Export data


Explore and
Configure Anonymize Analyze
select

48
Import Configure Anonymize Explore Analyze

Subset of solution space

Filters Clipboard Properties

49
Import Configure Anonymize Explore Analyze

marital-status native-country

age
race education workclass
sex occupation › Each transformation is identified by
salary-class generalization levels of the quasi-identifiers

Solution representation

Privacy-preserving Currently selected Optimal


(Privacy-preserving)

50
Import Configure Anonymize Explore Analyze

Filtering Clipboard Properties

› Select generalization levels › Organize transformations › View basic information about the

› Select bounds for information loss selected transformation

Score = information loss How far is this solution


(how much information was removed from the optimal? (%)
to satisfy 5-anonymity)

51
Import Configure Anonymize Explore Analyze

Select and apply the solution with


Generlization Level 1 for attribute Age

52
Analyze utility loss

Import data Export data


Explore and
Configure Anonymize Analyze
select

› Utility loss
› Privacy risk

53
Import Configure Anonymize Explore Analyze

Input dataset Transformed dataset

Statistical info (input) Statistical info (output)

54
Import Configure Anonymize Explore Analyze

Summary statistics
First select an attribute (e.g. Age)

Some records are suppressed due


to satisfy the privacy model Only 4 distinct values since
we have intervals now

Empirical distribution

55
Import Configure Anonymize Explore Analyze utility

Contingency
Most people are Most people are
male and 51 Select two attributes (e.g. Age and Sex) male and
years old between 51 and
60 years old

Average class Average class


Equivalence classes size of 32.5
size of 1.2

4573 175 classes


classes
56
Import Configure Anonymize Explore Analyze utility

Properties

Input

How far is this solution


from the optimal? (%)

Score = information loss


(how much information was removed
to satisfy 5-anonymity) Output

57
Analyze utility loss

Import data Export data


Explore and
Configure Anonymize Analyze
select

› Utility loss
› Privacy risk

58
Import Configure Anonymize Explore Analyse risk

Re-identification
Detect quasi-
risks
identifiers

Detect attributes
Estimates of
which must be
population
modified
uniques
(HIPAA)

59
Import Configure Anonymize Explore Analyse risk

Re-identification risks
Attacker models used to compile global risks:

Prosecutor model Journalist model Marketer model


• One specific individual • Background info about many patients • As many individuals as possible
• Example: neighbor • Targets any individual • Example: mailing campaign
• External database is required

60
Import Configure Anonymize Explore Analyse risk

Evaluate prosecutor risk


2-anonymity DB
› Risk = probability to identity single record Identifier Quasi-identifier Sensitive

id Zip Code Age Nationality Disease


› Not known a priori who intruder will re-identify
1 130** <30 * Cardiac
2 130** <30 * Cardiac

worst-case assumption 3 130** <30 * Infectious


4 1485* >40 * Infectious
person to re-identify is in
5 1485* >40 * Cancer
smallest equivalence class
6 1485* >40 * Cardiac
7 1485* >40 * Infectious
7 130** 3* * Cancer
Overall probability of re-identification = 2
"⁄ 8 130** 3* * Infectious
#

61
Import Configure Anonymize Explore Analyse risk

Evaluate journalist risk


› Anonymized data is a subset of a larger public database

› Risk determined by smallest equivalence class in Public databset that maps to Anonymized
dataset

Equivalence class Anonymized dataset Public dataset

Gender Born Count Records Count Records

Overall probability of re-identification = Male 1950-1959 3 1,4,12 4 1,4,12,27


"⁄ Male 1960-1969 2 2,14 5 2,14,15,22,26
#
Male 1970-1979 2 9,10 5 9,10,16,20,23
Female 1960-1969 2 7,11 5 7,11,18,19,21
Female 1970-1979 2 6,13 5 6,13,17,24,25

62
Import Configure Anonymize Explore Analyse risk

Determining threshold value

› Theshold helps to interpret re-identification probability


› actual risk > threshold è unacceptable!

› In practice:
Health data: minimal equivalence class size of 5
Threshold probability = 0.2

› Often: large fraction of records have re-identification risk << threshold


Metric: cumulative distribution of risk values

63
Import Configure Anonymize Explore Analyse risk

Input dataset Transformed dataset

64
Import Configure Anonymize Explore Analyse risk

Input dataset
Records affected [%]

69.3% of the records have a risk


69.3 between 50% and 100%

Exponential growth
14.1% of the records
14.1
have a risk between 33% and 50%

] 33 , 50 ]

Prosecutor re-identification risk [%]


65 ] 50 , 100 ]
Import Configure Anonymize Explore Analyse risk

Output dataset
Cumulative risk
Records affected [%]
78.1 % of the records have a
risk under 5%

78.1%

33.1% of the records have a


risk between 0.1% and 1%

33.1% All records have a risk under


the threshold

Exponential decay

] 0.1 , 1 ] ]4,5]

66 Prosecutor re-identification risk [%]


Import Configure Anonymize Explore Analyse risk

Analyze combinations of attributes regarding associated risks of re-identification

1 Select attributes as a Target variable

High distinction and separation è indicators that quasi-identifiers are more likely to re-identify

67
Import Configure Anonymize Explore Analyse risk

Prosecutor

Journalist

Marketer

68
Import Configure Anonymize Explore Analyse risk

Records at risk Highest risk Success rate


= Records with risks above threshold = Highest risk of a single record = Average # records that
can be re-identified

Input
dataset

Output
dataset

69
Import Configure Anonymize Explore Analyse risk

HIPAA Identifiers

› Health Insurance Portability and Accountability Act (1996)

› Provides security provisions for safeguarding medical information

› Specifies 18 identifiers that must be modified or removed

70
Conclusion

Import data Export data

Configure Anonymize Analyze

1. Annotate dataset
› Utility loss
2. Create generalization rules › Privacy risk
3. Select privacy model

4. Select utility measure

71

You might also like