ARX Kullanımı SUNUM

ARX Data Anonymization Tool
Hands-on
12 september 2019
Thomas Van den Bossche
Introduction
Introduction
Data custodians want to share data in a controlled manner with:
Government Research community Private sector
3
Re-identification by linking
Microdata Vote registration data
Name Date of birth Gender Zip Code Disease Name Date of birth Gender Zip code
Andre 01/02/1992 Male 53715 Heart Disease Ronald 01/12/1997 Male 53715
Carol 19/09/1986 Female 53715 Hepatitis Dan 29/03/1946 Male 53715
Dan 17/08/1977 Female 53715 Bronchitis Carol 19/03/1986 Female 53715
Eve 10/03/1994 Female 53715 Broken arm Eve 01/02/1992 Female 53715
Ellen 12/01/1994 Male 53715 Flu Ellen 03/08/1991 Female 53715
… … … … … … … … …
4
How can we de-identify our dataset?
Transform Enforce Measure utility
Methods Privacy models Utility models

Age Age
Generalization !-Anonymity Information loss
28 <30
Born Born
Suppression t-closeness
Russia *
Microaggregation Differential Privacy
… …
5
Annotation of attributes
Identifier Quasi-identifiers Sensitive

Name Date of birth Gender Zip Code Disease
Andre 01/12/1997 Male 53715 Heart Disease
Carol 01/12/1986 Female 53715 Hepatitis
Dan 01/12/1976 Male 53715 Bronchitis
Eve 01/12/1994 Female 53715 Broken arm
Ellen 01/12/1992 Female 53715 Flu
Eric 01/12/1991 Male 53715 Flu
Always Can be used for linking Used for

removed anonymized dataset with other research
before release datasets purposes
Combination identifies 87%

6 of US population
Privacy protection ⟷ Data utility
Privacy Protection
7 Data Utility
Used privacy model:
!-anonymity
!-Anonymity
› Person data cannot be distinguished from at least k - 1 individuals

› How?
Generalization Suppression
= make less precise (for outliers)
Age Age Nationality Nationality
28 <30 Russia *
9
Example of !-Anonymity
Original dataset 3-anonymous dataset
Eq c
Identifier Quasi-identifier Sensitive Identifier Quasi-identifier Sensitive
ui las
va s
le
id Zip Code Age Nationality Disease id Zip Code Age Nationality Disease
nc
e
1 13053 28 Russia Cardiac 1 130** <30 * Cardiac
2 13066 29 US Cardiac 2 130** <30 * Cardiac
3 13068 21 Japan Infectious 3 130** <30 * Infectious
4 14853 41 US Infectious 4 1485* >40 * Infectious
5 14853 50 India Cancer 5 1485* >40 * Cancer
6 14853 47 US Cardiac 6 1485* >40 * Cardiac
7 13053 37 India Cardiac 7 130** 3* * Cardiac
8 13068 39 US Cancer 8 130** 3* * Cancer

9 13068 31 US Infectious 9 130** 3* * Infectious
"
Probability that row corresponds to individual =
10 #
What is ARX?
ARX Data Anonymization Tool
› Transforming structured personal data
› Open source
Transform Enforce Measure utility
Methods Privacy models Utility models
12
ARX Use cases
Big data analytics Research projects Clinical trial

data sharing
13
ARX
› Public API
14
ARX
Privacy models
!-Anonymity
k-map
Differential Privacy
l-diversity
t-closeness
δ-Presence
15
De-identification
use-case
Use case
Adult data set
› Used to predict whether a person makes over 50K per year
› Widely used in machine learning and statistical analysis
› > 48k records
› 9 attributes
Our aim: minimize re-identification risk, while maintaining utility
17
Hands-on
Which steps do we take?
Import data Export data
Configure Anonymize Analyze
1. Annotate dataset › Utility loss

› Privacy risk
2. Create generalization rules
3. Select privacy model
4. Select utility measure
19
Which steps do we take?
Optional step:

Explore and
select
1. Annotate dataset
› Utility loss
› Privacy risk
20
Import data
Import data

Explore and
select
22
Import Configure Anonymize Explore Analyze
1 Generate new project
23
24
2 Import dataset
File: adults.csv
Using labcomputer?
Path: ‘Desktop/Workshop GDPR/

adults.csv’
25
2 Import dataset
26
Configure

Explore and
select
1. Annotate dataset
27
- Annotate dataset
- Create generalization rules
Input dataset
Select privacy model
Extract research sample Configure utility measure
28
- Annotate dataset
- Create generalization rules
29
1 Annotate dataset
1 Click an attribute column
2 Choose appropriate attribute type
30
1 Annotate dataset
3 Set attribute metadata
The variable we want to explain
31
2 Create generalization hierarchies

Age attribute
1 Select attribute age and Create Hierarchy…
2 Select Use intervals
32

Age attribute
3 Choose intervals of 5
Maximum value
100
90 Top coding from
90 Snap from
0 Snap from
4 Top coding when age > 90

0 Bottom coding from
0 Minimum value
33

Age attribute
5 Add new level
6 Choose size 2 (new level will span 2 intervals of previous level)
34

Age attribute
7 Repeat until we have a generalization level where one interval = 80
35

Age attribute
8 Result
# groups per hierarchy level Single record
36

Marital-status attribute
1 Select Marital-status, Create Hierarchy… and choose ‘Use ordering’
2 Rearrange the values as follows:
Spouse present
Spouse not present
37

3 Make a group containing the first two values (= spouse present)
38

3 Add a new group for the other values (right-click)
4 Enter the following settings (= spouse not present)
39

5 Result
Level-0: Level-1: Level-2:

No generalization Generalization Suppression
(original values)
40

Other attributes
Using labcomputer?
1 Open project “workshop.deid” with already defined hierarchies
Path: ‘Desktop/Workshop GDPR/
workshop.deid’
41
Select privacy model
42
1 Privacy model
!-anonymity
1 Add privacy model
2 Choose ! = 5
43
Configure utility measure
= objective function
44
1 configuration of utility measures

Max # records that can be
General settings
suppressed
Utility measure
lower score:
è less loss of information
è higher data quality
Attribute weights
ARX will try to reduce loss

of information for attributes
with higher weights
45
Anonymize

Explore and
select
46
1 Anonymize
1 Click “Anonymize”
2 Select “Optimal”
Use a stopping criterion in

case of scalability issues
Allow different generalizations

for each record
47
Explore and select

Explore and
select
48
Subset of solution space
Filters Clipboard Properties
49
marital-status native-country
age
race education workclass
sex occupation › Each transformation is identified by
salary-class generalization levels of the quasi-identifiers
Solution representation
Privacy-preserving Currently selected Optimal

(Privacy-preserving)
50
Filtering Clipboard Properties
› Select generalization levels › Organize transformations › View basic information about the
› Select bounds for information loss selected transformation
Score = information loss How far is this solution

(how much information was removed from the optimal? (%)
to satisfy 5-anonymity)
51
Select and apply the solution with

Generlization Level 1 for attribute Age
52
Analyze utility loss

Explore and
select
› Utility loss
› Privacy risk
53
Input dataset Transformed dataset
Statistical info (input) Statistical info (output)
54
Summary statistics
First select an attribute (e.g. Age)
Some records are suppressed due

to satisfy the privacy model Only 4 distinct values since
we have intervals now
Empirical distribution
55
Import Configure Anonymize Explore Analyze utility
Contingency
Most people are Most people are
male and 51 Select two attributes (e.g. Age and Sex) male and
years old between 51 and
60 years old
Average class Average class

Equivalence classes size of 32.5
size of 1.2
4573 175 classes

classes
56
Import Configure Anonymize Explore Analyze utility
Properties
Input
How far is this solution

from the optimal? (%)
Score = information loss

(how much information was removed
to satisfy 5-anonymity) Output
57
Analyze utility loss

Explore and
select
› Utility loss
› Privacy risk
58
Import Configure Anonymize Explore Analyse risk
Re-identification
Detect quasi-
risks
identifiers
Detect attributes
Estimates of
which must be
population
modified
uniques
(HIPAA)
59
Re-identification risks
Attacker models used to compile global risks:
Prosecutor model Journalist model Marketer model

• One specific individual • Background info about many patients • As many individuals as possible
• Example: neighbor • Targets any individual • Example: mailing campaign
• External database is required
60
Evaluate prosecutor risk

2-anonymity DB
› Risk = probability to identity single record Identifier Quasi-identifier Sensitive
id Zip Code Age Nationality Disease

› Not known a priori who intruder will re-identify
1 130** <30 * Cardiac
2 130** <30 * Cardiac
worst-case assumption 3 130** <30 * Infectious

4 1485* >40 * Infectious
person to re-identify is in
5 1485* >40 * Cancer
smallest equivalence class
6 1485* >40 * Cardiac
7 1485* >40 * Infectious
7 130** 3* * Cancer
Overall probability of re-identification = 2
"⁄ 8 130** 3* * Infectious
#
61
Evaluate journalist risk

› Anonymized data is a subset of a larger public database
› Risk determined by smallest equivalence class in Public databset that maps to Anonymized
dataset
Equivalence class Anonymized dataset Public dataset
Gender Born Count Records Count Records
Overall probability of re-identification = Male 1950-1959 3 1,4,12 4 1,4,12,27

"⁄ Male 1960-1969 2 2,14 5 2,14,15,22,26
#
Male 1970-1979 2 9,10 5 9,10,16,20,23
Female 1960-1969 2 7,11 5 7,11,18,19,21
Female 1970-1979 2 6,13 5 6,13,17,24,25
62
Determining threshold value
› Theshold helps to interpret re-identification probability

› actual risk > threshold è unacceptable!
› In practice:
Health data: minimal equivalence class size of 5
Threshold probability = 0.2
› Often: large fraction of records have re-identification risk << threshold

Metric: cumulative distribution of risk values
63
Input dataset Transformed dataset
64
Input dataset
Records affected [%]
69.3% of the records have a risk

69.3 between 50% and 100%
Exponential growth
14.1% of the records
14.1
have a risk between 33% and 50%
] 33 , 50 ]
Prosecutor re-identification risk [%]

65 ] 50 , 100 ]
Output dataset
Cumulative risk
Records affected [%]
78.1 % of the records have a
risk under 5%
78.1%
33.1% of the records have a

risk between 0.1% and 1%
33.1% All records have a risk under

the threshold
Exponential decay
] 0.1 , 1 ] ]4,5]
66 Prosecutor re-identification risk [%]

Analyze combinations of attributes regarding associated risks of re-identification
1 Select attributes as a Target variable
High distinction and separation è indicators that quasi-identifiers are more likely to re-identify
67
Prosecutor
Journalist
Marketer
68
Records at risk Highest risk Success rate

= Records with risks above threshold = Highest risk of a single record = Average # records that
can be re-identified
Input
dataset
Output
dataset
69
HIPAA Identifiers
› Health Insurance Portability and Accountability Act (1996)
› Provides security provisions for safeguarding medical information
› Specifies 18 identifiers that must be modified or removed
70
Conclusion
1. Annotate dataset
› Utility loss
2. Create generalization rules › Privacy risk
71

ARX Kullanımı SUNUM

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ARX Kullanımı SUNUM

Uploaded by

Copyright:

Available Formats

ARX Data Anonymization Tool

Data custodians want to share data in a controlled manner with:

Government Research community Private sector

Microdata Vote registration data

Transform Enforce Measure utility

Methods Privacy models Utility models

Microaggregation Differential Privacy

Identifier Quasi-identifiers Sensitive

Always Can be used for linking Used for

Combination identifies 87%

› Person data cannot be distinguished from at least k - 1 individuals

Age Age Nationality Nationality

2 13066 29 US Cardiac 2 130** <30 * Cardiac

3 13068 21 Japan Infectious 3 130** <30 * Infectious

4 14853 41 US Infectious 4 1485* >40 * Infectious

5 14853 50 India Cancer 5 1485* >40 * Cancer

6 14853 47 US Cardiac 6 1485* >40 * Cardiac

7 13053 37 India Cardiac 7 130** 3* * Cardiac

8 13068 39 US Cancer 8 130** 3* * Cancer

› Transforming structured personal data

Transform Enforce Measure utility

Methods Privacy models Utility models

Big data analytics Research projects Clinical trial

Our aim: minimize re-identification risk, while maintaining utility

Import data Export data

Configure Anonymize Analyze

1. Annotate dataset › Utility loss

3. Select privacy model

4. Select utility measure

Import data Export data

4. Select utility measure

Import data Export data

1 Generate new project

Path: ‘Desktop/Workshop GDPR/

Import data Export data

2. Create generalization rules

3. Select privacy model

4. Select utility measure

Select privacy model

Extract research sample Configure utility measure

1 Click an attribute column

2 Choose appropriate attribute type

3 Set attribute metadata

The variable we want to explain

2 Create generalization hierarchies

1 Select attribute age and Create Hierarchy…

2 Select Use intervals

2 Create generalization hierarchies

90 Top coding from

4 Top coding when age > 90

2 Create generalization hierarchies

5 Add new level

6 Choose size 2 (new level will span 2 intervals of previous level)

2 Create generalization hierarchies

7 Repeat until we have a generalization level where one interval = 80

2 Create generalization hierarchies

# groups per hierarchy level Single record

2 Create generalization hierarchies

1 Select Marital-status, Create Hierarchy… and choose ‘Use ordering’

2 Rearrange the values as follows:

Spouse not present

2 Create generalization hierarchies

3 Make a group containing the first two values (= spouse present)