Data Anonymization and Differential Privacy: Department of Computer Science City University of Hong Kong

CS 6290
Data Anonymization and

Differential Privacy
Department of Computer Science
City University of Hong Kong
Slides credits in part from Graham Cormode, Vitaly Shmatikov, Ashwin Machanavajjhala, Michael Hay, Xi He,
Tianhao Wang, Zhan QIN, Elaine Shi etc.
CS6290 Privacy-enhancing Technologies 1

Aggregated Personal Data is Invaluable

Aggregated Personal Data
• … is made publicly available in many forms.

Personal Data is … Well … Personal!

Data Release Horror Stories
We need to solve this data release problem ...

But Why Release Data?
• Urban mobility modeling at scale

• Smart traffic design is critical, congestion is getting worse

Big Data is Expected to Solve Big Problems
• The big data concept has taken off
• Many organizations have adopted it
• Entered the public vocabulary
• But this data is mostly about individuals

• Individuals want privacy for their data
• How can researchers work on sensitive data?

• The easy answer: anonymize it and share
• The problem: we don’t know how to do this

Mobility tracking for COVID-19
MTR Octopus
https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
MTR Octopus
2 April 2020
MTR Octopus
2 April 2020
Google’s Mobility Report
https://www.google.com/covid19/mobility/
11
2 April 2020 CS6290 Privacy-enhancing Technologies
Then, how does differential privacy protect
personal privacy with manually added noise?

Differential Privacy?

Outline: Record Level Anonymization
• Pre-1997: hash identifiers • 1997: linking attack [S07]

• Multiple QIs ≈ identifier
• 1998: k-anonymity [SS98]

• 2006+: syntactic attacks
• 2006: differential privacy • 2009: deFinetti's thm [K09]

[DM+06] • Machine learning attack
• Semantic notion of privacy
Basic Application Setting
Data “Anonymization”
allowing the data user (e.g., an analyst) to query the database for some aggregated
results while preventing the concrete identification of a particular person X.

Data “Anonymization”
• How?
• Remove “personally identifying information” (PII)

• Name, ID card number, phone number, bank account, email,
address… what else?
• Problem: PII has no technical meaning

• In privacy breaches, any information can be personally
identifying
• Examples: AOL search log dataset (search habits), Netflix prize
dataset (movie preferences), …

The Massachusetts Governor
Privacy Breach
[Sweeney, 2010]
Name SSN Visit date Diag. Zip Birth Sex Others

David 10201 08/08/2022 Heart 99988 08/08/1968 M …

The Massachusetts Governor
Privacy Breach
[Sweeney, 2010]
Visit date Diag. Zip Birth Sex Others

08/08/2022 Heart 99988 08/08/1968 M …
Zip Birth Sex Name Address Others

99988 08/08/1968 M David UK …

Linkage Attack
[Sweeney, 2010]

NYC Taxi Dataset: Anonymization with
Hashing
NYC Taxi Dataset
Obtained by Freedom of Information request
Contained 173 million individual trips with:
• Locations and times for pickup and drop-off
• Anonymized the license number using hash function
• Medallion number also hashed
• Other metadata
medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...
CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]

21
NYC Taxi Dataset: Anonymization with
Hashing
medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
CFCD208495D565EF66E7DFF9F98764DA, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...
Discover someone’s medallion number is ‘0’.

He earns $120 today…
His trace? Working habits? …
CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]

22
NYC Taxi Dataset: Not quite anonymization
[Hal14]
NYC Taxi Dataset
Hashing doesn’t work when we have knowledge of the format of the data
• Medallion and Taxi license numbers are 6-digit
• Map-reduce to crunch away at all hashes for an hour
• Discover the numbers that produce the hashes
• Map taxi license numbers to the driver's name using external dataset (license number,
name)
• Can figure out addresses, work schedules, gross salary, locations!
Lessons Learned
Hashing might not be a good idea for anonymization
Use secure encryption that is cryptographically strong (AES) and/or random numbers.

The Netflix Prize
The Netflix Prize Dataset [Narayanan06]
2006: Netflix publishes movie rating data of 480K users
• Meant to be used for recommendation system research
• Idea: Cross-reference with IMDB

• Results: Knowing 8 ratings (w/dates) identifies 90% of users
• People rated movies on Netflix that they did not rate on
IMDB

Observation #1: Dataset Joins
• Attacker learns sensitive data by joining two datasets on common attributes
• Anonymized dataset with sensitive attributes
• Example: age, race, symptoms
• “Harmless” dataset with individual identifiers
• Example: name, address, age, race
• Demographic attributes (age, ZIP code, race, etc.) are common in datasets
with information about individuals

Observation #2: Quasi-Identifiers
• Sweeney’s observation: (birthdate, ZIP code, gender) uniquely
identifies 87% of US population
• Side note: actually, only 63% [Golle, WPES ‘06]
• Publishing a record with a quasi-identifier is as bad as publishing it

with an explicit identity
• Eliminating quasi-identifiers is not desirable

• For example, users of the dataset may want to study distribution of diseases
by age and ZIP code

k-Anonymity
• Proposed by Samarati and Sweeney (1998)
• Hundreds of papers since then

• Extremely popular in the database and data mining communities
(SIGMOD, ICDE, KDD, VLDB)
• Most based on generalization and suppression

Anonymization in a Nutshell
• Dataset is a relational table
• Attributes (columns) are divided into quasi-identifiers
and sensitive attributes
Race Age Symptoms Blood Medical

type history
quasi-identifiers
… … … … …
sensitive attributes
… … … … …
• Generalize/suppress quasi-identifiers, don’t touch

sensitive attributes (keep them “truthful”)

k-Anonymity: Definition
• Each (transformed) quasi-identifier group must appear in at
least k records in the anonymized dataset
• k is chosen by the data owner (how?)
• Example: any age-race combination from original DB must appear at
least 10 times in anonymized DB
• Guarantees that any join on quasi-identifiers with the

anonymized dataset will contain at least k records for each
quasi-identifier

Achieving k-Anonymity
• Generalization
• Individual values of attributes replaced by broader category
• Area code instead of phone number: 3442 8765 -- >> 3442 xxxx
• Value “23” of the age attribute is replaced by 20<Age<=30
• Suppression
• Replace certain values of the attributes by an asterisk '*’.
• Example: replace all the values in the 'Name' attribute with a ‘*’.
• Lots of algorithms in the literature

• Aim to produce “useful” anonymizations, though … usually without any
clear notion of utility
Example: 3-Anonymity
Caucas 78712 Flu Caucas 787XX Flu

Asian 78705 Shingles Asian/AfrAm 78705 Shingles
Asian 78705 Acne Asian/AfrAm 78705 Acne
AfrAm 78705 Acne Asian/AfrAm 78705 Acne
This is 3-anonymous, right?

CS6290 Privacy-enhancing Technologies slide 31
Problem of k-anonymity
When joining with an external database, the
adversary learns Rusty Shackleford has Flu

Other Attempts
• L-diversity
• Entropy of sensitive attributes within each quasi-identifier group must be at
least L
• t-closeness
• Distribution of sensitive attributes within each quasi-identifier group should
be “close” to their distribution in the entire original database

Issues with Syntactic Definitions
• What adversary do they apply to?
• Do not consider adversaries with side information
• Do not consider the probability
• Do not consider adversarial algorithms for making decisions (inference)
• Any attribute is a potential quasi-identifier

• External/auxiliary/background information about people is very easy to
obtain

Differential Privacy
• Statistical outcome is indistinguishable regardless of whether
a particular user record is in the data or not
• “Whatever is learned would be learned regardless of whether or
not you participate”

Indistinguishability
x1 query 1
x2 answer 1
x3 San
DB=  transcript
 query T S
xn-1
xn ¢ ¢¢ answer T
Distance
random coins between
Differ in 1 row
distributions
x1 query 1 is at most 
x2 answer 1
y3 San
DB’=  transcript
 query T S’
xn-1
xn ¢ ¢¢ answer T
random coins
CS6290 Privacy-enhancing Technologies slide 36
An Example: Statistical Data Release
Clear
distance
between
distributions

An Example: Statistical Data Release
Indistinguishable
Distributions
after adding Alice

Formalizing Indistinguishability
[Dwork, ICALP’06]
For every pair of neighboring databases For every output
that differ in only one record

1# Why Every Pair of Neighboring Databases?

Definition: Two datasets 𝐷, 𝐷′ ⊂ 𝑋 are neighbors

if they differ in the data of a single individual.
2# Why for Every Output?

Should not be able to distinguish

whether the input was D1 or D2 no
matter what the output is

Privacy Parameter ε

𝑃𝑟𝑖𝑣𝑎𝑐𝑦 𝐵𝑢𝑑𝑔𝑒𝑡
• 𝜀 is also called the privacy budget
Pr 𝑀 𝐷 ∈ 𝑆 ≤ e 𝜀 Pr 𝑀 𝐷′ ∈ 𝑆
The analyst learns approximately the

same information if any single data
record is replaced by another one Bounded by e 𝜺

Differential Privacy Example
Smaller 𝜀, better privacy, but more noises is needed. The utility will be reduced.

𝜖 𝑖𝑛 𝑃𝑟𝑎𝑐𝑡𝑖𝑐𝑒
• Perspective taken by theory: Pick 𝜖, prove (or evaluate) accuracy.
• Realities of practice: Hard accuracy requirements.
• Find the smallest level of 𝜖 consistent with accuracy targets.
• How to do this?
• Search incurs privacy overhead…
• Privacy parameter is now a data-dependent quantity. What is the semantics?
• Constant factors can be meaningful…

Basic Algorithm
introducing the proper noise
with Laplace mechanism…

Output Randomization

Laplace Mechanism
Lap(b)
Note that Lap(b) has variance σ = 2b²and mean μ = 0
Parameter b controls the scale of noise η

The probability density function of Laplace distribution
η ~ Lap(μ, b) writes as η ~ Lap(b) for μ=0
• μ: position parameter
• b: scale/shape parameter, b>0
sharp distribution for smaller b
higher chance for sampling values near 0
When sampling noise η from the Laplace distribution

• Mean of noise η: μ = 0
• Variance of noise η: σ = 2b²
• Higher chance: η nears 0
• Others with decreased chance: η = -1, -2, -3, …, 1, 2, 3, …
https://en.wikipedia.org/wiki/Laplace_distribution
How much noise is enough for privacy?
i.e., Sensitivity is the maximum change in the output by adding or

removing any one of the records in the database.
Lap(b): b

Global Sensitivity
•

GS example : COUNT query
• Number of people having the disease
• Global Sensitivity: s=1
• b = s/ε = 1/ ε
• ε ∈ (0, 1)
Smaller ε → larger b →
higher chance for sampling values far from 0 (larger noise) → more secure
• Query q(D) → q(D) + η = q(D) + Lap(b)

• Sampling noise η from Lap(b) is enough for ε-DP.

Examples of Low Global Sensitivity
• Average: SUM/COUNT
• SUM of age, GS upper bound is around 120, why?
• Histograms and contingency tables
• Covariance matrix
• Many data-mining algorithms can be implemented through a sequence of
low-sensitivity queries
• Perceptron (foundational algorithm in machine learning), some Expectation-
Maximization (EM) algorithms, Statistical Query (SQ) learning algorithms

Examples of High Global Sensitivity
• Order statistics
• Clustering

Composition Theorems
enriching the DP mechanisms…

Why composition?
Determine the privacy guarantee for each algorithm step and deduce
the overall privacy guarantee.
Privacy Budget Allocation
• Allocation between users/computation providers Auction?
• Allocation between tasks
• In-task allocation
– Between iterations
– Between multiple statistics
– Optimization problem No satisfactory
solution yet!

Privacy Budget Allocation Example: K-Means

When Budget Has Exhausted

A Bound on the Number of Queries
The privacy guarantee declines due to repeated queries.

Noise cancels out with the zero mean…
Sequential Composition
All the algorithms are running on the same database one by one.

Parallel Composition
All the algorithms are running on different sections of the same

database.
Postprocessing
e.g.,
M1(D) = q(D) + η
M2(x) = x +2
M2(M1(D)) = q(D) + 2 + η
The consequent computations on DP’s output results will not increase

or decrease the original DP guarantee.

Summary on DP
• Differential privacy ensures that an attacker can’t infer the presence or
absence of a single record in the input based on any output
• Basic algorithm with random perturbation

• Laplace mechanism
• Composition rules help build complex algorithms using building blocks

• e.g., each building block is a Laplace mechanism with different ε

Approximate Differential Privacy
relaxing the ε-DP mechanisms (pure DP) for
better composition utility…

Review: ε-Differential Privacy (Pure DP)
A randomized algorithm A satisfies ε-differential privacy if:

Given two neighboring databases that differ in one record, D
and D’, then for any output S and ε > 0, we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S]
• Can achieve ε-DP for counts by adding a random noise value

e.g., the Laplace mechanism
• Uncertainty due to noise → plausible deniability
• Hides whether someone is present in the data
(ε, δ)-Differential Privacy (Approximate DP)
A randomized algorithm A satisfies (ε, δ)-differential privacy if:

and D’, then for any output S, ε > 0, and ε ∈ (0, 1), we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ
• Introduce a new privacy parameter, δ, which allows a “failure

probability” for the ε-DP definition.
δ = 0, go back to ε-DP;
With probability 1-δ, we will get the same guarantee as ε-DP;
With probability δ, we get no guarantee. δ should be very small.
A randomized algorithm A satisfies (ε, δ)-differential privacy if:

and D’, then for any output S, ε > 0, and ε ∈ (0, 1), we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ
Relaxation

A randomized algorithm A satisfies (ε, δ)-differential
privacy if: Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ
• Introducing δ allows a “failure probability” for the ε-DP definition and

meanwhile reduces the privacy burden on ε
• In order to achieve the same level of privacy guarantee ε, (ε, δ)-DP
produces more noise and less accurate than ε-DP.
Blue: Laplacian noise for ε-DP
Orange: Gaussian noise for (ε, δ)-DP
If so, why bothering (ε, δ)-DP?

(ε, δ)-DP (Excellent composition performance)
If M1, M2, M3, … are (ε, δ)-DP algorithms that access a private
sublinear
database D, then the resulting composition satisfies (ε’, δ’)-DP, where: composition for ε
ε’ = δ’ = δ
• In ε-DP, the sequential composition leads to ε’ = kε. linear composition

for ε
The composed privacy budget ε of (ε, δ)-DP increases much

slower than ε-DP during the composition.
ε ↑, privacy guarantee ↓.
(ε, δ)-DP is more friendly to stacking building blocks.
Gaps between Theory and Practice
• Large focus on academic literature has been on the “centralized”
model of differential privacy:
• Industry applications have favored the “local modal”:

Trying to Reduce Trust
• Most work on differential privacy relies a central model
• Data aggregator (e.g., organizations) that sees the true, raw data
• Can compute exact query answers, then add noise for privacy
• A reasonable question: can we reduce the centralized trust?

• Can we remove the central aggregator from the equation?
• Users produce locally private query results, send directly it to the data
user to aggregate results

DP mechanisms designed for decentralized setting
Data mining
Statistical
queries
Local Differential Privacy-LDP

Local Differential Privacy
• What about having each user run a DP algorithm on their data?
• Then combine all the results to get a final answer
• On first glance, this idea seems crazy

• Each user adds noise to mask their own input
• Do we have enough noise to protect the signal?
• Will the signal be overwhelmed by noise?
• Notice that noise can cancel out or be subtracted out

• The sampled noise has zero mean
• …, -2, -1, 0, 1, 2, …
Frequency Estimation Protocols
• Direct Encoding (Generalized Random Response) [Warner’65]

• Generalize binary attribute to arbitrary domain
• Unary Encoding (Basic One-time RAPPOR) [Erlingssonet al’14]
• Encode into a bit-vector and perturb each bit
• Binary Local Hash [Bassily and Smith’15]
• Encode by hashing and then perturb

LDP with a coin toss: Randomized Response
[Warner 65]
• Each user has a single bit (0 or 1) of private information
• Encoding e.g. political/sexual/religious preference, illness, etc.
• Randomize Response (RR): toss an unbiased coin

• If Heads (prob. = ½ ), report the true answer
• Else, toss unbiased coin again
• If Heads (prob. = ½ ), report true answer
• Else, report the opposite

LDP with a coin toss: Randomized Response
[Warner 65]
Np/2
N N(1-p)/2 p: the rate of people whose true answer is ‘Yes’

X: #‘Yes’ observed
Then, we have Np/2 + N/4 = X,
N/4 where p can be estimated
N/4
• Collect responses from N users, cancel noise, and estimate the truth
• The estimation error is proportional to 1/ 𝑁: more participants, more accurate
• RR satisfies ε-DP with ε = ln((¾ )/(¼ )) = ln(3)
• Generalization: allow two biased coins with different probs (p, q ≠ ½)
Generalized Randomized Response

LDP in practice
• Differential privacy based on RR is widely deployed!
• In Google Chrome browser, to collect browsing statistics
• In Apple iOS and MacOS, to collect typing statistics
• In Microsoft Windows, to collect telemetry data over time
• In Snap, to perform modeling of user preference
• Each yields deployments of over 100 million users
• All deployments are based on RR with customized designs
• To handle the diverse values a user might have
• Local Differential Privacy is state of the art in 2020+
• Randomized response invented in 1965: five decades ago!

Google’s RAPPOR
• Each user has one value out of a very
large set of possibilities
• E.g. their favourite URL, www.nytimes.com
• Attempt 1: run RR for all values: google.com? Bing.com? …

• User has sparse binary vector with one 1: run RR on every bit
• Slow: sends 1 bit for every possible website in the vector
• Limited: can’t handle new possibilities being added
• Try to do better by reducing domain size through hashing

RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response
Hashing with Bloom Filters
item
0 1 0 0 0 1 0 0 1 0
m-bit
• Bloom filters compactly encode set membership

• Use k different hash functions to hash an items into k values between 1 and m
• Update: Set all k hashed values in the m-bit vector to 1 to indicate the present of an item
• Query: Can lookup items, store n items in O(n) bits
• Collision: choose k and m to obtain small false positive probability
• Vectors can be merged with OR-ing operation (of same size)
• Duplicate insertions do not change Bloom filters

Counting Bloom Filters & Randomized Response
item
0/1 1/0 0/1 0/1 0/1 1/0 0/1 0/1 1/0 0/1
• Idea: apply Randomized Response to the bits in a Bloom filter

• Use fewer bits in the filter compared to representing each website with one bit
• Each user hashes their input into k bits of 1 in the filter
• Sensitivity = 2k since removing/re-adding an item change at most 2k bits (item
is hashed into 1s at different filter positions)
• Combine all user reports and observe how often each bit is set
• Since bloom vectors can be merged with OR-ing operation

Decoding Noisy Counting Bloom Filters
• Each data owner creates their own bloom filter that processes
each bit with randomized response, then send it to the data user.
• The data user will estimate the frequency of a particular item:

• Look up the item’s bit locations in the collected bloom filter
• Compute the unbiased estimation of the item’s presence bit
• Take the minimum of these estimates as the frequency
• Try advanced methods to decode all items at once

RAPPOR in practice
• The RAPPOR approach is implemented in the Chrome browser

• Collects data from opt-in users, tens of millions per day
• Open source implementation available
• Tracks settings in the browser, e.g. home page, search engine
• Many users unexpectedly change home page → possible malware
• Typical configuration:
• 128 bit Bloom filter, 2 hash functions, privacy parameter ~0.5
• Needs about 10K reports to identify a value with confidence

Apple’s Differential Privacy in Practice
• Apple uses their system to collect data from iOS and OS X users
• Popular emojis: (heart) (laugh) (smile) (crying) (sadface)
• “New” words: bruh, hun, bae, tryna, despacito, mayweather
• Which websites to mute, which to autoplay audio on!
• Deployment settings: w=1000, d=1000, ε=2-8 (some criticism)
Microsoft telemetry data collection
• Microsoft want to collect data on app usage

• How much time was spent on a particular app today?
• Allows finding patterns over time
• Makes use of multiple subroutines:
• 1BitMean to collect numeric data
• dBitFlip to collect (sparse) histogram data
• Memoization and output perturbation to allow repeated probing
• Has been implemented in Windows since 2017

MS Telemetry Collection in Practice
• Deployed in Windows 10 Fall Creators Update
(October 2017)
• Collects number of seconds users spend in different apps
• Parameters: ε =1
• Collection period: every 6 hours
• Collects data on all app usage, not just one at a time
• Can analyze based on the fact that total time spent is
limited
• Gives overall guarantee of ε = 1.672 for a round of
collection

Reflecting on LDP
• Local Differential Privacy is a big success for privacy research
• Adopted by Google, Apple, Microsoft and more for deployment
• Deployments affecting (hundreds of) millions of users
• In contrast, centralized DP has smaller success (but, see US Census)
• However, there are reasons to pause and reflect:
• LDP only works when you can rely on millions of active participants
• Companies using LDP also gather vast amounts of raw data directly!
• Privacy settings are not very tight: deployed ε ranges from 0.5 to 8+

Case study: DP + others
applying differential privacy mechanisms
to enhance …

Deep Learning and Privacy
• Some examples of DP models are promising, e.g. DP-SGD
• Making the training algorithm differentially private and provides
indistinguishable guarantee for the training data
Debate:
Accuracy – Privacy tradeoffs?
Noise, delicate model

Deep Learning and Privacy
• TensorFlow DP-SGD API
• δ - A rule of thumb is to set it to be less than the inverse of the size of the training
dataset. By default, it is set to 10^-5 for MNIST dataset with 60,000 training points.
• Tensorflow Privacy provides a tool, compute_dp_sgd_privacy, to compute the value
of ε given a fixed value of hyperparameters
compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=train_data.shape[0],
batch_size=batch_size,
noise_multiplier=noise_multiplier,
epochs=epochs,
delta=1e-5)
https://github.com/tensorflow/privacy/blob/master/g3doc/tutorials/classification_privacy.ipynb

Hybrid Privacy Models
• We can imagine ‘hybrid’ models of privacy
• Users run LDP with different (personalized) values of ε
• Some users send their raw data to the aggregator with trust
• Blender [Avent et al., USENIX17] fleshes out this model
• Combines data obtained from the ‘opt-in’ group O with the ‘regular’ LDP clients C
• E.g. use opt-in data to generate list of possible query answers
• Use combined data from both groups to rank the list
• Several natural questions about theory and applications:
• Quantify benefits of hybrid privacy as a function of |O| and |C|
• What is the ‘killer app’ for this model of privacy?
• Other hybrids: each user chooses what to reveal non-privately
Opening up Private Data
• Deployments of LDP are still tightly controlled by aggregator
• Google/MSR/Apple controls the code on your device
• User reports typically sent over secure channel to aggregator
• Could there be a more ‘open’ implementation of LDP?
• Users could publish their LDP outputs
• Multiple aggregators could perform independent analysis
• Would require a big change in architecture and adoption!
LDP data
Conclusions
• Private data release is a challenging and exciting problem!
• The model of Local Differential Privacy has been widely adopted
• Initially for gathering counting statistics
• LDP comes at a cost: need many more users than centralized model
• Lots of research excitement developing new applications
• Practical impact of these is still to be seen
• Lots of opportunity for new work:
• Designing optimal mechanisms for local differential privacy
• Extend far beyond simple counts and marginals
• Unbalanced data, outliers, data biasing
• Structured data: graphs, movement patterns
References
• Pierangela Samarati, Latanya Sweeney: Protecting privacy when disclosing information: k-
anonymity and its enforcement through generalization and suppression, Harvard Data Privacy
Lab, 1998.
• Latanya Sweeney: Achieving k-Anonymity Privacy Protection Using Generalization and
Suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):
571-588 (2002)
• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan
Venkitasubramaniam: l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006: 24
• Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity
and l-Diversity. ICDE 2007: 106-115
• Cynthia Dwork: Differential Privacy. ICALP (2) 2006: 1-12
• Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam D. Smith: Calibrating Noise to Sensitivity in
Private Data Analysis. TCC 2006: 265-284

Data Anonymization and Differential Privacy: Department of Computer Science City University of Hong Kong

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Anonymization and Differential Privacy: Department of Computer Science City University of Hong Kong

Uploaded by

Copyright:

Available Formats

CS 6290

Data Anonymization and

CS6290 Privacy-enhancing Technologies 1

CS6290 Privacy-enhancing Technologies 2

CS6290 Privacy-enhancing Technologies 3

CS6290 Privacy-enhancing Technologies 4

We need to solve this data release problem ...

CS6290 Privacy-enhancing Technologies 5

• Urban mobility modeling at scale

CS6290 Privacy-enhancing Technologies 6

• But this data is mostly about individuals

• How can researchers work on sensitive data?

CS6290 Privacy-enhancing Technologies 7

CS6290 Privacy-enhancing Technologies 13

CS6290 Privacy-enhancing Technologies 14

• Pre-1997: hash identifiers • 1997: linking attack [S07]

• 1998: k-anonymity [SS98]

• 2006: differential privacy • 2009: deFinetti's thm [K09]

CS6290 Privacy-enhancing Technologies 16

• Remove “personally identifying information” (PII)

• Problem: PII has no technical meaning

CS6290 Privacy-enhancing Technologies 17

Name SSN Visit date Diag. Zip Birth Sex Others

CS6290 Privacy-enhancing Technologies 18

Visit date Diag. Zip Birth Sex Others

Zip Birth Sex Name Address Others

CS6290 Privacy-enhancing Technologies 19

CS6290 Privacy-enhancing Technologies 20

CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]

Discover someone’s medallion number is ‘0’.

CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]

CS6290 Privacy-enhancing Technologies 23

• Idea: Cross-reference with IMDB

CS6290 Privacy-enhancing Technologies 24

CS6290 Privacy-enhancing Technologies 25

• Publishing a record with a quasi-identifier is as bad as publishing it

• Eliminating quasi-identifiers is not desirable

CS6290 Privacy-enhancing Technologies 26

• Hundreds of papers since then

• Most based on generalization and suppression

CS6290 Privacy-enhancing Technologies 27

Race Age Symptoms Blood Medical

• Generalize/suppress quasi-identifiers, don’t touch

CS6290 Privacy-enhancing Technologies 28

• Guarantees that any join on quasi-identifiers with the

CS6290 Privacy-enhancing Technologies 29

• Lots of algorithms in the literature

Caucas 78712 Flu Caucas 787XX Flu

This is 3-anonymous, right?

CS6290 Privacy-enhancing Technologies 32

CS6290 Privacy-enhancing Technologies 33

• Any attribute is a potential quasi-identifier

CS6290 Privacy-enhancing Technologies 34

CS6290 Privacy-enhancing Technologies 35

CS6290 Privacy-enhancing Technologies 37

CS6290 Privacy-enhancing Technologies 38

CS6290 Privacy-enhancing Technologies 39

For every pair of neighboring databases For every output

Definition: Two datasets 𝐷, 𝐷′ ⊂ 𝑋 are neighbors

For every pair of neighboring databases For every output

Should not be able to distinguish

CS6290 Privacy-enhancing Technologies 41

CS6290 Privacy-enhancing Technologies 42