You are on page 1of 95

CS 6290

Data Anonymization and


Differential Privacy
Department of Computer Science
City University of Hong Kong

Slides credits in part from Graham Cormode, Vitaly Shmatikov, Ashwin Machanavajjhala, Michael Hay, Xi He,
Tianhao Wang, Zhan QIN, Elaine Shi etc.

CS6290 Privacy-enhancing Technologies 1


Aggregated Personal Data is Invaluable

CS6290 Privacy-enhancing Technologies 2


Aggregated Personal Data
• … is made publicly available in many forms.

CS6290 Privacy-enhancing Technologies 3


Personal Data is … Well … Personal!

CS6290 Privacy-enhancing Technologies 4


Data Release Horror Stories

We need to solve this data release problem ...

CS6290 Privacy-enhancing Technologies 5


But Why Release Data?

• Urban mobility modeling at scale


• Smart traffic design is critical, congestion is getting worse

CS6290 Privacy-enhancing Technologies 6


Big Data is Expected to Solve Big Problems
• The big data concept has taken off
• Many organizations have adopted it
• Entered the public vocabulary

• But this data is mostly about individuals


• Individuals want privacy for their data

• How can researchers work on sensitive data?


• The easy answer: anonymize it and share
• The problem: we don’t know how to do this

CS6290 Privacy-enhancing Technologies 7


Mobility tracking for COVID-19
MTR Octopus

https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 8
Mobility tracking for COVID-19
MTR Octopus

https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 9
Mobility tracking for COVID-19
MTR Octopus

https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 10
Mobility tracking for COVID-19
Google’s Mobility Report

https://www.google.com/covid19/mobility/
11
2 April 2020 CS6290 Privacy-enhancing Technologies
CS6290 Privacy-enhancing Technologies 12
Then, how does differential privacy protect
personal privacy with manually added noise?

CS6290 Privacy-enhancing Technologies 13


Differential Privacy?

CS6290 Privacy-enhancing Technologies 14


Outline: Record Level Anonymization

• Pre-1997: hash identifiers • 1997: linking attack [S07]


• Multiple QIs ≈ identifier

• 1998: k-anonymity [SS98]


• 2006+: syntactic attacks

• 2006: differential privacy • 2009: deFinetti's thm [K09]


[DM+06] • Machine learning attack
• Semantic notion of privacy
CS6290 Privacy-enhancing Technologies 15
Basic Application Setting

Data “Anonymization”
allowing the data user (e.g., an analyst) to query the database for some aggregated
results while preventing the concrete identification of a particular person X.

CS6290 Privacy-enhancing Technologies 16


Data “Anonymization”
• How?

• Remove “personally identifying information” (PII)


• Name, ID card number, phone number, bank account, email,
address… what else?

• Problem: PII has no technical meaning


• In privacy breaches, any information can be personally
identifying
• Examples: AOL search log dataset (search habits), Netflix prize
dataset (movie preferences), …

CS6290 Privacy-enhancing Technologies 17


The Massachusetts Governor
Privacy Breach
[Sweeney, 2010]

Name SSN Visit date Diag. Zip Birth Sex Others


David 10201 08/08/2022 Heart 99988 08/08/1968 M …

CS6290 Privacy-enhancing Technologies 18


The Massachusetts Governor
Privacy Breach
[Sweeney, 2010]

Visit date Diag. Zip Birth Sex Others


08/08/2022 Heart 99988 08/08/1968 M …

Zip Birth Sex Name Address Others


99988 08/08/1968 M David UK …

CS6290 Privacy-enhancing Technologies 19


Linkage Attack
[Sweeney, 2010]

CS6290 Privacy-enhancing Technologies 20


NYC Taxi Dataset: Anonymization with
Hashing
NYC Taxi Dataset
Obtained by Freedom of Information request
Contained 173 million individual trips with:
• Locations and times for pickup and drop-off
• Anonymized the license number using hash function
• Medallion number also hashed
• Other metadata

medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...

CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]


21
NYC Taxi Dataset: Anonymization with
Hashing

medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
CFCD208495D565EF66E7DFF9F98764DA, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...

Discover someone’s medallion number is ‘0’.


He earns $120 today…
His trace? Working habits? …

CS6290 Privacy-enhancing Technologies [https://tech.vijayp.ca/of-taxis-and-rainbows-f6bc289679a1]


22
NYC Taxi Dataset: Not quite anonymization
[Hal14]
NYC Taxi Dataset
Hashing doesn’t work when we have knowledge of the format of the data
• Medallion and Taxi license numbers are 6-digit
• Map-reduce to crunch away at all hashes for an hour
• Discover the numbers that produce the hashes
• Map taxi license numbers to the driver's name using external dataset (license number,
name)
• Can figure out addresses, work schedules, gross salary, locations!

Lessons Learned
Hashing might not be a good idea for anonymization
Use secure encryption that is cryptographically strong (AES) and/or random numbers.

CS6290 Privacy-enhancing Technologies 23


The Netflix Prize
The Netflix Prize Dataset [Narayanan06]
2006: Netflix publishes movie rating data of 480K users
• Meant to be used for recommendation system research

• Idea: Cross-reference with IMDB


• Results: Knowing 8 ratings (w/dates) identifies 90% of users
• People rated movies on Netflix that they did not rate on
IMDB

CS6290 Privacy-enhancing Technologies 24


Observation #1: Dataset Joins
• Attacker learns sensitive data by joining two datasets on common attributes
• Anonymized dataset with sensitive attributes
• Example: age, race, symptoms
• “Harmless” dataset with individual identifiers
• Example: name, address, age, race

• Demographic attributes (age, ZIP code, race, etc.) are common in datasets
with information about individuals

CS6290 Privacy-enhancing Technologies 25


Observation #2: Quasi-Identifiers
• Sweeney’s observation: (birthdate, ZIP code, gender) uniquely
identifies 87% of US population
• Side note: actually, only 63% [Golle, WPES ‘06]

• Publishing a record with a quasi-identifier is as bad as publishing it


with an explicit identity

• Eliminating quasi-identifiers is not desirable


• For example, users of the dataset may want to study distribution of diseases
by age and ZIP code

CS6290 Privacy-enhancing Technologies 26


k-Anonymity
• Proposed by Samarati and Sweeney (1998)

• Hundreds of papers since then


• Extremely popular in the database and data mining communities
(SIGMOD, ICDE, KDD, VLDB)

• Most based on generalization and suppression

CS6290 Privacy-enhancing Technologies 27


Anonymization in a Nutshell
• Dataset is a relational table
• Attributes (columns) are divided into quasi-identifiers
and sensitive attributes

Race Age Symptoms Blood Medical


type history
quasi-identifiers
… … … … …
sensitive attributes
… … … … …

• Generalize/suppress quasi-identifiers, don’t touch


sensitive attributes (keep them “truthful”)

CS6290 Privacy-enhancing Technologies 28


k-Anonymity: Definition
• Each (transformed) quasi-identifier group must appear in at
least k records in the anonymized dataset
• k is chosen by the data owner (how?)
• Example: any age-race combination from original DB must appear at
least 10 times in anonymized DB

• Guarantees that any join on quasi-identifiers with the


anonymized dataset will contain at least k records for each
quasi-identifier

CS6290 Privacy-enhancing Technologies 29


Achieving k-Anonymity
• Generalization
• Individual values of attributes replaced by broader category
• Area code instead of phone number: 3442 8765 -- >> 3442 xxxx
• Value “23” of the age attribute is replaced by 20<Age<=30

• Suppression
• Replace certain values of the attributes by an asterisk '*’.
• Example: replace all the values in the 'Name' attribute with a ‘*’.

• Lots of algorithms in the literature


• Aim to produce “useful” anonymizations, though … usually without any
clear notion of utility
CS6290 Privacy-enhancing Technologies 30
Example: 3-Anonymity

Caucas 78712 Flu Caucas 787XX Flu


Asian 78705 Shingles Asian/AfrAm 78705 Shingles
Caucas 78754 Flu Caucas 787XX Flu
Asian 78705 Acne Asian/AfrAm 78705 Acne
AfrAm 78705 Acne Asian/AfrAm 78705 Acne
Caucas 78705 Flu Caucas 787XX Flu

This is 3-anonymous, right?


CS6290 Privacy-enhancing Technologies slide 31
Problem of k-anonymity
When joining with an external database, the
adversary learns Rusty Shackleford has Flu

CS6290 Privacy-enhancing Technologies 32


Other Attempts
• L-diversity
• Entropy of sensitive attributes within each quasi-identifier group must be at
least L
• t-closeness
• Distribution of sensitive attributes within each quasi-identifier group should
be “close” to their distribution in the entire original database

CS6290 Privacy-enhancing Technologies 33


Issues with Syntactic Definitions
• What adversary do they apply to?
• Do not consider adversaries with side information
• Do not consider the probability
• Do not consider adversarial algorithms for making decisions (inference)

• Any attribute is a potential quasi-identifier


• External/auxiliary/background information about people is very easy to
obtain

CS6290 Privacy-enhancing Technologies 34


Differential Privacy
• Statistical outcome is indistinguishable regardless of whether
a particular user record is in the data or not
• “Whatever is learned would be learned regardless of whether or
not you participate”

CS6290 Privacy-enhancing Technologies 35


Indistinguishability
x1 query 1
x2 answer 1
x3 San
DB=  transcript
 query T S
xn-1
xn ¢ ¢¢ answer T
Distance
random coins between
Differ in 1 row
distributions
x1 query 1 is at most 
x2 answer 1
y3 San
DB’=  transcript
 query T S’
xn-1
xn ¢ ¢¢ answer T
random coins
CS6290 Privacy-enhancing Technologies slide 36
An Example: Statistical Data Release

Clear
distance
between
distributions

CS6290 Privacy-enhancing Technologies 37


An Example: Statistical Data Release

Indistinguishable
Distributions
after adding Alice

CS6290 Privacy-enhancing Technologies 38


Formalizing Indistinguishability
[Dwork, ICALP’06]
For every pair of neighboring databases For every output
that differ in only one record

CS6290 Privacy-enhancing Technologies 39


1# Why Every Pair of Neighboring Databases?

For every pair of neighboring databases For every output


that differ in only one record

Definition: Two datasets 𝐷, 𝐷′ ⊂ 𝑋 are neighbors


if they differ in the data of a single individual.
CS6290 Privacy-enhancing Technologies 40
2# Why for Every Output?

For every pair of neighboring databases For every output


that differ in only one record

Should not be able to distinguish


whether the input was D1 or D2 no
matter what the output is

CS6290 Privacy-enhancing Technologies 41


Privacy Parameter ε
For every pair of neighboring databases For every output
that differ in only one record

CS6290 Privacy-enhancing Technologies 42


𝑃𝑟𝑖𝑣𝑎𝑐𝑦 𝐵𝑢𝑑𝑔𝑒𝑡
• 𝜀 is also called the privacy budget
Pr 𝑀 𝐷 ∈ 𝑆 ≤ e 𝜀 Pr 𝑀 𝐷′ ∈ 𝑆

The analyst learns approximately the


same information if any single data
record is replaced by another one Bounded by e 𝜺

CS6290 Privacy-enhancing Technologies 43


Differential Privacy Example

Smaller 𝜀, better privacy, but more noises is needed. The utility will be reduced.

CS6290 Privacy-enhancing Technologies 44


𝜖 𝑖𝑛 𝑃𝑟𝑎𝑐𝑡𝑖𝑐𝑒
• Perspective taken by theory: Pick 𝜖, prove (or evaluate) accuracy.
• Realities of practice: Hard accuracy requirements.
• Find the smallest level of 𝜖 consistent with accuracy targets.
• How to do this?
• Search incurs privacy overhead…
• Privacy parameter is now a data-dependent quantity. What is the semantics?
• Constant factors can be meaningful…

CS6290 Privacy-enhancing Technologies 45


Basic Algorithm
introducing the proper noise
with Laplace mechanism…

CS6290 Privacy-enhancing Technologies 46


Output Randomization

CS6290 Privacy-enhancing Technologies 47


Laplace Mechanism

Lap(b)
Note that Lap(b) has variance σ = 2b²and mean μ = 0
Parameter b controls the scale of noise η

CS6290 Privacy-enhancing Technologies 48


The probability density function of Laplace distribution

η ~ Lap(μ, b) writes as η ~ Lap(b) for μ=0

• μ: position parameter
• b: scale/shape parameter, b>0
sharp distribution for smaller b
higher chance for sampling values near 0

When sampling noise η from the Laplace distribution


• Mean of noise η: μ = 0
• Variance of noise η: σ = 2b²
• Higher chance: η nears 0
• Others with decreased chance: η = -1, -2, -3, …, 1, 2, 3, …
CS6290 Privacy-enhancing Technologies 49
https://en.wikipedia.org/wiki/Laplace_distribution
How much noise is enough for privacy?

i.e., Sensitivity is the maximum change in the output by adding or


removing any one of the records in the database.

Lap(b): b

CS6290 Privacy-enhancing Technologies 50


Global Sensitivity

CS6290 Privacy-enhancing Technologies 51


GS example : COUNT query
• Number of people having the disease
• Global Sensitivity: s=1
• b = s/ε = 1/ ε
• ε ∈ (0, 1)
Smaller ε → larger b →
higher chance for sampling values far from 0 (larger noise) → more secure

• Query q(D) → q(D) + η = q(D) + Lap(b)


• Sampling noise η from Lap(b) is enough for ε-DP.

CS6290 Privacy-enhancing Technologies 52


Examples of Low Global Sensitivity

• Average: SUM/COUNT
• SUM of age, GS upper bound is around 120, why?
• Histograms and contingency tables
• Covariance matrix
• Many data-mining algorithms can be implemented through a sequence of
low-sensitivity queries
• Perceptron (foundational algorithm in machine learning), some Expectation-
Maximization (EM) algorithms, Statistical Query (SQ) learning algorithms

CS6290 Privacy-enhancing Technologies 53


Examples of High Global Sensitivity

• Order statistics
• Clustering

CS6290 Privacy-enhancing Technologies 54


Composition Theorems
enriching the DP mechanisms…

CS6290 Privacy-enhancing Technologies 55


Why composition?

Determine the privacy guarantee for each algorithm step and deduce
the overall privacy guarantee.
CS6290 Privacy-enhancing Technologies 56
Privacy Budget Allocation
• Allocation between users/computation providers Auction?
• Allocation between tasks
• In-task allocation
– Between iterations
– Between multiple statistics
– Optimization problem No satisfactory
solution yet!

CS6290 Privacy-enhancing Technologies 57


Privacy Budget Allocation Example: K-Means

CS6290 Privacy-enhancing Technologies 58


When Budget Has Exhausted

CS6290 Privacy-enhancing Technologies 59


A Bound on the Number of Queries

The privacy guarantee declines due to repeated queries.


Noise cancels out with the zero mean…
CS6290 Privacy-enhancing Technologies 60
Sequential Composition

All the algorithms are running on the same database one by one.

CS6290 Privacy-enhancing Technologies 61


Parallel Composition

All the algorithms are running on different sections of the same


database.
CS6290 Privacy-enhancing Technologies 62
Postprocessing

e.g.,
M1(D) = q(D) + η
M2(x) = x +2
M2(M1(D)) = q(D) + 2 + η

The consequent computations on DP’s output results will not increase


or decrease the original DP guarantee.

CS6290 Privacy-enhancing Technologies 64


Summary on DP
• Differential privacy ensures that an attacker can’t infer the presence or
absence of a single record in the input based on any output

• Basic algorithm with random perturbation


• Laplace mechanism

• Composition rules help build complex algorithms using building blocks


• e.g., each building block is a Laplace mechanism with different ε

CS6290 Privacy-enhancing Technologies 65


Approximate Differential Privacy
relaxing the ε-DP mechanisms (pure DP) for
better composition utility…

CS6290 Privacy-enhancing Technologies 66


Review: ε-Differential Privacy (Pure DP)

A randomized algorithm A satisfies ε-differential privacy if:


Given two neighboring databases that differ in one record, D
and D’, then for any output S and ε > 0, we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S]

• Can achieve ε-DP for counts by adding a random noise value


e.g., the Laplace mechanism
• Uncertainty due to noise → plausible deniability
• Hides whether someone is present in the data
CS6290 Privacy-enhancing Technologies 67
(ε, δ)-Differential Privacy (Approximate DP)

A randomized algorithm A satisfies (ε, δ)-differential privacy if:


Given two neighboring databases that differ in one record, D
and D’, then for any output S, ε > 0, and ε ∈ (0, 1), we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ

• Introduce a new privacy parameter, δ, which allows a “failure


probability” for the ε-DP definition.
δ = 0, go back to ε-DP;
With probability 1-δ, we will get the same guarantee as ε-DP;
With probability δ, we get no guarantee. δ should be very small.
CS6290 Privacy-enhancing Technologies 68
(ε, δ)-Differential Privacy (Approximate DP)

A randomized algorithm A satisfies (ε, δ)-differential privacy if:


Given two neighboring databases that differ in one record, D
and D’, then for any output S, ε > 0, and ε ∈ (0, 1), we have:
Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ

Relaxation

CS6290 Privacy-enhancing Technologies 69


(ε, δ)-Differential Privacy (Approximate DP)
A randomized algorithm A satisfies (ε, δ)-differential
privacy if: Pr[ A(D) = S] ≤ eε Pr[ A(D’) = S] + δ

• Introducing δ allows a “failure probability” for the ε-DP definition and


meanwhile reduces the privacy burden on ε
• In order to achieve the same level of privacy guarantee ε, (ε, δ)-DP
produces more noise and less accurate than ε-DP.
Blue: Laplacian noise for ε-DP
Orange: Gaussian noise for (ε, δ)-DP

If so, why bothering (ε, δ)-DP?


CS6290 Privacy-enhancing Technologies 70
(ε, δ)-DP (Excellent composition performance)
If M1, M2, M3, … are (ε, δ)-DP algorithms that access a private
sublinear
database D, then the resulting composition satisfies (ε’, δ’)-DP, where: composition for ε
ε’ = δ’ = δ

• In ε-DP, the sequential composition leads to ε’ = kε. linear composition


for ε

The composed privacy budget ε of (ε, δ)-DP increases much


slower than ε-DP during the composition.
ε ↑, privacy guarantee ↓.
(ε, δ)-DP is more friendly to stacking building blocks.
CS6290 Privacy-enhancing Technologies 71
Gaps between Theory and Practice
• Large focus on academic literature has been on the “centralized”
model of differential privacy:

• Industry applications have favored the “local modal”:

CS6290 Privacy-enhancing Technologies 72


Trying to Reduce Trust
• Most work on differential privacy relies a central model
• Data aggregator (e.g., organizations) that sees the true, raw data
• Can compute exact query answers, then add noise for privacy

• A reasonable question: can we reduce the centralized trust?


• Can we remove the central aggregator from the equation?
• Users produce locally private query results, send directly it to the data
user to aggregate results

CS6290 Privacy-enhancing Technologies 73


DP mechanisms designed for decentralized setting

Data mining
Statistical
queries

Local Differential Privacy-LDP

CS6290 Privacy-enhancing Technologies 74


Local Differential Privacy
• What about having each user run a DP algorithm on their data?
• Then combine all the results to get a final answer

• On first glance, this idea seems crazy


• Each user adds noise to mask their own input
• Do we have enough noise to protect the signal?
• Will the signal be overwhelmed by noise?

• Notice that noise can cancel out or be subtracted out


• The sampled noise has zero mean
• …, -2, -1, 0, 1, 2, …
CS6290 Privacy-enhancing Technologies 75
Frequency Estimation Protocols

• Direct Encoding (Generalized Random Response) [Warner’65]


• Generalize binary attribute to arbitrary domain
• Unary Encoding (Basic One-time RAPPOR) [Erlingssonet al’14]
• Encode into a bit-vector and perturb each bit
• Binary Local Hash [Bassily and Smith’15]
• Encode by hashing and then perturb

CS6290 Privacy-enhancing Technologies 76


LDP with a coin toss: Randomized Response
[Warner 65]
• Each user has a single bit (0 or 1) of private information
• Encoding e.g. political/sexual/religious preference, illness, etc.

• Randomize Response (RR): toss an unbiased coin


• If Heads (prob. = ½ ), report the true answer
• Else, toss unbiased coin again
• If Heads (prob. = ½ ), report true answer
• Else, report the opposite

CS6290 Privacy-enhancing Technologies 77


LDP with a coin toss: Randomized Response
[Warner 65]
Np/2

N N(1-p)/2 p: the rate of people whose true answer is ‘Yes’


X: #‘Yes’ observed
Then, we have Np/2 + N/4 = X,
N/4 where p can be estimated

N/4

• Collect responses from N users, cancel noise, and estimate the truth
• The estimation error is proportional to 1/ 𝑁: more participants, more accurate
• RR satisfies ε-DP with ε = ln((¾ )/(¼ )) = ln(3)
• Generalization: allow two biased coins with different probs (p, q ≠ ½)
CS6290 Privacy-enhancing Technologies 78
Generalized Randomized Response

CS6290 Privacy-enhancing Technologies 79


LDP in practice
• Differential privacy based on RR is widely deployed!
• In Google Chrome browser, to collect browsing statistics
• In Apple iOS and MacOS, to collect typing statistics
• In Microsoft Windows, to collect telemetry data over time
• In Snap, to perform modeling of user preference
• Each yields deployments of over 100 million users
• All deployments are based on RR with customized designs
• To handle the diverse values a user might have
• Local Differential Privacy is state of the art in 2020+
• Randomized response invented in 1965: five decades ago!

CS6290 Privacy-enhancing Technologies 80


Google’s RAPPOR
• Each user has one value out of a very
large set of possibilities
• E.g. their favourite URL, www.nytimes.com

• Attempt 1: run RR for all values: google.com? Bing.com? …


• User has sparse binary vector with one 1: run RR on every bit
• Slow: sends 1 bit for every possible website in the vector
• Limited: can’t handle new possibilities being added
• Try to do better by reducing domain size through hashing

CS6290 Privacy-enhancing Technologies 81


RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response
Hashing with Bloom Filters
item

0 1 0 0 0 1 0 0 1 0
m-bit

• Bloom filters compactly encode set membership


• Use k different hash functions to hash an items into k values between 1 and m
• Update: Set all k hashed values in the m-bit vector to 1 to indicate the present of an item
• Query: Can lookup items, store n items in O(n) bits
• Collision: choose k and m to obtain small false positive probability
• Vectors can be merged with OR-ing operation (of same size)
• Duplicate insertions do not change Bloom filters

CS6290 Privacy-enhancing Technologies 82


Counting Bloom Filters & Randomized Response
item

0/1 1/0 0/1 0/1 0/1 1/0 0/1 0/1 1/0 0/1

• Idea: apply Randomized Response to the bits in a Bloom filter


• Use fewer bits in the filter compared to representing each website with one bit
• Each user hashes their input into k bits of 1 in the filter
• Sensitivity = 2k since removing/re-adding an item change at most 2k bits (item
is hashed into 1s at different filter positions)
• Combine all user reports and observe how often each bit is set
• Since bloom vectors can be merged with OR-ing operation

CS6290 Privacy-enhancing Technologies 83


Decoding Noisy Counting Bloom Filters
• Each data owner creates their own bloom filter that processes
each bit with randomized response, then send it to the data user.

• The data user will estimate the frequency of a particular item:


• Look up the item’s bit locations in the collected bloom filter
• Compute the unbiased estimation of the item’s presence bit
• Take the minimum of these estimates as the frequency
• Try advanced methods to decode all items at once

CS6290 Privacy-enhancing Technologies 84


RAPPOR in practice

• The RAPPOR approach is implemented in the Chrome browser


• Collects data from opt-in users, tens of millions per day
• Open source implementation available
• Tracks settings in the browser, e.g. home page, search engine
• Many users unexpectedly change home page → possible malware
• Typical configuration:
• 128 bit Bloom filter, 2 hash functions, privacy parameter ~0.5
• Needs about 10K reports to identify a value with confidence

CS6290 Privacy-enhancing Technologies 85


Apple’s Differential Privacy in Practice

• Apple uses their system to collect data from iOS and OS X users
• Popular emojis: (heart) (laugh) (smile) (crying) (sadface)
• “New” words: bruh, hun, bae, tryna, despacito, mayweather
• Which websites to mute, which to autoplay audio on!
• Deployment settings: w=1000, d=1000, ε=2-8 (some criticism)
CS6290 Privacy-enhancing Technologies 86
Microsoft telemetry data collection

• Microsoft want to collect data on app usage


• How much time was spent on a particular app today?
• Allows finding patterns over time
• Makes use of multiple subroutines:
• 1BitMean to collect numeric data
• dBitFlip to collect (sparse) histogram data
• Memoization and output perturbation to allow repeated probing
• Has been implemented in Windows since 2017

CS6290 Privacy-enhancing Technologies 87


MS Telemetry Collection in Practice
• Deployed in Windows 10 Fall Creators Update
(October 2017)
• Collects number of seconds users spend in different apps
• Parameters: ε =1
• Collection period: every 6 hours
• Collects data on all app usage, not just one at a time
• Can analyze based on the fact that total time spent is
limited
• Gives overall guarantee of ε = 1.672 for a round of
collection

CS6290 Privacy-enhancing Technologies 88


Reflecting on LDP
• Local Differential Privacy is a big success for privacy research
• Adopted by Google, Apple, Microsoft and more for deployment
• Deployments affecting (hundreds of) millions of users
• In contrast, centralized DP has smaller success (but, see US Census)
• However, there are reasons to pause and reflect:
• LDP only works when you can rely on millions of active participants
• Companies using LDP also gather vast amounts of raw data directly!
• Privacy settings are not very tight: deployed ε ranges from 0.5 to 8+

CS6290 Privacy-enhancing Technologies 89


Case study: DP + others
applying differential privacy mechanisms
to enhance …

CS6290 Privacy-enhancing Technologies 90


Deep Learning and Privacy
• Some examples of DP models are promising, e.g. DP-SGD
• Making the training algorithm differentially private and provides
indistinguishable guarantee for the training data

Debate:
Accuracy – Privacy tradeoffs?
Noise, delicate model

CS6290 Privacy-enhancing Technologies 91


Deep Learning and Privacy
• TensorFlow DP-SGD API
• δ - A rule of thumb is to set it to be less than the inverse of the size of the training
dataset. By default, it is set to 10^-5 for MNIST dataset with 60,000 training points.
• Tensorflow Privacy provides a tool, compute_dp_sgd_privacy, to compute the value
of ε given a fixed value of hyperparameters
compute_dp_sgd_privacy.compute_dp_sgd_privacy(n=train_data.shape[0],
batch_size=batch_size,
noise_multiplier=noise_multiplier,
epochs=epochs,
delta=1e-5)

https://github.com/tensorflow/privacy/blob/master/g3doc/tutorials/classification_privacy.ipynb

CS6290 Privacy-enhancing Technologies 92


Hybrid Privacy Models
• We can imagine ‘hybrid’ models of privacy
• Users run LDP with different (personalized) values of ε
• Some users send their raw data to the aggregator with trust
• Blender [Avent et al., USENIX17] fleshes out this model
• Combines data obtained from the ‘opt-in’ group O with the ‘regular’ LDP clients C
• E.g. use opt-in data to generate list of possible query answers
• Use combined data from both groups to rank the list
• Several natural questions about theory and applications:
• Quantify benefits of hybrid privacy as a function of |O| and |C|
• What is the ‘killer app’ for this model of privacy?
• Other hybrids: each user chooses what to reveal non-privately
CS6290 Privacy-enhancing Technologies 93
Opening up Private Data
• Deployments of LDP are still tightly controlled by aggregator
• Google/MSR/Apple controls the code on your device
• User reports typically sent over secure channel to aggregator
• Could there be a more ‘open’ implementation of LDP?
• Users could publish their LDP outputs
• Multiple aggregators could perform independent analysis
• Would require a big change in architecture and adoption!

LDP data
CS6290 Privacy-enhancing Technologies 94
Conclusions
• Private data release is a challenging and exciting problem!
• The model of Local Differential Privacy has been widely adopted
• Initially for gathering counting statistics
• LDP comes at a cost: need many more users than centralized model
• Lots of research excitement developing new applications
• Practical impact of these is still to be seen
• Lots of opportunity for new work:
• Designing optimal mechanisms for local differential privacy
• Extend far beyond simple counts and marginals
• Unbalanced data, outliers, data biasing
• Structured data: graphs, movement patterns
CS6290 Privacy-enhancing Technologies 95
References
• Pierangela Samarati, Latanya Sweeney: Protecting privacy when disclosing information: k-
anonymity and its enforcement through generalization and suppression, Harvard Data Privacy
Lab, 1998.
• Latanya Sweeney: Achieving k-Anonymity Privacy Protection Using Generalization and
Suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):
571-588 (2002)
• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan
Venkitasubramaniam: l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006: 24
• Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity
and l-Diversity. ICDE 2007: 106-115
• Cynthia Dwork: Differential Privacy. ICALP (2) 2006: 1-12
• Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam D. Smith: Calibrating Noise to Sensitivity in
Private Data Analysis. TCC 2006: 265-284

CS6290 Privacy-enhancing Technologies 96

You might also like