Professional Documents
Culture Documents
Slides credits in part from Graham Cormode, Vitaly Shmatikov, Ashwin Machanavajjhala, Michael Hay, Xi He,
Tianhao Wang, Zhan QIN, Elaine Shi etc.
https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 8
Mobility tracking for COVID-19
MTR Octopus
https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 9
Mobility tracking for COVID-19
MTR Octopus
https://www.scmp.com/news/hong-kong/health-
environment/article/3078191/coronavirus-octopus-provide-data-city-residents
2 April 2020
CS6290 Privacy-enhancing Technologies 10
Mobility tracking for COVID-19
Google’s Mobility Report
https://www.google.com/covid19/mobility/
11
2 April 2020 CS6290 Privacy-enhancing Technologies
CS6290 Privacy-enhancing Technologies 12
Then, how does differential privacy protect
personal privacy with manually added noise?
Data “Anonymization”
allowing the data user (e.g., an analyst) to query the database for some aggregated
results while preventing the concrete identification of a particular person X.
medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
6B111958A39B24140C973B262EA9FEA5, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...
medallion, hack license, vendor id, pickup datetime, dropoff datetime, fees, trip distance, …
CFCD208495D565EF66E7DFF9F98764DA, D3B035A03C8A34DA17488129DA581EE7, VTS, 2013-12-03 15:46:00,
2013-12-03 16:47:00, 120, 3360, ...
Lessons Learned
Hashing might not be a good idea for anonymization
Use secure encryption that is cryptographically strong (AES) and/or random numbers.
• Demographic attributes (age, ZIP code, race, etc.) are common in datasets
with information about individuals
• Suppression
• Replace certain values of the attributes by an asterisk '*’.
• Example: replace all the values in the 'Name' attribute with a ‘*’.
Clear
distance
between
distributions
Indistinguishable
Distributions
after adding Alice
Smaller 𝜀, better privacy, but more noises is needed. The utility will be reduced.
Lap(b)
Note that Lap(b) has variance σ = 2b²and mean μ = 0
Parameter b controls the scale of noise η
• μ: position parameter
• b: scale/shape parameter, b>0
sharp distribution for smaller b
higher chance for sampling values near 0
Lap(b): b
• Average: SUM/COUNT
• SUM of age, GS upper bound is around 120, why?
• Histograms and contingency tables
• Covariance matrix
• Many data-mining algorithms can be implemented through a sequence of
low-sensitivity queries
• Perceptron (foundational algorithm in machine learning), some Expectation-
Maximization (EM) algorithms, Statistical Query (SQ) learning algorithms
• Order statistics
• Clustering
Determine the privacy guarantee for each algorithm step and deduce
the overall privacy guarantee.
CS6290 Privacy-enhancing Technologies 56
Privacy Budget Allocation
• Allocation between users/computation providers Auction?
• Allocation between tasks
• In-task allocation
– Between iterations
– Between multiple statistics
– Optimization problem No satisfactory
solution yet!
All the algorithms are running on the same database one by one.
e.g.,
M1(D) = q(D) + η
M2(x) = x +2
M2(M1(D)) = q(D) + 2 + η
Relaxation
Data mining
Statistical
queries
N/4
• Collect responses from N users, cancel noise, and estimate the truth
• The estimation error is proportional to 1/ 𝑁: more participants, more accurate
• RR satisfies ε-DP with ε = ln((¾ )/(¼ )) = ln(3)
• Generalization: allow two biased coins with different probs (p, q ≠ ½)
CS6290 Privacy-enhancing Technologies 78
Generalized Randomized Response
0 1 0 0 0 1 0 0 1 0
m-bit
0/1 1/0 0/1 0/1 0/1 1/0 0/1 0/1 1/0 0/1
• Apple uses their system to collect data from iOS and OS X users
• Popular emojis: (heart) (laugh) (smile) (crying) (sadface)
• “New” words: bruh, hun, bae, tryna, despacito, mayweather
• Which websites to mute, which to autoplay audio on!
• Deployment settings: w=1000, d=1000, ε=2-8 (some criticism)
CS6290 Privacy-enhancing Technologies 86
Microsoft telemetry data collection
Debate:
Accuracy – Privacy tradeoffs?
Noise, delicate model
https://github.com/tensorflow/privacy/blob/master/g3doc/tutorials/classification_privacy.ipynb
LDP data
CS6290 Privacy-enhancing Technologies 94
Conclusions
• Private data release is a challenging and exciting problem!
• The model of Local Differential Privacy has been widely adopted
• Initially for gathering counting statistics
• LDP comes at a cost: need many more users than centralized model
• Lots of research excitement developing new applications
• Practical impact of these is still to be seen
• Lots of opportunity for new work:
• Designing optimal mechanisms for local differential privacy
• Extend far beyond simple counts and marginals
• Unbalanced data, outliers, data biasing
• Structured data: graphs, movement patterns
CS6290 Privacy-enhancing Technologies 95
References
• Pierangela Samarati, Latanya Sweeney: Protecting privacy when disclosing information: k-
anonymity and its enforcement through generalization and suppression, Harvard Data Privacy
Lab, 1998.
• Latanya Sweeney: Achieving k-Anonymity Privacy Protection Using Generalization and
Suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5):
571-588 (2002)
• Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, Muthuramakrishnan
Venkitasubramaniam: l-Diversity: Privacy Beyond k-Anonymity. ICDE 2006: 24
• Ninghui Li, Tiancheng Li, Suresh Venkatasubramanian: t-Closeness: Privacy Beyond k-Anonymity
and l-Diversity. ICDE 2007: 106-115
• Cynthia Dwork: Differential Privacy. ICALP (2) 2006: 1-12
• Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam D. Smith: Calibrating Noise to Sensitivity in
Private Data Analysis. TCC 2006: 265-284