You are on page 1of 25

Differential Privacy

(Notes of Kamalika Chaudhuri (UCB) and G Kamath (Waterloo))


• Security: Protection against threats or danger.

• Privacy: Right to control how your information


is viewed and used.
• Why Privacy needed?

• Many of the applications have sensitive data.


Ex: AI in healthcare data analysis, user
behavior analysis (financial transactions,
search that a person is doing)
• So, how to preserve privacy when training a
machine learning algorithm?
Example Shows that anonymization do not guarantee
privacy
Why anonymization not work?
• Suppose a university in US wants to publish the
anonymized table of Professors details by not publishing
salary
• Let us there is one female Professor from South East
Asian region (India??).

• One can easily find who this Professor is by visiting the


department page.
Differential Privacy
• Considered as a standard in privacy preserving
data analysis
• Main Idea: Let us say we want to find some thing
related to whole population say ‘smoking causes
cancer’ but hide the information that specific
person who participated in the survey smokes
• We also want the privacy definition (method) to
be robust to any side information that the person
interested in breaking it (adversary) might have.
• So the main idea is to collect the information
about the population, but hide the individual
information.
• In this setting, users trust the curator (someone
who collects data) and provide their data, but
the curator releases only sanitized version to
outsiders for analysis
• So, adversary has only the sanitized version.
• As an example consider Census (count, age,
job of each person in a country)

• Public sends the raw data by filling forms to


curator (Govt.) for analysis.

• But the census authorities convert this into


sanitized tables to preserve privacy
• Main idea behind Differential privacy: Participation of single
person in a survey do not make the difference in chances
(probability) of getting a particular outcome
• Whether you fill the census form or no it does
not make any difference to the chances of your
privacy being compromised.

• This can happen even if adversary has side


information to get break your privacy

• This happens whether you participate or not


participate in survey.
• In differential privacy the privacy is obtained
due to randomness (randomized algorithm)

• The randomness may come by adding noise to


the outcome (output)
• If algorithm A (In the block diagram ealier it was
M)is differentially private, output of A has to be
random drawn from some distribution.
• The goal of DP is that if D is the data set with a
person X and he/she is replaced by say Y to get
D’ then output probabilities of A(D) and A(D’)
are almost same
• Distribution is not going to change much with
the presence of X or Y in the data set.
• Consider an Adversary who is an health
insurance provider and has some prior
knowledge about client Alice that she smokes.
• He would like to fix the premium based on
whether Alice has cancer or not.
• Let us say Alice has participated in a study that
has collected data of healthy and cancer
patients.
• Let us say adversary (insurance provider) finds
out that she falls in cancer group by reverse
engineering.
• This becomes a breach (break of agreement)
according to DP and Alice’s privacy is violated.
• If the adversary takes call based on study
already conducted say smoking causes cancer
it is not a breach
 A small +ve value say 0.1 for a strong privacy guarantee

 Should be very small, smaller than 1/number of points in the data set
(small data size)
• Properties of DP: All these apply if we use DP
• Resistance to side information available with
adversary.
• Post processing Invariance: Privacy risk does not
increase if we post process the output of DP
algorithm output.
• Graceful composition: Privacy risk do not
increase drastically even if multiple releases of
the same sensitive data are made.
• Group privacy: If the participation is of k
persons in DP instead of one person i.e., if D
and D’ private data values differ by k instead
of one then,
• Mechanism to satisfy the differential privacy?

• Two ways:

• Global Sensitivity Mechanism


• Exponential Mechanism
• Global Sensitivity Mechanism: Here we add
noise to output of function f to be made DP.
• Z is decided by the Sensitivity of f

This Gives  DP.


• Gaussian Mechanism: This gives approximate
DP i.e., ( ,  ) DP.

• Note that A( D )  f ( D)  Z
Differentially Private SGD
• How Stochastic Gradient Descent works?
• How DP SGD works?

But the gradient information is lost

You might also like