You are on page 1of 30

A Te c h n i c a l S e m i n a r P r e s e n t a t i o n a n d

A Q U I C K I N F O R M AT I O N G U I D E

Achieving
Data Privacy Using
Machine Learning
Un-learning
Presented by – Under Guidance of –
K. Priyanshu Manoj Mrs. P. G. Kolapwar
2020BCS063 Asst. Professor,
A-30 SGGSIE&T
Agenda
KEY TOPICS DISCUSSED IN THIS
PRESENTATION ARE -
• Why Data Privacy is important?
• How machine learning is potentially putting in danger the
Data and privacy of the USERS across the world
• What is Unlearning Process?
• What is Machine UN-Learning ?
• Methods used to Achieve the Machine Unlearning
• Impact of Machine Unlearning technology on Data Privacy
• The benefits of interactive technology, Remote learning and
the new normal
New trends to watch out for…
Data Privacy?
Why is it necessary?
• In 2020, the amount of users have realized that this data is
data on the internet is  collected and is both used and sold
64 zettabytes  .
(where a zettabyte is a trillion gigabytes).

Moreover, there are more than


Scandals like 
40 billion images on Instagram, 
Cambridge Analytica have
340 million tweets a day,
increased the perception of the
countless posts on Facebook and
value of our data shared online.
so on.
Data protection and privacy
have been discussed endlessly…

WHY So?

We are sharing the countless


apps and websites we regularly
visit.

Products that we’ve talked


about with friends or things
you’ve searched on Google
They appear as advertisements
in your social media feeds
m … tI
h
Hm houg one
I T we
We share a lot of data,dbut o nly nd
an tjust
leave also a lot of track he arou
was art .!!!
browsing the internet. .
sm here

However, we change, our thoughts


change, and the world change, but
the data stay on the internet
forever.

The fact that algorithms are able to


profile us so well leads us to wonder
to whom this data is being sold.
How machine learning is potentially putting in danger
the Data and privacy of the USERS across the world
For
It exampleto revoke things that have
is difficult
Facebook
already – shared online or to properly delete
been - Data = fuel
1. recently
such data. launched an “Off-Facebook Activity” tool driving ML applications,
2. Company says enables users to delete data that third-party
and thus includes collecting and analyzing information
apps and websites have shared with Facebook.
3. But as the MIT Technology Review notes, “it’s a bit misleading
— Facebook isn’t deleting any data from third-parties, it’ssuch
- So, just as personal emails or even medical records.
de-linking it from its own data on you.” Once fed into an ML model, such data can be retained
forever, putting users at risk of all sorts of privacy
breaches.
Well, Sometimes
But as it Iturns
foundout…
Questioning
Myself…
We have a whole team on our side…!!
RULES AND LAWS INTRODUCED BY
• GDPR
• CCPA
And
• PIPEDA
Recent government initiatives
EU’s General Data Protection Regulation (GDPR)
are designed “the right to be forgotten.”

to protect individuals’ data privacy, with a core concept


being…!
THE RIGHT TO BE FORGOTTEN…!
The right to be forgotten is defined as “the right to have
private information about a person be removed from
Internet searches and other directories under some
circumstances

several institutions and governments are moving to


discuss and propose regulations (Argentina, European
Union, Philippines)

Ex.
In 2014, the Spanish court ruled in a favor of a man who asked
that certain information be removed from Google searches.

James Gunn was fired from "Guardians of the Galaxy 3" by


Disney after his offensive tweets resurface. He was fired in
2018, for tweets that were written between 2008 and 2011.
so, should we Stop Learning/Training the
Machines?
• And live our lives as primitives again…

WELL…
OF COURSE NOT!!!!!
We are coming here with a solution…

MACHINE UN-LEARNING…!!

WHAT?

Is Machine are the


are the goals?
Unlearning?
applications?

remove from the be helpful


Machine unlearni
model the
selected data
against Granting to avoiding
data that AI
ng point the right
 is a nascent field And Achieving poising models 
of artificial (selective to be could leak s
and
intelligence, amnesia), without forgotten ensible info
affecting the adversarial rmation
performance.
attacks. .
Methods used to Achieve the Machine
Un-learning
HOW?

(Shrarded – Isolated - Sliced - Aggregatted) Zero Shot Machine


SISA Training Method Unlearning Method

METHODS
Error Minimization Approach
Gradient Based Method
Is it DIFICULT to UNLEARN the
machines?
In general, it is very difficult
to provably unlearn a data point,

What is the reason?

The stochasticity combined with


to erase the contributions In most of machine learning
the “unpredictable” data-order
we have to calculate their the learning process is dependence involved in models
impact incremental makes it difficult

meaning that we start with


to attribute specific model
some model and
parameter contributions to a
incrementally improve or
given data point
update it.
Sharded – Isolated – Sliced – Aggregated (SISA method)

enables deterministic Useful for a broad class of


unlearning incrementally/adaptively
learned ML models.
- Paper from 2018

- Published collectively by the


FEATURES OF
world community of
SISA TRAINING
Researchers
- University of Toronto,
MODEL
- Vector Institute and
- University of Wisconsin-
Madison helps models “unlearn”
information by reducing
Helps Train the data quickly
the number of updates that
due to the reduced
need to be computed when
complexity of the
data points are removed.
calculations
Sharding
1. We divide the data into disjoint
fragments and training a constituent
model on each smaller data fragment
2. Thus, we are able to distribute the
training cost.
3. While this means our approach
naturally benefits from parallelism
across shards,
4. we do not take this into account in our
analysis and experiments, out of
fairness to the baseline of retraining a
model from scratch
5. (which could also be accelerated by
distributing the computation across
multiple machines)
ISOLATION
1. Observe that based on the proposal
detailed earlier, the training of each
shard occurs in isolation.
2. By not performing a joint update, we
potentially degrade the generalization
ability of the overall model
(comprising of all constituents).
3. However, we demonstrate that for
appropriate choices of the number of
shards, this does not occur in practice
for certain types of learning tasks.
4. Isolation is a subtle, yet powerful
construction that enables us to give
concrete, provable, and intuitive
guarantees with respect to unlearning.
slicing

• By further dividing data dedicated for


each model (i.e., each shard) and
incrementally tuning (and storing) the
parameter state of a model, we
obtain additional time savings.

• In case if the slicing is not possible


Then the process of retaining has to be
carried out
Aggregation
• At inference time, predictions from various constituent
models can be used to provide an overall prediction. The
choice of aggregation strategy in SISA training is influenced by
two key factors:
• In the absence of knowledge of which points will be the
subject of unlearning requests, there is no better strategy
than to partition4 data uniformly and to opt for a voting
strategy where each constituent contributes equally to the
final outcome through a simple label-based majority vote.
This naturally satisfies both requirements above

Steps Involved:
1. It is intimately linked to how data is
partitioned to form shards: the goal of
aggregation is to maximize the joint predictive
performance of constituent models.
2. The aggregation strategy should not involve
the training data (otherwise the aggregation
mechanism itself would have to be unlearned
in some cases).
New trends to watch out for…
SISA was just one of the many researches that
are being conducted on this topic
There are many other interesting
ways to achieve the Machine
Unlearning…

Gradient Zero Shot Machine


Based Method Unlearning Method

Error Minimization
Approach
Thus,
THE TIME HAS COME…
when We can Scream Aloud!!!
Do you have any
questions?
I hope you all must have learned something new.
REFERENCES
• https://www.cse.iitd.ac.in/index.php/2011-12-29-23-14-40/cse-seminar-talks
• Speaker:
• Date:

• https://medium.com/syncedreview/machine-unlearning-fighting-for-the-right
-to-be-forgotten-c381f8a4acf5

• https://arxiv.org/pdf/1912.03817.pdf
• https://www.youtube.com/watch?v=xUnMkCB0Gns
• https://www.wired.com/story/machines-can-learn-can-they-unlearn/
• https://towardsdatascience.com/machine-unlearning-the-duty-of-forgetting-
3666e5b9f6e5
• https://github.com/cleverhans-lab/machine-unlearning
• https://en.wikipedia.org/wiki/Facebook%E2%80%93Cambridge_Analytica_da
ta_scandal
• And various online resources for gathering the satistics
Impact of Machine Unlearning technology on Data
Privacy

In unlearning research, we aim to develop an algorithm that can take as its input a trained machine learning
model and output a new one such that any requested training data (i.e., any data originally used to create the
machine learning model), has now been removed. A naive strategy is to retrain the model from scratch without
the training data that needs to be unlearned. However, this comes at a high computational cost; unlearning
research seeks to make this process more efficient.
Let’s describe the problem a bit more formally so we can see how it differs from other definitions of privacy. We
denote the user data that is requested to be unlearned as d_u. We need to develop an unlearning algorithm that
outputs the same distribution of models as retraining without d_u (the naive solution), which is our
(strict) deterministic definition that we explore to better align with the goals of new privacy legislation; in this
setting, we certainly unlearn the entirety of d_u’s contributions. If these distributions do not match, then there is
necessarily some influence from d_u that has led to this difference. Settings where an unlearning algorithm only
approximately matches the retraining distribution can be viewed as a (relaxed) probabilistic setting, where we
unlearn most (but not all) of d_u’s contributions.
Such an example of a probabilistic definition of privacy is already found in the seminal work on 
differential privacy, which addresses a different but related definition of privacy than what unlearning seeks to
achieve. For readers familiar with the definition of differential privacy, one can think of satisfying the strict
privacy definition behind unlearning through differential privacy as requiring that we learn a model with an
algorithm that satisfies ε=0. Of course, this would prevent any learning and destroy the utility of the learned
model. Research in this relaxed setting of unlearning may be able to further reduce computational load, but the
guarantee is difficult for non-experts to grasp and may not comply with all regulatory needs.
LEARNING V/S UNLEARNING

V/S
Machine learning is perceived to be able to exacerbate the
problem by collecting and analyzing all this data (from
emails to medical data) by holding the information
forever. Furthermore, using this information in
insurance, medical, and loan application models can lead
to obvious harm and amplify bias.
Switching to a researcher’s perspective, a concern is
that if and when a data point is actually removed from
an ML training set, that may make it necessary to
retrain downstream models from scratch.

You might also like