Professional Documents
Culture Documents
Doğru Mu? Teyit Et
Doğru Mu? Teyit Et
Byk veri analitii ham verisetlerinin kullanlmasyla kurumlar icin yksek katma deer yaratabilen
gunumuzun en nemli teknolojilerinden biri haline geldi. Diger taraftan bu verisetleri iinde yer alan
kiisel verilerin, orijinal toplanma amacnn disinda kullanilmasi data protection law ihlallerine yol
aabiliyor.
This era of big data analytics promises many things. In particular, it offers opportunities to extract
hidden value from unstructured raw datasets through novel reuse. The reuse of personal data is,
however, a key concern for data protection law as it involves processing for purposes beyond those
that justified its original collection, at odds with the principle of purpose limitation.
Bu nedenle gelisen byk veri teknolojisini sekteye uratmadan, kiilerin cikarlarin da gzeten
cozumler buyuk onem tasimaktadir. Veri anonimletirme bu amaca iyi hizmet eden yontemlerden
biri olarak hem hukuki hem de teknik anlamda cok fazla calisiliyor. Veri anonimletirildiinde, AB
data protection law ve birok lkenin hukuk sisteminde privacy law kapsam disinda degerilmeye
baslaniyor DORU MU? TEYIT ET. Veri anonimletirme yntemlerinin ve ilgili tanimlarin
standartlasmasi halen zerinde tartisialan ve calisilan bir sre.
The issue becomes one of balancing the private interests of individuals and realizing the promise of
big data. One way to resolve this issue is to transform personal data that will be shared for further
processing into anonymous information to use an EU legal term. Anonymous information is
outside the scope of EU data protection laws, and is also carved out from privacy laws in many other
jurisdictions worldwide.
Anominlestirme surecinin dikkatli ynetilmesi veri setinin faydasinin azami olcude muhafaza edilmesi
icin cok nemli. Ozellikle AB hukuk sistemi iindeki lkelerde veri anonimletirmenin gereklilii
konusunda buyuk olcude fikir birlii mevcutken, cevap arayan buyuk soru verisetindeki analitik
araclarla deer yaratacak yapisina en az zarari vererek, etkin bir sekilde nasil anonimletirmen nasil
yapilmasi gerektigidir.
AB data protection laws ile uyumlu anonimletirmenin zorluklarndan cozum olarak biri one srlen
anonimletirme tekniklerin kanunlarda belirtilen terimler araciligiyla yorumlanrken ortaya kan
muglaklklardan kaynakland soylenebilir. DEFEND, ILLUSTRATE
Yet, the texts of both the existing EU Data Protection Directive1 (DPD) and the new EU General Data
Protection Regulation2 (GDPR) are ambiguous.
The foregoing solution works well in theory, but only as long as the output potential from the data
still retains utility, which is not necessarily the case in practice. This leaves those in charge of
processing the data with a problem: how to ensure that anonymisation is conducted effectively on
the data in their possession, while retaining its utility for potential future disclosure to, and further
processing by, third parties?
Despite broad consensus around the need for effective anonymisation techniques, the debate as to
when data can be said to be legally anonymized to satisfy EU data protection laws is long-standing.
Part of the complexity in reaching consensus derives from confusion around terminology, in
particular the meaning of the concept of anonymisation in this context, and how strictly delineated
that concept should be. This can be explained, in turn, by a lack of consensus on the doctrinal theory
that should underpin its traditional conceptualization as a privacy-protecting mechanism.
Kavramsal olarak kullanldnda anonimletirme, kiisel verinin korunmas iin bir yol asa da
eksizsiz bir yol haritas karma hedefi ok gereki deil.
Bunun en buyuk nedeni anonimleme zerine matematiksel yaklam bla bla np hard
Bu nedenle anonimletirme surecine, veri setinin anonimlestirme makinesinden geirip ktlar data
protection laws uyumlu halde kullanma idealinden cok;
veri setine, veri setinin saklandigi alt yapiya ve veri setini kullanacaklara baml,
veri seti anonimletirilip paylasildiginda sonra da devam eden
dinamik bir denetim sureci olarak yaklamak daha gereki olacaktr. (BU BOLUM
SONUCTA DAHA GUZEL OLABILIR)
Less clear is whether the first data controller could be seen as bearing an ongoing duty to monitor
the data environment of anonymised datasets. If we assume that to determine whether a dataset is
anonymised the answer has to be contextual, and because context evolves over time, it can only
make sense to subject data controllers to ongoing monitoring duties, even if the dataset is
considered anonymised, as per definition initial data controllers are still data controllers. To be clear,
the finding of such a duty does not necessarily contradict the GDPR.
The next question is, then, whether contractual obligations between initial data controllers and
dataset recipients are also crucial to fully control data environments and ensure re-identification
risks remains sufficiently remote. It seems that they do indeed become crucial in cases in which it is
essential for recipients of datasets to put in place security measures.
A dynamic approach to anonymisation therefore means assessing the data environment in context
and over time and implies duties and obligations for both data controllers releasing datasets and
dataset recipients.
This paper suggests that, although the concept of anonymisation is crucial to demarcate the scope of
data protection laws at least from a descriptive standpoint, recent attempts to clarify the terms of
the dichotomy between anonymous information and personal data (in particular, by EU data
protection regulators) have partly failed. Although this failure could be attributed to the very use of a
terminology that creates the illusion of a definitive and permanent contour that clearly delineates
the scope of data protection laws, the reasons are slightly more complex. Essentially, failure can be
explained by the implicit adoption of a static approach, which tends to assume that once the data is
anonymized, not only can the initial data controller forget about it, but also that recipients of the
transformed dataset are thereafter free from any obligations or duties because it always lies outside
the scope of data protection laws. By contrast, the state of anonymized data has to be
comprehended in context, which includes an assessment of the data, the infrastructure, and the
agents.
Moreover, the state of anonymized data should be comprehended dynamically: anonymized data
can become personal data again, depending upon the purpose of the further
Going back to identifiability, interestingly, Advocate General Campos Snchez-Bordona in the Breyer
case33 seems to consider that, indeed, context is crucial for identifying personal data, and in
particular characterising IP addresses as personal data. And the CJEU in its recent judgment of 2016
expressly refers to paragraph 68 of the opinion and thereby also excludes identifiability if the
identification of the data subject was prohibited by law or practically impossible on account of the
fact that it requires a disproportionate effort in terms of time, cost and man-power, so that the risk
of identification appears in reality to be insignificant.
In as much as the category of non-personal data is context-dependent, we argue the same should be
true for the anonymised data concept. Such a fluid line between the categories of personal data and
anonymised data should be seen as a way to mitigate the risk created by the exclusion of
anonymised data from the scope of data protection law. Consequently, the exclusion should never
be considered definitive but should always depend upon context. Ultimately, a key deterrent against
re-identification risk is the potential re-application of data protection laws themselves.
First, it will delete personal identifiers like names and social security numbers. Second, it will modify
other categories of information that act like identifiers in the particular context--the hospital will
delete the names of next of kin, the school will excise student ID numbers, and the bank will obscure
account numbers.
What will remain is a best-of-both-worlds compromise: Analysts will still find the data useful, but
unscrupulous marketers and malevolent identity thieves will find it impossible to identify the people
tracked. Anonymization will calm regulators and keep critics at bay. Society will be able to turn its
col-lective attention to other problems because technology will have solved this one. Anonymization
ensures privacy.
Clever adversaries can often reidentify or deanonymize the people hidden in an anonymized
database.
Reidentification science disrupts the privacy policy landscape by undermining the faith we have
placed in anonymization. This is no small faith, for technologists rely on it to justify sharing data
indiscriminately
and storing data perpetually, while promising users (and the world) that they are protecting
--
How many other people in the United States share your specific combination of ZIP code,
birth date (including year), and sex? According to a landmark study, for 87 percent of the American
population,
the answer is zero; these three pieces of information uniquely identify each of them.
would have classified ZIP code, birth date, sex, or movie ratings as PII.
--
--
--
yer kaplasn
aol
netflix
--
Notice that with the two joined tables, the sum of the information is greater than the parts.
--
It would also be a mistake to conclude that the three stories demonstrate only the peril of public
release of anonymized data. Some might argue that had the State of Massachusetts, AOL and Netflix
kept their anonymized data to themselves, or at least shared the data much less widely, we would
not have had to worry about data privacy.
--
Finally, some might object that the fact that reidentification is possible
--
At the very least, we must abandon the pervasively held idea that we
can protect privacy by simply removing personally identifiable information
almost nobody would have categorized movie ratings and search queries as
PII, and as a result, no law or regulation did either.210 Today, four years after
.I can argue that every piece of data is potentially a PII but they have different degree, some are
independently PII, some are jointly PII, some are dependently PII
Google argued w/o last chunk of IP, users are anonymized but some users can use two IP work and
home jointly probability is much lower.
--
Latanya Sweeney has similarly argued against using forms of the word
anonymous when they are not literally true.224 Dr. Sweeney instead uses deidentify
in her research. As she defines it, [i]n deidentified data, all explicit
identifiers, such as SSN, name, address, and telephone number, are removed,
--
databases together, he can add the newly linked data to his collection
Success breeds further success. Narayanan and Shmatikov explain that once
any piece of data has been linked to a persons real identity, any association
between this data and a virtual identity breaks the anonymity of the latter.
This is why we should worry even about reidentification events that seem to
Utility and privacy are, at bottom, two goals at war with one another.253
--
translate directly into a prescription. It does not lead, for example, to the
conclusion that all anonymization techniques are fatally flawed, but instead, as
calls her preferred goal differential privacy and ties it to so-called interactive
techniques
--
In 1977, statistician Tore Dalenius proposed a strict definition of data privacy: that the attacker
should learn nothing about an individual that they didnt know before using the sensitive dataset.
Although this guarantee failed (and we will see why), it is important in understanding why
differential privacy is constructed the way it is.
Daleniuss definition failed because, in 2006, computer scientist Cynthia Dwork proved that this
guarantee was impossible to givein other words, any access to sensitive data would violate this
definition of privacy. The problem she found was that certain types of background information could
always lead to a new conclusion about an individual. Her proof is illustrated in the following
anecdote: I know that Alice is two inches taller than the average Lithuanian woman. Then I interact
with a dataset of Lithuanian women and compute the average height, which I didnt know before. I
now know Alices height exactly, even though she was not in the dataset. It is impossible to account
for all types of background information that might lead to a new conclusion about an individual from
use of a dataset.
Differential privacy guarantees the following: that the attacker can learn virtually nothing more
about an individual than they would learn if that persons record were absent from the dataset.
While weaker than Daleniuss definition of privacy, the guarantee is strong enough because it aligns
with real world incentivesindividuals have no incentive not to participate in a dataset, because the
analysts of that dataset will draw the same conclusions about that individual whether the individual
includes himself in the dataset or not. As their sensitive personal information is almost irrelevant in
the outputs of the system, users can be assured that the organization handling their data is not
violating their privacy.
--
k-anon
One way to achieve this is to have the released records adhere to kanonymity, which means each
released record has at least (k-1) other records in the release whose values are indistinct over those
fields that appear in external data. So, kanonymity provides privacy protection by guaranteeing that
each released record will relate to at least k individuals even if the records are directly linked to
external information
A release of data is said to
adhere to k-anonymity if each released record has at least (k-1) other records also
visible in the release whose values are indistinct over a special set of fields called
the quasi-identifier [4]. The quasi-identifier contains those fields that are likely to
the released records are directly linked (or matched) to external information.
value with a less specific but semantically consistent value. Suppression involves
not releasing a value at all. While there are numerous techniques available2
--
less specific, more general value that is faithful to the original. In Figure 2 the
original ZIP codes {02138, 02139} can be generalized to 0213*, thereby stripping
imposing on each value generalization hierarchy a new maximal element, atop the
old maximal element. The new maximal element is the attribute's suppressed
domain Z0 represents ZIP codes for Cambridge, MA, and E0 represents race.
From now on, all references to generalization include the new maximal element;
Non-Sensitive Attributes),
where Explicit Identifier is a set of attributes, such as name and social security
contains all attributes that do not fall into the previous three categories [40].
Most works assume that the four sets of attributes are disjoint. Most works
assume that each record in the table represents a distinct record owner.
pieces of prior knowledge: the victims record in the released data and the
For example, the adversary noticed that his boss was hospitalized,
therefore, knew that his bosss medical record would appear in the released
patient database. Also, it is not difficult for an adversary to obtain his bosss
zip code, date of birth, and sex, which could serve as the quasi-identifier in
linking attacks.
chosen privacy model and to retain as much data utility as possible. An information
the Non-Sensitive Attributes are published if they are important to the data
mining task.
The first category considers that a privacy threat occurs when an adversary
We call these record linkage, attribute linkage, and table linkage, respectively.
In all three types of linkages, we assume that the adversary knows the QID of
the victim. In record and attribute linkages, we further assume that the adversary
knows the victims record is in the released table, and seeks to identify the
victims record and/or sensitive information from the table. In table linkage,
the attack seeks to determine the presence or absence of the victims record in
The published table should provide the adversary with little additional information
beyond the background knowledge. If the adversary has a large variation between the prior and
posterior beliefs, we call it the probabilistic
attack.
--
In the attack of record linkage, some value qid on QID identifies a small
number of records in the released table T , called a group. If the victims QID
matches the value qid, the victim is vulnerable to being linked to the small
number of records in the group. In this case, the adversary faces only a small
number of possibilities for the victims record, and with the help of additional
knowledge, there is a chance that the adversary could uniquely identify the
The k-anonymity model assumes that QID is known to the data holder.
Most works consider a single QID containing all attributes that can be potentially
the more protection k-anonymity would provide. On the other hand, this also
from Table 2.2 using the taxonomy trees in Figure 2.1. It has two distinct
and Artist,
Female, [30-
35)
3-anonymous.
To prevent record linkage through QID, Samarati and Sweeney [201, 202,
203, 217] propose the notion of k-anonymity: If one record in the table has
some value qid, at least k 1 other records also have the value qid. In other
--
In the attack of attribute linkage, the adversary may not precisely identify
the record of the target victim, but could infer his/her sensitive values from
the published data T , based on the set of sensitive values associated to the
that the target victim Emily is a female dancer at age 30 and owns a record in
the table. The adversary may infer that Emily has HIV with 75% confidence
because 3 out of the 4 female artists with age [30-35) have HIV . Regardless
--
l-diversity
3 log 2
3 log 1
3 = log(1.9),
4 log 3
4 log 1
4 = log(1.8).
since the entropy of a qid group is always greater than or equal to the minimum
entropy of its subgroups {qid1, . . . , qidn} where qid = qid1 qidn, that
is,
occurs in S.
-diversity has the limitation of implicitly assuming that each sensitive attribute
sensitive values are not similar, achieving -diversity may cause a large data
utility loss.
-diversity has the limitation of implicitly assuming that each sensitive attribute
sensitive values are not similar, achieving -diversity may cause a large data
utility loss. Consider a data table containing data of 1000 patients on some
QID attributes and a single sensitive attribute Disease with two possible
values, HIV or Flu. Assume that there are only 5 patients with HIV in the
table. To achieve 2-diversity, at least one patient with HIV is needed in each
qid group; therefore, at most 5 groups can be formed [66], resulting in high
--
t-Closeness
al. [153] observe that when the overall distribution of a sensitive attribute
patient table where 95% of records have Flu and 5% of records have HIV .
Suppose that a qid group has 50% of Flu and 50% of HIV and, therefore,
satisfies 2-diversity. However, this group presents a serious privacy threat because
any record owner in the group could be inferred as having HIV with
t-closeness uses the Earth Mover Distance (EMD) function to measure the
closeness between two distributions of sensitive values, and requires the closeness
values. Second, the EMD function is not suitable for preventing attribute
would greatly degrade the data utility because it requires the distribution
of sensitive values to be the same in all qid groups. This would significantly
--
e-Differential Privacy
Dwork [74] proposes an insightful privacy notion: the risk to the record
posterior probability before and after accessing the published data, Dwork [74]
proposes to compare the risk with and without the record owners data in the
-differential privacy
ensures that the removal or addition of a single database record does not significantly
Although
that they may submit their personal information to the database securely in
the knowledge that nothing, or almost nothing, can be discovered from the
database with their information that could not have been discovered without
This strong guarantee is achieved by comparison with and without the record
owners data in the published data. Dwork [75] proves that if the number of
by o(
n), where n is the number of records in the database. Dwork [76] further
and non-interactive query models, discussed in Chapters 1.2 and 17.1. Refer
--
Motivated by the learning theory, Blum et al. [33] present a privacy model
called distributional privacy for a non-interactive query model. The key idea
is that when a data table is drawn from a distribution, the table should reveal
privacy, and can answer all queries over a discretized domain in a concept
the algorithm has high computational cost. Blum et al. [33] present an efficient
remain open.
--
The raw data table usually does not satisfy a specified privacy requirement
and the table must be modified before being published. The modification
the correlation between QID and sensitive attributes by grouping and shuffling
is more general than the child nodes Engineer and Lawyer. The root node,
ANY Job, represents the most general value in Job. For a numerical attribute,
exact values can be replaced with an interval that covers exact values. If a taxonomy
on privacy protection, data utility, and search space. But they all
A suppression replaces some values with a special value, indicating that the
replaced values are not disclosed. The reverse operation of suppression is called