You are on page 1of 15

DUZENLEYICI KANUN VS LERI DE BIR PARAGRAFTA ANLAT

Byk veri analitii ham verisetlerinin kullanlmasyla kurumlar icin yksek katma deer yaratabilen
gunumuzun en nemli teknolojilerinden biri haline geldi. Diger taraftan bu verisetleri iinde yer alan
kiisel verilerin, orijinal toplanma amacnn disinda kullanilmasi data protection law ihlallerine yol
aabiliyor.

This era of big data analytics promises many things. In particular, it offers opportunities to extract
hidden value from unstructured raw datasets through novel reuse. The reuse of personal data is,
however, a key concern for data protection law as it involves processing for purposes beyond those
that justified its original collection, at odds with the principle of purpose limitation.

Bu nedenle gelisen byk veri teknolojisini sekteye uratmadan, kiilerin cikarlarin da gzeten
cozumler buyuk onem tasimaktadir. Veri anonimletirme bu amaca iyi hizmet eden yontemlerden
biri olarak hem hukuki hem de teknik anlamda cok fazla calisiliyor. Veri anonimletirildiinde, AB
data protection law ve birok lkenin hukuk sisteminde privacy law kapsam disinda degerilmeye
baslaniyor DORU MU? TEYIT ET. Veri anonimletirme yntemlerinin ve ilgili tanimlarin
standartlasmasi halen zerinde tartisialan ve calisilan bir sre.

The issue becomes one of balancing the private interests of individuals and realizing the promise of
big data. One way to resolve this issue is to transform personal data that will be shared for further
processing into anonymous information to use an EU legal term. Anonymous information is
outside the scope of EU data protection laws, and is also carved out from privacy laws in many other
jurisdictions worldwide.

Anominlestirme surecinin dikkatli ynetilmesi veri setinin faydasinin azami olcude muhafaza edilmesi
icin cok nemli. Ozellikle AB hukuk sistemi iindeki lkelerde veri anonimletirmenin gereklilii
konusunda buyuk olcude fikir birlii mevcutken, cevap arayan buyuk soru verisetindeki analitik
araclarla deer yaratacak yapisina en az zarari vererek, etkin bir sekilde nasil anonimletirmen nasil
yapilmasi gerektigidir.

AB data protection laws ile uyumlu anonimletirmenin zorluklarndan cozum olarak biri one srlen
anonimletirme tekniklerin kanunlarda belirtilen terimler araciligiyla yorumlanrken ortaya kan
muglaklklardan kaynakland soylenebilir. DEFEND, ILLUSTRATE

Yet, the texts of both the existing EU Data Protection Directive1 (DPD) and the new EU General Data
Protection Regulation2 (GDPR) are ambiguous.

The foregoing solution works well in theory, but only as long as the output potential from the data
still retains utility, which is not necessarily the case in practice. This leaves those in charge of
processing the data with a problem: how to ensure that anonymisation is conducted effectively on
the data in their possession, while retaining its utility for potential future disclosure to, and further
processing by, third parties?

Despite broad consensus around the need for effective anonymisation techniques, the debate as to
when data can be said to be legally anonymized to satisfy EU data protection laws is long-standing.
Part of the complexity in reaching consensus derives from confusion around terminology, in
particular the meaning of the concept of anonymisation in this context, and how strictly delineated
that concept should be. This can be explained, in turn, by a lack of consensus on the doctrinal theory
that should underpin its traditional conceptualization as a privacy-protecting mechanism.
Kavramsal olarak kullanldnda anonimletirme, kiisel verinin korunmas iin bir yol asa da
eksizsiz bir yol haritas karma hedefi ok gereki deil.

Bunun en buyuk nedeni anonimleme zerine matematiksel yaklam bla bla np hard

ohm un makalesi var

Bu nedenle anonimletirme surecine, veri setinin anonimlestirme makinesinden geirip ktlar data
protection laws uyumlu halde kullanma idealinden cok;

veri setine, veri setinin saklandigi alt yapiya ve veri setini kullanacaklara baml,
veri seti anonimletirilip paylasildiginda sonra da devam eden
dinamik bir denetim sureci olarak yaklamak daha gereki olacaktr. (BU BOLUM
SONUCTA DAHA GUZEL OLABILIR)

Less clear is whether the first data controller could be seen as bearing an ongoing duty to monitor
the data environment of anonymised datasets. If we assume that to determine whether a dataset is
anonymised the answer has to be contextual, and because context evolves over time, it can only
make sense to subject data controllers to ongoing monitoring duties, even if the dataset is
considered anonymised, as per definition initial data controllers are still data controllers. To be clear,
the finding of such a duty does not necessarily contradict the GDPR.

The next question is, then, whether contractual obligations between initial data controllers and
dataset recipients are also crucial to fully control data environments and ensure re-identification
risks remains sufficiently remote. It seems that they do indeed become crucial in cases in which it is
essential for recipients of datasets to put in place security measures.

A dynamic approach to anonymisation therefore means assessing the data environment in context
and over time and implies duties and obligations for both data controllers releasing datasets and
dataset recipients.

This paper suggests that, although the concept of anonymisation is crucial to demarcate the scope of
data protection laws at least from a descriptive standpoint, recent attempts to clarify the terms of
the dichotomy between anonymous information and personal data (in particular, by EU data
protection regulators) have partly failed. Although this failure could be attributed to the very use of a
terminology that creates the illusion of a definitive and permanent contour that clearly delineates
the scope of data protection laws, the reasons are slightly more complex. Essentially, failure can be
explained by the implicit adoption of a static approach, which tends to assume that once the data is
anonymized, not only can the initial data controller forget about it, but also that recipients of the
transformed dataset are thereafter free from any obligations or duties because it always lies outside
the scope of data protection laws. By contrast, the state of anonymized data has to be
comprehended in context, which includes an assessment of the data, the infrastructure, and the
agents.

Moreover, the state of anonymized data should be comprehended dynamically: anonymized data
can become personal data again, depending upon the purpose of the further

The Anonymisation Decision-Making Framework, PAPER

Broken Promises of Privacy: Responding to the Surprising Failure of Anonymisation GOOD

.yukarda maddelerle bahsettigin konu ile de ilgili makale


Article 2(a) of the DPD defines personal data as any information relating to an identified or
identifiable natural person ('data subject')24 specifying that an identifiable person is one who
can be identified, directly or indirectly, in particular by reference to an identification number or
to one or more factors specific to his physical, physiological, mental, economic, cultural or social
identity.25
Art. 29 WP breaks down the concept of personal data into four components (any information;
relating to; an identified or identifiable; natural person) and puts forward a three-prong test to
determine whether relevant data relates to a natural person. [I]n order to consider that the data
relate to an individual, a "content" element OR a "purpose" element OR a "result" element should
be present.29

Going back to identifiability, interestingly, Advocate General Campos Snchez-Bordona in the Breyer
case33 seems to consider that, indeed, context is crucial for identifying personal data, and in
particular characterising IP addresses as personal data. And the CJEU in its recent judgment of 2016
expressly refers to paragraph 68 of the opinion and thereby also excludes identifiability if the
identification of the data subject was prohibited by law or practically impossible on account of the
fact that it requires a disproportionate effort in terms of time, cost and man-power, so that the risk
of identification appears in reality to be insignificant.

In as much as the category of non-personal data is context-dependent, we argue the same should be
true for the anonymised data concept. Such a fluid line between the categories of personal data and
anonymised data should be seen as a way to mitigate the risk created by the exclusion of
anonymised data from the scope of data protection law. Consequently, the exclusion should never
be considered definitive but should always depend upon context. Ultimately, a key deterrent against
re-identification risk is the potential re-application of data protection laws themselves.

First, it will delete personal identifiers like names and social security numbers. Second, it will modify
other categories of information that act like identifiers in the particular context--the hospital will
delete the names of next of kin, the school will excise student ID numbers, and the bank will obscure
account numbers.

What will remain is a best-of-both-worlds compromise: Analysts will still find the data useful, but
unscrupulous marketers and malevolent identity thieves will find it impossible to identify the people
tracked. Anonymization will calm regulators and keep critics at bay. Society will be able to turn its
col-lective attention to other problems because technology will have solved this one. Anonymization
ensures privacy.

Clever adversaries can often reidentify or deanonymize the people hidden in an anonymized
database.

Reidentification science disrupts the privacy policy landscape by undermining the faith we have

placed in anonymization. This is no small faith, for technologists rely on it to justify sharing data
indiscriminately

and storing data perpetually, while promising users (and the world) that they are protecting

privacy. Advances in reidentification expose these promises as too often illusory.

--
How many other people in the United States share your specific combination of ZIP code,

birth date (including year), and sex? According to a landmark study, for 87 percent of the American
population,

the answer is zero; these three pieces of information uniquely identify each of them.

Latanya Sweeney, Uniqueness of Simple Demographics in the U.S. Population

Philippe Golle, Revisiting the Uniqueness of Simple Demographics in the US Population

Prior to these studies, nobody

would have classified ZIP code, birth date, sex, or movie ratings as PII.

--

No useful database can ever be perfectly anonymous, and as the utility

of data increases, the privacy decreases.

--

yer kaplasn diye ornek yapacaksan

ohm dekileri kopyala

--

yer kaplasn

aol

netflix

--

Notice that with the two joined tables, the sum of the information is greater than the parts.

--

It would also be a mistake to conclude that the three stories demonstrate only the peril of public
release of anonymized data. Some might argue that had the State of Massachusetts, AOL and Netflix
kept their anonymized data to themselves, or at least shared the data much less widely, we would
not have had to worry about data privacy.

--

Finally, some might object that the fact that reidentification is possible

does not necessarily make it likely to happen. In particular, if there are no

motivated, skilled adversaries, then there is no threat.

--

At the very least, we must abandon the pervasively held idea that we
can protect privacy by simply removing personally identifiable information

(PII). This is now a discredited approach. Even if we continue to follow it in

marginal, special cases, we must chart a new course in general.

The trouble is that PII is an ever-expanding category. Ten years ago,

almost nobody would have categorized movie ratings and search queries as

PII, and as a result, no law or regulation did either.210 Today, four years after

computer scientists exposed the power of these categories of data to identify,

no law or regulation yet treats them as PII.

.I can argue that every piece of data is potentially a PII but they have different degree, some are
independently PII, some are jointly PII, some are dependently PII

Google argued w/o last chunk of IP, users are anonymized but some users can use two IP work and
home jointly probability is much lower.

--

Latanya Sweeney has similarly argued against using forms of the word

anonymous when they are not literally true.224 Dr. Sweeney instead uses deidentify

in her research. As she defines it, [i]n deidentified data, all explicit

identifiers, such as SSN, name, address, and telephone number, are removed,

generalized, or replaced with a made-up alternative.

--

Once an adversary has linked two anonymized

databases together, he can add the newly linked data to his collection

of outside information and use it to help unlock other anonymized databases.

Success breeds further success. Narayanan and Shmatikov explain that once

any piece of data has been linked to a persons real identity, any association

between this data and a virtual identity breaks the anonymity of the latter.

This is why we should worry even about reidentification events that seem to

expose only nonsensitive information, because they increase the linkability of

data, and thereby expose people to potential future harm.

Utility and privacy are, at bottom, two goals at war with one another.253

--

In order to be useful, anonymized data must be imperfectly anonymous.

[P]erfect privacy can be achieved by publishing nothing at allbut this has


no utility; perfect utility can be obtained by publishing the data exactly as

received from the respondents, but this offers no privacy.

Although the impossibility result should inform regulation, it does not

translate directly into a prescription. It does not lead, for example, to the

conclusion that all anonymization techniques are fatally flawed, but instead, as

Cynthia Dwork puts, to a new approach to formulating privacys goals.262 She

calls her preferred goal differential privacy and ties it to so-called interactive

techniques

--

In 1977, statistician Tore Dalenius proposed a strict definition of data privacy: that the attacker
should learn nothing about an individual that they didnt know before using the sensitive dataset.
Although this guarantee failed (and we will see why), it is important in understanding why
differential privacy is constructed the way it is.

Daleniuss definition failed because, in 2006, computer scientist Cynthia Dwork proved that this
guarantee was impossible to givein other words, any access to sensitive data would violate this
definition of privacy. The problem she found was that certain types of background information could
always lead to a new conclusion about an individual. Her proof is illustrated in the following
anecdote: I know that Alice is two inches taller than the average Lithuanian woman. Then I interact
with a dataset of Lithuanian women and compute the average height, which I didnt know before. I
now know Alices height exactly, even though she was not in the dataset. It is impossible to account
for all types of background information that might lead to a new conclusion about an individual from
use of a dataset.

Differential privacy guarantees the following: that the attacker can learn virtually nothing more
about an individual than they would learn if that persons record were absent from the dataset.
While weaker than Daleniuss definition of privacy, the guarantee is strong enough because it aligns
with real world incentivesindividuals have no incentive not to participate in a dataset, because the
analysts of that dataset will draw the same conclusions about that individual whether the individual
includes himself in the dataset or not. As their sensitive personal information is almost irrelevant in
the outputs of the system, users can be assured that the organization handling their data is not
violating their privacy.

--

k-anon

One way to achieve this is to have the released records adhere to kanonymity, which means each
released record has at least (k-1) other records in the release whose values are indistinct over those
fields that appear in external data. So, kanonymity provides privacy protection by guaranteeing that
each released record will relate to at least k individuals even if the records are directly linked to
external information
A release of data is said to

adhere to k-anonymity if each released record has at least (k-1) other records also

visible in the release whose values are indistinct over a special set of fields called

the quasi-identifier [4]. The quasi-identifier contains those fields that are likely to

appear in other known data sets. Therefore, k-anonymity provides privacy

protection by guaranteeing that each record relates to at least k individuals even if

the released records are directly linked (or matched) to external information.

This paper provides a formal presentation of achieving k-anonymity using

generalization and suppression. Generalization involves replacing (or recoding) a

value with a less specific but semantically consistent value. Suppression involves

not releasing a value at all. While there are numerous techniques available2

combining these two offers several advantages

--

Generalization including suppression

The idea of generalizing an attribute is a simple concept. A value is replaced by a

less specific, more general value that is faithful to the original. In Figure 2 the

original ZIP codes {02138, 02139} can be generalized to 0213*, thereby stripping

the rightmost digit and semantically indicating a larger geographical area.

Such a relationship implies the existence of a value generalization hierarchy

VGHA for attribute A.

I expand my representation of generalization to include suppression by

imposing on each value generalization hierarchy a new maximal element, atop the

old maximal element. The new maximal element is the attribute's suppressed

value. The height of each value generalization hierarchy is thereby incremented

by one. No other changes are necessary to incorporate suppression. Figure 2 and

Figure 3 provides examples of domain and value generalization hierarchies

expanded to include the suppressed maximal element (*****). In this example,

domain Z0 represents ZIP codes for Cambridge, MA, and E0 represents race.

From now on, all references to generalization include the new maximal element;

and, hierarchy refers to domain generalization hierarchies unless otherwise noted.


--

In the most basic form of privacy-preserving data publishing (PPDP), the

data holder has a table of the form

D(Explicit Identifier, Quasi Identifier, Sensitive Attributes,

Non-Sensitive Attributes),

where Explicit Identifier is a set of attributes, such as name and social security

number (SSN), containing information that explicitly identifies record owners;

Quasi Identifier is a set of attributes that could potentially identify record

owners; Sensitive Attributes consist of sensitive person-specific information

such as disease, salary, and disability status; and Non-Sensitive Attributes

contains all attributes that do not fall into the previous three categories [40].

Most works assume that the four sets of attributes are disjoint. Most works

assume that each record in the table represents a distinct record owner.

In the above example, the owner of a record is re-identified by linking his

quasi-identifier. To perform such linking attacks, the adversary needs two

pieces of prior knowledge: the victims record in the released data and the

quasi-identifier of the victim. Such knowledge can be obtained by observations.

For example, the adversary noticed that his boss was hospitalized,

therefore, knew that his bosss medical record would appear in the released

patient database. Also, it is not difficult for an adversary to obtain his bosss

zip code, date of birth, and sex, which could serve as the quasi-identifier in

linking attacks.

To prevent linking attacks, the data holder publishes an anonymous table

T (QID , Sensitive Attributes, Non-Sensitive Attributes),

QID is an anonymous version of the original QID obtained by


applying

anonymization operations to the attributes in QID in the original table D.

Anonymization operations hide some detailed information so that mulitple

records become indistinguishable with respect to QID . Consequently,


if a

person is linked to a record through QID , the person is also linked to


all
other records that have the same value for QID , making the linking
ambiguous.

Alternatively, anonymization operations could generate a synthetic

data table T based on the statistical properties of the original table D, or

add noise to the original table D. The anonymization problem is to produce

an anonymous T that satisfies a given privacy requirement determined by the

chosen privacy model and to retain as much data utility as possible. An information

metric is used to measure the utility of an anonymous table. Note,

the Non-Sensitive Attributes are published if they are important to the data

mining task.

We can broadly classify privacy models to

two categories based on their attack principles.

The first category considers that a privacy threat occurs when an adversary

is able to link a record owner to a record in a published data table, to a sensitive

attribute in a published data table, or to the published data table itself.

We call these record linkage, attribute linkage, and table linkage, respectively.

In all three types of linkages, we assume that the adversary knows the QID of

the victim. In record and attribute linkages, we further assume that the adversary

knows the victims record is in the released table, and seeks to identify the

victims record and/or sensitive information from the table. In table linkage,

the attack seeks to determine the presence or absence of the victims record in

the released table.

The second category aims at achieving the uninformative principle [160]:

The published table should provide the adversary with little additional information

beyond the background knowledge. If the adversary has a large variation between the prior and
posterior beliefs, we call it the probabilistic

attack.

--

In the attack of record linkage, some value qid on QID identifies a small

number of records in the released table T , called a group. If the victims QID

matches the value qid, the victim is vulnerable to being linked to the small

number of records in the group. In this case, the adversary faces only a small
number of possibilities for the victims record, and with the help of additional

knowledge, there is a chance that the adversary could uniquely identify the

victims record from the group.

The k-anonymity model assumes that QID is known to the data holder.

Most works consider a single QID containing all attributes that can be potentially

used in the quasi-identifier. The more attributes included in QID,

the more protection k-anonymity would provide. On the other hand, this also

implies more distortion is needed to achieve k-anonymity because the records

in a group have to agree on more attributes.

Table 2.4 shows a 3-anonymous table by generalizing QID = {Job, Sex,Age}

from Table 2.2 using the taxonomy trees in Figure 2.1. It has two distinct

groups on QID, namely Professional,Male, [35-


40)

and Artist,

Female, [30-
35)

. Since each group contains at least 3 records, the table is

3-anonymous.

To prevent record linkage through QID, Samarati and Sweeney [201, 202,

203, 217] propose the notion of k-anonymity: If one record in the table has

some value qid, at least k 1 other records also have the value qid. In other

words, the minimum equivalence group size on QID is at least k. A table

satisfying this requirement is called k-anonymous. In a k-anonymous table,

each record is indistinguishable from at least k 1 other records with respect

to QID. Consequently, the probability of linking a victim to a specific record

through QID is at most 1/k.

--

In the attack of attribute linkage, the adversary may not precisely identify

the record of the target victim, but could infer his/her sensitive values from

the published data T , based on the set of sensitive values associated to the

group that the victim belongs to


Consider the 3-anonymous data in Table 2.4. Suppose the adversary knows

that the target victim Emily is a female dancer at age 30 and owns a record in

the table. The adversary may infer that Emily has HIV with 75% confidence

because 3 out of the 4 female artists with age [30-35) have HIV . Regardless

of the correctness of the inference, Emilys privacy has been compromised.

--

l-diversity

Consider Table 2.4. For the first group Professional,Male, [35-


40)

3 log 2

3 log 1

3 = log(1.9),

and for the second group Artist, Female, [30-


35)

4 log 3

4 log 1

4 = log(1.8).

So the table satisfies entropy -diversity where 1.8.

To achieve entropy -diversity, the table as a whole must be at least log(l)

since the entropy of a qid group is always greater than or equal to the minimum

entropy of its subgroups {qid1, . . . , qidn} where qid = qid1 qidn, that

is,

entropy(qid) min(entropy(qid1), . . . , entropy(qidn)).


This requirement is hard to achieve, especially if some sensitive value frequently

occurs in S.

-diversity has the limitation of implicitly assuming that each sensitive attribute

takes values uniformly over its domain. In case the frequencies of

sensitive values are not similar, achieving -diversity may cause a large data

utility loss.

-diversity has the limitation of implicitly assuming that each sensitive attribute

takes values uniformly over its domain. In case the frequencies of

sensitive values are not similar, achieving -diversity may cause a large data

utility loss. Consider a data table containing data of 1000 patients on some

QID attributes and a single sensitive attribute Disease with two possible

values, HIV or Flu. Assume that there are only 5 patients with HIV in the

table. To achieve 2-diversity, at least one patient with HIV is needed in each

qid group; therefore, at most 5 groups can be formed [66], resulting in high

information loss in this case.

--

t-Closeness

In a spirit similar to the uninformative principle discussed earlier, Li et

al. [153] observe that when the overall distribution of a sensitive attribute

is skewed, -diversity does not prevent attribute linkage attacks. Consider a

patient table where 95% of records have Flu and 5% of records have HIV .

Suppose that a qid group has 50% of Flu and 50% of HIV and, therefore,

satisfies 2-diversity. However, this group presents a serious privacy threat because

any record owner in the group could be inferred as having HIV with

50% confidence, compared to 5% in the overall table.

To prevent skewness attack, Li et al. [153] propose a privacy model, called t-

Closeness, which requires the distribution of a sensitive attribute in any group

on QID to be close to the distribution of the attribute in the overall table.

t-closeness uses the Earth Mover Distance (EMD) function to measure the

closeness between two distributions of sensitive values, and requires the closeness

to be within t. t-closeness has several limitations and weaknesses. First, it


lacks the flexibility of specifying different protection levels for different sensitive

values. Second, the EMD function is not suitable for preventing attribute

linkage on numerical sensitive attributes [152]. Third, enforcing t-closeness

would greatly degrade the data utility because it requires the distribution

of sensitive values to be the same in all qid groups. This would significantly

damage the correlation between QID and sensitive attributes.

--

e-Differential Privacy

Dwork [74] proposes an insightful privacy notion: the risk to the record

owners privacy should not substantially increase as a result of participating

in a statistical database. Instead of comparing the prior probability and the

posterior probability before and after accessing the published data, Dwork [74]

proposes to compare the risk with and without the record owners data in the

published data. Consequently, the privacy model called

-differential privacy

ensures that the removal or addition of a single database record does not significantly

affect the outcome of any analysis.

Although

-differential privacy does not prevent record

and attribute linkages studied in earlier chapters, it assures record owners

that they may submit their personal information to the database securely in

the knowledge that nothing, or almost nothing, can be discovered from the

database with their information that could not have been discovered without

their information. Dwork [74] formally proves that

-differential privacy can

provide a guarantee against adversaries with arbitrary background knowledge.

This strong guarantee is achieved by comparison with and without the record

owners data in the published data. Dwork [75] proves that if the number of

queries is sub-linear in n, the noise to achieve differential privacy is bounded

by o(

n), where n is the number of records in the database. Dwork [76] further

shows that the notion of differential privacy is applicable to both interactive

and non-interactive query models, discussed in Chapters 1.2 and 17.1. Refer

to [76] for a survey on differential privacy.

--

Motivated by the learning theory, Blum et al. [33] present a privacy model

called distributional privacy for a non-interactive query model. The key idea

is that when a data table is drawn from a distribution, the table should reveal

only information about the underlying distribution, and nothing else.

Distributional privacy is a strictly stronger privacy notion than differential

privacy, and can answer all queries over a discretized domain in a concept

class of polynomial VC-dimension, where Vapnik-Chervonenkis (VC) dimension

is a measure of the capacity of a statistical classification algorithm. Yet,

32 Introduction to Privacy-Preserving Data Publishing

the algorithm has high computational cost. Blum et al. [33] present an efficient

algorithm specifically for simple interval queries with limited constraints.

The problems of developing efficient algorithms for more complicated queries

remain open.

--

The raw data table usually does not satisfy a specified privacy requirement

and the table must be modified before being published. The modification

is done by applying a sequence of anonymization operations to the table.

An anonymization operation comes in several flavors: generalization, suppression,

anatomization, permutation, and perturbation. Generalization and suppression

replace values of specific description, typically the QID attributes,

with less specific description. Anatomization and permutation de-associate

the correlation between QID and sensitive attributes by grouping and shuffling

sensitive values in a qid group. Perturbation distorts the data by adding

noise, aggregating values, swapping values, or generating synthetic data based

on some statistical properties of the original data.


Each generalization or suppression operation hides some details in QID. For

a categorical attribute, a specific value can be replaced with a general value

according to a given taxonomy. In Figure 3.1, the parent node Professional

is more general than the child nodes Engineer and Lawyer. The root node,

ANY Job, represents the most general value in Job. For a numerical attribute,

exact values can be replaced with an interval that covers exact values. If a taxonomy

of intervals is given, the situation is similar to categorical attributes.

More often, however, no pre-determined taxonomy is given for a numerical

attribute. Different classes of anonymization operations have different implications

on privacy protection, data utility, and search space. But they all

result in a less precise but consistent representation of original data.

A generalization replaces some values with a parent value in the taxonomy

of an attribute. The reverse operation of generalization is called specialization.

A suppression replaces some values with a special value, indicating that the

replaced values are not disclosed. The reverse operation of suppression is called

disclosure. Below, we summarize five generalization schemes.

You might also like