- Queue
- D_Optimal
- EHSV M1286
- Support Vector Machine
- class12-PAC3
- 100 Ovr Calculator
- HW2sol
- HmkProbSols
- half2
- Control Systems Jan 2014
- NetworkSimTestCases-2015
- Market Equilibrium With Transaction Costs
- תכן לוגי מתקדם- הרצאה 3 | הוספת שקופיות ב-2011
- ES 204 MP2
- Root Locus Technique
- Solid Commerce NetSuite Integration
- RP Notes
- 2010 IWYMIC Individual
- Cambridge Text - Answers
- a
- blendkit wk 3 creating blended assignment instructions
- Full Text
- 0BzldnLHvRAq_SUFTUG5YbnpSbVk
- guymonmarqueztechnologyrolloutrubric
- HW Set3 Gentry_graded (2)
- Points
- 9 Key to Numerical Differentiation
- 4
- Project 1
- Miscellaneous Mathematical Constants Etc
- The Unwinding: An Inner History of the New America
- Yes Please
- Sapiens: A Brief History of Humankind
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- John Adams
- The Emperor of All Maladies: A Biography of Cancer
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- Grand Pursuit: The Story of Economic Genius
- This Changes Everything: Capitalism vs. The Climate
- The Prize: The Epic Quest for Oil, Money & Power
- The New Confessions of an Economic Hit Man
- Team of Rivals: The Political Genius of Abraham Lincoln
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Rise of ISIS: A Threat We Can't Ignore
- Angela's Ashes: A Memoir
- Steve Jobs
- How To Win Friends and Influence People
- Bad Feminist: Essays
- The Light Between Oceans: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- You Too Can Have a Body Like Mine: A Novel
- The Incarnations: A Novel
- The Love Affairs of Nathaniel P.: A Novel
- Life of Pi
- Bel Canto
- The Master
- A Man Called Ove: A Novel
- Brooklyn: A Novel
- The Flamethrowers: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The Wallcreeper
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- Beautiful Ruins: A Novel
- The Kitchen House: A Novel
- Interpreter of Maladies
- The Perks of Being a Wallflower
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- The Cider House Rules
- A Prayer for Owen Meany: A Novel
- Mislaid: A Novel

**Ji-Won Byun, Tiancheng Li, Elisa Bertino, Ninghui Li
**

Department of Computer Science

Purdue University, West Lafayette, IN 47907, USA

¦byunj,li83,bertino,ninghui¦@cs.purdue.edu

Yonglak Sohn

Department of Computer Engineering

Seokyeong University, Seoul, Korea

syl@skuniv.ac.kr

Abstract

Although the k-anonymity and -diversity models have led to a number of valuable privacy-protecting

techniques and algorithms, the existing solutions are currently limited to static data release. That is, it is

assumed that a complete dataset is available at the time of data release. This assumption implies a sig-

niﬁcant shortcoming, as in many applications data collection is rather a continual process. Moreover, the

assumption entails “one-time” data dissemination; thus, it does not adequately address today’s strong de-

mand for immediate and up-to-date information. In this paper, we consider incremental data dissemination,

where a dataset is continuously incremented with new data. The key issue here is that the same data may

be anonymized and published multiple times, each of the time in a different form. Thus, static anonymiza-

tion (i.e., anonymization which does not consider previously released data) may enable various types of

inference. In this paper, we identify such inference issues and discuss some prevention methods.

1 Introduction

When person-speciﬁc data is published, protecting individual respondents’ privacy is a top priority. Among

various approaches addressing this issue, the k-anonymity model [22, 19] and the -diversity model [16] have

recently drawn signiﬁcant attention in the research community. In the k-anonymity model, privacy protection

is achieved by ensuring that every record in a released dataset is indistinguishable from at least (k −1) other

records within the dataset. Thus, every respondent included in the dataset correspond to at least k records in

a k-anonymous dataset, and the risk of record identiﬁcation (i.e., the probability of associating a particular

individual with a released record) is guaranteed to be at most 1/k. While the k-anonymity model primarily

focuses on the problem of record identiﬁcation, the -diversity model, which is built upon the k-anonymity

model, addresses the risk of attribute disclosure (i.e., the probability of associating a particular individual with

a sensitive attribute value). As an attribute disclosure may occur without records being identiﬁed (e.g., due

to lack of diversity in a sensitive attribute), the -diversity model, in its simplest form

1

, additionally requires

that every group of indistinguishable records contain at least distinct sensitive attribute values; thereby the

risk of attribute disclosure is bound to at most 1/.

1

We discuss more robust -diversity requirements in Section 2.

1

Although these models have yielded a number of valuable privacy-protecting techniques [3, 8, 9, 11,

12, 21], existing approaches only deal with static data release. That is, all these approaches assume that a

complete dataset is available at the time of data release. This assumption implies a signiﬁcant shortcoming,

as in many applications data collection is rather a continuous process. Moreover, the assumption entails

“one-time” data dissemination. Obviously, this does not address today’s strong demand for immediate and

up-to-date information, as the data cannot be released before the data collection is considered complete.

As a simple example, suppose that a hospital is required to share its patient records with a disease control

agency. In order to protect patients’ privacy, the hospital anonymizes all the records prior to sharing. At

ﬁrst glance, the task seems reasonably straightforward, as existing anonymization techniques can efﬁciently

anonymize the records. The challenge is, however, that newrecords are continuously collected by the hospital

(e.g., whenever new patients are admitted), and it is critical for the agency to receive up-to-date data in timely

manner.

One possible approach is to provide the agency with datasets containing only the new records, which are

independently anonymized, on a regular basis. Then the agency can either study each dataset independently or

merge multiple datasets together for more comprehensive analysis. Although straightforward, this approach

may suffer from severely low data quality. The key problem is that relatively small sets of records are

anonymized independently so that the records may have to be modiﬁed much more than when they are

anonymized together with previous records [5]. Moreover, a recoding scheme applied to each dataset may

make the datasets inconsistent with each other; thus, collective analysis on multiple datasets may require

additional data modiﬁcation. Therefore, in terms of data quality, this approach is highly undesirable. One

may believe that data quality can be assured by waiting for new data to be accumulated sufﬁciently large.

However, this approach may not be acceptable in many applications as new data cannot be released in a

timely manner.

A better approach is to anonymize and provide the entire dataset whenever it is augmented with new

records (possibly along with another dataset containing only new records). In this way, the agency can be

provided with up-to-date, quality-preserving and “more complete” datasets each time. Although this ap-

proach can also be easily implemented by using existing techniques (i.e., anonymizing the entire dataset

every time), it has a signiﬁcant drawback. That is, even though each released dataset, when observed inde-

pendently, is guaranteed to be anonymous, the combination of several released datasets may be vulnerable to

various inferences. We illustrate these inferences through some examples in Section 3.1. As such inferences

are typically made by comparing or linking records across different tables (or versions), we refer to them as

cross-version inferences to differentiate them from inferences that may occur within a single table.

Our goal in this paper is to identify and prevent cross-version inferences so that an increasing dataset can

be incrementally disseminated without compromising the imposed privacy requirement. In order to achieve

this, we ﬁrst deﬁne the privacy requirement for incremental data dissemination. We then discuss three types

of cross-version inference that an attacker may exploit by observing multiple anonymized datasets. We also

present our anonymization method where the degree of generalization is determined based on the previ-

ously released datasets to prevent any cross-version inference. The basic idea is to obscure linking between

records across different datasets. We develop our technique in two different types of recoding approaches;

namely, full-domain generalization [11] and multidimensional anonymization [12]. One of the key differ-

ences between these two approaches is that the former generalizes a given dataset according to pre-deﬁned

generalization hierarchies, while the latter does not. Based on our experimental result, we compare these two

approaches with respect to data quality and vulnerability to cross-table inference. Another issue we address is

that as a dataset is released multiple times, one may need to keep the history of previously released datasets.

We thus discuss how to maintain such history in a compact form to reduce unnecessary overheads.

2

The remainder of this paper is organized as follows. In Section 2, we review the basic concepts of the k-

anonymity and -diversity models and provide an overview of related techniques. In Section 3, we formulate

the privacy requirement for incremental data dissemination. Then in Section 4, we describe three types of

inference attacks based on our assumption of potential attackers. We present our approach to preventing these

inferences in Section 5 and evaluate our technique in Section 6. We review some related work in Section 7

and conclude our discussion in Section 8.

2 Preliminaries

In this section, we discuss a number of anonymization models and brieﬂy review existing anonymization

techniques.

2.1 Anonymity Models

The k-anonymity model assumes that data are stored in a table (or a relation) of columns (or attributes) and

rows (or records). It also assumes that the target table contains person-speciﬁc information and that each

record in the table corresponds to a unique real-world individual. The process of anonymizing such a table

starts with removing all the explicit identiﬁers, such as name and SSN, from the table. However, even though

a table is free of explicit identiﬁers, some of the remaining attributes in combination could be speciﬁc enough

to identify individuals. For example, it has been shown that 87% of individuals in the United States can be

uniquely identiﬁed by a set of attributes such as ¦ZIP, gender, date of birth¦ [22]. This implies that each

attribute alone may not be speciﬁc enough to identify individuals, but a particular group of attributes together

may identify a particular individuals [19, 22].

The main objective of the k-anonymity model is thus to transform a table so that no one can make high-

probability associations between records in the table and the corresponding individuals by using such group

of attributes, called quasi-identiﬁer. In order to achieve this goal, the k-anonymity model requires that any

record in a table be indistinguishable from at least (k − 1) other records with respect to the quasi-identiﬁer.

A set of records that are indistinguishable from each other is often referred to as an equivalence class. Thus,

a k-anonymous table can be viewed as a set of equivalence classes, each of which contains at least k records.

The enforcement of k-anonymity guarantees that even though an adversary knows the quasi-identiﬁer value

of an individual and is sure that a k-anonymous table T contains the record of the individual, he cannot

determine which record in T corresponds to the individual with a probability greater than 1/k.

Although the k-anonymity model does not consider sensitive attributes, a private dataset typically contains

some sensitive attributes that are not part of the quasi-identiﬁer. For instance, in patient table, Diagnosis is

considered a sensitive attribute. For such datasets, the key consideration of anonymization is the protection

of individuals’ sensitive attributes. However, the k-anonymity model does not provide sufﬁcient protection

in this setting, as it is possible to infer certain individuals’ sensitive attribute values without precisely re-

identifying their records. For instance, consider a k-anonymized table where all records in an equivalence

class have the same sensitive attribute value. Although none of these records can be uniquely matched with

the corresponding individuals, their sensitive attribute value can be inferred with probability 1. Recently,

Machanavajjhala et al. [16] pointed out such inference issues in the k-anonymity model and proposed the

notion of -diversity. Several formulations of -diversity are introduced in [16]. In its simplest form, the

-diversity model requires that records in each equivalence class have at least distinct sensitive attribute

values. As this requirement ensures that every equivalence class contains at least distinct sensitive attribute

values, the risk of attribute disclosure is kept under 1/. Note that in this case, the -diversity requirement

3

also ensures -anonymity, as the size of every equivalence class must be greater than or equal to . Although

simple and intuitive, modiﬁed datasets based on this requirement could still be vulnerable to probabilistic

inferences. For example, consider that among the distinct values in an equivalence class, one particular

value appears much more frequently than the others. In such a case, an adversary may conclude that the

individuals contained in the equivalence class are very likely to have that speciﬁc value. A more robust

diversity is achieved by enforcing entropy -diversity [16], which requires every equivalence class to satisfy

the following condition.

−

s∈S

p(e, s) log p(e, s) > log

where S is the domain of the sensitive attribute and p(e, s) represents the fraction of records in e that have

sensitive value s. Although entropy -diversity does provide stronger privacy, the requirement may sometimes

be too restrictive. For instance, as pointed out in [16], in order for entropy -diversity to be achievable, the

entropy of the entire table must also be greater than or equal to log .

Recently, a number of anonymization models have been proposed. In [26], Xiao and Tao observed that -

diversity cannot prevent attribute disclosure, when multiple records in the table corresponds to one individual.

They proposed to have each individual specify privacy policies about his or her own attributes. In [13], Li

et al. observed that -diversity is neither sufﬁcient nor necessary for protecting against attribute disclosure.

They proposed t-closeness as a stronger anonymization model, which requires the distribution of sensitive

values in each equivalence class to be analogous to the distribution of the entire dataset. In [15], Li and

Li considered the adversary’s background knowledge in deﬁning privacy. They proposed an approach for

modeling the adversary’s background knowledge by using data mining techniques on the data to be released.

2.2 Anonymization Techniques

The k-anonymity (and -diversity) requirement is typically enforced through generalization, where real val-

ues are replaced with “less speciﬁc but semantically consistent values” [21]. Given a domain, there are

various ways to generalize the values in the domain. Intuitively, numeric values can be generalized into in-

tervals (e.g., [11 − 20]), and categorical values can be generalized into a set of possible values (e.g., ¦USA,

Canada, Mexico¦) or a single value that represents such a set (e.g., North-America). As generalization makes

data uncertain, the utility of the data is inevitably downgraded. The key challenge of anonymization is thus

to minimize the amount of ambiguity introduced by generalization while enforcing anonymity requirement.

Various generalization strategies have been developed. In the hierarchy-based generalization schemes,

a non-overlapping generalization-hierarchy is ﬁrst deﬁned for each attribute in the quasi-identiﬁer. Then an

algorithm in this category tries to ﬁnd an optimal (or good) solution which is allowed by such generalization

hierarchies. Here an optimal solution is a solution that satisﬁes the privacy requirement and at the same time

minimizes a desired cost metric. Based on the use of generalization hierarchies, the algorithms in this category

can be further classiﬁed into two subclasses. In the single-level generalization schemes [11, 19, 21], all the

values in a domain are generalized into a single level in the corresponding hierarchy. This restriction could

be a signiﬁcant drawback in that it may lead to relatively high data distortion due to excessive generalization.

The multi-level generalization [8, 9] schemes, on the other hand, allows values in a domain to be generalized

into different levels in the hierarchy. Although this leads to much more ﬂexible generalization, possible

generalizations are still limited by the imposed generalization hierarchies.

Another class of generalization schemes is the hierarchy-free generalization class [3, 2, 12, 4]. Hierarchy-

free generalization schemes do not rely on the notion of pre-deﬁned generalization hierarchies. In [3], Ba-

4

yardo et al. propose an algorithm based on a powerset search problem, where the space of anonymizations

(formulated as the powerset of totally ordered values in a dataset) is explored using a tree-search strategies.

In [2], Aggrawal et al. propose a clustering approach to achieve k-anonymity, which clusters records into

groups based on some distance metric. In [12], LeFevre et al. transform the k-anonymity problem into a

partitioning problem and propose a greedy approach that recursively splits a partition at the median value

until no more split is allowed with respect to the k-anonymity requirement. Byun et al.[4], on the other hand,

introduce a ﬂexible k-anonymization approach which uses the idea of clustering to minimize information loss

and thus ensures good data quality.

On the theoretic side, optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. Furthermore,

the curse of dimensionality also calls for more effective anonymization techniques, as shown in [1] that,

when the number of quasi-identiﬁer attributes is high, enforcing k-anonymity necessarily results in severe

information loss, even for k = 2.

Recently, Xiao and Tao [25] propose Anatomy, a data anonymization approach that divides one table into

two for release; one table includes original quasi-identiﬁer and a group id, and the other includes the associa-

tions between the group id and the sensitive attribute values. Koudas et al. [10] explore the permutation-based

anonymization approach and examine the anonymization problem from the perspective of answering down-

stream aggregate queries.

3 Problem Formulation

In this section, we start with an example to illustrate the problem of inference. We then describe our notion

of incremental dissemination and formally deﬁne a privacy requirement for it.

3.1 Motivating Examples

Let us revisit our previous scenario where a hospital is required to provide the anonymized version of its

patient records to a disease control agency. As previously discussed, to assure data quality, the hospital

anonymizes the patient table whenever it is augmented with new records. To make our example more con-

crete, suppose that the hospital relies on a model where both the k-anonymity and -diversity are considered;

therefore, a ‘(k, )-anonymous’ dataset is a dataset that satisﬁes both the k-anonymity and -diversity require-

ments. The hospital initially has a table like the one in Figure 1 and reports to the agency its (2, 2)-anonymous

table shown in Figure 2. As shown, the probability of identity disclosure (i.e., the association between indi-

vidual and record) and attribute disclosure (i.e., the association between individual and diagnosis) are kept

under 1/2 in the dataset, respectively. For example, even if an attacker knows that the record of Tom, who

is a 21-year-old male, is in the released table, he cannot learn about Tom’s disease with a probability greater

than 1/2 (although he learns that Tom has either asthma or ﬂu). At a later time, three more patient records

(shown in Italic) are inserted into the dataset, resulting the table in Figure 3. The hospital then releases a new

(2, 2)-anonymous table as depicted in Figure 4. Observe that Tom’s privacy is still protected in the newly

released dataset. However, not every patient’s privacy is protected from the attacker.

Example 1 “Alice has cancer!” Suppose the attacker knows that Alice, who is in her late twenties, has

recently been admitted to the hospital. Thus, he knows that Alice’s record is not in the old dataset in Figure 2,

but in the new dataset in Figure 4. From the new dataset, he learns only that Alice has one of ¦Asthma, Flu,

Cancer¦. However, by consulting the previous dataset, he can easily deduce that Alice has neither asthma nor

ﬂu (as they must belong to patients other than Alice). He now infers that Alice has cancer.

5

Example 2 “Bob has alzheimer!” The attacker knows that Bob is 52 years old and has long been treated in

the hospital. Thus, he is sure that Bob’s record is in both datasets in Figures 2 and 4. First, by studying the

old dataset, he learns that Bob suffers from either alzheimer or diabetes. Now the attacker checks the new

dataset and learns that Bob has either alzheimer or heart disease. He can thus conclude that Bob suffers from

alzheimer. Note that three other records in the new dataset are also vulnerable to similar inferences.

As shown in the examples above, anonymizing a dataset without considering previously released infor-

mation may enable various inferences.

3.2 Incremental data dissemination and privacy requirement

Let T be a private table with a set of quasi-identiﬁer attributes Q and a sensitive attribute S. We assume

that T consists of person-speciﬁc records, each of which corresponds to a unique real-world individual. We

also assume that T continuously grows with new records and denote the state of T at time i as T

i

. For the

privacy of individuals, each T

i

must be “properly” anonymized before being released to public. Our goal is

to address both identity disclosure and attribute disclosure, and we adopt an anonymity model

2

that combines

the requirements of k-anonymity and -diversity as follows.

Deﬁnition 1 ((k, c)-Anonymity) Let table T be with a set of quasi-identiﬁer attributes Q and a sensitive

attribute S. With respect to Q, T consists of a set of non-empty equivalence classes, where ∀ e ∈ T, record

r ∈ e ⇒ r[Q] = e[Q]. We say that T is (k, c)-anonymous with respect to Q if the following conditions are

satisﬁed.

1. ∀ e ∈ T, [e[ ≥ k, where k > 0.

2. ∀ e ∈ T, ∀ s ∈ S,

|{r|r∈e∧r[S]=s}|

|e|

≤ c, where 0 < c ≤ 1.

The ﬁrst condition ensures the k-anonymity requirement, and the second condition enforces the diversity

requirement in the sensitive attribute. In its essence, the second condition dictates that the maximum conﬁ-

dence of association between any quasi-identiﬁer value and a particular sensitive attribute value in T must

not exceed a threshold c.

At a given time i, only an (k, c)-anonymous version of T

i

, denoted as

T

i

, is released to public. Thus,

users, including potential attackers, may have access to a series of (k, c)-anonymous tables,

T

1

,

T

2

, . . ., where

[

T

i

[ ≤ [

T

j

[ for i < j. As every released table is (k, c)-anonymous, by observing each table independently, one

cannot associate a record with a particular individual with probability higher than 1/k or infer any individual’s

sensitive attribute with conﬁdence higher than c. However, as shown in Section 3.1, it is possible that one

can increase the conﬁdence of such undesirable inferences by observing the difference between the released

tables. For instance, if an observer can be sure that two (anonymized) records in two different versions indeed

correspond to the same individual, then he may be able to use this knowledge to infer more information than

what is allowed by the (k, c)-anonymity protection. If such a case occurs, we say that there is an inference

channel between the two versions.

Deﬁnition 2 (Cross-version inference channel) Let Θ = ¦

T

1

, . . . ,

T

n

¦ be the set of all released tables for

private table T, where

T

i

is an (k, c)-anonymous version released at time i, 1 ≤ i ≤ n. Let θ ⊆ Θ and

T

i

∈ Θ.

We say that there exists cross-version inference channel from θ to

T

i

, denoted as θ

T

i

, if observing tables

in θ and

T

i

collectively increases the risk of either identity disclosure or attribute disclosure in

T

i

higher than

1/k or c, respectively.

2

A similar model is also introduced in [24].

6

When data are disseminated incrementally, it is critical to ensure that there is no cross-version inference

channel among the released tables. In other words, the data provider must make sure that not only each

released table is free of undesirable inferences, but also no released table creates cross-version inference

channels with respect to the previously released tables. We formally deﬁne this requirement as follows.

Deﬁnition 3 (Privacy-preserving incremental data dissemination) Let Θ = ¦

T

0

, . . . ,

T

n

¦ be the set of all

released tables of private table T, where

T

i

is an (k, c)-anonymous version of T released at time i, 0 ≤ i ≤ n.

Θ is said to be privacy-preserving if and only if (θ,

T

i

) such that θ ⊆ Θ,

T

i

∈ Θ, and θ

T

i

.

4 Cross-version Inferences

We ﬁrst describe potential attackers and their knowledge that we assume in this paper. Then based on the

attack scenario, we identify three types of cross-version inference attacks in this section.

4.1 Attack scenario

We assume that the attacker has been keeping track of all the released tables; he thus possesses a set of

released tables ¦

T

0

, . . . ,

T

n

¦. We also assume that the attacker has the knowledge of who is and who is not

contained in each table; that is, for each anonymized table

T

i

, the attacker also possesses a population table

U

i

which contains the explicit identiﬁers and the quasi-identiﬁers of the individuals in

T

i

. This may seem to

be too farfetched at ﬁrst glance; however, we assume the worst case, as we cannot rely on attacker’s lack of

knowledge. Also, such knowledge is not always difﬁcult to acquire for a dedicated attacker. For instance,

consider medical records released by a hospital. Although the attacker may not be aware of all the patients,

he may know when target individuals in whom he is interested (e.g., local celebrities) are admitted to the

hospital. Based on this knowledge, the attacker can easily deduce which tables may include such individuals

and which tables may not. Another, perhaps the worst, possibility is that the attacker may collude with an

insider who has access to detailed information about the patients; e.g., the attacker could obtains a list of

patients from a registration staff

3

. Thus, it is reasonable to assume that the attacker’s knowledge includes

the list of individuals contained in each table as well as their quasi-identiﬁer values. However, as all the

released tables are (k, c)-anonymous, the attacker cannot infer the individuals’ sensitive attribute values with

a signiﬁcant probability, even utilizing such knowledge. Therefore, the goal of the attacker is to increase

his/her conﬁdence of attribute disclosure (i.e., above c) by comparing the released tables all together. In the

remainder of this section, we describe three types of cross-version inferences that the attacker may exploit in

order to achieve this goal.

4.2 Notations

We ﬁrst introduce some notations we use in our discussion. Let T be a table with a set of quasi-identifer

attributes Q and a sensitive attribute S. Let A be a set of attributes, where A ⊆ (Q∪ S). Then T[A] denotes

the duplicate-eliminating projection of T onto the attributes A. Let e

i

= ¦r

0

, . . . , r

m

¦ be an equivalence

class in T, where m > 0. By deﬁnition, the records in e

i

all share the same quasi-identiﬁer value, and e

i

[Q]

represents the common quasi-identiﬁer value of e

i

. We also use similar notations for individual records; that

is, for record r ∈ T, r[Q] represents the quasi-identiﬁer value of r and r[S] the sensitive attribute value of r.

3

Nowadays, about 80% of privacy infringement is committed by insiders.

7

In addition, we use T¸A) to denote the duplicate-preserving projection of T. For instance, e

i

¸S) represents

the multiset of all the sensitive attribute values in e

i

. We also use [^[ to denote the cardinalities of set ^.

Regardless of recoding schemes, we consider a generalized value as a set of possible values. Suppose

that v is a value from domain T and v a generalized value of v. Then we denote this relation as v _ v, and

interpret v as a set of values where (v ∈ v) ∧ (∀v

i

∈ v, v

i

∈ T). Overloading this notation, we say that r is

a generalized version of record r, denoted as r _ r, if (∀q

i

∈ Q, r[q

i

] _ r[q

i

]) ∧ (r[S] = r[S]). Moreover,

we say that two generalized values v

1

and v

2

are compatible, denoted as v

1

v

2

, if v

1

∩ v

2

,= ∅. Similarly,

two generalized records r

1

and r

2

are compatible (i.e., r

1

r

2

) if ∀q

i

∈ Q, r

i

[q

i

] ∩ r

j

[q

i

] ,= ∅. We also say

that two equivalence classes e

1

and e

2

are compatible if ∀q

i

∈ Q, e

1

[q

i

] ∩ e

2

[q

i

] ,= ∅

4.3 Difference attack

Let

T

i

= ¦e

0,1

, . . . , e

0,n

¦ and

T

j

= ¦e

1,1

, . . . , e

1,m

¦ be two (k, c)-anonymous tables that are released at

time i and j (i ,= j), respectively. As previously discussed in Section 4.1, we assume that an attacker knows

who is and who is not in each released table. Also knowing the quasi-identiﬁer values of the individuals in

T

i

and

T

j

, for any equivalence class e in either

T

i

or

T

j

, the attacker knows the individuals whose records

are contained in e. Let I(e) represent the set of individuals in e. With this information, the attacker can now

perform difference attacks as follows. Let E

i

and E

j

be two sets of equivalence classes, where E

i

⊆

T

i

and

E

j

⊆

T

j

. If ∪

e∈Ei

I(e) ⊆ ∪

e∈Ej

I(e), then set D = ∪

e∈Ej

I(e) −∪

e∈Ei

I(e) represents the set of individuals

whose records are in E

j

, but not in E

i

. Furthermore, set S

D

= ∪

e∈Ej

e¸S) −∪

e∈Ei

e¸S) indicates the set of

sensitive attribute values that belong to those individuals in D. Therefore, if D contains less than k records,

or if the most frequent sensitive value in S

D

appears with a probability larger than c, the (k, c)-anonymity

requirement is violated.

The DirectedCheck procedure in Figure 5 checks if the ﬁrst table of the input is vulnerable to difference

attack with respect to the second table. Two tables

T

i

and

T

j

are vulnerable to difference attack if at least one

of DirectedCheck(

T

i

,

T

j

) and DirectedCheck(

T

j

,

T

i

) returns true.

The DirectedCheck procedure enumerates all subsets of

T

i

, and for each set E, the procedure calls Get-

MinSets procedure in Figure 6 to get the minimum set E

**of equivalence classes in
**

T

j

that contains all the

records in E. We call such E

**the minimum covering set of E. The procedure then checks whether there is
**

vulnerability between the two equivalence classes E and E

**. As the algorithm checks all the subsets of
**

T

i

,

the time complexity is exponential (i.e, it is O(2

n

), where n is the number of equivalence classes in

T

i

). As

such, for large tables with many equivalence classes, this brute-force check is clearly unacceptable. In what

follows, we discuss a few observations that result in effective heuristics to reduce the space of the problem in

most cases.

Observation 1 Let E

1

and E

2

be two sets of equivalence classes in

T

i

, and E

1

and E

2

be their minimal

covering sets in

T

j

, respectively. If E

1

and E

2

are not vulnerable to difference attack, and E

1

∩E

2

= ∅, then

we do not need to consider any subset of

T

i

which contains E

1

∪ E

2

.

The fact that E

1

and E

2

are not vulnerable to difference attacks means that sets (E

1

−E

1

) and (E

2

−E

2

)

are both (k, c)-anonymous. As the minimum covering set of E

1

∪E

2

is E

1

∪E

2

, and E

1

and E

2

are disjoint,

(E

1

∪ E

2

) −(E

1

∪ E

2

) is also (k, c)-anonymous. This also implies that if each E

1

and E

2

is not vulnerable

to difference attack, then neither is any set containing E

1

∪E

2

. Based on this observation, we can modify the

method in Figure 5 as follows. In each time we union one more element to a subset to create larger subsets,

we check if their minimum covering sets are disjoint. If they are, we do not insert the unioned subset to the

queue. Note that this also prevents all the sets containing the unioned set from being generated.

8

Observation 2 Let E

1

and E

2

be two sets of equivalence classes in

T

i

, and E

1

and E

2

be their minimal

covering sets in

T

j

, respectively. If E

1

= E

2

, then we only need to check if E

1

∪ E

2

is vulnerable to

difference attack.

In other words, we can skip checking if each of E

1

and E

2

is vulnerable to difference attack. This is

because unless E

1

∪ E

2

is vulnerable to difference attack, E

1

and E

2

are not vulnerable. Thus, we can save

our computational effort as follows. When we insert a new subset into the queue, we check if there exists

another set with the same minimum covering set. If such a set is found, we simply merge the new subset with

the found set.

Observation 3 Consider the method in Figure 5. Suppose that

T

i

was released after

T

j

; that is,

T

i

contains

some records that are not in

T

j

. If equivalence class e ∈

T

j

contains all such records, then we do not need to

consider that equivalence class for difference attack.

It is easy to see that if e ∈

T

i

contains some record(s) that

T

j

do not, the minimum covering set of e

is an empty-set. Since e itself must be (k, c)-anonymous, e is safe from difference attack. Based on this

observation, we can purge all such equivalence classes from the initial problem set.

4.4 Intersection attack

The key idea of k-anonymity is to introduce sufﬁcient ambiguity into the association between quasi-identiﬁer

values and sensitive attribute values. However, this ambiguity may be reduced to an undesirable level if the

structure of equivalence classes are varied in different releases. For instance, suppose that the attacker wants

to know the sensitive attribute of Alice, whose quasi-identiﬁer value is q

A

. Then the attacker can select a

set of tables, θ

+

A

, that all contain Alice’s record. As the attacker knows the quasi-identiﬁer of Alice, he does

not need to examine all the records; he just needs to consider the records that may possibly correspond to

Alice. That is, in each

T

i

∈ θ

+

A

, the attacker only need to consider an equivalence class e

i

⊆

T

i

, where

q

A

_ e

i

[Q]. Let E

A

= ¦e

0

, . . . , e

n

¦ be the set of all equivalence classes identiﬁed from θ

+

A

such that

q

A

_ e

i

[Q], 0 ≤ i ≤ n. As every e

i

is (k, c)-anonymous, the attacker cannot infer Alice’s sensitive attribute

value with conﬁdence higher than c by examining each e

i

independently. However, as every equivalence class

in E

A

contains Alice’s record, the attacker knows that Alice’s sensitive attribute value, s

A

, must be present

in every equivalence class in E

A

; i.e., ∀e

i

∈ E

A

, s

A

∈ e

i

¸S). This implies that s

A

must be found in set

SI

A

=

ei∈EA

e

i

¸S). Therefore, if the most frequent value in SI

A

appears with a probability greater than

c, then the sensitive attribute value of Alice can be inferred with conﬁdence greater than c.

The pseudo-code in Figure 7 provides an algorithm for checking the vulnerability to the intersection

attack for given two (k, c)-anonymous tables,

T

0

and

T

1

. The basic idea is to check every pair of equivalence

classes e

i

∈

T

0

and e

j

∈

T

1

that contain the same record(s).

4.5 Record-tracing attack

Unlike the previous attacks, the attacker may be interested in knowing who may be associated with a par-

ticular attribute value. In other words, instead of wanting to know what sensitive attribute value a particular

individual has, the attacker now wants to know which individuals possess a speciﬁc sensitive attribute value;

e.g., the individuals who suffer from ‘HIV+’. Let s

p

be the sensitive attribute value in which the attacker is

interested and

T

i

∈ Θ be the table in which (at least) one record with sensitive value s

p

appears. Although

T

9

may contain more than one record with s

p

, suppose, for simplicity, that the attacker is interested in a partic-

ular record r

p

such that (r

p

[S] = s

p

) ∧ (r

p

∈ e

i

). As

T is (k, c)-anonymous, when the attacker queries the

population table U

i

with r

p

[Q], he obtains at least k individuals who may correspond to r

p

. Let I

p

be the set

of such individuals. Suppose that the attacker also possesses a subsequently released table

T

j

(i < j) which

includes r

p

. Note that in each of these tables the quasi-identiﬁer of r

p

may be generalized differently. This

means that if the attacker can identify from

T

j

the record corresponding to r

p

, then he may be able to learn

additional information about the quasi-identiﬁer of the individual corresponding to r

p

and possibly reduce

the size of I

p

. There are many cases where the attacker can identify r

p

in

T

j

. However, in order to illustrate

our point clearly, we show some simple cases in the following example.

Example 3 The attacker knows that r

p

must be contained in the equivalence class of

T

j

that is compatible

with r

p

[Q]. Suppose that there is only one compatible equivalence class, e

i+1

in

T

j

(see Figure 8 (i)). Then

the attacker can conﬁdently combine his knowledge on the quasi-identiﬁer of r

p

; i.e., r

p

[Q] ← r

p

[Q] ∩

e

i+1

[Q]. Suppose now that there are more than one compatible equivalence classes in

T

i+1

, say e

i+1

and

e

i+1

. If s

p

∈ e

i+1

[S] and s

p

/ ∈ e

i+1

[S], then the attacker can be sure that r

p

∈ e

i+1

and updates his

knowledge of r

p

[Q] as r

p

[Q] ∩ e

i+1

[Q]. However, if s

p

∈ e

i+1

[S] and s

p

∈ e

i+1

[S], then r

p

could be in

either e

i+1

and e

i+1

(see Figure 8 (ii)). Although the attacker may or may not determine which equivalence

class contains r

p

, he is sure that r

p

∈ e

i+1

∪ e

i+1

; therefore, r

p

[Q] ← r

p

[Q] ∩ (e

i+1

[Q] ∪ e

i+1

[Q]).

After updating r

p

[Q] with

T

j

, the attacker can reexamine I

p

and eliminate individuals whose quasi-

identiﬁers are no longer compatible with the updated r

p

[Q]. When the size of I

p

becomes less than k, the

attacker can infer the association between the individuals in I

p

and r

p

with a probability higher than 1/k.

In the above example, when there are more than one compatible equivalence classes ¦e

i+1,1

, ..., e

i+1,r

¦ in

T

i+1

, we say that the attacker updates r

p

[Q] as r

p

[Q] ∩(∪

1≤j≤r

e

i+1,j

). While correct, this is not a sufﬁcient

description of what the attacker can do, as there are cases where the attacker can notice that some equivalence

classes in

T

i+1

cannot contain r

p

. For example, let r

1

∈ e

i,1

and r

2

∈ e

i,2

be two records in

T

i

, both

taking s

r

as the sensitive value (see Figure 9(i)). Suppose that

T

i+1

contains a single equivalence class e

i+1,1

that is compatible to r

1

and two compatible equivalence classes e

i+1,1

and e

i+1,2

that are compatible to r

2

.

Although r

2

has two compatible equivalence classes, the attacker can be sure that r

2

is included in e

i+1,2

, as

the record with s

r

in e

i+1,1

must correspond to r

1

. Figure 9(ii) illustrates another case of which the attacker

can take advantage. As shown, there are two records in e

i,1

that take s

r

as the sensitive value. Although the

attacker cannot be sure that each of these records is contained in e

i+1,1

or e

i+1,2

, he is sure that one record is

in e

i+1,1

and the other in e

i+1,2

. Thus, he can make an arbitrary choice and update his knowledge about the

quasi-identiﬁers of the two records accordingly. Using such techniques, the attacker can make more precise

inference by eliminating equivalence classes in

T

i+1

that are impossible to contain r

r

.

We nowdescribe a more thorough algorithmthat checks two (k, c)-anonymous tables for the vulnerability

to the record-tracing attack. First, we construct a bipartite graph G = (V, E), where V = V

1

∪ V

2

and each

vertex in V

1

represents a record in

T

i

, and each vertex in V

2

represents a record in

T

i+1

that is compatible

with at least one record in

T

i

. We deﬁne E as the set of edges from vertices in V

1

to vertices in V

2

, which

represents possible matching relationships. That is, if there is an edge from r

i

∈ V

1

to r

j

∈ V

2

, this means

that records r

i

and r

j

may both correspond to the same record although they are generalized into different

forms. We create such edges between V

1

and V

2

as follows. For each vertex r ∈ V

1

, we ﬁnd from V

2

the set

of records R where ∀r

i

∈ R, (r[Q] r

i

[Q]) ∧ (r[S] = r

i

[S]). If [R[ = 1 and r

∈ E, then we create an

edge from r to r

and mark it with ¸d), which indicates that r deﬁnitely corresponds to r

. If [R[ > 1, then

we create an edge from r and every r

i

∈ R and mark it with ¸p) to indicate that r plausibly corresponds to

10

r

i

. Now given the constructed bipartite graph, the pseudo-code in Figure 10 removes plausible edges that are

not feasible and discovers more deﬁnite edges by scanning through the edges.

Note that the algorithm above does not handle the case illustrated in Figure 9(ii). In order to address

such cases, we also performs the following. For each equivalence class e

1,i

∈

T

i

, we ﬁnd from

T

j

the set of

equivalence classes E where ∀e

2,j

∈ E, e

1,i

[Q] e

2,j

[Q]. If the same number of records with any sensitive

value s appear in both e

1,i

and E, we remove unnecessary plausible edges such that each of such records in

e

1,i

has a deﬁnite edge to a distinct record in E.

After all infeasible edges are removed, each record r

1,i

∈ V

1

is associated with a set of possibly matching

records ¦r

2,j

, . . . , r

2,m

¦ (j ≤ m) in V

2

. Now we can follow the edges and compute for each record r

1,i

∈

T

i

the inferrable quasi-identiﬁer r

1,i

[Q] = r

1,i

[Q] ∩ (

=j,...,m

r

2,

[Q]). If any inferred quasi-identifer maps

to less than k individuals in the population table U

i

, then table

T

i

is vulnerable to the record-tracing attack

with respect to

T

j

.

It is worth noting that the key problem enabling the record-tracing attack arises from the fact that the

sensitive attribute value of a record, together with its generalized quasi-identiﬁer, may uniquely identify the

record in different anonymous tables. This issue can be especially critical for records with rare sensitive

attribute values (e.g., rare diseases) or tables where every individual has a unique sensitive attribute value

(e.g., DNA sequence).

5 Inference Prevention

In this section, we describe our incremental data anonymization which incorporates the inference detection

techniques in the previous section. We ﬁrst describe our data/history management strategy which aims to

reduce the computational overheads. Then, we describe the properties of our checking algorithms which

make them suitable for existing data anonymization techniques such as full-domain generalization [11] and

multidimensional anonymization [12].

5.1 Data/history management

Consider a newly anonymized table,

T

i

, which is about to be released. In order to check whether

T

i

is vul-

nerable to cross-version inferences, it is essential to maintain some form of history about previously released

datasets, Θ = ¦

T

0

, . . . ,

T

i−1

¦. However, checking the vulnerability in

T

i

against each table in Θ can be

computationally expensive. To avoid such inefﬁciency, we maintain a history table, H

i

at time i, which has

the following attributes.

• RID : is a unique record ID (or the explicit identiﬁer of the corresponding individual). Assuming that

each

T

i

also contains RID (which is projected out before being released), RID is used to join H

i

and

T

i

.

• TS (Time Stamp) : represents the time (or the version number) when the record is ﬁrst released.

• IS (Inferable Sensitive values) : stores the set of sensitive attribute values with which the record can be

associated. For instance, if record r is released in equivalence class e

i

of

T

i

, then r[IS]

i

← (r[IS]

i−1

∩

e

i

¸S)). This ﬁeld is used for checking vulnerability to intersection attack.

• IQ(Inferable Quasi-identiﬁer) : keeps track of the quasi-identiﬁers into which the record has previously

been generalized. For instance, for record r ∈

T

i

, r[IQ]

i

← r[IQ]

i−1

∩ r[Q]. This ﬁeld is used for

checking vulnerability to record-tracing attack.

11

The main idea of H

i

is to keep track of the attacker’s accumulated knowledge on each released record.

For instance, value r[IS] of record r ∈ H

i−1

indicates the set of sensitive attribute values that the attacker

may be able to associate with r prior to the release of

T

i

. This is indeed the worst case as we are assuming

that the attacker possesses every released table, i.e., Θ. However, as discussed in Section 4.1, we need to be

conservative when estimating what the attacker can do. Using H, the cost of checking

T

i

for vulnerability

can be signiﬁcantly reduced; for intersection and record-tracing attacks, we check

T

i

against H

i−1

, instead

of every

T

j

∈ Θ

4

.

5.2 Incorporating inference detection into data anonymization

We now discuss how to incorporate the inference detection algorithms into secure anonymization algorithms.

We ﬁrst consider the full-domain anonymization, where all values of an attribute are consistently generalized

to the same level in the predeﬁned generalization hierarchy. In [11], LeFevre et al. propose an algorithm that

ﬁnds minimal generalizations for a given table. In its essence, the proposed algorithm is a bottom-up search

approach in that it starts with un-generalized data and tries to ﬁnd minimal generalizations by increasingly

generalizing the target data in each step. The key property on which the algorithm relies is the generalization

property: given a table T and two generalization strategies G

1

, G

2

(G

1

_ G

2

), if G

1

(T) is k-anonymous,

then G

2

(T) is also k-anonymous

5

. Although intuitive, this property is critical as it guarantees the optimality

to the discovered solutions; i.e., once the search ﬁnds a generalization level that satisﬁes the k-anonymity

requirement, we do not need to search further.

Observation 4 Given a table T and two generalization strategies G

1

, G

2

(G

1

_ G

2

), if G

1

(T) is not

vulnerable to any inference attack, then neither is G

2

(T).

The proof is simple. As each equivalence class in G

2

(T) is the union of one or more equivalence classes

in G

1

(T), the information about each record in G

2

(T) is more vague than that in G

1

(T); thus, G

2

does not

create more inference attacks than G

1

. Based on this observation, we modify the algorithmin [11] as follows.

In each step of generalization, in addition to checking the (k, c)-anonymity requirement, we also check for

the vulnerability to inference. If either check fails, then we need to further generalize the data.

Next, we consider the multidimensional k-anonymity algorithm proposed in [12]. Speciﬁcally, the algo-

rithm consists of the following two steps. The ﬁrst step is to ﬁnd a partitioning scheme of the d-dimensional

space, where d is the number of attributes in the quasi-identiﬁer, such that each partition contains more than

k records. In order to ﬁnd such a partitioning, the algorithm recursively splits a partition at the median value

(of a selected dimension) until no more split is allowed with respect to the k-anonymity requirement. Note

that unlike the previous algorithm, this algorithm is a top-down search approach, and the quality of the search

relies on the following property

6

: given a partition p, if p does not satisfy the k-anonymity requirement, then

any sub-partition of p does not satisfy the requirement.

Observation 5 Given a partition p of records, if p is vulnerable to any inference attack, then so is any sub-

partition of p.

Suppose that we have a partition p

1

of the dataset, in which some records are vulnerable to inference

attacks. Then, any further cut of p

1

will lead to a dataset that is also vulnerable to inference attacks. This

4

In our current implementation, difference attack is still checked against every previously released table.

5

This property is also used in [16] for -diversity and is thus applicable for (k, c)-anonymity.

6

It is easy to see that the property also holds for any diversity requirement.

12

is based on the fact that any further cut on p

1

leads to de-generalization of the dataset; thus, it reveals more

information about each record than p

1

. Based on this observation, we modify the algorithm in [12] as follow.

In each step of partition, in addition to checking the (k, c)-anonymity requirement, we also check for the

vulnerability to inference. If either check fails, then we do not need to further partition the data.

6 Experiments

The main goal of our experiments is to show that our approach effectively prevents the previously discussed

inference attacks when data is incrementally disseminated. We also show that our approach produces datasets

with good data quality. We ﬁrst describe our experimental settings and then report our experimental results.

6.1 Experimental Setup

6.1.1 Experimental Environment

The experiments were performed on a 2.66 GHz Intel IV processor machine with 1 GB of RAM. The oper-

ating system on the machine was Microsoft Windows XP Professional Edition, and the implementation was

built and run in Java 2 Platform, Standard Edition 5.0. For our experiments, we used the Adult dataset from

the UC Irvine Machine Learning Repository [18], which is considered a de facto benchmark for evaluating

the performance of anonymization algorithms. Before the experiments, the Adult data set was prepared as

described in [3, 9, 12]. We removed records with missing values and retained only nine of the original at-

tributes. In our experiments, we considered ¦age, work class, education, marital status, race, gender, native

country, salary¦ as the quasi-identiﬁer, and occupation attribute as the sensitive attribute.

6.1.2 Data quality metrics

The quality of generalized data has been measured by various metric. In our experiment, we measure the

data quality mainly based on Average Information Loss (AIL, for short) metric proposed in [4]. The basic

idea of AIL metric is that the amount of generalization is equivalent to the expansion of each equivalence

class (i.e., the geometrical size of each partition). Note that as all the records in an equivalence class are

modiﬁed to share the same quasi-identifer, each region indeed represents the generalized quasi-identiﬁer of

the records contained in it. Thus, data distortion can be measured naturally by the size of the region covered

by each equivalence class. Following this idea, Information Loss (IL for short) measures the amount of data

distortion in an equivalence class as follows.

Deﬁnition 4 (Information loss) [4] Let e=¦r

1

, . . . , r

n

¦ be an equivalence class where Q=¦a

1

, . . . , a

m

¦.

Then the amount of data distortion occurred by generalizing e, denoted by AIL(e), is deﬁned as:

AIL(e) = [e[

j=1,...,m

|Gj|

|Dj|

where [e[ is the number of records in e, and [D

j

[ the domain size of attribute a

j

. [G

j

[ represents the amount of

generalization in attribute a

j

(e.g., the length of the shortest interval which contains all the a

j

values existing

in e).

Based on IL, the AIL of a given table

T is computed as: AIL(

T) = (

e∈

T

IL(e)) / [T[. The key

advantage of AIL metric is that it precisely measures the amount of generalization (or vagueness of data),

while being independent from the underlying generalization scheme (e.g, anonymization technique used or

generalization hierarchies assumed). For the same reason, we also use the Discernibility Metric (DM) [3] as

13

another quality measure in our experiment. Intuitively, DM measures the quality of anonymized data based

on the size of the equivalence classes, which indicates how much records are indistinguishable from each

other.

6.2 Experimental Results

We ﬁrst measured how many records were vulnerable in statically anonymized datasets with respect to the

inference attacks we discussed. For this, we modiﬁed two k-anonymization algorithms, Incognito [11] and

Mondrian [12], and used them as our static (k, c)-anonymization algorithms. Using these algorithms, we ﬁrst

anonymized 5K records and obtained the ﬁrst “published” datasets. We then generated ﬁve more subsequent

datasets by adding 5K more records each time. Then we used our vulnerability detection algorithms to

count the number of records among these datasets that are vulnerable to each of inference attack. Figure 11

shows the result. As shown, much more records were found to be vulnerable in the datasets anonymized by

Mondrian. This is indeed unsurprising, as Mondrian, taking a multidimensional approach, produces datasets

with much less generalization. In fact, for Incognito, even the initial dataset was highly generalized. This

clearly illustrates the unfortunate reality; that is, the more precise data are, the more vulnerable they are to

undesirable inferences.

The next step was to investigate how effectively our approach would work with a real dataset. The main

focus was its effect on the data quality. As previously discussed, in order to prevent undesirable inferences,

one needs to hide more information. In our case, it means that the given data must be generalized until there

is no vulnerability to any type of inference attack. We modiﬁed the static (k, c)-anonymization algorithms as

discussed in Section 5 and obtained our inf-checked (k, c)-anonymization algorithms. Note that although we

implemented the full-featured algorithms for difference and intersection attacks, we took a simple approach

for record-tracing attack. That is, we considered all the edges without removing infeasible/unnecessary edges

as discussed in Section 4.5. We also implemented a merge approach where we anonymize each dataset

independently and merge it with the previously released dataset. Although this approach is secure from any

type of inference attacks, we expected that the data quality would be the worst, as merging would inevitably

have a signiﬁcant effect on generalization (recoding) scheme.

With these algorithms as well as the static anonymization algorithms, we repeated our experiment. As

before, we started with 5K records and increased the dataset by 5K each time. We then checked the vulnera-

bility and measured the data quality of such datasets. We measured the data quality both with AIL and DM,

and the results are illustrated in Figures 12 and 13, respectively. It is clear that in terms of data quality the

inf-checked algorithm is much superior than the merge algorithm. Although the static algorithms produced

the best quality datasets, these data are vulnerable to inference attacks as previously shown. The datasets

generated by our inf checked algorithm and the merge algorithm were not vulnerable to any type of inference

attack.

We also note that the quality of datasets generated by the inf-checked algorithm is not optimal. This was

mainly due to the complexity of checking for difference attack. Even though our heuristics to reduce the

size of subsets (see Section 4.3) were highly effective in most cases, there were some cases where the size of

subsets grew explosively. As such cases not only caused lengthy execution times, they caused memory blow-

ups. In order to avoid such cases, we set an upper limit threshold for the size of subsets in this experiment.

For example, while our modiﬁed algorithm of Incognito is processing a node in the generalization lattice, if

the size of subsets needed to be checked exceeds the threshold, we stop the iteration and consider the node as

a vulnerable node. Similarly, when we encounter such a case while considering a split in Mondrian, we stop

the check and do not consider the split. Note that this approach does not affect the security of data, although

14

it may negatively affect the overall data quality. Even if the optimality cannot be guaranteed, we believe that

the data quality seems to be still acceptable, considering the results shown in Figures 12 and 13.

Another important comparison was the computational efﬁciency of these algorithms. Figure 14 shows

our experimental result for each algorithm. The merge algorithm is highly efﬁcient with respect to execution

time (although it was very inefﬁcient with respect to data quality). As the merge algorithm anonymizes the

same sized dataset each time and merging datasets can be done very quickly, the execution time is closely

constant. While equipped the heuristics and the data structure discussed in Sections 4.3 and 5.1, the inf-

checked algorithm is still slow. However, considering the previously discussed results, we believe that this is

the price you have to pay for better data quality and reliable privacy. Also, when compared to our previous

implementation without any heuristics, this is a very promising result.

7 Related Work

The problem of information disclosure [6] has been studied extensively in the framework of statistical

databases. A number of information disclosure limitation techniques [7] have been designed for data pub-

lishing, including Sampling, Cell Suppression, Rounding, and Data Swapping and Perturbation. These tech-

niques, however, compromised data integrity of the tables. Samarati and Sweeney [20, 22, 21] introduced the

k-anonymity approach and used generalization and suppression techniques to preserve information truthful-

ness. A number of static anonymization algorithms [3, 8, 9, 11, 12, 14, 21] have been proposed to achieve

k-anonymity requirement. Optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17].

While static anonymization has been extensively investigated in the past few years, only a few approaches

address the problem of anonymization in dynamic environments. In [22], Sweeney identiﬁed possible infer-

ences when new records are inserted and suggested two simple solutions. The ﬁrst solution is that once

records in a dataset are anonymized and released, in any subsequent release of the dataset, the records must

be the same or more generalized. As previously mentioned, this approach may suffer from unnecessarily low

data quality. Also, this approach cannot protect newly inserted records from difference attack, as discussed

in Section 4. The other solution suggested is that once a dataset is released, all released attributes (includ-

ing sensitive attributes) must be treated as the quasi-identiﬁer in subsequent releases. This approach seems

reasonable as it may effectively prevent linking between records. However, this approach has a signiﬁcant

drawback in that every equivalence class will inevitable have a homogeneous sensitive attribute value; thus,

this approach cannot adequately control the risk of attribute disclosure.

Yao et al. [28] addressed the inference issue when a single table is released in the form of multiple views.

They proposed several methods to check whether or not a given set of views violates the k-anonymity require-

ment collectively. However, they did not address how to deal with such violations. Wang and Fung [23] fur-

ther investigated this issue and proposed a top-down specialization approach to prevent record-linking across

multiple anonymous tables. However, their work focuses on the horizontal growth of databases (i.e., addition

of new attributes), and does not address vertically-growing databases where records are inserted. Recently,

Xiao and Tao [27] proposed a new generalization principle m-invariance for dynamic dataset publishing.

The m-invariance principle requires that each equivalence class in every release contains distinct sensitive

attribute values and for each tuple, all equivalence classes containing that tuple have exactly the same set

of sensitive attribute values. They also introduced the counterfeit generalization technique to achieve the

m-invariance requirement.

In [5], we presented a preliminary limited investigation concerning the inference problem of dynamic

anonymization with respect to incremental datasets. We identiﬁed some inferences and also proposed an

approach where new records are directly inserted to the previously anonymized dataset for computational

15

efﬁciency. However, compared to this current work, our previous work has several limitations.The key differ-

ences of this work with respect to [5] are as follows. In [5], we focused only on the inference enabling sets that

may exist between two tables, while in this work we consider more robust and systematic inference attacks in

a collection of released tables. The inference attacks discussed in this work subsume attacks using inference

enabling sets and address more sophisticated inferences. For instance, our study of the record-tracing attack is

a new contribution in this work. We also provide a detailed descriptions of attacks and algorithms for detect-

ing them. Our previous approach was also limited to the multidimensional generalization. By contrast, our

current approach considers and is applicable to both the full-domain and multidimensional approaches; there-

fore it can be combined with a large variety of anonymization algorithms. In this paper we also address the

issue of computational costs in detecting possible inferences. We discuss various heuristics to signiﬁcantly

reduce the search space, and also suggest a scheme to store the history (of previously released tables).

8 Conclusions

In this paper, we discussed inference attacks against the anonymization of incremental data. In particular,

we discussed three basic types of cross-version inference attacks and presented algorithms for detecting each

attack. We also presented some heuristics to address the efﬁciency of our algorithms. Based on these ideas,

we developed secure anonymization algorithms for incremental datasets using two existing anonymization

algorithms. We also empirically evaluated our approach by comparing to other approaches. Our experimental

result showed that our approach outperformed other approaches in terms of privacy and data quality.

For the future work, we are working on essential properties (e.g, correctness) of our methods and analysis.

Another interesting direction for the further work is to see if there are other types of inferences. For instance,

one can devise an attack where more than one type of inference are jointly utilized. We also plan to investigate

inference issues in more dynamic environments where deletions and updates of records are allowed.

References

[1] C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the International Con-

ference on Very Large Data Bases (VLDB), 2005.

[2] G. Aggrawal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving

anonymity via clustering. In Proceedings of the ACM Symposium on Principles of Database Systems

(PODS), 2006.

[3] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proceedings of the

International Conference on Data Engineering (ICDE), 2005.

[4] J.-W. Byun, A. Kamra, E. Bertino, and N. Li. Efﬁcient k-anonymization using clustering techniques. In

Proceedings of the Intenational Conference on Database Systems for Advanced Applications(DASFAA),

2007.

[5] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In Secure

Data Management, 2006.

[6] D. Denning. Cryptography and Data Security. Addison-Wesley Longman Publishing Co., Inc., 1982.

16

[7] G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S. F. Roehrig. Disclosure limitation methods

and information loss for tabular data. In Conﬁdentiality, Disclosure and Data Access: Theory and Practical

Applications for Statistical Agencies, pages 135166. Elsevier, 2001.

[8] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation.

In Proceedings of the International Conference on Data Engineering (ICDE), 2005.

[9] V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proceedings of the ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (KDD), 2002.

[10] N. Koudas, D. Srivastava, T. Yu, and Q. Zhang. Aggregate query answering on anonymized tables. In

Proceedings of the International Conference on Data Engineering (ICDE), 2007.

[11] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Incognito: Efﬁcient full-domain k-anonymity. In Pro-

ceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2005.

[12] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In Proceed-

ings of the International Conference on Data Engineering (ICDE), 2006.

[13] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and -diversity. In

Proceedings of the International Conference on Data Engineering (ICDE), 2007.

[14] T. Li and N. Li. Towards optimal k-anonymization. Data & Knowledge Engineering, 2008.

[15] T. Li and N. Li. Injector: Mining Background Knowledge for Data Anonymization. In Proceedings of

the International Conference on Data Engineering (ICDE), 2008.

[16] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. ‘-diversity: Privacy beyond

k-anonymity. In Proceedings of the International Conference on Data Engineering (ICDE), 2005.

[17] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proceedings of the ACM

Symposium on Principles of Database Systems (PODS), 2004.

[18] C. B. S. Hettich and C. Merz. UCI repository of machine learning databases, 1998.

[19] P. Samarati. Protecting respondents privacy in microdata release. IEEE Transactions on Knowledge and

Data Engineering, 13, 2001.

[20] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its en-

forcement through generalization and suppression. Technical Report SRI-CSL-98-04, 1998.

[21] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Interna-

tional Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.

[22] L. Sweeney. K-anonymity: A model for protecting privacy. International Journal on Uncertainty,

Fuzziness and Knowledge-based Systems, 2002.

[23] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In Proceedings of the ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining (KDD), pages 414423, 2006.

17

[24] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang. (α,k)-anonymity: an enhanced k-anonymity model

for privacy preserving data publishing. In Proceedings of the ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining (KDD), pages 754759, 2006.

[25] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the Interna-

tional Conference on Very Large Data Bases (VLDB), pages 139150, 2006.

[26] X. Xiao and Y. Tao. Personalized privacy preservation. In Proceedings of the ACM SIGMOD Interna-

tional Conference on Management of Data (SIGMOD), pages 229240, 2006.

[27] X. Xiao and Y. Tao. M-invariance: towards privacy preserving re-publication of dynamic datasets. In

Proceedings of the ACMSIGMOD International Conference onManagement of Data (SIGMOD), pages

689700, 2007.

[28] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In Proceedings of

the International Conference on Very Large Data Bases (VLDB), 2005.

18

NAME AGE Gender Diagnosis

Tom 21 Male Asthma

Mike 23 Male Flu

Bob 52 Male Alzheimer

Eve 57 Female Diabetes

Figure 1: Patient table

AGE Gender Diagnosis

[21 −25] Male Asthma

[21 −25] Male Flu

[50 −60] Person Alzheimer

[50 −60] Person Diabetes

Figure 2: Anonymous patient table

NAME AGE Gender Diagnosis

Tom 21 Male Asthma

Mike 23 Male Flu

Bob 52 Male Alzheimer

Eve 57 Female Diabetes

Alice 27 Female Cancer

Hank 53 Male Hepatitis

Sal 59 Female Flu

Figure 3: Updated patient table

AGE Gender Diagnosis

[21 −30] Person Asthma

[21 −30] Person Flu

[21 −30] Person Cancer

[51 −55] Male Alzheimer

[51 −55] Male Hepatitis

[56 −60] Female Flu

[56 −60] Female Diabetes

Figure 4: Updated anonymous pa-

tient table

19

DirectedCheck

Input: Two (k, c)-anonymous tables

T

i

and

T

j

, where

T

i

= ¦e

i,1

, . . . , e

i,m

¦ and

T

j

=

¦e

j,1

, . . . , e

j,n

¦

Output: true if

T

i

is vulnerable to difference attack with respect to

T

j

and false otherwise

Q = ¦(∅, 0)¦

while Q ,= ∅

Remove the ﬁrst element p = (E, index) from Q

if E ,= ∅

E

← GetMinSet(E,

T

j

)

if [E

**−E[ < k, return true
**

else if (E

**¸S) −E¸S)) does not satisfy c-diversity, return true
**

for each ∈ ¦index + 1, . . . , m¦ // generates subsets with size [E[ + 1

insert (E ∪ e

i,

, ) into Q

return false

Figure 5: Algorithm for checking if one table is vulnerable to difference attack with respect to another table

20

GetMinSet

Input: An equivalence class set E and a table

T

Output: An equivalence class set E

⊆

T that is minimal and contains all records in E

E

= ∅

while E ,⊆ E

choose a tuple t in E that is not in E

**ﬁnd the equivalence class e ⊆
**

T that contains t

E

= E

∪ ¦e¦

return E

Figure 6: Algorithm for obtaining a minimum equivalence class set that contains all the records in the given

equivalence class set

21

Check Intersection Attack

Input: Two (k, c)-anonymous tables

T

0

and

T

1

Output: true if the two tables are vulnerable to intersection attack and false otherwise

for each equivalence class e

0,i

in

T

0

for each e

1,j

∈

T

1

that contains any record in e

0,i

if (e

0,i

¸S) ∩ e

1,j

¸S)) does not satisfy c-diversity, return true

return false

Figure 7: Algorithm for checking if given two tables are vulnerable to intersection attack

22

Ti Ti+1 Tn

ei,1

e i+1,2

ei+1,1

e i+1,3

e i+1,4

ei,2

ei,3

en,1

en,2

en,3

en,5

en,6 sp

sp sp

en,4

sp

sp sp

sp sp

e i+1,5

(i)

(ii)

. . .

Figure 8: Record-tracing attacks

e

i,1

e

i+1,2

e

i+1,1

e

i,2

s

r

s

r

(i)

s

r

s

r

e

i,1

e

i+1,2

e

i+1,1

s

r

s

r

(ii)

s

r

s

r

Figure 9: More inference in record-tracing

23

Remove Infeasible Edges

Input: A bipartite graph G = (V, E) where V = V

1

∪ V

2

and E is a set of edges representing

possible matching relationships.

Output: A bipartite graph G

= (V, E

) where E

**⊂ E with infeasible edges removed
**

E

= E

while true

change1 ←false, change2 ←false

for each r

j

∈ V

2

e ←all the incoming edges of r

j

if e contains both deﬁnite and plausible edge(s)

remove all plausible edges in e from E

, change1 ←true

if change1 = true

for each r

i

∈ V

1

e ←all the outgoing edges of r

i

if e contains only a single plausible edge

mark the edge in e as deﬁnite, change2 ←true

if change2 = false, break

return (V, E

)

Figure 10: Algorithm for removing infeasible edges from a bipartite graph

24

0

10

20

30

40

50

0 5 10 15 20 25 30 35

N

u

m

b

e

r

o

f

V

u

l

n

e

r

a

b

l

e

R

e

c

o

r

d

s

Table Size (unit = 1,000)

Static Incognito (k=5, c=0.7)

Difference

Intersection

R-Tracing

0

1000

2000

3000

4000

5000

0 5 10 15 20 25 30 35

N

u

m

b

e

r

o

f

V

u

l

n

e

r

a

b

l

e

R

e

c

o

r

d

s

Table Size (unit = 1,000)

Static Mondrian (k=5, c=0.7)

Difference

Intersection

R-Tracing

Figure 11: Vulnerability to Inference Attacks

25

3

4

5

6

7

8

9

0 5 10 15 20 25 30 35

A

v

e

r

a

g

e

I

n

f

o

r

m

a

t

i

o

n

L

o

s

s

Table Size (unit = 1,000)

Incognito (k=5, c=0.7)

Static

Merge

Inf_Checked

0

1

2

3

4

5

6

7

0 5 10 15 20 25 30 35

A

v

e

r

a

g

e

I

n

f

o

r

m

a

t

i

o

n

L

o

s

s

Table Size (unit = 1,000)

Mondrian (k=5, c=0.7)

Static

Merge

Inf_Checked

0

1

2

3

4

5

6

7

0 5 10 15 20 25 30 35

A

v

e

r

a

g

e

I

n

f

o

r

m

a

t

i

o

n

L

o

s

s

Table Size (unit = 1,000)

Mondrian (k=5, c=0.7)

Static

Merge

Inf_Checked

Figure 12: Data Quality: Average Information Loss

26

0

20

40

60

80

100

120

0 5 10 15 20 25 30 35

D

i

s

c

e

r

n

a

b

i

l

i

t

y

M

e

t

r

i

c

(

u

n

i

t

=

1

M

)

Table Size (unit = 1,000)

Incognito (k=5, c=0.7)

Static

Merge

Inf_Checked

0

20

40

60

80

100

0 5 10 15 20 25 30 35

D

i

s

c

e

r

n

a

b

i

l

i

t

y

M

e

t

r

i

c

(

u

n

i

t

=

1

M

)

Table Size (unit = 1,000)

Mondrian (k=5, c=0.7)

Static

Merge

Inf_Checked

0

20

40

60

80

100

0 5 10 15 20 25 30 35

D

i

s

c

e

r

n

a

b

i

l

i

t

y

M

e

t

r

i

c

(

u

n

i

t

=

1

M

)

Table Size (unit = 1,000)

Mondrian (k=5, c=0.7)

Static

Merge

Inf_Checked

Figure 13: Data Quality: Discernibility Metric

27

0

100

200

300

400

500

0 5 10 15 20 25 30 35

E

x

e

c

u

t

i

o

n

T

i

m

e

(

u

n

i

t

=

s

e

c

)

Table Size (unit = 1,000)

Incognito (k=5, c=0.7)

Static

Merge

Inf_Checked

0

500

1000

1500

2000

0 5 10 15 20 25 30 35

E

x

e

c

u

t

i

o

n

T

i

m

e

(

u

n

i

t

=

s

e

c

)

Table Size (unit = 1,000)

Mondrian (k=5, c=0.7)

Static

Merge

Inf_Checked

Figure 14: Execution Time

28

Although these models have yielded a number of valuable privacy-protecting techniques [3, 8, 9, 11, 12, 21], existing approaches only deal with static data release. That is, all these approaches assume that a complete dataset is available at the time of data release. This assumption implies a signiﬁcant shortcoming, as in many applications data collection is rather a continuous process. Moreover, the assumption entails “one-time” data dissemination. Obviously, this does not address today’s strong demand for immediate and up-to-date information, as the data cannot be released before the data collection is considered complete. As a simple example, suppose that a hospital is required to share its patient records with a disease control agency. In order to protect patients’ privacy, the hospital anonymizes all the records prior to sharing. At ﬁrst glance, the task seems reasonably straightforward, as existing anonymization techniques can efﬁciently anonymize the records. The challenge is, however, that new records are continuously collected by the hospital (e.g., whenever new patients are admitted), and it is critical for the agency to receive up-to-date data in timely manner. One possible approach is to provide the agency with datasets containing only the new records, which are independently anonymized, on a regular basis. Then the agency can either study each dataset independently or merge multiple datasets together for more comprehensive analysis. Although straightforward, this approach may suffer from severely low data quality. The key problem is that relatively small sets of records are anonymized independently so that the records may have to be modiﬁed much more than when they are anonymized together with previous records [5]. Moreover, a recoding scheme applied to each dataset may make the datasets inconsistent with each other; thus, collective analysis on multiple datasets may require additional data modiﬁcation. Therefore, in terms of data quality, this approach is highly undesirable. One may believe that data quality can be assured by waiting for new data to be accumulated sufﬁciently large. However, this approach may not be acceptable in many applications as new data cannot be released in a timely manner. A better approach is to anonymize and provide the entire dataset whenever it is augmented with new records (possibly along with another dataset containing only new records). In this way, the agency can be provided with up-to-date, quality-preserving and “more complete” datasets each time. Although this approach can also be easily implemented by using existing techniques (i.e., anonymizing the entire dataset every time), it has a signiﬁcant drawback. That is, even though each released dataset, when observed independently, is guaranteed to be anonymous, the combination of several released datasets may be vulnerable to various inferences. We illustrate these inferences through some examples in Section 3.1. As such inferences are typically made by comparing or linking records across different tables (or versions), we refer to them as cross-version inferences to differentiate them from inferences that may occur within a single table. Our goal in this paper is to identify and prevent cross-version inferences so that an increasing dataset can be incrementally disseminated without compromising the imposed privacy requirement. In order to achieve this, we ﬁrst deﬁne the privacy requirement for incremental data dissemination. We then discuss three types of cross-version inference that an attacker may exploit by observing multiple anonymized datasets. We also present our anonymization method where the degree of generalization is determined based on the previously released datasets to prevent any cross-version inference. The basic idea is to obscure linking between records across different datasets. We develop our technique in two different types of recoding approaches; namely, full-domain generalization [11] and multidimensional anonymization [12]. One of the key differences between these two approaches is that the former generalizes a given dataset according to pre-deﬁned generalization hierarchies, while the latter does not. Based on our experimental result, we compare these two approaches with respect to data quality and vulnerability to cross-table inference. Another issue we address is that as a dataset is released multiple times, one may need to keep the history of previously released datasets. We thus discuss how to maintain such history in a compact form to reduce unnecessary overheads.

2

each of which contains at least k records. Thus. the risk of attribute disclosure is kept under 1/ . we formulate the privacy requirement for incremental data dissemination. date of birth} [22]. A set of records that are indistinguishable from each other is often referred to as an equivalence class. For instance. The main objective of the k-anonymity model is thus to transform a table so that no one can make highprobability associations between records in the table and the corresponding individuals by using such group of attributes. For such datasets. from the table. it has been shown that 87% of individuals in the United States can be uniquely identiﬁed by a set of attributes such as {ZIP. but a particular group of attributes together may identify a particular individuals [19. Recently. Although none of these records can be uniquely matched with the corresponding individuals. consider a k-anonymized table where all records in an equivalence class have the same sensitive attribute value. It also assumes that the target table contains person-speciﬁc information and that each record in the table corresponds to a unique real-world individual. we discuss a number of anonymization models and brieﬂy review existing anonymization techniques. the -diversity requirement 3 . In Section 3. The enforcement of k-anonymity guarantees that even though an adversary knows the quasi-identiﬁer value of an individual and is sure that a k-anonymous table T contains the record of the individual. a k-anonymous table can be viewed as a set of equivalence classes. [16] pointed out such inference issues in the k-anonymity model and proposed the notion of -diversity. As this requirement ensures that every equivalence class contains at least distinct sensitive attribute values. We present our approach to preventing these inferences in Section 5 and evaluate our technique in Section 6. 2. Diagnosis is considered a sensitive attribute. Note that in this case. Then in Section 4. even though a table is free of explicit identiﬁers. as it is possible to infer certain individuals’ sensitive attribute values without precisely reidentifying their records. we review the basic concepts of the kanonymity and -diversity models and provide an overview of related techniques. For instance. In order to achieve this goal. For example. However. he cannot determine which record in T corresponds to the individual with a probability greater than 1/k. their sensitive attribute value can be inferred with probability 1. in patient table. the key consideration of anonymization is the protection of individuals’ sensitive attributes. However. gender. the -diversity model requires that records in each equivalence class have at least distinct sensitive attribute values. In its simplest form. a private dataset typically contains some sensitive attributes that are not part of the quasi-identiﬁer. the k-anonymity model does not provide sufﬁcient protection in this setting.1 Anonymity Models The k-anonymity model assumes that data are stored in a table (or a relation) of columns (or attributes) and rows (or records). some of the remaining attributes in combination could be speciﬁc enough to identify individuals. Machanavajjhala et al. In Section 2. we describe three types of inference attacks based on our assumption of potential attackers. 22]. Although the k-anonymity model does not consider sensitive attributes. called quasi-identiﬁer. The process of anonymizing such a table starts with removing all the explicit identiﬁers. 2 Preliminaries In this section.The remainder of this paper is organized as follows. We review some related work in Section 7 and conclude our discussion in Section 8. This implies that each attribute alone may not be speciﬁc enough to identify individuals. Several formulations of -diversity are introduced in [16]. such as name and SSN. the k-anonymity model requires that any record in a table be indistinguishable from at least (k − 1) other records with respect to the quasi-identiﬁer.

Another class of generalization schemes is the hierarchy-free generalization class [3. − s∈S p(e. Ba- 4 . possible generalizations are still limited by the imposed generalization hierarchies. modiﬁed datasets based on this requirement could still be vulnerable to probabilistic inferences. 21]. s) represents the fraction of records in e that have sensitive value s. In such a case. Li and Li considered the adversary’s background knowledge in deﬁning privacy. numeric values can be generalized into intervals (e. The multi-level generalization [8. s) > log where S is the domain of the sensitive attribute and p(e. A more robust diversity is achieved by enforcing entropy -diversity [16]. In [13]. Various generalization strategies have been developed. In [26]. the requirement may sometimes be too restrictive. 2. Here an optimal solution is a solution that satisﬁes the privacy requirement and at the same time minimizes a desired cost metric.. the entropy of the entire table must also be greater than or equal to log . Canada. a non-overlapping generalization-hierarchy is ﬁrst deﬁned for each attribute in the quasi-identiﬁer.g. In the single-level generalization schemes [11.2 Anonymization Techniques The k-anonymity (and -diversity) requirement is typically enforced through generalization. {USA. Mexico}) or a single value that represents such a set (e. Although simple and intuitive. 12. which requires every equivalence class to satisfy the following condition.. In [3]. when multiple records in the table corresponds to one individual. They proposed t-closeness as a stronger anonymization model. where real values are replaced with “less speciﬁc but semantically consistent values” [21]. there are various ways to generalize the values in the domain. As generalization makes data uncertain. Intuitively. Based on the use of generalization hierarchies. Recently. on the other hand. an adversary may conclude that the individuals contained in the equivalence class are very likely to have that speciﬁc value. [11 − 20]). Hierarchyfree generalization schemes do not rely on the notion of pre-deﬁned generalization hierarchies. 9] schemes. observed that -diversity is neither sufﬁcient nor necessary for protecting against attribute disclosure. 2. This restriction could be a signiﬁcant drawback in that it may lead to relatively high data distortion due to excessive generalization. North-America). In the hierarchy-based generalization schemes.also ensures -anonymity.g. Then an algorithm in this category tries to ﬁnd an optimal (or good) solution which is allowed by such generalization hierarchies. allows values in a domain to be generalized into different levels in the hierarchy. the utility of the data is inevitably downgraded. In [15]. Given a domain. Although entropy -diversity does provide stronger privacy. They proposed an approach for modeling the adversary’s background knowledge by using data mining techniques on the data to be released. Li et al.g. The key challenge of anonymization is thus to minimize the amount of ambiguity introduced by generalization while enforcing anonymity requirement. consider that among the distinct values in an equivalence class. Although this leads to much more ﬂexible generalization. the algorithms in this category can be further classiﬁed into two subclasses. which requires the distribution of sensitive values in each equivalence class to be analogous to the distribution of the entire dataset. They proposed to have each individual specify privacy policies about his or her own attributes. all the values in a domain are generalized into a single level in the corresponding hierarchy. and categorical values can be generalized into a set of possible values (e. one particular value appears much more frequently than the others. in order for entropy -diversity to be achievable. s) log p(e. as the size of every equivalence class must be greater than or equal to . as pointed out in [16]. For example.. For instance. 4]. a number of anonymization models have been proposed. Xiao and Tao observed that diversity cannot prevent attribute disclosure. 19.

he knows that Alice’s record is not in the old dataset in Figure 2. 3 Problem Formulation In this section. a data anonymization approach that divides one table into two for release. 2)-anonymous table as depicted in Figure 4.[4].. 3. suppose that the hospital relies on a model where both the k-anonymity and -diversity are considered. where the space of anonymizations (formulated as the powerset of totally ordered values in a dataset) is explored using a tree-search strategies. when the number of quasi-identiﬁer attributes is high. one table includes original quasi-identiﬁer and a group id. At a later time. Thus. As shown. Byun et al. The hospital initially has a table like the one in Figure 1 and reports to the agency its (2. Cancer}. to assure data quality. Xiao and Tao [25] propose Anatomy. the association between individual and diagnosis) are kept under 1/2 in the dataset. even for k = 2. On the theoretic side. As previously discussed. he learns only that Alice has one of {Asthma. who is a 21-year-old male. propose a clustering approach to achieve k-anonymity. a ‘(k. transform the k-anonymity problem into a partitioning problem and propose a greedy approach that recursively splits a partition at the median value until no more split is allowed with respect to the k-anonymity requirement.1 Motivating Examples Let us revisit our previous scenario where a hospital is required to provide the anonymized version of its patient records to a disease control agency. We then describe our notion of incremental dissemination and formally deﬁne a privacy requirement for it. Observe that Tom’s privacy is still protected in the newly released dataset. on the other hand. Aggrawal et al. not every patient’s privacy is protected from the attacker. LeFevre et al. propose an algorithm based on a powerset search problem. we start with an example to illustrate the problem of inference. introduce a ﬂexible k-anonymization approach which uses the idea of clustering to minimize information loss and thus ensures good data quality. the curse of dimensionality also calls for more effective anonymization techniques. resulting the table in Figure 3. and the other includes the associations between the group id and the sensitive attribute values. In [12]. the hospital anonymizes the patient table whenever it is augmented with new records. From the new dataset. Koudas et al. In [2]. who is in her late twenties. )-anonymous’ dataset is a dataset that satisﬁes both the k-anonymity and -diversity requirements. but in the new dataset in Figure 4. therefore. For example. Flu.yardo et al. enforcing k-anonymity necessarily results in severe information loss. the association between individual and record) and attribute disclosure (i. by consulting the previous dataset. To make our example more concrete. 5 . he cannot learn about Tom’s disease with a probability greater than 1/2 (although he learns that Tom has either asthma or ﬂu). he can easily deduce that Alice has neither asthma nor ﬂu (as they must belong to patients other than Alice). 2)-anonymous table shown in Figure 2. which clusters records into groups based on some distance metric. Furthermore. the probability of identity disclosure (i. optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. Recently. as shown in [1] that. However.. Example 1 “Alice has cancer!” Suppose the attacker knows that Alice. respectively. However. is in the released table. He now infers that Alice has cancer. even if an attacker knows that the record of Tom.e. three more patient records (shown in Italic) are inserted into the dataset.e. The hospital then releases a new (2. [10] explore the permutation-based anonymization approach and examine the anonymization problem from the perspective of answering downstream aggregate queries. has recently been admitted to the hospital.

. Thus. denoted as θ Ti . 3. With respect to Q. the second condition dictates that the maximum conﬁdence of association between any quasi-identiﬁer value and a particular sensitive attribute value in T must not exceed a threshold c. However. he is sure that Bob’s record is in both datasets in Figures 2 and 4. At a given time i. . Note that three other records in the new dataset are also vulnerable to similar inferences. c)-anonymous. 1 ≤ i ≤ n. 2. ∀ s ∈ S. We assume that T consists of person-speciﬁc records. including potential attackers. if observing tables in θ and Ti collectively increases the risk of either identity disclosure or attribute disclosure in Ti higher than 1/k or c. . we say that there is an inference channel between the two versions. T2 . For the privacy of individuals. by studying the old dataset. . He can thus conclude that Bob suffers from alzheimer. each of which corresponds to a unique real-world individual. Now the attacker checks the new dataset and learns that Bob has either alzheimer or heart disease. T1 . Tn } be the set of all released tables for private table T . if an observer can be sure that two (anonymized) records in two different versions indeed correspond to the same individual. one cannot associate a record with a particular individual with probability higher than 1/k or infer any individual’s sensitive attribute with conﬁdence higher than c. For instance. as shown in Section 3. First. anonymizing a dataset without considering previously released information may enable various inferences. c)-anonymous version released at time i. As every released table is (k..1. Deﬁnition 2 (Cross-version inference channel) Let Θ = {T1 . where |Ti | ≤ |Tj | for i < j. c)-anonymity protection. where ∀ e ∈ T . ∀ e ∈ T. may have access to a series of (k. Thus. . then he may be able to use this knowledge to infer more information than what is allowed by the (k. is released to public. c)-anonymous version of Ti . where Ti is an (k. T consists of a set of non-empty equivalence classes. and we adopt an anonymity model2 that combines the requirements of k-anonymity and -diversity as follows. c)-Anonymity) Let table T be with a set of quasi-identiﬁer attributes Q and a sensitive attribute S. |e| The ﬁrst condition ensures the k-anonymity requirement. where k > 0. Deﬁnition 1 ((k. each Ti must be “properly” anonymized before being released to public. where 0 < c ≤ 1. We also assume that T continuously grows with new records and denote the state of T at time i as Ti . ∀ e ∈ T. |{r|r∈e∧r[S]=s}| ≤ c. by observing each table independently.2 Incremental data dissemination and privacy requirement Let T be a private table with a set of quasi-identiﬁer attributes Q and a sensitive attribute S. it is possible that one can increase the conﬁdence of such undesirable inferences by observing the difference between the released tables. As shown in the examples above. denoted as Ti . record r ∈ e ⇒ r[Q] = e[Q]. If such a case occurs. |e| ≥ k. 1. c)-anonymous with respect to Q if the following conditions are satisﬁed. he learns that Bob suffers from either alzheimer or diabetes. In its essence. users.Example 2 “Bob has alzheimer!” The attacker knows that Bob is 52 years old and has long been treated in the hospital. . c)-anonymous tables. Our goal is to address both identity disclosure and attribute disclosure. respectively. 2A similar model is also introduced in [24]. 6 . We say that T is (k. We say that there exists cross-version inference channel from θ to Ti . only an (k. and the second condition enforces the diversity requirement in the sensitive attribute. Let θ ⊆ Θ and Ti ∈ Θ. .

3 Nowadays. . .. the records in ei all share the same quasi-identiﬁer value. where m > 0. as we cannot rely on attacker’s lack of knowledge. it is critical to ensure that there is no cross-version inference channel among the released tables. for each anonymized table Ti . local celebrities) are admitted to the hospital. We also assume that the attacker has the knowledge of who is and who is not contained in each table. such knowledge is not always difﬁcult to acquire for a dedicated attacker. consider medical records released by a hospital. . Also. .1 Attack scenario We assume that the attacker has been keeping track of all the released tables. Then based on the attack scenario. . above c) by comparing the released tables all together. rm } be an equivalence class in T . that is. as all the released tables are (k. Let A be a set of attributes. In the remainder of this section. however. Deﬁnition 3 (Privacy-preserving incremental data dissemination) Let Θ = {T0 . Based on this knowledge. . where Ti is an (k.e. However. Θ is said to be privacy-preserving if and only if (θ. that is.. he may know when target individuals in whom he is interested (e. . 0 ≤ i ≤ n. Ti ) such that θ ⊆ Θ. but also no released table creates cross-version inference channels with respect to the previously released tables. Let T be a table with a set of quasi-identifer attributes Q and a sensitive attribute S. the attacker cannot infer the individuals’ sensitive attribute values with a signiﬁcant probability.g. e. Although the attacker may not be aware of all the patients. 4. 7 . the attacker could obtains a list of patients from a registration staff3 . and θ Ti . and ei [Q] represents the common quasi-identiﬁer value of ei . perhaps the worst. In other words. For instance. 4 Cross-version Inferences We ﬁrst describe potential attackers and their knowledge that we assume in this paper. Thus. 4. c)-anonymous. the attacker can easily deduce which tables may include such individuals and which tables may not. We also use similar notations for individual records. even utilizing such knowledge. Tn } be the set of all released tables of private table T . Let ei = {r0 . we describe three types of cross-version inferences that the attacker may exploit in order to achieve this goal. .When data are disseminated incrementally. This may seem to be too farfetched at ﬁrst glance. . for record r ∈ T . . . it is reasonable to assume that the attacker’s knowledge includes the list of individuals contained in each table as well as their quasi-identiﬁer values. By deﬁnition. Then T [A] denotes the duplicate-eliminating projection of T onto the attributes A. the data provider must make sure that not only each released table is free of undesirable inferences. . the goal of the attacker is to increase his/her conﬁdence of attribute disclosure (i.g. Another. he thus possesses a set of released tables {T0 . We formally deﬁne this requirement as follows. we assume the worst case..2 Notations We ﬁrst introduce some notations we use in our discussion. r[Q] represents the quasi-identiﬁer value of r and r[S] the sensitive attribute value of r. c)-anonymous version of T released at time i. Ti ∈ Θ. we identify three types of cross-version inference attacks in this section. the attacker also possesses a population table Ui which contains the explicit identiﬁers and the quasi-identiﬁers of the individuals in Ti . about 80% of privacy infringement is committed by insiders. Therefore. possibility is that the attacker may collude with an insider who has access to detailed information about the patients. Tn }. where A ⊆ (Q ∪ S).

3 Difference attack Let Ti = {e0. if v1 ∩ v2 = ∅. the procedure calls GetMinSets procedure in Figure 6 to get the minimum set E of equivalence classes in Tj that contains all the records in E. if D contains less than k records. With this information. We call such E the minimum covering set of E. . . we use T A to denote the duplicate-preserving projection of T . In each time we union one more element to a subset to create larger subsets. Observation 1 Let E1 and E2 be two sets of equivalence classes in Ti . Suppose that v is a value from domain D and v a generalized value of v.In addition. the time complexity is exponential (i. then neither is any set containing E1 ∪ E2 . If they are. Note that this also prevents all the sets containing the unioned set from being generated. . we discuss a few observations that result in effective heuristics to reduce the space of the problem in most cases. 8 . e0. we do not insert the unioned subset to the queue. if (∀qi ∈ Q. the attacker knows the individuals whose records are contained in e. where n is the number of equivalence classes in Ti ). Furthermore. but not in Ei . we say that r is a generalized version of record r.1. for large tables with many equivalence classes. Therefore. The DirectedCheck procedure in Figure 5 checks if the ﬁrst table of the input is vulnerable to difference attack with respect to the second table. . Similarly. . respectively. Also knowing the quasi-identiﬁer values of the individuals in Ti and Tj . c)-anonymous. This also implies that if each E1 and E2 is not vulnerable to difference attack. As the algorithm checks all the subsets of Ti . .n } and Tj = {e1. respectively. In what follows. r1 r2 ) if ∀qi ∈ Q. We also use |N | to denote the cardinalities of set N . Moreover. then we do not need to consider any subset of Ti which contains E1 ∪ E2 . denoted as r r. The DirectedCheck procedure enumerates all subsets of Ti . c)-anonymous tables that are released at time i and j (i = j). then set D = ∪e∈Ej I(e) − ∪e∈Ei I(e) represents the set of individuals whose records are in Ej . and E1 ∩ E2 = ∅. c)-anonymous. and E1 and E2 be their minimal covering sets in Tj . Then we denote this relation as v v. . For instance. If ∪e∈Ei I(e) ⊆ ∪e∈Ej I(e). Two tables Ti and Tj are vulnerable to difference attack if at least one of DirectedCheck(Ti . set SD = ∪e∈Ej e S − ∪e∈Ei e S indicates the set of sensitive attribute values that belong to those individuals in D. ri [qi ] ∩ rj [qi ] = ∅. and E1 and E2 are disjoint. the (k.m } be two (k. e1. we check if their minimum covering sets are disjoint. As previously discussed in Section 4. As such. for any equivalence class e in either Ti or Tj . Let I(e) represent the set of individuals in e. this brute-force check is clearly unacceptable. where Ei ⊆ Ti and Ej ⊆ Tj . e1 [qi ] ∩ e2 [qi ] = ∅ 4.e.1 . it is O(2n ). Let Ei and Ej be two sets of equivalence classes. we assume that an attacker knows who is and who is not in each released table. we consider a generalized value as a set of possible values. Tj ) and DirectedCheck(Tj . The fact that E1 and E2 are not vulnerable to difference attacks means that sets (E1 − E1 ) and (E2 − E2 ) are both (k. denoted as v1 v2 .. two generalized records r1 and r2 are compatible (i.1 . we can modify the method in Figure 5 as follows. Overloading this notation. If E1 and E2 are not vulnerable to difference attack. Based on this observation. and interpret v as a set of values where (v ∈ v) ∧ (∀vi ∈ v. or if the most frequent sensitive value in SD appears with a probability larger than c. ei S represents the multiset of all the sensitive attribute values in ei . The procedure then checks whether there is vulnerability between the two equivalence classes E and E . and for each set E. r[qi ] r[qi ]) ∧ (r[S] = r[S]). We also say that two equivalence classes e1 and e2 are compatible if ∀qi ∈ Q. the attacker can now perform difference attacks as follows. Ti ) returns true. we say that two generalized values v1 and v2 are compatible. As the minimum covering set of E1 ∪ E2 is E1 ∪ E2 . Regardless of recoding schemes. c)-anonymity requirement is violated.e. vi ∈ D). . (E1 ∪ E2 ) − (E1 ∪ E2 ) is also (k.

For instance. the attacker cannot infer Alice’s sensitive attribute value with conﬁdence higher than c by examining each ei independently. E1 and E2 are not vulnerable. . c)-anonymous. . whose quasi-identiﬁer value is qA . in each Ti ∈ θA .. that is. the attacker now wants to know which individuals possess a speciﬁc sensitive attribute value. If E1 = E2 . he just needs to consider the records that may possibly correspond to + Alice. c)-anonymous tables. Therefore.g. if the most frequent value in SIA appears with a probability greater than c. sA . Let sp be the sensitive attribute value in which the attacker is interested and Ti ∈ Θ be the table in which (at least) one record with sensitive value sp appears. the individuals who suffer from ‘HIV+’. . must be present in every equivalence class in EA . The basic idea is to check every pair of equivalence classes ei ∈ T0 and ej ∈ T1 that contain the same record(s). we check if there exists another set with the same minimum covering set. In other words. we can skip checking if each of E1 and E2 is vulnerable to difference attack. then we only need to check if E1 ∪ E2 is vulnerable to difference attack. The pseudo-code in Figure 7 provides an algorithm for checking the vulnerability to the intersection attack for given two (k. this ambiguity may be reduced to an undesirable level if the structure of equivalence classes are varied in different releases. instead of wanting to know what sensitive attribute value a particular individual has. θA . It is easy to see that if e ∈ Ti contains some record(s) that Tj do not. where + qA ei [Q]. we simply merge the new subset with the found set. suppose that the attacker wants to know the sensitive attribute of Alice. T0 and T1 . If equivalence class e ∈ Tj contains all such records. as every equivalence class in EA contains Alice’s record. Ti contains some records that are not in Tj . However. This implies that sA must be found in set SIA = ei ∈EA ei S . Based on this observation. en } be the set of all equivalence classes identiﬁed from θA such that qA ei [Q]. ∀ei ∈ EA .5 Record-tracing attack Unlike the previous attacks. If such a set is found. Let EA = {e0 . and E1 and E2 be their minimal covering sets in Tj . Then the attacker can select a + set of tables. Although T 9 . c)-anonymous.. the attacker only need to consider an equivalence class ei ⊆ Ti . e. sA ∈ ei S . This is because unless E1 ∪ E2 is vulnerable to difference attack. However. he does not need to examine all the records.Observation 2 Let E1 and E2 be two sets of equivalence classes in Ti . In other words. we can purge all such equivalence classes from the initial problem set. e is safe from difference attack. we can save our computational effort as follows. As the attacker knows the quasi-identiﬁer of Alice. Thus. Suppose that Ti was released after Tj .e. Observation 3 Consider the method in Figure 5. i. the attacker may be interested in knowing who may be associated with a particular attribute value. That is. Since e itself must be (k. that all contain Alice’s record. respectively. . When we insert a new subset into the queue. 4. 4. the attacker knows that Alice’s sensitive attribute value. then the sensitive attribute value of Alice can be inferred with conﬁdence greater than c. 0 ≤ i ≤ n.4 Intersection attack The key idea of k-anonymity is to introduce sufﬁcient ambiguity into the association between quasi-identiﬁer values and sensitive attribute values. As every ei is (k. then we do not need to consider that equivalence class for difference attack. the minimum covering set of e is an empty-set.

As shown. While correct. We deﬁne E as the set of edges from vertices in V1 to vertices in V2 . where V = V1 ∪ V2 and each vertex in V1 represents a record in Ti . Suppose now that there are more than one compatible equivalence classes in Ti+1 . in order to illustrate our point clearly. the attacker can make more precise inference by eliminating equivalence classes in Ti+1 that are impossible to contain rr . then rp could be in either ei+1 and ei+1 (see Figure 8 (ii)). he can make an arbitrary choice and update his knowledge about the quasi-identiﬁers of the two records accordingly.2 . We create such edges between V1 and V2 as follows. we show some simple cases in the following example. then we create an edge from r to r and mark it with d . he is sure that one record is in ei+1. This means that if the attacker can identify from Tj the record corresponding to rp . we say that the attacker updates rp [Q] as rp [Q] ∩ (∪1≤j≤r ei+1. if there is an edge from ri ∈ V1 to rj ∈ V2 . In the above example. ei+1. If |R| = 1 and r ∈ E. rp [Q] ← rp [Q] ∩ ei+1 [Q]. However. which indicates that r deﬁnitely corresponds to r .1 that take sr as the sensitive value. c)-anonymous.1 that is compatible to r1 and two compatible equivalence classes ei+1. Although the attacker may or may not determine which equivalence class contains rp .1 or ei+1. as the record with sr in ei+1. Then the attacker can conﬁdently combine his knowledge on the quasi-identiﬁer of rp . Example 3 The attacker knows that rp must be contained in the equivalence class of Tj that is compatible with rp [Q].1 and r2 ∈ ei. Although r2 has two compatible equivalence classes.. and each vertex in V2 represents a record in Ti+1 that is compatible with at least one record in Ti . the attacker can infer the association between the individuals in Ip and rp with a probability higher than 1/k.1 and the other in ei+1. Suppose that Ti+1 contains a single equivalence class ei+1. the attacker can reexamine Ip and eliminate individuals whose quasiidentiﬁers are no longer compatible with the updated rp [Q]. As T is (k.r } in Ti+1 . that the attacker is interested in a particular record rp such that (rp [S] = sp ) ∧ (rp ∈ ei ). this is not a sufﬁcient description of what the attacker can do. (r[Q] ri [Q]) ∧ (r[S] = ri [S]). If |R| > 1.2 .may contain more than one record with sp .2 that are compatible to r2 .2 . After updating rp [Q] with Tj . we construct a bipartite graph G = (V. then he may be able to learn additional information about the quasi-identiﬁer of the individual corresponding to rp and possibly reduce the size of Ip .1 . For example. Using such techniques. he is sure that rp ∈ ei+1 ∪ ei+1 . E). let r1 ∈ ei. for simplicity. That is.e.1 and ei+1... as there are cases where the attacker can notice that some equivalence classes in Ti+1 cannot contain rp . Note that in each of these tables the quasi-identiﬁer of rp may be generalized differently. if sp ∈ ei+1 [S] and sp ∈ ei+1 [S]. the attacker can be sure that r2 is included in ei+1. When the size of Ip becomes less than k. i. Figure 9(ii) illustrates another case of which the attacker can take advantage.1 must correspond to r1 . then we create an edge from r and every ri ∈ R and mark it with p to indicate that r plausibly corresponds to 10 . For each vertex r ∈ V1 . therefore. Suppose that there is only one compatible equivalence class. If sp ∈ ei+1 [S] and sp ∈ ei+1 [S]. suppose. which represents possible matching relationships. . Suppose that the attacker also possesses a subsequently released table Tj (i < j) which includes rp . rp [Q] ← rp [Q] ∩ (ei+1 [Q] ∪ ei+1 [Q]). both taking sr as the sensitive value (see Figure 9(i)). Let Ip be the set of such individuals.2 be two records in Ti . this means that records ri and rj may both correspond to the same record although they are generalized into different forms. However. when there are more than one compatible equivalence classes {ei+1. then the attacker can be sure that rp ∈ ei+1 and updates his / knowledge of rp [Q] as rp [Q] ∩ ei+1 [Q]. c)-anonymous tables for the vulnerability to the record-tracing attack. Although the attacker cannot be sure that each of these records is contained in ei+1. First.. there are two records in ei. Thus. We now describe a more thorough algorithm that checks two (k. There are many cases where the attacker can identify rp in Tj . when the attacker queries the population table Ui with rp [Q]. we ﬁnd from V2 the set of records R where ∀ri ∈ R. say ei+1 and ei+1 .j ). ei+1 in Tj (see Figure 8 (i)). he obtains at least k individuals who may correspond to rp .

i ∈ V1 is associated with a set of possibly matching records {r2. we describe the properties of our checking algorithms which make them suitable for existing data anonymization techniques such as full-domain generalization [11] and multidimensional anonymization [12]. This ﬁeld is used for checking vulnerability to intersection attack. . we describe our incremental data anonymization which incorporates the inference detection techniques in the previous section. Note that the algorithm above does not handle the case illustrated in Figure 9(ii). • RID : is a unique record ID (or the explicit identiﬁer of the corresponding individual). . . In order to check whether Ti is vulnerable to cross-version inferences. This ﬁeld is used for checking vulnerability to record-tracing attack. r[IQ]i ← r[IQ]i−1 ∩ r[Q]. It is worth noting that the key problem enabling the record-tracing attack arises from the fact that the sensitive attribute value of a record. .m r2. • IQ (Inferable Quasi-identiﬁer) : keeps track of the quasi-identiﬁers into which the record has previously been generalized. 11 . Assuming that each Ti also contains RID (which is projected out before being released). ..1 Data/history management Consider a newly anonymized table. rare diseases) or tables where every individual has a unique sensitive attribute value (e.m } (j ≤ m) in V2 . If any inferred quasi-identifer maps to less than k individuals in the population table Ui . Θ = {T0 . [Q]). the pseudo-code in Figure 10 removes plausible edges that are not feasible and discovers more deﬁnite edges by scanning through the edges. Hi at time i. we remove unnecessary plausible edges such that each of such records in e1.. checking the vulnerability in Ti against each table in Θ can be computationally expensive. r2. For instance. Now given the constructed bipartite graph. We ﬁrst describe our data/history management strategy which aims to reduce the computational overheads. it is essential to maintain some form of history about previously released datasets. together with its generalized quasi-identiﬁer.. • IS (Inferable Sensitive values) : stores the set of sensitive attribute values with which the record can be associated. we maintain a history table.j . To avoid such inefﬁciency. However. Then. may uniquely identify the record in different anonymous tables. . After all infeasible edges are removed. Ti . which has the following attributes.i [Q] e2. RID is used to join Hi and Ti .g. . This issue can be especially critical for records with rare sensitive attribute values (e.i ∈ Ti . For each equivalence class e1. For instance. we also performs the following. .i [Q] = r1.g.j [Q]. then table Ti is vulnerable to the record-tracing attack with respect to Tj . e1. 5 Inference Prevention In this section. if record r is released in equivalence class ei of Ti . • TS (Time Stamp) : represents the time (or the version number) when the record is ﬁrst released. we ﬁnd from Tj the set of equivalence classes E where ∀e2.j ∈ E. Ti−1 }..i [Q] ∩ ( =j. In order to address such cases.. Now we can follow the edges and compute for each record r1. each record r1. If the same number of records with any sensitive value s appear in both e1. which is about to be released. then r[IS]i ← (r[IS]i−1 ∩ ei S ).i and E.i ∈ Ti the inferrable quasi-identiﬁer r1.i has a deﬁnite edge to a distinct record in E.. 5. for record r ∈ Ti .ri . DNA sequence).

Observation 5 Given a partition p of records. Θ. once the search ﬁnds a generalization level that satisﬁes the k-anonymity requirement. 6 It is easy to see that the property also holds for any diversity requirement. If either check fails. In its essence. for intersection and record-tracing attacks. property is also used in [16] for -diversity and is thus applicable for (k. then neither is G2 (T ). we do not need to search further. the algorithm recursively splits a partition at the median value (of a selected dimension) until no more split is allowed with respect to the k-anonymity requirement. then any sub-partition of p does not satisfy the requirement. where all values of an attribute are consistently generalized to the same level in the predeﬁned generalization hierarchy. Note that unlike the previous algorithm.e. Speciﬁcally. Using H. we consider the multidimensional k-anonymity algorithm proposed in [12].. Observation 4 Given a table T and two generalization strategies G1 . In each step of generalization. we check Ti against Hi−1 . Then.The main idea of Hi is to keep track of the attacker’s accumulated knowledge on each released record. this property is critical as it guarantees the optimality to the discovered solutions. then so is any subpartition of p. the information about each record in G2 (T ) is more vague than that in G1 (T ). this algorithm is a top-down search approach. propose an algorithm that ﬁnds minimal generalizations for a given table. This our current implementation. Based on this observation. In order to ﬁnd such a partitioning. if p is vulnerable to any inference attack. then G2 (T ) is also k-anonymous5. in addition to checking the (k. in which some records are vulnerable to inference attacks. G2 (G1 vulnerable to any inference attack.2 Incorporating inference detection into data anonymization We now discuss how to incorporate the inference detection algorithms into secure anonymization algorithms. LeFevre et al. where d is the number of attributes in the quasi-identiﬁer. However. This is indeed the worst case as we are assuming that the attacker possesses every released table.. c)-anonymity requirement. We ﬁrst consider the full-domain anonymization. any further cut of p1 will lead to a dataset that is also vulnerable to inference attacks.1. c)-anonymity. In [11]. instead of every Tj ∈ Θ4 . The key property on which the algorithm relies is the generalization property: given a table T and two generalization strategies G1 . For instance. i. if G1 (T ) is not The proof is simple.e. we need to be conservative when estimating what the attacker can do. as discussed in Section 4. Suppose that we have a partition p1 of the dataset. difference attack is still checked against every previously released table. The ﬁrst step is to ﬁnd a partitioning scheme of the d-dimensional space. the cost of checking Ti for vulnerability can be signiﬁcantly reduced. Although intuitive. As each equivalence class in G2 (T ) is the union of one or more equivalence classes in G1 (T ). thus. value r[IS] of record r ∈ Hi−1 indicates the set of sensitive attribute values that the attacker may be able to associate with r prior to the release of Ti . Next. the algorithm consists of the following two steps. G2 ). and the quality of the search relies on the following property6: given a partition p. if p does not satisfy the k-anonymity requirement. we also check for the vulnerability to inference. if G1 (T ) is k-anonymous. i. the proposed algorithm is a bottom-up search approach in that it starts with un-generalized data and tries to ﬁnd minimal generalizations by increasingly generalizing the target data in each step. G2 (G1 G2 ). G2 does not create more inference attacks than G1 . such that each partition contains more than k records. we modify the algorithm in [11] as follows. 5 This 4 In 12 . 5. then we need to further generalize the data.

g. For our experiments. 6. . denoted by AIL(e). salary} as the quasi-identiﬁer. is deﬁned as: |Gj | AIL(e) = |e| × j=1. Deﬁnition 4 (Information loss) [4] Let e={r1 .1. Standard Edition 5.g.. while being independent from the underlying generalization scheme (e. In our experiments.2 Data quality metrics The quality of generalized data has been measured by various metric. and the implementation was built and run in Java 2 Platform. If either check fails. we measure the data quality mainly based on Average Information Loss (AIL.1 Experimental Setup 6. Based on this observation.. Thus. the AIL of a given table T is computed as: AIL(T ) = ( e∈T IL(e)) / |T |.0. am }. 6 Experiments The main goal of our experiments is to show that our approach effectively prevents the previously discussed inference attacks when data is incrementally disseminated. the Adult data set was prepared as described in [3. Based on IL. the geometrical size of each partition). marital status. 6. . each region indeed represents the generalized quasi-identiﬁer of the records contained in it.. and |Dj | the domain size of attribute aj . rn } be an equivalence class where Q={a1 . . in addition to checking the (k. The key advantage of AIL metric is that it precisely measures the amount of generalization (or vagueness of data).1. . c)-anonymity requirement.1 Experimental Environment The experiments were performed on a 2. we considered {age. we also use the Discernibility Metric (DM ) [3] as 13 . We also show that our approach produces datasets with good data quality. gender. We removed records with missing values and retained only nine of the original attributes. and occupation attribute as the sensitive attribute. we used the Adult dataset from the UC Irvine Machine Learning Repository [18].e. it reveals more information about each record than p1 . Information Loss (IL for short) measures the amount of data distortion in an equivalence class as follows.m |Dj | where |e| is the number of records in e. thus. . 9.is based on the fact that any further cut on p1 leads to de-generalization of the dataset. The operating system on the machine was Microsoft Windows XP Professional Edition. Before the experiments. native country. We ﬁrst describe our experimental settings and then report our experimental results. for short) metric proposed in [4]. Note that as all the records in an equivalence class are modiﬁed to share the same quasi-identifer. we also check for the vulnerability to inference. education. then we do not need to further partition the data. work class. we modify the algorithm in [12] as follow. .66 GHz Intel IV processor machine with 1 GB of RAM. |Gj | represents the amount of generalization in attribute aj (e. data distortion can be measured naturally by the size of the region covered by each equivalence class. anonymization technique used or generalization hierarchies assumed).. Then the amount of data distortion occurred by generalizing e. which is considered a de facto benchmark for evaluating the performance of anonymization algorithms.. the length of the shortest interval which contains all the aj values existing in e). . . Following this idea. In our experiment. In each step of partition. race. The basic idea of AIL metric is that the amount of generalization is equivalent to the expansion of each equivalence class (i. For the same reason. 12]..

we stop the check and do not consider the split. For this. As previously discussed. It is clear that in terms of data quality the inf-checked algorithm is much superior than the merge algorithm. they caused memory blowups. which indicates how much records are indistinguishable from each other. Intuitively. Figure 11 shows the result. We also implemented a merge approach where we anonymize each dataset independently and merge it with the previously released dataset. while our modiﬁed algorithm of Incognito is processing a node in the generalization lattice. even the initial dataset was highly generalized. In fact. we started with 5K records and increased the dataset by 5K each time.5. we ﬁrst anonymized 5K records and obtained the ﬁrst “published” datasets. This is indeed unsurprising.3) were highly effective in most cases. The main focus was its effect on the data quality. That is. the more precise data are. We also note that the quality of datasets generated by the inf-checked algorithm is not optimal. and the results are illustrated in Figures 12 and 13. these data are vulnerable to inference attacks as previously shown. This clearly illustrates the unfortunate reality. In our case. In order to avoid such cases. there were some cases where the size of subsets grew explosively. in order to prevent undesirable inferences. if the size of subsets needed to be checked exceeds the threshold. Note that although we implemented the full-featured algorithms for difference and intersection attacks. c)-anonymization algorithms. c)-anonymization algorithms. Although the static algorithms produced the best quality datasets. The next step was to investigate how effectively our approach would work with a real dataset. that is. c)-anonymization algorithms as discussed in Section 5 and obtained our inf-checked (k. As such cases not only caused lengthy execution times. respectively. This was mainly due to the complexity of checking for difference attack. it means that the given data must be generalized until there is no vulnerability to any type of inference attack. taking a multidimensional approach. for Incognito. when we encounter such a case while considering a split in Mondrian. As shown. we modiﬁed two k-anonymization algorithms. Note that this approach does not affect the security of data. Although this approach is secure from any type of inference attacks. The datasets generated by our inf checked algorithm and the merge algorithm were not vulnerable to any type of inference attack.another quality measure in our experiment. much more records were found to be vulnerable in the datasets anonymized by Mondrian. produces datasets with much less generalization. we repeated our experiment.2 Experimental Results We ﬁrst measured how many records were vulnerable in statically anonymized datasets with respect to the inference attacks we discussed. We modiﬁed the static (k. With these algorithms as well as the static anonymization algorithms. Incognito [11] and Mondrian [12]. we considered all the edges without removing infeasible/unnecessary edges as discussed in Section 4. we expected that the data quality would be the worst. one needs to hide more information. We then generated ﬁve more subsequent datasets by adding 5K more records each time. we took a simple approach for record-tracing attack. For example. 6. we set an upper limit threshold for the size of subsets in this experiment. the more vulnerable they are to undesirable inferences. we stop the iteration and consider the node as a vulnerable node. as merging would inevitably have a signiﬁcant effect on generalization (recoding) scheme. Then we used our vulnerability detection algorithms to count the number of records among these datasets that are vulnerable to each of inference attack. Similarly. DM measures the quality of anonymized data based on the size of the equivalence classes. We then checked the vulnerability and measured the data quality of such datasets. As before. Even though our heuristics to reduce the size of subsets (see Section 4. although 14 . and used them as our static (k. Using these algorithms. as Mondrian. We measured the data quality both with AIL and DM .

14. 7 Related Work The problem of information disclosure [6] has been studied extensively in the framework of statistical databases. They proposed several methods to check whether or not a given set of views violates the k-anonymity requirement collectively. Figure 14 shows our experimental result for each algorithm. Sweeney identiﬁed possible inferences when new records are inserted and suggested two simple solutions. we believe that the data quality seems to be still acceptable.. Samarati and Sweeney [20. addition of new attributes). we presented a preliminary limited investigation concerning the inference problem of dynamic anonymization with respect to incremental datasets. While static anonymization has been extensively investigated in the past few years. 21] have been proposed to achieve k-anonymity requirement. Wang and Fung [23] further investigated this issue and proposed a top-down specialization approach to prevent record-linking across multiple anonymous tables. we believe that this is the price you have to pay for better data quality and reliable privacy. 12. While equipped the heuristics and the data structure discussed in Sections 4. as discussed in Section 4. As previously mentioned. 22. 8. [28] addressed the inference issue when a single table is released in the form of multiple views. Rounding. all equivalence classes containing that tuple have exactly the same set of sensitive attribute values. In [5].1. Also. The m-invariance principle requires that each equivalence class in every release contains distinct sensitive attribute values and for each tuple. 9. all released attributes (including sensitive attributes) must be treated as the quasi-identiﬁer in subsequent releases. However. Also. compromised data integrity of the tables. Recently. Yao et al. A number of static anonymization algorithms [3. Even if the optimality cannot be guaranteed. They also introduced the counterfeit generalization technique to achieve the m-invariance requirement. However.it may negatively affect the overall data quality. Cell Suppression. when compared to our previous implementation without any heuristics. The merge algorithm is highly efﬁcient with respect to execution time (although it was very inefﬁcient with respect to data quality). this approach cannot adequately control the risk of attribute disclosure. this approach may suffer from unnecessarily low data quality. Xiao and Tao [27] proposed a new generalization principle m-invariance for dynamic dataset publishing. Another important comparison was the computational efﬁciency of these algorithms. in any subsequent release of the dataset. 21] introduced the k-anonymity approach and used generalization and suppression techniques to preserve information truthfulness. A number of information disclosure limitation techniques [7] have been designed for data publishing. Optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. this approach has a signiﬁcant drawback in that every equivalence class will inevitable have a homogeneous sensitive attribute value. and Data Swapping and Perturbation. including Sampling. The ﬁrst solution is that once records in a dataset are anonymized and released. As the merge algorithm anonymizes the same sized dataset each time and merging datasets can be done very quickly. the infchecked algorithm is still slow. However. this approach cannot protect newly inserted records from difference attack. the execution time is closely constant. their work focuses on the horizontal growth of databases (i. These techniques. however. In [22]. The other solution suggested is that once a dataset is released. the records must be the same or more generalized.3 and 5. 11. this is a very promising result. they did not address how to deal with such violations. and does not address vertically-growing databases where records are inserted. thus. considering the results shown in Figures 12 and 13. This approach seems reasonable as it may effectively prevent linking between records. We identiﬁed some inferences and also proposed an approach where new records are directly inserted to the previously anonymized dataset for computational 15 . However.e. considering the previously discussed results. only a few approaches address the problem of anonymization in dynamic environments.

Aggrawal. In Proceedings of the Intenational Conference on Database Systems for Advanced Applications(DASFAA). we are working on essential properties (e. we discussed three basic types of cross-version inference attacks and presented algorithms for detecting each attack. 2006. In Secure Data Management. References [1] C. and also suggest a scheme to store the history (of previously released tables). For the future work. For instance. while in this work we consider more robust and systematic inference attacks in a collection of released tables. The inference attacks discussed in this work subsume attacks using inference enabling sets and address more sophisticated inferences. Agrawal. Li. Cryptography and Data Security. S.efﬁciency. Bayardo and R. In this paper we also address the issue of computational costs in detecting possible inferences. Based on these ideas. 2005. Byun. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). Bertino. J. Another interesting direction for the further work is to see if there are other types of inferences. Data privacy through optimal k-anonymization. Byun. Our previous approach was also limited to the multidimensional generalization. E. our study of the record-tracing attack is a new contribution in this work. R. correctness) of our methods and analysis. 2007. 2005. T. Thomas.-W. K. Y. and N. We also empirically evaluated our approach by comparing to other approaches. Khuller. Secure anonymization for incremental datasets. Our experimental result showed that our approach outperformed other approaches in terms of privacy and data quality. 2006. one can devise an attack where more than one type of inference are jointly utilized. and N. [6] D. In particular. compared to this current work. Inc. 1982. 8 Conclusions In this paper.g.. However. Li. Aggarwal. A. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Denning.The key differences of this work with respect to [5] are as follows. We discuss various heuristics to signiﬁcantly reduce the search space. [4] J. On k-anonymity and the curse of dimensionality. [2] G. Bertino. Efﬁcient k-anonymization using clustering techniques. Feder. E. we discussed inference attacks against the anonymization of incremental data. D. Zhu. In [5].. therefore it can be combined with a large variety of anonymization algorithms. Kamra.-W. Achieving anonymity via clustering. Kenthapadi. our current approach considers and is applicable to both the full-domain and multidimensional approaches. we focused only on the inference enabling sets that may exist between two tables. By contrast. we developed secure anonymization algorithms for incremental datasets using two existing anonymization algorithms. We also plan to investigate inference issues in more dynamic environments where deletions and updates of records are allowed. For instance. In Proceedings of the International Conference on Data Engineering (ICDE). We also presented some heuristics to address the efﬁciency of our algorithms. our previous work has several limitations. 16 . [5] J. Panigrahy. Sohn. [3] R. Addison-Wesley Longman Publishing Co. We also provide a detailed descriptions of attacks and algorithms for detecting them. and A.

Fung. In Conﬁdentiality. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). [14] T. and M. LeFevre. Anonymizing sequential releases. Merz. Venkitasubramaniam. [16] A. 2002. Fung. In Proceedings of the International Conference on Data Engineering (ICDE). 1998. Injector: Mining Background Knowledge for Data Anonymization. Wang and B. Technical Report SRI-CSL-98-04. International Journal on Uncertainty. Yu. F. 2006. D. R. UCI repository of machine learning databases. [21] L. Disclosure limitation methods and information loss for tabular data. [12] K. K. Sweeney. D. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2004. Zhang. Fuzziness and Knowledge-based Systems. S. Padman. Yu. Wang. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Li. [9] V. [23] K. and Q. Sweeney. Mondrian multidimensional k-anonymity. International Journal on Uncertainty. In Proceedings of the International Conference on Data Engineering (ICDE). Fienberg. [8] B. Sweeney. Fuzziness and Knowledge-based Systems. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. T. Samarati. Elsevier. Top-down specialization for information and privacy preservation. T. 1998. Li. and P. 2001. DeWitt. On the complexity of optimal k-anonymity. [19] P. [22] L. J. [18] C. Krishnan. In Proceedings of the International Conference on Data Engineering (ICDE). 2008. M. Gehrke. K-anonymity: A model for protecting privacy. T. 17 . Protecting respondents privacy in microdata release. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). [10] N. and R. ‘-diversity: Privacy beyond k-anonymity. 13. 2005. E. B. and R. C. t-closeness: Privacy beyond k-anonymity and -diversity. Ramakrishnan. Li and N. [17] A. Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. Machanavajjhala. Li and N. Li. Hettich and C. Meyerson and R. Venkatasubramanian. D. M. Towards optimal k-anonymization. and S. Koudas. [11] K. 2002. 2001. Roehrig. 2002. Incognito: Efﬁcient full-domain k-anonymity.[7] G. C. Ramakrishnan. [15] T. 2006. S. DeWitt. pages 414423. Williams. Li. Duncan. In Proceedings of the International Conference on Data Engineering (ICDE). S. Srivastava. Data & Knowledge Engineering. R. LeFevre. IEEE Transactions on Knowledge and Data Engineering. Samarati and L. 2008. Kifer. pages 135166. Aggregate query answering on anonymized tables. In Proceedings of the International Conference on Data Engineering (ICDE). Achieving k-anonymity privacy protection using generalization and suppression. Iyengar. 2005. D. [13] N. S. and S. 2007. [20] P. In Proceedings of the International Conference on Data Engineering (ICDE). Transforming data to satisfy privacy constraints. 2007. 2005.

C. pages 689700. S. In Proceedings of the International Conference on Very Large Data Bases (VLDB).k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. Jajodia. [27] X. 2006. 18 . [25] X. 2007. pages 139150. X. Wong. and S. 2006. Tao. Yao. W. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Anatomy: simple and effective privacy preservation. Fu. Xiao and Y. J. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Xiao and Y. In Proceedings of the ACMSIGMOD International Conference onManagement of Data (SIGMOD). Wang.-W. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Tao. 2006. Checking for k-anonymity violation by views. Xiao and Y. Li. [28] C. and K. [26] X. pages 229240. A. (α.-C. M-invariance: towards privacy preserving re-publication of dynamic datasets. Wang.[24] R. 2005. Tao. pages 754759. Personalized privacy preservation.

NAME Tom Mike Bob Eve AGE 21 23 52 57 Gender Male Male Male Female Diagnosis Asthma Flu Alzheimer Diabetes AGE [21 − 25] [21 − 25] [50 − 60] [50 − 60] Gender Male Male Person Person Diagnosis Asthma Flu Alzheimer Diabetes Figure 1: Patient table Figure 2: Anonymous patient table AGE [21 − 30] [21 − 30] [21 − 30] [51 − 55] [51 − 55] [56 − 60] [56 − 60] Gender Person Person Person Male Male Female Female Diagnosis Asthma Flu Cancer Alzheimer Hepatitis Flu Diabetes NAME Tom Mike Bob Eve Alice Hank Sal AGE 21 23 52 57 27 53 59 Gender Male Male Male Female Female Male Female Diagnosis Asthma Flu Alzheimer Diabetes Cancer Hepatitis Flu Figure 3: Updated patient table Figure 4: Updated anonymous patient table 19 .

ej.n } Output: true if Ti is vulnerable to difference attack with respect to Tj and false otherwise Q = {(∅. . . . Tj ) if |E − E| < k.DirectedCheck Input: Two (k. . . . 0)} while Q = ∅ Remove the ﬁrst element p = (E. . m} // generates subsets with size |E| + 1 insert (E ∪ ei. return true for each ∈ {index + 1. c)-anonymous tables Ti and Tj . where Ti = {ei. ei. index) from Q if E = ∅ E ← GetM inSet(E.1 .m } and Tj = {ej. .1 . . . ) into Q return false Figure 5: Algorithm for checking if one table is vulnerable to difference attack with respect to another table 20 . . . return true else if (E S − E S ) does not satisfy c-diversity. .

GetMinSet Input: An equivalence class set E and a table T Output: An equivalence class set E ⊆ T that is minimal and contains all records in E E =∅ while E ⊆ E choose a tuple t in E that is not in E ﬁnd the equivalence class e ⊆ T that contains t E = E ∪ {e} return E Figure 6: Algorithm for obtaining a minimum equivalence class set that contains all the records in the given equivalence class set 21 .

Check Intersection Attack Input: Two (k.j S ) does not satisfy c-diversity.i in T0 for each e1.j ∈ T1 that contains any record in e0. c)-anonymous tables T0 and T1 Output: true if the two tables are vulnerable to intersection attack and false otherwise for each equivalence class e0.i if (e0. return true return false Figure 7: Algorithm for checking if given two tables are vulnerable to intersection attack 22 .i S ∩ e1.

2 en...1 sp e i+1.2 en.5 sp Ti+1 .(i) ei.1 sr ei.2 e i+1.3 e i+1.2 sr (i) e i+1.1 sp ei+1.6 sp Tn (ii) ei.5 sp en.3 sp e i+1.1 sp en.2 sr ei+1. en.2 sr (ii) Ti Figure 8: Record-tracing attacks Figure 9: More inference in record-tracing 23 .4 en.1 sr ei.1 sr sr ei+1.3 ei.1 sr ei.4 sp e i+1.

break return (V. change2 ← false for each rj ∈ V2 e ← all the incoming edges of rj if e contains both deﬁnite and plausible edge(s) remove all plausible edges in e from E . Output: A bipartite graph G = (V. E) where V = V1 ∪ V2 and E is a set of edges representing possible matching relationships. change1 ← true if change1 = true for each ri ∈ V1 e ← all the outgoing edges of ri if e contains only a single plausible edge mark the edge in e as deﬁnite. E ) Figure 10: Algorithm for removing infeasible edges from a bipartite graph 24 . E ) where E ⊂ E with infeasible edges removed E =E while true change1 ← false.Remove Infeasible Edges Input: A bipartite graph G = (V. change2 ← true if change2 = false.

7) 5000 Number of Vulnerable Records Number of Vulnerable Records 50 40 30 20 10 0 0 5 10 15 20 25 Table Size (unit = 1.Static Incognito (k=5. c=0.7) Difference Intersection R-Tracing 10 15 20 25 Table Size (unit = 1. c=0.000) 30 35 Figure 11: Vulnerability to Inference Attacks 25 .000) 30 35 Difference Intersection R-Tracing 4000 3000 2000 1000 0 0 5 Static Mondrian (k=5.

000) 30 35 Figure 12: Data Quality: Average Information Loss 26 . c=0.7) 9 Average Information Loss 8 7 6 5 4 3 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Average Information Loss Static Merge Inf_Checked 7 6 5 4 3 2 1 0 0 5 Static Merge Inf_Checked Mondrian (k=5.Incognito (k=5. c=0.7) 10 15 20 25 Table Size (unit = 1.

c=0. c=0.Incognito (k=5.000) 30 35 Figure 13: Data Quality: Discernibility Metric 27 .7) 10 15 20 25 Table Size (unit = 1.7) 120 Discernability Metric (unit = 1M) 100 80 60 40 20 0 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Static Merge Inf_Checked Discernability Metric (unit = 1M) 100 80 60 40 20 0 0 5 Static Merge Inf_Checked Mondrian (k=5.

Incognito (k=5.7) 1500 1000 500 0 0 5 10 15 20 25 Table Size (unit = 1. c=0.000) 30 35 Execution Time (unit = sec) Static Merge Inf_Checked Static Merge Inf_Checked Mondrian (k=5.7) 2000 500 Execution Time (unit = sec) 400 300 200 100 0 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Figure 14: Execution Time 28 . c=0.

- QueueUploaded byMansoor Ali
- D_OptimalUploaded bypandaprasad
- EHSV M1286Uploaded byYashaswy Govada
- Support Vector MachineUploaded by1balamanian
- class12-PAC3Uploaded byscribd2313
- 100 Ovr CalculatorUploaded byMark Alty
- HW2solUploaded byCliff Tan
- HmkProbSolsUploaded bySalmanNadeem
- half2Uploaded byCrystal
- Control Systems Jan 2014Uploaded byPrasad C M
- NetworkSimTestCases-2015Uploaded byThinh Tran Van
- Market Equilibrium With Transaction CostsUploaded byjpwang
- תכן לוגי מתקדם- הרצאה 3 | הוספת שקופיות ב-2011Uploaded byRon
- ES 204 MP2Uploaded byWilliam Provido
- Root Locus TechniqueUploaded byTammanurRavi
- Solid Commerce NetSuite IntegrationUploaded byashok_subbu
- RP NotesUploaded byPrayag Gowgi
- 2010 IWYMIC IndividualUploaded bykwong
- Cambridge Text - AnswersUploaded bylorenzo
- aUploaded byJundel Sabellina Sistona
- blendkit wk 3 creating blended assignment instructionsUploaded byapi-354332387
- Full TextUploaded byGaurav Deshmukh
- 0BzldnLHvRAq_SUFTUG5YbnpSbVkUploaded byAnkit
- guymonmarqueztechnologyrolloutrubricUploaded byapi-246015400
- HW Set3 Gentry_graded (2)Uploaded bySakshi Singhania
- PointsUploaded bymsvinu
- 9 Key to Numerical DifferentiationUploaded byuchiha_rhenzaki
- 4Uploaded bynikhil
- Project 1Uploaded byzabi Ahmed
- Miscellaneous Mathematical Constants EtcUploaded byhascribd