Privacy-Preserving Incremental Data Dissemination

Ji-Won Byun, Tiancheng Li, Elisa Bertino, Ninghui Li
Department of Computer Science
Purdue University, West Lafayette, IN 47907, USA
¦byunj,li83,bertino,ninghui¦@cs.purdue.edu
Yonglak Sohn
Department of Computer Engineering
Seokyeong University, Seoul, Korea
syl@skuniv.ac.kr
Abstract
Although the k-anonymity and -diversity models have led to a number of valuable privacy-protecting
techniques and algorithms, the existing solutions are currently limited to static data release. That is, it is
assumed that a complete dataset is available at the time of data release. This assumption implies a sig-
nificant shortcoming, as in many applications data collection is rather a continual process. Moreover, the
assumption entails “one-time” data dissemination; thus, it does not adequately address today’s strong de-
mand for immediate and up-to-date information. In this paper, we consider incremental data dissemination,
where a dataset is continuously incremented with new data. The key issue here is that the same data may
be anonymized and published multiple times, each of the time in a different form. Thus, static anonymiza-
tion (i.e., anonymization which does not consider previously released data) may enable various types of
inference. In this paper, we identify such inference issues and discuss some prevention methods.
1 Introduction
When person-specific data is published, protecting individual respondents’ privacy is a top priority. Among
various approaches addressing this issue, the k-anonymity model [22, 19] and the -diversity model [16] have
recently drawn significant attention in the research community. In the k-anonymity model, privacy protection
is achieved by ensuring that every record in a released dataset is indistinguishable from at least (k −1) other
records within the dataset. Thus, every respondent included in the dataset correspond to at least k records in
a k-anonymous dataset, and the risk of record identification (i.e., the probability of associating a particular
individual with a released record) is guaranteed to be at most 1/k. While the k-anonymity model primarily
focuses on the problem of record identification, the -diversity model, which is built upon the k-anonymity
model, addresses the risk of attribute disclosure (i.e., the probability of associating a particular individual with
a sensitive attribute value). As an attribute disclosure may occur without records being identified (e.g., due
to lack of diversity in a sensitive attribute), the -diversity model, in its simplest form
1
, additionally requires
that every group of indistinguishable records contain at least distinct sensitive attribute values; thereby the
risk of attribute disclosure is bound to at most 1/.
1
We discuss more robust -diversity requirements in Section 2.
1
Although these models have yielded a number of valuable privacy-protecting techniques [3, 8, 9, 11,
12, 21], existing approaches only deal with static data release. That is, all these approaches assume that a
complete dataset is available at the time of data release. This assumption implies a significant shortcoming,
as in many applications data collection is rather a continuous process. Moreover, the assumption entails
“one-time” data dissemination. Obviously, this does not address today’s strong demand for immediate and
up-to-date information, as the data cannot be released before the data collection is considered complete.
As a simple example, suppose that a hospital is required to share its patient records with a disease control
agency. In order to protect patients’ privacy, the hospital anonymizes all the records prior to sharing. At
first glance, the task seems reasonably straightforward, as existing anonymization techniques can efficiently
anonymize the records. The challenge is, however, that newrecords are continuously collected by the hospital
(e.g., whenever new patients are admitted), and it is critical for the agency to receive up-to-date data in timely
manner.
One possible approach is to provide the agency with datasets containing only the new records, which are
independently anonymized, on a regular basis. Then the agency can either study each dataset independently or
merge multiple datasets together for more comprehensive analysis. Although straightforward, this approach
may suffer from severely low data quality. The key problem is that relatively small sets of records are
anonymized independently so that the records may have to be modified much more than when they are
anonymized together with previous records [5]. Moreover, a recoding scheme applied to each dataset may
make the datasets inconsistent with each other; thus, collective analysis on multiple datasets may require
additional data modification. Therefore, in terms of data quality, this approach is highly undesirable. One
may believe that data quality can be assured by waiting for new data to be accumulated sufficiently large.
However, this approach may not be acceptable in many applications as new data cannot be released in a
timely manner.
A better approach is to anonymize and provide the entire dataset whenever it is augmented with new
records (possibly along with another dataset containing only new records). In this way, the agency can be
provided with up-to-date, quality-preserving and “more complete” datasets each time. Although this ap-
proach can also be easily implemented by using existing techniques (i.e., anonymizing the entire dataset
every time), it has a significant drawback. That is, even though each released dataset, when observed inde-
pendently, is guaranteed to be anonymous, the combination of several released datasets may be vulnerable to
various inferences. We illustrate these inferences through some examples in Section 3.1. As such inferences
are typically made by comparing or linking records across different tables (or versions), we refer to them as
cross-version inferences to differentiate them from inferences that may occur within a single table.
Our goal in this paper is to identify and prevent cross-version inferences so that an increasing dataset can
be incrementally disseminated without compromising the imposed privacy requirement. In order to achieve
this, we first define the privacy requirement for incremental data dissemination. We then discuss three types
of cross-version inference that an attacker may exploit by observing multiple anonymized datasets. We also
present our anonymization method where the degree of generalization is determined based on the previ-
ously released datasets to prevent any cross-version inference. The basic idea is to obscure linking between
records across different datasets. We develop our technique in two different types of recoding approaches;
namely, full-domain generalization [11] and multidimensional anonymization [12]. One of the key differ-
ences between these two approaches is that the former generalizes a given dataset according to pre-defined
generalization hierarchies, while the latter does not. Based on our experimental result, we compare these two
approaches with respect to data quality and vulnerability to cross-table inference. Another issue we address is
that as a dataset is released multiple times, one may need to keep the history of previously released datasets.
We thus discuss how to maintain such history in a compact form to reduce unnecessary overheads.
2
The remainder of this paper is organized as follows. In Section 2, we review the basic concepts of the k-
anonymity and -diversity models and provide an overview of related techniques. In Section 3, we formulate
the privacy requirement for incremental data dissemination. Then in Section 4, we describe three types of
inference attacks based on our assumption of potential attackers. We present our approach to preventing these
inferences in Section 5 and evaluate our technique in Section 6. We review some related work in Section 7
and conclude our discussion in Section 8.
2 Preliminaries
In this section, we discuss a number of anonymization models and briefly review existing anonymization
techniques.
2.1 Anonymity Models
The k-anonymity model assumes that data are stored in a table (or a relation) of columns (or attributes) and
rows (or records). It also assumes that the target table contains person-specific information and that each
record in the table corresponds to a unique real-world individual. The process of anonymizing such a table
starts with removing all the explicit identifiers, such as name and SSN, from the table. However, even though
a table is free of explicit identifiers, some of the remaining attributes in combination could be specific enough
to identify individuals. For example, it has been shown that 87% of individuals in the United States can be
uniquely identified by a set of attributes such as ¦ZIP, gender, date of birth¦ [22]. This implies that each
attribute alone may not be specific enough to identify individuals, but a particular group of attributes together
may identify a particular individuals [19, 22].
The main objective of the k-anonymity model is thus to transform a table so that no one can make high-
probability associations between records in the table and the corresponding individuals by using such group
of attributes, called quasi-identifier. In order to achieve this goal, the k-anonymity model requires that any
record in a table be indistinguishable from at least (k − 1) other records with respect to the quasi-identifier.
A set of records that are indistinguishable from each other is often referred to as an equivalence class. Thus,
a k-anonymous table can be viewed as a set of equivalence classes, each of which contains at least k records.
The enforcement of k-anonymity guarantees that even though an adversary knows the quasi-identifier value
of an individual and is sure that a k-anonymous table T contains the record of the individual, he cannot
determine which record in T corresponds to the individual with a probability greater than 1/k.
Although the k-anonymity model does not consider sensitive attributes, a private dataset typically contains
some sensitive attributes that are not part of the quasi-identifier. For instance, in patient table, Diagnosis is
considered a sensitive attribute. For such datasets, the key consideration of anonymization is the protection
of individuals’ sensitive attributes. However, the k-anonymity model does not provide sufficient protection
in this setting, as it is possible to infer certain individuals’ sensitive attribute values without precisely re-
identifying their records. For instance, consider a k-anonymized table where all records in an equivalence
class have the same sensitive attribute value. Although none of these records can be uniquely matched with
the corresponding individuals, their sensitive attribute value can be inferred with probability 1. Recently,
Machanavajjhala et al. [16] pointed out such inference issues in the k-anonymity model and proposed the
notion of -diversity. Several formulations of -diversity are introduced in [16]. In its simplest form, the
-diversity model requires that records in each equivalence class have at least distinct sensitive attribute
values. As this requirement ensures that every equivalence class contains at least distinct sensitive attribute
values, the risk of attribute disclosure is kept under 1/. Note that in this case, the -diversity requirement
3
also ensures -anonymity, as the size of every equivalence class must be greater than or equal to . Although
simple and intuitive, modified datasets based on this requirement could still be vulnerable to probabilistic
inferences. For example, consider that among the distinct values in an equivalence class, one particular
value appears much more frequently than the others. In such a case, an adversary may conclude that the
individuals contained in the equivalence class are very likely to have that specific value. A more robust
diversity is achieved by enforcing entropy -diversity [16], which requires every equivalence class to satisfy
the following condition.

s∈S
p(e, s) log p(e, s) > log
where S is the domain of the sensitive attribute and p(e, s) represents the fraction of records in e that have
sensitive value s. Although entropy -diversity does provide stronger privacy, the requirement may sometimes
be too restrictive. For instance, as pointed out in [16], in order for entropy -diversity to be achievable, the
entropy of the entire table must also be greater than or equal to log .
Recently, a number of anonymization models have been proposed. In [26], Xiao and Tao observed that -
diversity cannot prevent attribute disclosure, when multiple records in the table corresponds to one individual.
They proposed to have each individual specify privacy policies about his or her own attributes. In [13], Li
et al. observed that -diversity is neither sufficient nor necessary for protecting against attribute disclosure.
They proposed t-closeness as a stronger anonymization model, which requires the distribution of sensitive
values in each equivalence class to be analogous to the distribution of the entire dataset. In [15], Li and
Li considered the adversary’s background knowledge in defining privacy. They proposed an approach for
modeling the adversary’s background knowledge by using data mining techniques on the data to be released.
2.2 Anonymization Techniques
The k-anonymity (and -diversity) requirement is typically enforced through generalization, where real val-
ues are replaced with “less specific but semantically consistent values” [21]. Given a domain, there are
various ways to generalize the values in the domain. Intuitively, numeric values can be generalized into in-
tervals (e.g., [11 − 20]), and categorical values can be generalized into a set of possible values (e.g., ¦USA,
Canada, Mexico¦) or a single value that represents such a set (e.g., North-America). As generalization makes
data uncertain, the utility of the data is inevitably downgraded. The key challenge of anonymization is thus
to minimize the amount of ambiguity introduced by generalization while enforcing anonymity requirement.
Various generalization strategies have been developed. In the hierarchy-based generalization schemes,
a non-overlapping generalization-hierarchy is first defined for each attribute in the quasi-identifier. Then an
algorithm in this category tries to find an optimal (or good) solution which is allowed by such generalization
hierarchies. Here an optimal solution is a solution that satisfies the privacy requirement and at the same time
minimizes a desired cost metric. Based on the use of generalization hierarchies, the algorithms in this category
can be further classified into two subclasses. In the single-level generalization schemes [11, 19, 21], all the
values in a domain are generalized into a single level in the corresponding hierarchy. This restriction could
be a significant drawback in that it may lead to relatively high data distortion due to excessive generalization.
The multi-level generalization [8, 9] schemes, on the other hand, allows values in a domain to be generalized
into different levels in the hierarchy. Although this leads to much more flexible generalization, possible
generalizations are still limited by the imposed generalization hierarchies.
Another class of generalization schemes is the hierarchy-free generalization class [3, 2, 12, 4]. Hierarchy-
free generalization schemes do not rely on the notion of pre-defined generalization hierarchies. In [3], Ba-
4
yardo et al. propose an algorithm based on a powerset search problem, where the space of anonymizations
(formulated as the powerset of totally ordered values in a dataset) is explored using a tree-search strategies.
In [2], Aggrawal et al. propose a clustering approach to achieve k-anonymity, which clusters records into
groups based on some distance metric. In [12], LeFevre et al. transform the k-anonymity problem into a
partitioning problem and propose a greedy approach that recursively splits a partition at the median value
until no more split is allowed with respect to the k-anonymity requirement. Byun et al.[4], on the other hand,
introduce a flexible k-anonymization approach which uses the idea of clustering to minimize information loss
and thus ensures good data quality.
On the theoretic side, optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. Furthermore,
the curse of dimensionality also calls for more effective anonymization techniques, as shown in [1] that,
when the number of quasi-identifier attributes is high, enforcing k-anonymity necessarily results in severe
information loss, even for k = 2.
Recently, Xiao and Tao [25] propose Anatomy, a data anonymization approach that divides one table into
two for release; one table includes original quasi-identifier and a group id, and the other includes the associa-
tions between the group id and the sensitive attribute values. Koudas et al. [10] explore the permutation-based
anonymization approach and examine the anonymization problem from the perspective of answering down-
stream aggregate queries.
3 Problem Formulation
In this section, we start with an example to illustrate the problem of inference. We then describe our notion
of incremental dissemination and formally define a privacy requirement for it.
3.1 Motivating Examples
Let us revisit our previous scenario where a hospital is required to provide the anonymized version of its
patient records to a disease control agency. As previously discussed, to assure data quality, the hospital
anonymizes the patient table whenever it is augmented with new records. To make our example more con-
crete, suppose that the hospital relies on a model where both the k-anonymity and -diversity are considered;
therefore, a ‘(k, )-anonymous’ dataset is a dataset that satisfies both the k-anonymity and -diversity require-
ments. The hospital initially has a table like the one in Figure 1 and reports to the agency its (2, 2)-anonymous
table shown in Figure 2. As shown, the probability of identity disclosure (i.e., the association between indi-
vidual and record) and attribute disclosure (i.e., the association between individual and diagnosis) are kept
under 1/2 in the dataset, respectively. For example, even if an attacker knows that the record of Tom, who
is a 21-year-old male, is in the released table, he cannot learn about Tom’s disease with a probability greater
than 1/2 (although he learns that Tom has either asthma or flu). At a later time, three more patient records
(shown in Italic) are inserted into the dataset, resulting the table in Figure 3. The hospital then releases a new
(2, 2)-anonymous table as depicted in Figure 4. Observe that Tom’s privacy is still protected in the newly
released dataset. However, not every patient’s privacy is protected from the attacker.
Example 1 “Alice has cancer!” Suppose the attacker knows that Alice, who is in her late twenties, has
recently been admitted to the hospital. Thus, he knows that Alice’s record is not in the old dataset in Figure 2,
but in the new dataset in Figure 4. From the new dataset, he learns only that Alice has one of ¦Asthma, Flu,
Cancer¦. However, by consulting the previous dataset, he can easily deduce that Alice has neither asthma nor
flu (as they must belong to patients other than Alice). He now infers that Alice has cancer.
5
Example 2 “Bob has alzheimer!” The attacker knows that Bob is 52 years old and has long been treated in
the hospital. Thus, he is sure that Bob’s record is in both datasets in Figures 2 and 4. First, by studying the
old dataset, he learns that Bob suffers from either alzheimer or diabetes. Now the attacker checks the new
dataset and learns that Bob has either alzheimer or heart disease. He can thus conclude that Bob suffers from
alzheimer. Note that three other records in the new dataset are also vulnerable to similar inferences.
As shown in the examples above, anonymizing a dataset without considering previously released infor-
mation may enable various inferences.
3.2 Incremental data dissemination and privacy requirement
Let T be a private table with a set of quasi-identifier attributes Q and a sensitive attribute S. We assume
that T consists of person-specific records, each of which corresponds to a unique real-world individual. We
also assume that T continuously grows with new records and denote the state of T at time i as T
i
. For the
privacy of individuals, each T
i
must be “properly” anonymized before being released to public. Our goal is
to address both identity disclosure and attribute disclosure, and we adopt an anonymity model
2
that combines
the requirements of k-anonymity and -diversity as follows.
Definition 1 ((k, c)-Anonymity) Let table T be with a set of quasi-identifier attributes Q and a sensitive
attribute S. With respect to Q, T consists of a set of non-empty equivalence classes, where ∀ e ∈ T, record
r ∈ e ⇒ r[Q] = e[Q]. We say that T is (k, c)-anonymous with respect to Q if the following conditions are
satisfied.
1. ∀ e ∈ T, [e[ ≥ k, where k > 0.
2. ∀ e ∈ T, ∀ s ∈ S,
|{r|r∈e∧r[S]=s}|
|e|
≤ c, where 0 < c ≤ 1.
The first condition ensures the k-anonymity requirement, and the second condition enforces the diversity
requirement in the sensitive attribute. In its essence, the second condition dictates that the maximum confi-
dence of association between any quasi-identifier value and a particular sensitive attribute value in T must
not exceed a threshold c.
At a given time i, only an (k, c)-anonymous version of T
i
, denoted as

T
i
, is released to public. Thus,
users, including potential attackers, may have access to a series of (k, c)-anonymous tables,

T
1
,

T
2
, . . ., where
[

T
i
[ ≤ [

T
j
[ for i < j. As every released table is (k, c)-anonymous, by observing each table independently, one
cannot associate a record with a particular individual with probability higher than 1/k or infer any individual’s
sensitive attribute with confidence higher than c. However, as shown in Section 3.1, it is possible that one
can increase the confidence of such undesirable inferences by observing the difference between the released
tables. For instance, if an observer can be sure that two (anonymized) records in two different versions indeed
correspond to the same individual, then he may be able to use this knowledge to infer more information than
what is allowed by the (k, c)-anonymity protection. If such a case occurs, we say that there is an inference
channel between the two versions.
Definition 2 (Cross-version inference channel) Let Θ = ¦

T
1
, . . . ,

T
n
¦ be the set of all released tables for
private table T, where

T
i
is an (k, c)-anonymous version released at time i, 1 ≤ i ≤ n. Let θ ⊆ Θ and

T
i
∈ Θ.
We say that there exists cross-version inference channel from θ to

T
i
, denoted as θ

T
i
, if observing tables
in θ and

T
i
collectively increases the risk of either identity disclosure or attribute disclosure in

T
i
higher than
1/k or c, respectively.
2
A similar model is also introduced in [24].
6
When data are disseminated incrementally, it is critical to ensure that there is no cross-version inference
channel among the released tables. In other words, the data provider must make sure that not only each
released table is free of undesirable inferences, but also no released table creates cross-version inference
channels with respect to the previously released tables. We formally define this requirement as follows.
Definition 3 (Privacy-preserving incremental data dissemination) Let Θ = ¦

T
0
, . . . ,

T
n
¦ be the set of all
released tables of private table T, where

T
i
is an (k, c)-anonymous version of T released at time i, 0 ≤ i ≤ n.
Θ is said to be privacy-preserving if and only if (θ,

T
i
) such that θ ⊆ Θ,

T
i
∈ Θ, and θ

T
i
.
4 Cross-version Inferences
We first describe potential attackers and their knowledge that we assume in this paper. Then based on the
attack scenario, we identify three types of cross-version inference attacks in this section.
4.1 Attack scenario
We assume that the attacker has been keeping track of all the released tables; he thus possesses a set of
released tables ¦

T
0
, . . . ,

T
n
¦. We also assume that the attacker has the knowledge of who is and who is not
contained in each table; that is, for each anonymized table

T
i
, the attacker also possesses a population table
U
i
which contains the explicit identifiers and the quasi-identifiers of the individuals in

T
i
. This may seem to
be too farfetched at first glance; however, we assume the worst case, as we cannot rely on attacker’s lack of
knowledge. Also, such knowledge is not always difficult to acquire for a dedicated attacker. For instance,
consider medical records released by a hospital. Although the attacker may not be aware of all the patients,
he may know when target individuals in whom he is interested (e.g., local celebrities) are admitted to the
hospital. Based on this knowledge, the attacker can easily deduce which tables may include such individuals
and which tables may not. Another, perhaps the worst, possibility is that the attacker may collude with an
insider who has access to detailed information about the patients; e.g., the attacker could obtains a list of
patients from a registration staff
3
. Thus, it is reasonable to assume that the attacker’s knowledge includes
the list of individuals contained in each table as well as their quasi-identifier values. However, as all the
released tables are (k, c)-anonymous, the attacker cannot infer the individuals’ sensitive attribute values with
a significant probability, even utilizing such knowledge. Therefore, the goal of the attacker is to increase
his/her confidence of attribute disclosure (i.e., above c) by comparing the released tables all together. In the
remainder of this section, we describe three types of cross-version inferences that the attacker may exploit in
order to achieve this goal.
4.2 Notations
We first introduce some notations we use in our discussion. Let T be a table with a set of quasi-identifer
attributes Q and a sensitive attribute S. Let A be a set of attributes, where A ⊆ (Q∪ S). Then T[A] denotes
the duplicate-eliminating projection of T onto the attributes A. Let e
i
= ¦r
0
, . . . , r
m
¦ be an equivalence
class in T, where m > 0. By definition, the records in e
i
all share the same quasi-identifier value, and e
i
[Q]
represents the common quasi-identifier value of e
i
. We also use similar notations for individual records; that
is, for record r ∈ T, r[Q] represents the quasi-identifier value of r and r[S] the sensitive attribute value of r.
3
Nowadays, about 80% of privacy infringement is committed by insiders.
7
In addition, we use T¸A) to denote the duplicate-preserving projection of T. For instance, e
i
¸S) represents
the multiset of all the sensitive attribute values in e
i
. We also use [^[ to denote the cardinalities of set ^.
Regardless of recoding schemes, we consider a generalized value as a set of possible values. Suppose
that v is a value from domain T and v a generalized value of v. Then we denote this relation as v _ v, and
interpret v as a set of values where (v ∈ v) ∧ (∀v
i
∈ v, v
i
∈ T). Overloading this notation, we say that r is
a generalized version of record r, denoted as r _ r, if (∀q
i
∈ Q, r[q
i
] _ r[q
i
]) ∧ (r[S] = r[S]). Moreover,
we say that two generalized values v
1
and v
2
are compatible, denoted as v
1
v
2
, if v
1
∩ v
2
,= ∅. Similarly,
two generalized records r
1
and r
2
are compatible (i.e., r
1
r
2
) if ∀q
i
∈ Q, r
i
[q
i
] ∩ r
j
[q
i
] ,= ∅. We also say
that two equivalence classes e
1
and e
2
are compatible if ∀q
i
∈ Q, e
1
[q
i
] ∩ e
2
[q
i
] ,= ∅
4.3 Difference attack
Let

T
i
= ¦e
0,1
, . . . , e
0,n
¦ and

T
j
= ¦e
1,1
, . . . , e
1,m
¦ be two (k, c)-anonymous tables that are released at
time i and j (i ,= j), respectively. As previously discussed in Section 4.1, we assume that an attacker knows
who is and who is not in each released table. Also knowing the quasi-identifier values of the individuals in

T
i
and

T
j
, for any equivalence class e in either

T
i
or

T
j
, the attacker knows the individuals whose records
are contained in e. Let I(e) represent the set of individuals in e. With this information, the attacker can now
perform difference attacks as follows. Let E
i
and E
j
be two sets of equivalence classes, where E
i


T
i
and
E
j


T
j
. If ∪
e∈Ei
I(e) ⊆ ∪
e∈Ej
I(e), then set D = ∪
e∈Ej
I(e) −∪
e∈Ei
I(e) represents the set of individuals
whose records are in E
j
, but not in E
i
. Furthermore, set S
D
= ∪
e∈Ej
e¸S) −∪
e∈Ei
e¸S) indicates the set of
sensitive attribute values that belong to those individuals in D. Therefore, if D contains less than k records,
or if the most frequent sensitive value in S
D
appears with a probability larger than c, the (k, c)-anonymity
requirement is violated.
The DirectedCheck procedure in Figure 5 checks if the first table of the input is vulnerable to difference
attack with respect to the second table. Two tables

T
i
and

T
j
are vulnerable to difference attack if at least one
of DirectedCheck(

T
i
,

T
j
) and DirectedCheck(

T
j
,

T
i
) returns true.
The DirectedCheck procedure enumerates all subsets of

T
i
, and for each set E, the procedure calls Get-
MinSets procedure in Figure 6 to get the minimum set E

of equivalence classes in

T
j
that contains all the
records in E. We call such E

the minimum covering set of E. The procedure then checks whether there is
vulnerability between the two equivalence classes E and E

. As the algorithm checks all the subsets of

T
i
,
the time complexity is exponential (i.e, it is O(2
n
), where n is the number of equivalence classes in

T
i
). As
such, for large tables with many equivalence classes, this brute-force check is clearly unacceptable. In what
follows, we discuss a few observations that result in effective heuristics to reduce the space of the problem in
most cases.
Observation 1 Let E
1
and E
2
be two sets of equivalence classes in

T
i
, and E

1
and E

2
be their minimal
covering sets in

T
j
, respectively. If E
1
and E
2
are not vulnerable to difference attack, and E

1
∩E

2
= ∅, then
we do not need to consider any subset of

T
i
which contains E
1
∪ E
2
.
The fact that E
1
and E
2
are not vulnerable to difference attacks means that sets (E

1
−E
1
) and (E

2
−E
2
)
are both (k, c)-anonymous. As the minimum covering set of E
1
∪E
2
is E

1
∪E

2
, and E

1
and E

2
are disjoint,
(E

1
∪ E

2
) −(E
1
∪ E
2
) is also (k, c)-anonymous. This also implies that if each E
1
and E
2
is not vulnerable
to difference attack, then neither is any set containing E
1
∪E
2
. Based on this observation, we can modify the
method in Figure 5 as follows. In each time we union one more element to a subset to create larger subsets,
we check if their minimum covering sets are disjoint. If they are, we do not insert the unioned subset to the
queue. Note that this also prevents all the sets containing the unioned set from being generated.
8
Observation 2 Let E
1
and E
2
be two sets of equivalence classes in

T
i
, and E

1
and E

2
be their minimal
covering sets in

T
j
, respectively. If E

1
= E

2
, then we only need to check if E
1
∪ E
2
is vulnerable to
difference attack.
In other words, we can skip checking if each of E
1
and E
2
is vulnerable to difference attack. This is
because unless E
1
∪ E
2
is vulnerable to difference attack, E
1
and E
2
are not vulnerable. Thus, we can save
our computational effort as follows. When we insert a new subset into the queue, we check if there exists
another set with the same minimum covering set. If such a set is found, we simply merge the new subset with
the found set.
Observation 3 Consider the method in Figure 5. Suppose that

T
i
was released after

T
j
; that is,

T
i
contains
some records that are not in

T
j
. If equivalence class e ∈

T
j
contains all such records, then we do not need to
consider that equivalence class for difference attack.
It is easy to see that if e ∈

T
i
contains some record(s) that

T
j
do not, the minimum covering set of e
is an empty-set. Since e itself must be (k, c)-anonymous, e is safe from difference attack. Based on this
observation, we can purge all such equivalence classes from the initial problem set.
4.4 Intersection attack
The key idea of k-anonymity is to introduce sufficient ambiguity into the association between quasi-identifier
values and sensitive attribute values. However, this ambiguity may be reduced to an undesirable level if the
structure of equivalence classes are varied in different releases. For instance, suppose that the attacker wants
to know the sensitive attribute of Alice, whose quasi-identifier value is q
A
. Then the attacker can select a
set of tables, θ
+
A
, that all contain Alice’s record. As the attacker knows the quasi-identifier of Alice, he does
not need to examine all the records; he just needs to consider the records that may possibly correspond to
Alice. That is, in each

T
i
∈ θ
+
A
, the attacker only need to consider an equivalence class e
i


T
i
, where
q
A
_ e
i
[Q]. Let E
A
= ¦e
0
, . . . , e
n
¦ be the set of all equivalence classes identified from θ
+
A
such that
q
A
_ e
i
[Q], 0 ≤ i ≤ n. As every e
i
is (k, c)-anonymous, the attacker cannot infer Alice’s sensitive attribute
value with confidence higher than c by examining each e
i
independently. However, as every equivalence class
in E
A
contains Alice’s record, the attacker knows that Alice’s sensitive attribute value, s
A
, must be present
in every equivalence class in E
A
; i.e., ∀e
i
∈ E
A
, s
A
∈ e
i
¸S). This implies that s
A
must be found in set
SI
A
=

ei∈EA
e
i
¸S). Therefore, if the most frequent value in SI
A
appears with a probability greater than
c, then the sensitive attribute value of Alice can be inferred with confidence greater than c.
The pseudo-code in Figure 7 provides an algorithm for checking the vulnerability to the intersection
attack for given two (k, c)-anonymous tables,

T
0
and

T
1
. The basic idea is to check every pair of equivalence
classes e
i


T
0
and e
j


T
1
that contain the same record(s).
4.5 Record-tracing attack
Unlike the previous attacks, the attacker may be interested in knowing who may be associated with a par-
ticular attribute value. In other words, instead of wanting to know what sensitive attribute value a particular
individual has, the attacker now wants to know which individuals possess a specific sensitive attribute value;
e.g., the individuals who suffer from ‘HIV+’. Let s
p
be the sensitive attribute value in which the attacker is
interested and

T
i
∈ Θ be the table in which (at least) one record with sensitive value s
p
appears. Although

T
9
may contain more than one record with s
p
, suppose, for simplicity, that the attacker is interested in a partic-
ular record r
p
such that (r
p
[S] = s
p
) ∧ (r
p
∈ e
i
). As

T is (k, c)-anonymous, when the attacker queries the
population table U
i
with r
p
[Q], he obtains at least k individuals who may correspond to r
p
. Let I
p
be the set
of such individuals. Suppose that the attacker also possesses a subsequently released table

T
j
(i < j) which
includes r
p
. Note that in each of these tables the quasi-identifier of r
p
may be generalized differently. This
means that if the attacker can identify from

T
j
the record corresponding to r
p
, then he may be able to learn
additional information about the quasi-identifier of the individual corresponding to r
p
and possibly reduce
the size of I
p
. There are many cases where the attacker can identify r
p
in

T
j
. However, in order to illustrate
our point clearly, we show some simple cases in the following example.
Example 3 The attacker knows that r
p
must be contained in the equivalence class of

T
j
that is compatible
with r
p
[Q]. Suppose that there is only one compatible equivalence class, e
i+1
in

T
j
(see Figure 8 (i)). Then
the attacker can confidently combine his knowledge on the quasi-identifier of r
p
; i.e., r
p
[Q] ← r
p
[Q] ∩
e
i+1
[Q]. Suppose now that there are more than one compatible equivalence classes in

T
i+1
, say e
i+1
and
e

i+1
. If s
p
∈ e
i+1
[S] and s
p
/ ∈ e

i+1
[S], then the attacker can be sure that r
p
∈ e
i+1
and updates his
knowledge of r
p
[Q] as r
p
[Q] ∩ e
i+1
[Q]. However, if s
p
∈ e
i+1
[S] and s
p
∈ e

i+1
[S], then r
p
could be in
either e
i+1
and e

i+1
(see Figure 8 (ii)). Although the attacker may or may not determine which equivalence
class contains r
p
, he is sure that r
p
∈ e
i+1
∪ e

i+1
; therefore, r
p
[Q] ← r
p
[Q] ∩ (e
i+1
[Q] ∪ e

i+1
[Q]).
After updating r
p
[Q] with

T
j
, the attacker can reexamine I
p
and eliminate individuals whose quasi-
identifiers are no longer compatible with the updated r
p
[Q]. When the size of I
p
becomes less than k, the
attacker can infer the association between the individuals in I
p
and r
p
with a probability higher than 1/k.
In the above example, when there are more than one compatible equivalence classes ¦e
i+1,1
, ..., e
i+1,r
¦ in

T
i+1
, we say that the attacker updates r
p
[Q] as r
p
[Q] ∩(∪
1≤j≤r
e
i+1,j
). While correct, this is not a sufficient
description of what the attacker can do, as there are cases where the attacker can notice that some equivalence
classes in

T
i+1
cannot contain r
p
. For example, let r
1
∈ e
i,1
and r
2
∈ e
i,2
be two records in

T
i
, both
taking s
r
as the sensitive value (see Figure 9(i)). Suppose that

T
i+1
contains a single equivalence class e
i+1,1
that is compatible to r
1
and two compatible equivalence classes e
i+1,1
and e
i+1,2
that are compatible to r
2
.
Although r
2
has two compatible equivalence classes, the attacker can be sure that r
2
is included in e
i+1,2
, as
the record with s
r
in e
i+1,1
must correspond to r
1
. Figure 9(ii) illustrates another case of which the attacker
can take advantage. As shown, there are two records in e
i,1
that take s
r
as the sensitive value. Although the
attacker cannot be sure that each of these records is contained in e
i+1,1
or e
i+1,2
, he is sure that one record is
in e
i+1,1
and the other in e
i+1,2
. Thus, he can make an arbitrary choice and update his knowledge about the
quasi-identifiers of the two records accordingly. Using such techniques, the attacker can make more precise
inference by eliminating equivalence classes in

T
i+1
that are impossible to contain r
r
.
We nowdescribe a more thorough algorithmthat checks two (k, c)-anonymous tables for the vulnerability
to the record-tracing attack. First, we construct a bipartite graph G = (V, E), where V = V
1
∪ V
2
and each
vertex in V
1
represents a record in

T
i
, and each vertex in V
2
represents a record in

T
i+1
that is compatible
with at least one record in

T
i
. We define E as the set of edges from vertices in V
1
to vertices in V
2
, which
represents possible matching relationships. That is, if there is an edge from r
i
∈ V
1
to r
j
∈ V
2
, this means
that records r
i
and r
j
may both correspond to the same record although they are generalized into different
forms. We create such edges between V
1
and V
2
as follows. For each vertex r ∈ V
1
, we find from V
2
the set
of records R where ∀r
i
∈ R, (r[Q] r
i
[Q]) ∧ (r[S] = r
i
[S]). If [R[ = 1 and r

∈ E, then we create an
edge from r to r

and mark it with ¸d), which indicates that r definitely corresponds to r

. If [R[ > 1, then
we create an edge from r and every r

i
∈ R and mark it with ¸p) to indicate that r plausibly corresponds to
10
r

i
. Now given the constructed bipartite graph, the pseudo-code in Figure 10 removes plausible edges that are
not feasible and discovers more definite edges by scanning through the edges.
Note that the algorithm above does not handle the case illustrated in Figure 9(ii). In order to address
such cases, we also performs the following. For each equivalence class e
1,i


T
i
, we find from

T
j
the set of
equivalence classes E where ∀e
2,j
∈ E, e
1,i
[Q] e
2,j
[Q]. If the same number of records with any sensitive
value s appear in both e
1,i
and E, we remove unnecessary plausible edges such that each of such records in
e
1,i
has a definite edge to a distinct record in E.
After all infeasible edges are removed, each record r
1,i
∈ V
1
is associated with a set of possibly matching
records ¦r
2,j
, . . . , r
2,m
¦ (j ≤ m) in V
2
. Now we can follow the edges and compute for each record r
1,i


T
i
the inferrable quasi-identifier r

1,i
[Q] = r
1,i
[Q] ∩ (

=j,...,m
r
2,
[Q]). If any inferred quasi-identifer maps
to less than k individuals in the population table U
i
, then table

T
i
is vulnerable to the record-tracing attack
with respect to

T
j
.
It is worth noting that the key problem enabling the record-tracing attack arises from the fact that the
sensitive attribute value of a record, together with its generalized quasi-identifier, may uniquely identify the
record in different anonymous tables. This issue can be especially critical for records with rare sensitive
attribute values (e.g., rare diseases) or tables where every individual has a unique sensitive attribute value
(e.g., DNA sequence).
5 Inference Prevention
In this section, we describe our incremental data anonymization which incorporates the inference detection
techniques in the previous section. We first describe our data/history management strategy which aims to
reduce the computational overheads. Then, we describe the properties of our checking algorithms which
make them suitable for existing data anonymization techniques such as full-domain generalization [11] and
multidimensional anonymization [12].
5.1 Data/history management
Consider a newly anonymized table,

T
i
, which is about to be released. In order to check whether

T
i
is vul-
nerable to cross-version inferences, it is essential to maintain some form of history about previously released
datasets, Θ = ¦

T
0
, . . . ,

T
i−1
¦. However, checking the vulnerability in

T
i
against each table in Θ can be
computationally expensive. To avoid such inefficiency, we maintain a history table, H
i
at time i, which has
the following attributes.
• RID : is a unique record ID (or the explicit identifier of the corresponding individual). Assuming that
each

T
i
also contains RID (which is projected out before being released), RID is used to join H
i
and

T
i
.
• TS (Time Stamp) : represents the time (or the version number) when the record is first released.
• IS (Inferable Sensitive values) : stores the set of sensitive attribute values with which the record can be
associated. For instance, if record r is released in equivalence class e
i
of

T
i
, then r[IS]
i
← (r[IS]
i−1

e
i
¸S)). This field is used for checking vulnerability to intersection attack.
• IQ(Inferable Quasi-identifier) : keeps track of the quasi-identifiers into which the record has previously
been generalized. For instance, for record r ∈

T
i
, r[IQ]
i
← r[IQ]
i−1
∩ r[Q]. This field is used for
checking vulnerability to record-tracing attack.
11
The main idea of H
i
is to keep track of the attacker’s accumulated knowledge on each released record.
For instance, value r[IS] of record r ∈ H
i−1
indicates the set of sensitive attribute values that the attacker
may be able to associate with r prior to the release of

T
i
. This is indeed the worst case as we are assuming
that the attacker possesses every released table, i.e., Θ. However, as discussed in Section 4.1, we need to be
conservative when estimating what the attacker can do. Using H, the cost of checking

T
i
for vulnerability
can be significantly reduced; for intersection and record-tracing attacks, we check

T
i
against H
i−1
, instead
of every

T
j
∈ Θ
4
.
5.2 Incorporating inference detection into data anonymization
We now discuss how to incorporate the inference detection algorithms into secure anonymization algorithms.
We first consider the full-domain anonymization, where all values of an attribute are consistently generalized
to the same level in the predefined generalization hierarchy. In [11], LeFevre et al. propose an algorithm that
finds minimal generalizations for a given table. In its essence, the proposed algorithm is a bottom-up search
approach in that it starts with un-generalized data and tries to find minimal generalizations by increasingly
generalizing the target data in each step. The key property on which the algorithm relies is the generalization
property: given a table T and two generalization strategies G
1
, G
2
(G
1
_ G
2
), if G
1
(T) is k-anonymous,
then G
2
(T) is also k-anonymous
5
. Although intuitive, this property is critical as it guarantees the optimality
to the discovered solutions; i.e., once the search finds a generalization level that satisfies the k-anonymity
requirement, we do not need to search further.
Observation 4 Given a table T and two generalization strategies G
1
, G
2
(G
1
_ G
2
), if G
1
(T) is not
vulnerable to any inference attack, then neither is G
2
(T).
The proof is simple. As each equivalence class in G
2
(T) is the union of one or more equivalence classes
in G
1
(T), the information about each record in G
2
(T) is more vague than that in G
1
(T); thus, G
2
does not
create more inference attacks than G
1
. Based on this observation, we modify the algorithmin [11] as follows.
In each step of generalization, in addition to checking the (k, c)-anonymity requirement, we also check for
the vulnerability to inference. If either check fails, then we need to further generalize the data.
Next, we consider the multidimensional k-anonymity algorithm proposed in [12]. Specifically, the algo-
rithm consists of the following two steps. The first step is to find a partitioning scheme of the d-dimensional
space, where d is the number of attributes in the quasi-identifier, such that each partition contains more than
k records. In order to find such a partitioning, the algorithm recursively splits a partition at the median value
(of a selected dimension) until no more split is allowed with respect to the k-anonymity requirement. Note
that unlike the previous algorithm, this algorithm is a top-down search approach, and the quality of the search
relies on the following property
6
: given a partition p, if p does not satisfy the k-anonymity requirement, then
any sub-partition of p does not satisfy the requirement.
Observation 5 Given a partition p of records, if p is vulnerable to any inference attack, then so is any sub-
partition of p.
Suppose that we have a partition p
1
of the dataset, in which some records are vulnerable to inference
attacks. Then, any further cut of p
1
will lead to a dataset that is also vulnerable to inference attacks. This
4
In our current implementation, difference attack is still checked against every previously released table.
5
This property is also used in [16] for -diversity and is thus applicable for (k, c)-anonymity.
6
It is easy to see that the property also holds for any diversity requirement.
12
is based on the fact that any further cut on p
1
leads to de-generalization of the dataset; thus, it reveals more
information about each record than p
1
. Based on this observation, we modify the algorithm in [12] as follow.
In each step of partition, in addition to checking the (k, c)-anonymity requirement, we also check for the
vulnerability to inference. If either check fails, then we do not need to further partition the data.
6 Experiments
The main goal of our experiments is to show that our approach effectively prevents the previously discussed
inference attacks when data is incrementally disseminated. We also show that our approach produces datasets
with good data quality. We first describe our experimental settings and then report our experimental results.
6.1 Experimental Setup
6.1.1 Experimental Environment
The experiments were performed on a 2.66 GHz Intel IV processor machine with 1 GB of RAM. The oper-
ating system on the machine was Microsoft Windows XP Professional Edition, and the implementation was
built and run in Java 2 Platform, Standard Edition 5.0. For our experiments, we used the Adult dataset from
the UC Irvine Machine Learning Repository [18], which is considered a de facto benchmark for evaluating
the performance of anonymization algorithms. Before the experiments, the Adult data set was prepared as
described in [3, 9, 12]. We removed records with missing values and retained only nine of the original at-
tributes. In our experiments, we considered ¦age, work class, education, marital status, race, gender, native
country, salary¦ as the quasi-identifier, and occupation attribute as the sensitive attribute.
6.1.2 Data quality metrics
The quality of generalized data has been measured by various metric. In our experiment, we measure the
data quality mainly based on Average Information Loss (AIL, for short) metric proposed in [4]. The basic
idea of AIL metric is that the amount of generalization is equivalent to the expansion of each equivalence
class (i.e., the geometrical size of each partition). Note that as all the records in an equivalence class are
modified to share the same quasi-identifer, each region indeed represents the generalized quasi-identifier of
the records contained in it. Thus, data distortion can be measured naturally by the size of the region covered
by each equivalence class. Following this idea, Information Loss (IL for short) measures the amount of data
distortion in an equivalence class as follows.
Definition 4 (Information loss) [4] Let e=¦r
1
, . . . , r
n
¦ be an equivalence class where Q=¦a
1
, . . . , a
m
¦.
Then the amount of data distortion occurred by generalizing e, denoted by AIL(e), is defined as:
AIL(e) = [e[

j=1,...,m
|Gj|
|Dj|
where [e[ is the number of records in e, and [D
j
[ the domain size of attribute a
j
. [G
j
[ represents the amount of
generalization in attribute a
j
(e.g., the length of the shortest interval which contains all the a
j
values existing
in e).
Based on IL, the AIL of a given table

T is computed as: AIL(

T) = (

e∈

T
IL(e)) / [T[. The key
advantage of AIL metric is that it precisely measures the amount of generalization (or vagueness of data),
while being independent from the underlying generalization scheme (e.g, anonymization technique used or
generalization hierarchies assumed). For the same reason, we also use the Discernibility Metric (DM) [3] as
13
another quality measure in our experiment. Intuitively, DM measures the quality of anonymized data based
on the size of the equivalence classes, which indicates how much records are indistinguishable from each
other.
6.2 Experimental Results
We first measured how many records were vulnerable in statically anonymized datasets with respect to the
inference attacks we discussed. For this, we modified two k-anonymization algorithms, Incognito [11] and
Mondrian [12], and used them as our static (k, c)-anonymization algorithms. Using these algorithms, we first
anonymized 5K records and obtained the first “published” datasets. We then generated five more subsequent
datasets by adding 5K more records each time. Then we used our vulnerability detection algorithms to
count the number of records among these datasets that are vulnerable to each of inference attack. Figure 11
shows the result. As shown, much more records were found to be vulnerable in the datasets anonymized by
Mondrian. This is indeed unsurprising, as Mondrian, taking a multidimensional approach, produces datasets
with much less generalization. In fact, for Incognito, even the initial dataset was highly generalized. This
clearly illustrates the unfortunate reality; that is, the more precise data are, the more vulnerable they are to
undesirable inferences.
The next step was to investigate how effectively our approach would work with a real dataset. The main
focus was its effect on the data quality. As previously discussed, in order to prevent undesirable inferences,
one needs to hide more information. In our case, it means that the given data must be generalized until there
is no vulnerability to any type of inference attack. We modified the static (k, c)-anonymization algorithms as
discussed in Section 5 and obtained our inf-checked (k, c)-anonymization algorithms. Note that although we
implemented the full-featured algorithms for difference and intersection attacks, we took a simple approach
for record-tracing attack. That is, we considered all the edges without removing infeasible/unnecessary edges
as discussed in Section 4.5. We also implemented a merge approach where we anonymize each dataset
independently and merge it with the previously released dataset. Although this approach is secure from any
type of inference attacks, we expected that the data quality would be the worst, as merging would inevitably
have a significant effect on generalization (recoding) scheme.
With these algorithms as well as the static anonymization algorithms, we repeated our experiment. As
before, we started with 5K records and increased the dataset by 5K each time. We then checked the vulnera-
bility and measured the data quality of such datasets. We measured the data quality both with AIL and DM,
and the results are illustrated in Figures 12 and 13, respectively. It is clear that in terms of data quality the
inf-checked algorithm is much superior than the merge algorithm. Although the static algorithms produced
the best quality datasets, these data are vulnerable to inference attacks as previously shown. The datasets
generated by our inf checked algorithm and the merge algorithm were not vulnerable to any type of inference
attack.
We also note that the quality of datasets generated by the inf-checked algorithm is not optimal. This was
mainly due to the complexity of checking for difference attack. Even though our heuristics to reduce the
size of subsets (see Section 4.3) were highly effective in most cases, there were some cases where the size of
subsets grew explosively. As such cases not only caused lengthy execution times, they caused memory blow-
ups. In order to avoid such cases, we set an upper limit threshold for the size of subsets in this experiment.
For example, while our modified algorithm of Incognito is processing a node in the generalization lattice, if
the size of subsets needed to be checked exceeds the threshold, we stop the iteration and consider the node as
a vulnerable node. Similarly, when we encounter such a case while considering a split in Mondrian, we stop
the check and do not consider the split. Note that this approach does not affect the security of data, although
14
it may negatively affect the overall data quality. Even if the optimality cannot be guaranteed, we believe that
the data quality seems to be still acceptable, considering the results shown in Figures 12 and 13.
Another important comparison was the computational efficiency of these algorithms. Figure 14 shows
our experimental result for each algorithm. The merge algorithm is highly efficient with respect to execution
time (although it was very inefficient with respect to data quality). As the merge algorithm anonymizes the
same sized dataset each time and merging datasets can be done very quickly, the execution time is closely
constant. While equipped the heuristics and the data structure discussed in Sections 4.3 and 5.1, the inf-
checked algorithm is still slow. However, considering the previously discussed results, we believe that this is
the price you have to pay for better data quality and reliable privacy. Also, when compared to our previous
implementation without any heuristics, this is a very promising result.
7 Related Work
The problem of information disclosure [6] has been studied extensively in the framework of statistical
databases. A number of information disclosure limitation techniques [7] have been designed for data pub-
lishing, including Sampling, Cell Suppression, Rounding, and Data Swapping and Perturbation. These tech-
niques, however, compromised data integrity of the tables. Samarati and Sweeney [20, 22, 21] introduced the
k-anonymity approach and used generalization and suppression techniques to preserve information truthful-
ness. A number of static anonymization algorithms [3, 8, 9, 11, 12, 14, 21] have been proposed to achieve
k-anonymity requirement. Optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17].
While static anonymization has been extensively investigated in the past few years, only a few approaches
address the problem of anonymization in dynamic environments. In [22], Sweeney identified possible infer-
ences when new records are inserted and suggested two simple solutions. The first solution is that once
records in a dataset are anonymized and released, in any subsequent release of the dataset, the records must
be the same or more generalized. As previously mentioned, this approach may suffer from unnecessarily low
data quality. Also, this approach cannot protect newly inserted records from difference attack, as discussed
in Section 4. The other solution suggested is that once a dataset is released, all released attributes (includ-
ing sensitive attributes) must be treated as the quasi-identifier in subsequent releases. This approach seems
reasonable as it may effectively prevent linking between records. However, this approach has a significant
drawback in that every equivalence class will inevitable have a homogeneous sensitive attribute value; thus,
this approach cannot adequately control the risk of attribute disclosure.
Yao et al. [28] addressed the inference issue when a single table is released in the form of multiple views.
They proposed several methods to check whether or not a given set of views violates the k-anonymity require-
ment collectively. However, they did not address how to deal with such violations. Wang and Fung [23] fur-
ther investigated this issue and proposed a top-down specialization approach to prevent record-linking across
multiple anonymous tables. However, their work focuses on the horizontal growth of databases (i.e., addition
of new attributes), and does not address vertically-growing databases where records are inserted. Recently,
Xiao and Tao [27] proposed a new generalization principle m-invariance for dynamic dataset publishing.
The m-invariance principle requires that each equivalence class in every release contains distinct sensitive
attribute values and for each tuple, all equivalence classes containing that tuple have exactly the same set
of sensitive attribute values. They also introduced the counterfeit generalization technique to achieve the
m-invariance requirement.
In [5], we presented a preliminary limited investigation concerning the inference problem of dynamic
anonymization with respect to incremental datasets. We identified some inferences and also proposed an
approach where new records are directly inserted to the previously anonymized dataset for computational
15
efficiency. However, compared to this current work, our previous work has several limitations.The key differ-
ences of this work with respect to [5] are as follows. In [5], we focused only on the inference enabling sets that
may exist between two tables, while in this work we consider more robust and systematic inference attacks in
a collection of released tables. The inference attacks discussed in this work subsume attacks using inference
enabling sets and address more sophisticated inferences. For instance, our study of the record-tracing attack is
a new contribution in this work. We also provide a detailed descriptions of attacks and algorithms for detect-
ing them. Our previous approach was also limited to the multidimensional generalization. By contrast, our
current approach considers and is applicable to both the full-domain and multidimensional approaches; there-
fore it can be combined with a large variety of anonymization algorithms. In this paper we also address the
issue of computational costs in detecting possible inferences. We discuss various heuristics to significantly
reduce the search space, and also suggest a scheme to store the history (of previously released tables).
8 Conclusions
In this paper, we discussed inference attacks against the anonymization of incremental data. In particular,
we discussed three basic types of cross-version inference attacks and presented algorithms for detecting each
attack. We also presented some heuristics to address the efficiency of our algorithms. Based on these ideas,
we developed secure anonymization algorithms for incremental datasets using two existing anonymization
algorithms. We also empirically evaluated our approach by comparing to other approaches. Our experimental
result showed that our approach outperformed other approaches in terms of privacy and data quality.
For the future work, we are working on essential properties (e.g, correctness) of our methods and analysis.
Another interesting direction for the further work is to see if there are other types of inferences. For instance,
one can devise an attack where more than one type of inference are jointly utilized. We also plan to investigate
inference issues in more dynamic environments where deletions and updates of records are allowed.
References
[1] C. Aggarwal. On k-anonymity and the curse of dimensionality. In Proceedings of the International Con-
ference on Very Large Data Bases (VLDB), 2005.
[2] G. Aggrawal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and A. Zhu. Achieving
anonymity via clustering. In Proceedings of the ACM Symposium on Principles of Database Systems
(PODS), 2006.
[3] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proceedings of the
International Conference on Data Engineering (ICDE), 2005.
[4] J.-W. Byun, A. Kamra, E. Bertino, and N. Li. Efficient k-anonymization using clustering techniques. In
Proceedings of the Intenational Conference on Database Systems for Advanced Applications(DASFAA),
2007.
[5] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization for incremental datasets. In Secure
Data Management, 2006.
[6] D. Denning. Cryptography and Data Security. Addison-Wesley Longman Publishing Co., Inc., 1982.
16
[7] G. T. Duncan, S. E. Fienberg, R. Krishnan, R. Padman, and S. F. Roehrig. Disclosure limitation methods
and information loss for tabular data. In Confidentiality, Disclosure and Data Access: Theory and Practical
Applications for Statistical Agencies, pages 135166. Elsevier, 2001.
[8] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization for information and privacy preservation.
In Proceedings of the International Conference on Data Engineering (ICDE), 2005.
[9] V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD), 2002.
[10] N. Koudas, D. Srivastava, T. Yu, and Q. Zhang. Aggregate query answering on anonymized tables. In
Proceedings of the International Conference on Data Engineering (ICDE), 2007.
[11] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Incognito: Efficient full-domain k-anonymity. In Pro-
ceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2005.
[12] K. LeFevre, D. DeWitt, and R. Ramakrishnan. Mondrian multidimensional k-anonymity. In Proceed-
ings of the International Conference on Data Engineering (ICDE), 2006.
[13] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity and -diversity. In
Proceedings of the International Conference on Data Engineering (ICDE), 2007.
[14] T. Li and N. Li. Towards optimal k-anonymization. Data & Knowledge Engineering, 2008.
[15] T. Li and N. Li. Injector: Mining Background Knowledge for Data Anonymization. In Proceedings of
the International Conference on Data Engineering (ICDE), 2008.
[16] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam. ‘-diversity: Privacy beyond
k-anonymity. In Proceedings of the International Conference on Data Engineering (ICDE), 2005.
[17] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proceedings of the ACM
Symposium on Principles of Database Systems (PODS), 2004.
[18] C. B. S. Hettich and C. Merz. UCI repository of machine learning databases, 1998.
[19] P. Samarati. Protecting respondents privacy in microdata release. IEEE Transactions on Knowledge and
Data Engineering, 13, 2001.
[20] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: k-anonymity and its en-
forcement through generalization and suppression. Technical Report SRI-CSL-98-04, 1998.
[21] L. Sweeney. Achieving k-anonymity privacy protection using generalization and suppression. Interna-
tional Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002.
[22] L. Sweeney. K-anonymity: A model for protecting privacy. International Journal on Uncertainty,
Fuzziness and Knowledge-based Systems, 2002.
[23] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In Proceedings of the ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining (KDD), pages 414423, 2006.
17
[24] R. C.-W. Wong, J. Li, A. W.-C. Fu, and K. Wang. (α,k)-anonymity: an enhanced k-anonymity model
for privacy preserving data publishing. In Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD), pages 754759, 2006.
[25] X. Xiao and Y. Tao. Anatomy: simple and effective privacy preservation. In Proceedings of the Interna-
tional Conference on Very Large Data Bases (VLDB), pages 139150, 2006.
[26] X. Xiao and Y. Tao. Personalized privacy preservation. In Proceedings of the ACM SIGMOD Interna-
tional Conference on Management of Data (SIGMOD), pages 229240, 2006.
[27] X. Xiao and Y. Tao. M-invariance: towards privacy preserving re-publication of dynamic datasets. In
Proceedings of the ACMSIGMOD International Conference onManagement of Data (SIGMOD), pages
689700, 2007.
[28] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violation by views. In Proceedings of
the International Conference on Very Large Data Bases (VLDB), 2005.
18
NAME AGE Gender Diagnosis
Tom 21 Male Asthma
Mike 23 Male Flu
Bob 52 Male Alzheimer
Eve 57 Female Diabetes
Figure 1: Patient table
AGE Gender Diagnosis
[21 −25] Male Asthma
[21 −25] Male Flu
[50 −60] Person Alzheimer
[50 −60] Person Diabetes
Figure 2: Anonymous patient table
NAME AGE Gender Diagnosis
Tom 21 Male Asthma
Mike 23 Male Flu
Bob 52 Male Alzheimer
Eve 57 Female Diabetes
Alice 27 Female Cancer
Hank 53 Male Hepatitis
Sal 59 Female Flu
Figure 3: Updated patient table
AGE Gender Diagnosis
[21 −30] Person Asthma
[21 −30] Person Flu
[21 −30] Person Cancer
[51 −55] Male Alzheimer
[51 −55] Male Hepatitis
[56 −60] Female Flu
[56 −60] Female Diabetes
Figure 4: Updated anonymous pa-
tient table
19
DirectedCheck
Input: Two (k, c)-anonymous tables

T
i
and

T
j
, where

T
i
= ¦e
i,1
, . . . , e
i,m
¦ and

T
j
=
¦e
j,1
, . . . , e
j,n
¦
Output: true if

T
i
is vulnerable to difference attack with respect to

T
j
and false otherwise
Q = ¦(∅, 0)¦
while Q ,= ∅
Remove the first element p = (E, index) from Q
if E ,= ∅
E

← GetMinSet(E,

T
j
)
if [E

−E[ < k, return true
else if (E

¸S) −E¸S)) does not satisfy c-diversity, return true
for each ∈ ¦index + 1, . . . , m¦ // generates subsets with size [E[ + 1
insert (E ∪ e
i,
, ) into Q
return false
Figure 5: Algorithm for checking if one table is vulnerable to difference attack with respect to another table
20
GetMinSet
Input: An equivalence class set E and a table

T
Output: An equivalence class set E



T that is minimal and contains all records in E
E

= ∅
while E ,⊆ E

choose a tuple t in E that is not in E

find the equivalence class e ⊆

T that contains t
E

= E

∪ ¦e¦
return E
Figure 6: Algorithm for obtaining a minimum equivalence class set that contains all the records in the given
equivalence class set
21
Check Intersection Attack
Input: Two (k, c)-anonymous tables

T
0
and

T
1
Output: true if the two tables are vulnerable to intersection attack and false otherwise
for each equivalence class e
0,i
in

T
0
for each e
1,j


T
1
that contains any record in e
0,i
if (e
0,i
¸S) ∩ e
1,j
¸S)) does not satisfy c-diversity, return true
return false
Figure 7: Algorithm for checking if given two tables are vulnerable to intersection attack
22

Ti Ti+1 Tn
ei,1
e i+1,2
ei+1,1
e i+1,3
e i+1,4
ei,2
ei,3
en,1
en,2
en,3
en,5
en,6 sp
sp sp
en,4
sp
sp sp
sp sp
e i+1,5
(i)
(ii)
. . .
Figure 8: Record-tracing attacks

e
i,1

e
i+1,2

e
i+1,1

e
i,2

s
r
s
r

(i)
s
r

s
r

e
i,1

e
i+1,2

e
i+1,1

s
r
s
r

(ii)
s
r

s
r

Figure 9: More inference in record-tracing
23
Remove Infeasible Edges
Input: A bipartite graph G = (V, E) where V = V
1
∪ V
2
and E is a set of edges representing
possible matching relationships.
Output: A bipartite graph G

= (V, E

) where E

⊂ E with infeasible edges removed
E

= E
while true
change1 ←false, change2 ←false
for each r
j
∈ V
2
e ←all the incoming edges of r
j
if e contains both definite and plausible edge(s)
remove all plausible edges in e from E

, change1 ←true
if change1 = true
for each r
i
∈ V
1
e ←all the outgoing edges of r
i
if e contains only a single plausible edge
mark the edge in e as definite, change2 ←true
if change2 = false, break
return (V, E

)
Figure 10: Algorithm for removing infeasible edges from a bipartite graph
24
0
10
20
30
40
50
0 5 10 15 20 25 30 35
N
u
m
b
e
r

o
f

V
u
l
n
e
r
a
b
l
e

R
e
c
o
r
d
s
Table Size (unit = 1,000)
Static Incognito (k=5, c=0.7)
Difference
Intersection
R-Tracing
0
1000
2000
3000
4000
5000
0 5 10 15 20 25 30 35
N
u
m
b
e
r

o
f

V
u
l
n
e
r
a
b
l
e

R
e
c
o
r
d
s
Table Size (unit = 1,000)
Static Mondrian (k=5, c=0.7)
Difference
Intersection
R-Tracing
Figure 11: Vulnerability to Inference Attacks
25
3
4
5
6
7
8
9
0 5 10 15 20 25 30 35
A
v
e
r
a
g
e

I
n
f
o
r
m
a
t
i
o
n

L
o
s
s
Table Size (unit = 1,000)
Incognito (k=5, c=0.7)
Static
Merge
Inf_Checked
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30 35
A
v
e
r
a
g
e

I
n
f
o
r
m
a
t
i
o
n

L
o
s
s
Table Size (unit = 1,000)
Mondrian (k=5, c=0.7)
Static
Merge
Inf_Checked
0
1
2
3
4
5
6
7
0 5 10 15 20 25 30 35
A
v
e
r
a
g
e

I
n
f
o
r
m
a
t
i
o
n

L
o
s
s
Table Size (unit = 1,000)
Mondrian (k=5, c=0.7)
Static
Merge
Inf_Checked
Figure 12: Data Quality: Average Information Loss
26
0
20
40
60
80
100
120
0 5 10 15 20 25 30 35
D
i
s
c
e
r
n
a
b
i
l
i
t
y

M
e
t
r
i
c

(
u
n
i
t

=

1
M
)
Table Size (unit = 1,000)
Incognito (k=5, c=0.7)
Static
Merge
Inf_Checked
0
20
40
60
80
100
0 5 10 15 20 25 30 35
D
i
s
c
e
r
n
a
b
i
l
i
t
y

M
e
t
r
i
c

(
u
n
i
t

=

1
M
)
Table Size (unit = 1,000)
Mondrian (k=5, c=0.7)
Static
Merge
Inf_Checked
0
20
40
60
80
100
0 5 10 15 20 25 30 35
D
i
s
c
e
r
n
a
b
i
l
i
t
y

M
e
t
r
i
c

(
u
n
i
t

=

1
M
)
Table Size (unit = 1,000)
Mondrian (k=5, c=0.7)
Static
Merge
Inf_Checked
Figure 13: Data Quality: Discernibility Metric
27
0
100
200
300
400
500
0 5 10 15 20 25 30 35
E
x
e
c
u
t
i
o
n

T
i
m
e

(
u
n
i
t

=

s
e
c
)
Table Size (unit = 1,000)
Incognito (k=5, c=0.7)
Static
Merge
Inf_Checked
0
500
1000
1500
2000
0 5 10 15 20 25 30 35
E
x
e
c
u
t
i
o
n

T
i
m
e

(
u
n
i
t

=

s
e
c
)
Table Size (unit = 1,000)
Mondrian (k=5, c=0.7)
Static
Merge
Inf_Checked
Figure 14: Execution Time
28

Although these models have yielded a number of valuable privacy-protecting techniques [3, 8, 9, 11, 12, 21], existing approaches only deal with static data release. That is, all these approaches assume that a complete dataset is available at the time of data release. This assumption implies a significant shortcoming, as in many applications data collection is rather a continuous process. Moreover, the assumption entails “one-time” data dissemination. Obviously, this does not address today’s strong demand for immediate and up-to-date information, as the data cannot be released before the data collection is considered complete. As a simple example, suppose that a hospital is required to share its patient records with a disease control agency. In order to protect patients’ privacy, the hospital anonymizes all the records prior to sharing. At first glance, the task seems reasonably straightforward, as existing anonymization techniques can efficiently anonymize the records. The challenge is, however, that new records are continuously collected by the hospital (e.g., whenever new patients are admitted), and it is critical for the agency to receive up-to-date data in timely manner. One possible approach is to provide the agency with datasets containing only the new records, which are independently anonymized, on a regular basis. Then the agency can either study each dataset independently or merge multiple datasets together for more comprehensive analysis. Although straightforward, this approach may suffer from severely low data quality. The key problem is that relatively small sets of records are anonymized independently so that the records may have to be modified much more than when they are anonymized together with previous records [5]. Moreover, a recoding scheme applied to each dataset may make the datasets inconsistent with each other; thus, collective analysis on multiple datasets may require additional data modification. Therefore, in terms of data quality, this approach is highly undesirable. One may believe that data quality can be assured by waiting for new data to be accumulated sufficiently large. However, this approach may not be acceptable in many applications as new data cannot be released in a timely manner. A better approach is to anonymize and provide the entire dataset whenever it is augmented with new records (possibly along with another dataset containing only new records). In this way, the agency can be provided with up-to-date, quality-preserving and “more complete” datasets each time. Although this approach can also be easily implemented by using existing techniques (i.e., anonymizing the entire dataset every time), it has a significant drawback. That is, even though each released dataset, when observed independently, is guaranteed to be anonymous, the combination of several released datasets may be vulnerable to various inferences. We illustrate these inferences through some examples in Section 3.1. As such inferences are typically made by comparing or linking records across different tables (or versions), we refer to them as cross-version inferences to differentiate them from inferences that may occur within a single table. Our goal in this paper is to identify and prevent cross-version inferences so that an increasing dataset can be incrementally disseminated without compromising the imposed privacy requirement. In order to achieve this, we first define the privacy requirement for incremental data dissemination. We then discuss three types of cross-version inference that an attacker may exploit by observing multiple anonymized datasets. We also present our anonymization method where the degree of generalization is determined based on the previously released datasets to prevent any cross-version inference. The basic idea is to obscure linking between records across different datasets. We develop our technique in two different types of recoding approaches; namely, full-domain generalization [11] and multidimensional anonymization [12]. One of the key differences between these two approaches is that the former generalizes a given dataset according to pre-defined generalization hierarchies, while the latter does not. Based on our experimental result, we compare these two approaches with respect to data quality and vulnerability to cross-table inference. Another issue we address is that as a dataset is released multiple times, one may need to keep the history of previously released datasets. We thus discuss how to maintain such history in a compact form to reduce unnecessary overheads.

2

each of which contains at least k records. Thus. the risk of attribute disclosure is kept under 1/ . we formulate the privacy requirement for incremental data dissemination. date of birth} [22]. A set of records that are indistinguishable from each other is often referred to as an equivalence class. For instance. The main objective of the k-anonymity model is thus to transform a table so that no one can make highprobability associations between records in the table and the corresponding individuals by using such group of attributes. For such datasets. from the table. it has been shown that 87% of individuals in the United States can be uniquely identified by a set of attributes such as {ZIP. but a particular group of attributes together may identify a particular individuals [19. Recently. Although none of these records can be uniquely matched with the corresponding individuals. consider a k-anonymized table where all records in an equivalence class have the same sensitive attribute value. It also assumes that the target table contains person-specific information and that each record in the table corresponds to a unique real-world individual. we discuss a number of anonymization models and briefly review existing anonymization techniques. the -diversity requirement 3 . In Section 3. The enforcement of k-anonymity guarantees that even though an adversary knows the quasi-identifier value of an individual and is sure that a k-anonymous table T contains the record of the individual. a k-anonymous table can be viewed as a set of equivalence classes. [16] pointed out such inference issues in the k-anonymity model and proposed the notion of -diversity. As this requirement ensures that every equivalence class contains at least distinct sensitive attribute values. We present our approach to preventing these inferences in Section 5 and evaluate our technique in Section 6. 2. Diagnosis is considered a sensitive attribute. Note that in this case. Then in Section 4. even though a table is free of explicit identifiers. as it is possible to infer certain individuals’ sensitive attribute values without precisely reidentifying their records. we review the basic concepts of the kanonymity and -diversity models and provide an overview of related techniques. For instance. In order to achieve this goal. For example. However. he cannot determine which record in T corresponds to the individual with a probability greater than 1/k. their sensitive attribute value can be inferred with probability 1. in patient table. the key consideration of anonymization is the protection of individuals’ sensitive attributes. However. gender. the -diversity model requires that records in each equivalence class have at least distinct sensitive attribute values. In its simplest form. a private dataset typically contains some sensitive attributes that are not part of the quasi-identifier. the k-anonymity model does not provide sufficient protection in this setting.1 Anonymity Models The k-anonymity model assumes that data are stored in a table (or a relation) of columns (or attributes) and rows (or records). some of the remaining attributes in combination could be specific enough to identify individuals. Machanavajjhala et al. In Section 2. we describe three types of inference attacks based on our assumption of potential attackers. 22]. Although the k-anonymity model does not consider sensitive attributes. called quasi-identifier. The process of anonymizing such a table starts with removing all the explicit identifiers. 2 Preliminaries In this section.The remainder of this paper is organized as follows. We review some related work in Section 7 and conclude our discussion in Section 8. This implies that each attribute alone may not be specific enough to identify individuals. Several formulations of -diversity are introduced in [16]. such as name and SSN. the k-anonymity model requires that any record in a table be indistinguishable from at least (k − 1) other records with respect to the quasi-identifier.

Another class of generalization schemes is the hierarchy-free generalization class [3. − s∈S p(e. Ba- 4 . possible generalizations are still limited by the imposed generalization hierarchies. modified datasets based on this requirement could still be vulnerable to probabilistic inferences. 21]. s) represents the fraction of records in e that have sensitive value s. In such a case. Li and Li considered the adversary’s background knowledge in defining privacy. numeric values can be generalized into intervals (e. The multi-level generalization [8. s) > log where S is the domain of the sensitive attribute and p(e. A more robust diversity is achieved by enforcing entropy -diversity [16]. In [13]. Various generalization strategies have been developed. In [26]. the requirement may sometimes be too restrictive. 2. Here an optimal solution is a solution that satisfies the privacy requirement and at the same time minimizes a desired cost metric.. the entropy of the entire table must also be greater than or equal to log . Canada. a non-overlapping generalization-hierarchy is first defined for each attribute in the quasi-identifier.g. In the single-level generalization schemes [11.2 Anonymization Techniques The k-anonymity (and -diversity) requirement is typically enforced through generalization. {USA. Mexico}) or a single value that represents such a set (e. Although simple and intuitive. 12. which requires every equivalence class to satisfy the following condition.. In [3]. when multiple records in the table corresponds to one individual. They proposed t-closeness as a stronger anonymization model. where real values are replaced with “less specific but semantically consistent values” [21]. there are various ways to generalize the values in the domain. As generalization makes data uncertain. Intuitively. Based on the use of generalization hierarchies. Recently. on the other hand. an adversary may conclude that the individuals contained in the equivalence class are very likely to have that specific value. [11 − 20]). Hierarchyfree generalization schemes do not rely on the notion of pre-defined generalization hierarchies. 9] schemes. observed that -diversity is neither sufficient nor necessary for protecting against attribute disclosure. 2. This restriction could be a significant drawback in that it may lead to relatively high data distortion due to excessive generalization. North-America). In the hierarchy-based generalization schemes.also ensures -anonymity.g. Then an algorithm in this category tries to find an optimal (or good) solution which is allowed by such generalization hierarchies. allows values in a domain to be generalized into different levels in the hierarchy. the utility of the data is inevitably downgraded. In [15]. Given a domain. Although entropy -diversity does provide stronger privacy. They proposed an approach for modeling the adversary’s background knowledge by using data mining techniques on the data to be released. Li et al.g. The key challenge of anonymization is thus to minimize the amount of ambiguity introduced by generalization while enforcing anonymity requirement. consider that among the distinct values in an equivalence class. Although this leads to much more flexible generalization. the algorithms in this category can be further classified into two subclasses. which requires the distribution of sensitive values in each equivalence class to be analogous to the distribution of the entire dataset. They proposed to have each individual specify privacy policies about his or her own attributes. all the values in a domain are generalized into a single level in the corresponding hierarchy. and categorical values can be generalized into a set of possible values (e. one particular value appears much more frequently than the others. in order for entropy -diversity to be achievable. s) log p(e. as the size of every equivalence class must be greater than or equal to . as pointed out in [16]. For example.. For instance. 4]. a number of anonymization models have been proposed. Xiao and Tao observed that diversity cannot prevent attribute disclosure. 19.

he knows that Alice’s record is not in the old dataset in Figure 2. 3 Problem Formulation In this section. a data anonymization approach that divides one table into two for release. 2)-anonymous table as depicted in Figure 4.[4].. 3. suppose that the hospital relies on a model where both the k-anonymity and -diversity are considered. where the space of anonymizations (formulated as the powerset of totally ordered values in a dataset) is explored using a tree-search strategies. when the number of quasi-identifier attributes is high. one table includes original quasi-identifier and a group id. At a later time. Thus. As shown. Byun et al. The hospital initially has a table like the one in Figure 1 and reports to the agency its (2. Cancer}. to assure data quality. Xiao and Tao [25] propose Anatomy. the association between individual and diagnosis) are kept under 1/2 in the dataset. even for k = 2. On the theoretic side. As previously discussed. he learns only that Alice has one of {Asthma. who is a 21-year-old male. propose a clustering approach to achieve k-anonymity. a ‘(k. transform the k-anonymity problem into a partitioning problem and propose a greedy approach that recursively splits a partition at the median value until no more split is allowed with respect to the k-anonymity requirement.1 Motivating Examples Let us revisit our previous scenario where a hospital is required to provide the anonymized version of its patient records to a disease control agency. We then describe our notion of incremental dissemination and formally define a privacy requirement for it. Observe that Tom’s privacy is still protected in the newly released dataset. on the other hand. Aggrawal et al. not every patient’s privacy is protected from the attacker. LeFevre et al. propose an algorithm based on a powerset search problem. we start with an example to illustrate the problem of inference. introduce a flexible k-anonymization approach which uses the idea of clustering to minimize information loss and thus ensures good data quality. the curse of dimensionality also calls for more effective anonymization techniques. resulting the table in Figure 3. and the other includes the associations between the group id and the sensitive attribute values. In [12]. the hospital anonymizes the patient table whenever it is augmented with new records. From the new dataset. Koudas et al. In [2]. who is in her late twenties. )-anonymous’ dataset is a dataset that satisfies both the k-anonymity and -diversity requirements. but in the new dataset in Figure 4. therefore. For example. Flu.yardo et al. enforcing k-anonymity necessarily results in severe information loss. the association between individual and record) and attribute disclosure (i. by consulting the previous dataset. To make our example more concrete. 5 . he cannot learn about Tom’s disease with a probability greater than 1/2 (although he learns that Tom has either asthma or flu). he can easily deduce that Alice has neither asthma nor flu (as they must belong to patients other than Alice). 2)-anonymous table shown in Figure 2. which clusters records into groups based on some distance metric. Furthermore. the probability of identity disclosure (i. optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. Recently. as shown in [1] that. However.. Example 1 “Alice has cancer!” Suppose the attacker knows that Alice. respectively. However. is in the released table. He now infers that Alice has cancer. even if an attacker knows that the record of Tom.e. three more patient records (shown in Italic) are inserted into the dataset.e. The hospital then releases a new (2. [10] explore the permutation-based anonymization approach and examine the anonymization problem from the perspective of answering downstream aggregate queries. has recently been admitted to the hospital.

. Thus. denoted as θ Ti . 3. With respect to Q. the second condition dictates that the maximum confidence of association between any quasi-identifier value and a particular sensitive attribute value in T must not exceed a threshold c. However. he is sure that Bob’s record is in both datasets in Figures 2 and 4. At a given time i. . Note that three other records in the new dataset are also vulnerable to similar inferences. c)-anonymous. 1 ≤ i ≤ n. 2. ∀ s ∈ S. We assume that T consists of person-specific records. including potential attackers. if observing tables in θ and Ti collectively increases the risk of either identity disclosure or attribute disclosure in Ti higher than 1/k or c. . we say that there is an inference channel between the two versions. T2 . For the privacy of individuals. by studying the old dataset. . He can thus conclude that Bob suffers from alzheimer. each of which corresponds to a unique real-world individual. Now the attacker checks the new dataset and learns that Bob has either alzheimer or heart disease. T1 . Tn } be the set of all released tables for private table T . if an observer can be sure that two (anonymized) records in two different versions indeed correspond to the same individual. one cannot associate a record with a particular individual with probability higher than 1/k or infer any individual’s sensitive attribute with confidence higher than c. For instance. as shown in Section 3. First. anonymizing a dataset without considering previously released information may enable various inferences. c)-anonymous version released at time i. As every released table is (k..1. Definition 2 (Cross-version inference channel) Let Θ = {T1 . where |Ti | ≤ |Tj | for i < j. c)-anonymity protection. where ∀ e ∈ T . ∀ e ∈ T. may have access to a series of (k. Thus. . then he may be able to use this knowledge to infer more information than what is allowed by the (k. is released to public. c)-anonymous version of Ti . where Ti is an (k. T consists of a set of non-empty equivalence classes. and we adopt an anonymity model2 that combines the requirements of k-anonymity and -diversity as follows. c)-Anonymity) Let table T be with a set of quasi-identifier attributes Q and a sensitive attribute S. |e| The first condition ensures the k-anonymity requirement. where k > 0. Definition 1 ((k. each Ti must be “properly” anonymized before being released to public. where 0 < c ≤ 1. We also assume that T continuously grows with new records and denote the state of T at time i as Ti . ∀ e ∈ T. |{r|r∈e∧r[S]=s}| ≤ c. by observing each table independently.2 Incremental data dissemination and privacy requirement Let T be a private table with a set of quasi-identifier attributes Q and a sensitive attribute S. it is possible that one can increase the confidence of such undesirable inferences by observing the difference between the released tables. As shown in the examples above. denoted as Ti . record r ∈ e ⇒ r[Q] = e[Q]. If such a case occurs. |e| ≥ k. 1. c)-anonymous with respect to Q if the following conditions are satisfied. he learns that Bob suffers from either alzheimer or diabetes. In its essence. users.Example 2 “Bob has alzheimer!” The attacker knows that Bob is 52 years old and has long been treated in the hospital. . c)-anonymous tables. Our goal is to address both identity disclosure and attribute disclosure. respectively. 2A similar model is also introduced in [24]. 6 . We say that T is (k. We say that there exists cross-version inference channel from θ to Ti . only an (k. and the second condition enforces the diversity requirement in the sensitive attribute. Let θ ⊆ Θ and Ti ∈ Θ. .

3 Nowadays. . .. the records in ei all share the same quasi-identifier value. where m > 0. as we cannot rely on attacker’s lack of knowledge. it is critical to ensure that there is no cross-version inference channel among the released tables. for each anonymized table Ti . local celebrities) are admitted to the hospital. We also assume that the attacker has the knowledge of who is and who is not contained in each table. such knowledge is not always difficult to acquire for a dedicated attacker. consider medical records released by a hospital. . Also. .1 Attack scenario We assume that the attacker has been keeping track of all the released tables. Then based on the attack scenario. . above c) by comparing the released tables all together. rm } be an equivalence class in T . that is. as all the released tables are (k. Let A be a set of attributes. In the remainder of this section. however. Definition 3 (Privacy-preserving incremental data dissemination) Let Θ = {T0 . Based on this knowledge. . where Ti is an (k.e. However. Θ is said to be privacy-preserving if and only if (θ. that is.. he may know when target individuals in whom he is interested (e. . 0 ≤ i ≤ n. Ti ) such that θ ⊆ Θ. but also no released table creates cross-version inference channels with respect to the previously released tables. Let T be a table with a set of quasi-identifer attributes Q and a sensitive attribute S. the attacker cannot infer the individuals’ sensitive attribute values with a significant probability.g. e. Although the attacker may not be aware of all the patients. 4. 7 . the attacker could obtains a list of patients from a registration staff3 . and θ Ti . and ei [Q] represents the common quasi-identifier value of ei . perhaps the worst. In other words. For instance. 4 Cross-version Inferences We first describe potential attackers and their knowledge that we assume in this paper. Thus. 4. c)-anonymous. the attacker can easily deduce which tables may include such individuals and which tables may not. We also use similar notations for individual records. even utilizing such knowledge. Tn } be the set of all released tables of private table T . Let ei = {r0 . we describe three types of cross-version inferences that the attacker may exploit in order to achieve this goal. .When data are disseminated incrementally. This may seem to be too farfetched at first glance. . for record r ∈ T . . . it is reasonable to assume that the attacker’s knowledge includes the list of individuals contained in each table as well as their quasi-identifier values. By definition. Then T [A] denotes the duplicate-eliminating projection of T onto the attributes A. the data provider must make sure that not only each released table is free of undesirable inferences. . the goal of the attacker is to increase his/her confidence of attribute disclosure (i.g. Another. he thus possesses a set of released tables {T0 . We formally define this requirement as follows. we assume the worst case..2 Notations We first introduce some notations we use in our discussion. r[Q] represents the quasi-identifier value of r and r[S] the sensitive attribute value of r. c)-anonymous version of T released at time i. Ti ∈ Θ. we identify three types of cross-version inference attacks in this section. the attacker also possesses a population table Ui which contains the explicit identifiers and the quasi-identifiers of the individuals in Ti . about 80% of privacy infringement is committed by insiders. Therefore. possibility is that the attacker may collude with an insider who has access to detailed information about the patients. Tn }. where A ⊆ (Q ∪ S).

3 Difference attack Let Ti = {e0. if v1 ∩ v2 = ∅. the procedure calls GetMinSets procedure in Figure 6 to get the minimum set E of equivalence classes in Tj that contains all the records in E. if D contains less than k records. With this information. We call such E the minimum covering set of E. . . we use T A to denote the duplicate-preserving projection of T . In each time we union one more element to a subset to create larger subsets. Observation 1 Let E1 and E2 be two sets of equivalence classes in Ti . Suppose that v is a value from domain D and v a generalized value of v.In addition. the time complexity is exponential (i. then neither is any set containing E1 ∪ E2 . If they are. Note that this also prevents all the sets containing the unioned set from being generated. . we discuss a few observations that result in effective heuristics to reduce the space of the problem in most cases. 8 . e0. we do not insert the unioned subset to the queue. if (∀qi ∈ Q. the attacker knows the individuals whose records are contained in e. where n is the number of equivalence classes in Ti ). Furthermore. but not in Ei . we say that r is a generalized version of record r.1. for large tables with many equivalence classes. Therefore. The DirectedCheck procedure in Figure 5 checks if the first table of the input is vulnerable to difference attack with respect to the second table. . Similarly. . respectively. Also knowing the quasi-identifier values of the individuals in Ti and Tj . c)-anonymous. This also implies that if each E1 and E2 is not vulnerable to difference attack. As the algorithm checks all the subsets of Ti . .n } and Tj = {e1. respectively. In what follows. r1 r2 ) if ∀qi ∈ Q. We also use |N | to denote the cardinalities of set N . Moreover. then we do not need to consider any subset of Ti which contains E1 ∪ E2 . denoted as r r. The DirectedCheck procedure enumerates all subsets of Ti . c)-anonymous tables that are released at time i and j (i = j). then set D = ∪e∈Ej I(e) − ∪e∈Ei I(e) represents the set of individuals whose records are in Ej . and E1 ∩ E2 = ∅. c)-anonymous. and E1 and E2 be their minimal covering sets in Tj . Then we denote this relation as v v. . For instance. If ∪e∈Ei I(e) ⊆ ∪e∈Ej I(e). Two tables Ti and Tj are vulnerable to difference attack if at least one of DirectedCheck(Ti . set SD = ∪e∈Ej e S − ∪e∈Ei e S indicates the set of sensitive attribute values that belong to those individuals in D. ri [qi ] ∩ rj [qi ] = ∅. and E1 and E2 are disjoint. the (k.m } be two (k. e1. we check if their minimum covering sets are disjoint. As previously discussed in Section 4. As such. for any equivalence class e in either Ti or Tj . Let I(e) represent the set of individuals in e. this brute-force check is clearly unacceptable. where Ei ⊆ Ti and Ej ⊆ Tj . e1 [qi ] ∩ e2 [qi ] = ∅ 4.e.1 . it is O(2n ). Let Ei and Ej be two sets of equivalence classes. we assume that an attacker knows who is and who is not in each released table. we consider a generalized value as a set of possible values. Tj ) and DirectedCheck(Tj . The fact that E1 and E2 are not vulnerable to difference attacks means that sets (E1 − E1 ) and (E2 − E2 ) are both (k. denoted as v1 v2 .. two generalized records r1 and r2 are compatible (i.1 . we can modify the method in Figure 5 as follows. Overloading this notation. If E1 and E2 are not vulnerable to difference attack. Based on this observation. and interpret v as a set of values where (v ∈ v) ∧ (∀vi ∈ v. or if the most frequent sensitive value in SD appears with a probability larger than c. ei S represents the multiset of all the sensitive attribute values in ei . The procedure then checks whether there is vulnerability between the two equivalence classes E and E . and for each set E. r[qi ] r[qi ]) ∧ (r[S] = r[S]). We also say that two equivalence classes e1 and e2 are compatible if ∀qi ∈ Q. the attacker can now perform difference attacks as follows. Ti ) returns true. we say that two generalized values v1 and v2 are compatible. As the minimum covering set of E1 ∪ E2 is E1 ∪ E2 . Regardless of recoding schemes. c)-anonymity requirement is violated.e. vi ∈ D). . (E1 ∪ E2 ) − (E1 ∪ E2 ) is also (k.

For instance. the attacker cannot infer Alice’s sensitive attribute value with confidence higher than c by examining each ei independently. E1 and E2 are not vulnerable. . c)-anonymous. . whose quasi-identifier value is qA . in each Ti ∈ θA .. that is. the attacker now wants to know which individuals possess a specific sensitive attribute value. If E1 = E2 . he just needs to consider the records that may possibly correspond to + Alice. c)-anonymous tables. Therefore.g. if the most frequent value in SIA appears with a probability greater than c. sA . Let sp be the sensitive attribute value in which the attacker is interested and Ti ∈ Θ be the table in which (at least) one record with sensitive value sp appears. the individuals who suffer from ‘HIV+’. . must be present in every equivalence class in EA . The basic idea is to check every pair of equivalence classes ei ∈ T0 and ej ∈ T1 that contain the same record(s). we check if there exists another set with the same minimum covering set. In other words. we can skip checking if each of E1 and E2 is vulnerable to difference attack. then we only need to check if E1 ∪ E2 is vulnerable to difference attack. The pseudo-code in Figure 7 provides an algorithm for checking the vulnerability to the intersection attack for given two (k. this ambiguity may be reduced to an undesirable level if the structure of equivalence classes are varied in different releases. instead of wanting to know what sensitive attribute value a particular individual has. θA . It is easy to see that if e ∈ Ti contains some record(s) that Tj do not. where + qA ei [Q]. we simply merge the new subset with the found set. suppose that the attacker wants to know the sensitive attribute of Alice. T0 and T1 . If equivalence class e ∈ Tj contains all such records. as every equivalence class in EA contains Alice’s record. Ti contains some records that are not in Tj . However. This implies that sA must be found in set SIA = ei ∈EA ei S . Based on this observation. en } be the set of all equivalence classes identified from θA such that qA ei [Q]. ∀ei ∈ EA .5 Record-tracing attack Unlike the previous attacks. If such a set is found. Let EA = {e0 . and E1 and E2 be their minimal covering sets in Tj . Then the attacker can select a + set of tables. Although T 9 . c)-anonymous.. the attacker only need to consider an equivalence class ei ⊆ Ti . e. sA ∈ ei S . This is because unless E1 ∪ E2 is vulnerable to difference attack. However. he does not need to examine all the records.Observation 2 Let E1 and E2 be two sets of equivalence classes in Ti . In other words. we can purge all such equivalence classes from the initial problem set. e is safe from difference attack. we can save our computational effort as follows. As the attacker knows the quasi-identifier of Alice. Thus. Suppose that Ti was released after Tj .e. Observation 3 Consider the method in Figure 5. i. the attacker may be interested in knowing who may be associated with a particular attribute value. That is. Since e itself must be (k. that all contain Alice’s record. respectively. . When we insert a new subset into the queue. 4. 4. the attacker knows that Alice’s sensitive attribute value. then the sensitive attribute value of Alice can be inferred with confidence greater than c. 0 ≤ i ≤ n.4 Intersection attack The key idea of k-anonymity is to introduce sufficient ambiguity into the association between quasi-identifier values and sensitive attribute values. As every ei is (k. then we do not need to consider that equivalence class for difference attack. the minimum covering set of e is an empty-set.

As shown. While correct. We define E as the set of edges from vertices in V1 to vertices in V2 . where V = V1 ∪ V2 and each vertex in V1 represents a record in Ti . Suppose now that there are more than one compatible equivalence classes in Ti+1 . in order to illustrate our point clearly. the attacker can make more precise inference by eliminating equivalence classes in Ti+1 that are impossible to contain rr . then rp could be in either ei+1 and ei+1 (see Figure 8 (ii)). he can make an arbitrary choice and update his knowledge about the quasi-identifiers of the two records accordingly.2 . We create such edges between V1 and V2 as follows. we show some simple cases in the following example. then we create an edge from r to r and mark it with d . he is sure that one record is in ei+1. This means that if the attacker can identify from Tj the record corresponding to rp . we say that the attacker updates rp [Q] as rp [Q] ∩ (∪1≤j≤r ei+1. if there is an edge from ri ∈ V1 to rj ∈ V2 . In the above example. ei+1. If |R| = 1 and r ∈ E. rp [Q] ← rp [Q] ∩ ei+1 [Q]. However. which indicates that r definitely corresponds to r .1 that take sr as the sensitive value. c)-anonymous.1 that is compatible to r1 and two compatible equivalence classes ei+1. Although the attacker may or may not determine which equivalence class contains rp .1 or ei+1. as the record with sr in ei+1. Then the attacker can confidently combine his knowledge on the quasi-identifier of rp . Example 3 The attacker knows that rp must be contained in the equivalence class of Tj that is compatible with rp [Q].1 and r2 ∈ ei. Although r2 has two compatible equivalence classes.. and each vertex in V2 represents a record in Ti+1 that is compatible with at least one record in Ti . the attacker can infer the association between the individuals in Ip and rp with a probability higher than 1/k.1 and the other in ei+1. Suppose that Ti+1 contains a single equivalence class ei+1. the attacker can reexamine Ip and eliminate individuals whose quasiidentifiers are no longer compatible with the updated rp [Q]. As T is (k.r } in Ti+1 . that the attacker is interested in a particular record rp such that (rp [S] = sp ) ∧ (rp ∈ ei ). this is not a sufficient description of what the attacker can do. (r[Q] ri [Q]) ∧ (r[S] = ri [S]). If |R| > 1.2 .may contain more than one record with sp .2 that are compatible to r2 .2 . After updating rp [Q] with Tj . we construct a bipartite graph G = (V. then he may be able to learn additional information about the quasi-identifier of the individual corresponding to rp and possibly reduce the size of Ip .1 . For example. Using such techniques. he is sure that rp ∈ ei+1 ∪ ei+1 . E). let r1 ∈ ei. for simplicity. That is.e.1 and ei+1... as there are cases where the attacker can notice that some equivalence classes in Ti+1 cannot contain rp . Note that in each of these tables the quasi-identifier of rp may be generalized differently. if sp ∈ ei+1 [S] and sp ∈ ei+1 [S]. the attacker can be sure that r2 is included in ei+1. When the size of Ip becomes less than k. i. Figure 9(ii) illustrates another case of which the attacker can take advantage.1 must correspond to r1 . then we create an edge from r and every ri ∈ R and mark it with p to indicate that r plausibly corresponds to 10 . For each vertex r ∈ V1 . therefore. Suppose that there is only one compatible equivalence class. If sp ∈ ei+1 [S] and sp ∈ ei+1 [S]. suppose. which represents possible matching relationships. . Suppose that the attacker also possesses a subsequently released table Tj (i < j) which includes rp . rp [Q] ← rp [Q] ∩ (ei+1 [Q] ∪ ei+1 [Q]). both taking sr as the sensitive value (see Figure 9(i)). Let Ip be the set of such individuals.2 be two records in Ti . this means that records ri and rj may both correspond to the same record although they are generalized into different forms. However. when there are more than one compatible equivalence classes {ei+1. then the attacker can be sure that rp ∈ ei+1 and updates his / knowledge of rp [Q] as rp [Q] ∩ ei+1 [Q]. c)-anonymous tables for the vulnerability to the record-tracing attack. Although the attacker cannot be sure that each of these records is contained in ei+1. First.. there are two records in ei. Thus. We now describe a more thorough algorithm that checks two (k. There are many cases where the attacker can identify rp in Tj . when the attacker queries the population table Ui with rp [Q]. we find from V2 the set of records R where ∀ri ∈ R. say ei+1 and ei+1 .j ). ei+1 in Tj (see Figure 8 (i)). he obtains at least k individuals who may correspond to rp .

i ∈ V1 is associated with a set of possibly matching records {r2. we describe the properties of our checking algorithms which make them suitable for existing data anonymization techniques such as full-domain generalization [11] and multidimensional anonymization [12]. This field is used for checking vulnerability to intersection attack. . we describe our incremental data anonymization which incorporates the inference detection techniques in the previous section. Note that the algorithm above does not handle the case illustrated in Figure 9(ii). • RID : is a unique record ID (or the explicit identifier of the corresponding individual). . . In order to check whether Ti is vulnerable to cross-version inferences. This field is used for checking vulnerability to record-tracing attack. r[IQ]i ← r[IQ]i−1 ∩ r[Q]. It is worth noting that the key problem enabling the record-tracing attack arises from the fact that the sensitive attribute value of a record. .m r2. • IQ (Inferable Quasi-identifier) : keeps track of the quasi-identifiers into which the record has previously been generalized. 11 . Assuming that each Ti also contains RID (which is projected out before being released). ..1 Data/history management Consider a newly anonymized table. rare diseases) or tables where every individual has a unique sensitive attribute value (e.m } (j ≤ m) in V2 . If any inferred quasi-identifer maps to less than k individuals in the population table Ui . Θ = {T0 . [Q]). the pseudo-code in Figure 10 removes plausible edges that are not feasible and discovers more definite edges by scanning through the edges. Hi at time i. we remove unnecessary plausible edges such that each of such records in e1.. checking the vulnerability in Ti against each table in Θ can be computationally expensive. r2. For instance. Now given the constructed bipartite graph. We first describe our data/history management strategy which aims to reduce the computational overheads. it is essential to maintain some form of history about previously released datasets. together with its generalized quasi-identifier.. • IS (Inferable Sensitive values) : stores the set of sensitive attribute values with which the record can be associated. we maintain a history table.j . To avoid such inefficiency. However. Then. may uniquely identify the record in different anonymous tables. . After all infeasible edges are removed. Ti . which has the following attributes.i [Q] e2. RID is used to join Hi and Ti .g. . This issue can be especially critical for records with rare sensitive attribute values (e.i ∈ Ti . For each equivalence class e1. For instance. we also performs the following. .i [Q] = r1.g.j [Q]. then table Ti is vulnerable to the record-tracing attack with respect to Tj . e1. 5 Inference Prevention In this section. if record r is released in equivalence class ei of Ti . • TS (Time Stamp) : represents the time (or the version number) when the record is first released. we find from Tj the set of equivalence classes E where ∀e2.j ∈ E. Ti−1 }..i [Q] ∩ ( =j. In order to address such cases.. Now we can follow the edges and compute for each record r1. each record r1. If the same number of records with any sensitive value s appear in both e1. which is about to be released. then r[IS]i ← (r[IS]i−1 ∩ ei S ).i and E.i ∈ Ti the inferrable quasi-identifier r1.i has a definite edge to a distinct record in E.. 5. for record r ∈ Ti .ri . DNA sequence).

Observation 5 Given a partition p of records. Θ. once the search finds a generalization level that satisfies the k-anonymity requirement. 6 It is easy to see that the property also holds for any diversity requirement. If either check fails. In its essence. for intersection and record-tracing attacks. property is also used in [16] for -diversity and is thus applicable for (k. then neither is G2 (T ). we do not need to search further. the algorithm recursively splits a partition at the median value (of a selected dimension) until no more split is allowed with respect to the k-anonymity requirement. then any sub-partition of p does not satisfy the requirement. where all values of an attribute are consistently generalized to the same level in the predefined generalization hierarchy. Note that unlike the previous algorithm.e. Specifically. Using H. we consider the multidimensional k-anonymity algorithm proposed in [12].. Observation 4 Given a table T and two generalization strategies G1 . In each step of generalization. we check Ti against Hi−1 . Then.The main idea of Hi is to keep track of the attacker’s accumulated knowledge on each released record. this property is critical as it guarantees the optimality to the discovered solutions. then so is any subpartition of p. the information about each record in G2 (T ) is more vague than that in G1 (T ). this algorithm is a top-down search approach. propose an algorithm that finds minimal generalizations for a given table. This our current implementation. Based on this observation. In order to find such a partitioning. if p is vulnerable to any inference attack. then G2 (T ) is also k-anonymous5. in addition to checking the (k. in which some records are vulnerable to inference attacks. G2 (G1 vulnerable to any inference attack.2 Incorporating inference detection into data anonymization We now discuss how to incorporate the inference detection algorithms into secure anonymization algorithms. LeFevre et al. where d is the number of attributes in the quasi-identifier. However. This is indeed the worst case as we are assuming that the attacker possesses every released table.. c)-anonymity requirement. We first consider the full-domain anonymization. any further cut of p1 will lead to a dataset that is also vulnerable to inference attacks.1. c)-anonymity. In [11]. instead of every Tj ∈ Θ4 . The key property on which the algorithm relies is the generalization property: given a table T and two generalization strategies G1 . For instance. i. if G1 (T ) is not The proof is simple.e. we need to be conservative when estimating what the attacker can do. as discussed in Section 4. Suppose that we have a partition p1 of the dataset. difference attack is still checked against every previously released table. The first step is to find a partitioning scheme of the d-dimensional space. the cost of checking Ti for vulnerability can be significantly reduced. Although intuitive. As each equivalence class in G2 (T ) is the union of one or more equivalence classes in G1 (T ). thus. value r[IS] of record r ∈ Hi−1 indicates the set of sensitive attribute values that the attacker may be able to associate with r prior to the release of Ti . Next. the algorithm consists of the following two steps. G2 ). and the quality of the search relies on the following property6: given a partition p. if p does not satisfy the k-anonymity requirement. we also check for the vulnerability to inference. if G1 (T ) is k-anonymous. i. the proposed algorithm is a bottom-up search approach in that it starts with un-generalized data and tries to find minimal generalizations by increasingly generalizing the target data in each step. G2 (G1 G2 ). G2 does not create more inference attacks than G1 . such that each partition contains more than k records. we modify the algorithm in [11] as follows. 5 This 4 In 12 . 5. then we need to further generalize the data.

g. For our experiments. 6. . denoted by AIL(e). salary} as the quasi-identifier. is defined as: |Gj | AIL(e) = |e| × j=1. Definition 4 (Information loss) [4] Let e={r1 .1. Standard Edition 5.g.. while being independent from the underlying generalization scheme (e. In our experiments.2 Data quality metrics The quality of generalized data has been measured by various metric. and the implementation was built and run in Java 2 Platform. If either check fails. we measure the data quality mainly based on Average Information Loss (AIL.1 Experimental Setup 6. Based on this observation.. Thus. the AIL of a given table T is computed as: AIL(T ) = ( e∈T IL(e)) / |T |.0. am }. 6 Experiments The main goal of our experiments is to show that our approach effectively prevents the previously discussed inference attacks when data is incrementally disseminated. the Adult data set was prepared as described in [3. Based on IL. the geometrical size of each partition). marital status. 6. . each region indeed represents the generalized quasi-identifier of the records contained in it.. and |Dj | the domain size of attribute aj . rn } be an equivalence class where Q={a1 . . in addition to checking the (k. The key advantage of AIL metric is that it precisely measures the amount of generalization (or vagueness of data).1. . c)-anonymity requirement.1 Experimental Environment The experiments were performed on a 2. we considered {age. we also use the Discernibility Metric (DM ) [3] as 13 . We also show that our approach produces datasets with good data quality. gender. We removed records with missing values and retained only nine of the original attributes. and occupation attribute as the sensitive attribute. we used the Adult dataset from the UC Irvine Machine Learning Repository [18].e. it reveals more information about each record than p1 . Information Loss (IL for short) measures the amount of data distortion in an equivalence class as follows.m |Dj | where |e| is the number of records in e. thus. . 9.is based on the fact that any further cut on p1 leads to de-generalization of the dataset. The operating system on the machine was Microsoft Windows XP Professional Edition. Before the experiments. native country. We first describe our experimental settings and then report our experimental results. for short) metric proposed in [4]. Note that as all the records in an equivalence class are modified to share the same quasi-identifer. we also check for the vulnerability to inference. education. then we do not need to further partition the data. work class. we modify the algorithm in [12] as follow. .66 GHz Intel IV processor machine with 1 GB of RAM. |Gj | represents the amount of generalization in attribute aj (e. data distortion can be measured naturally by the size of the region covered by each equivalence class. anonymization technique used or generalization hierarchies assumed).. Then the amount of data distortion occurred by generalizing e. which is considered a de facto benchmark for evaluating the performance of anonymization algorithms.. the length of the shortest interval which contains all the aj values existing in e). . . Following this idea. In our experiment. In each step of partition. race. The basic idea of AIL metric is that the amount of generalization is equivalent to the expansion of each equivalence class (i. For the same reason. 12]..

we stop the check and do not consider the split. For this. As previously discussed. It is clear that in terms of data quality the inf-checked algorithm is much superior than the merge algorithm. they caused memory blowups. which indicates how much records are indistinguishable from each other. Intuitively. Figure 11 shows the result. We also implemented a merge approach where we anonymize each dataset independently and merge it with the previously released dataset. while our modified algorithm of Incognito is processing a node in the generalization lattice. even the initial dataset was highly generalized. In fact. we started with 5K records and increased the dataset by 5K each time.5. we first anonymized 5K records and obtained the first “published” datasets. This is indeed unsurprising.3) were highly effective in most cases. The main focus was its effect on the data quality. That is. the more precise data are. We also note that the quality of datasets generated by the inf-checked algorithm is not optimal. and the results are illustrated in Figures 12 and 13. these data are vulnerable to inference attacks as previously shown. This clearly illustrates the unfortunate reality. In our case. In order to avoid such cases. there were some cases where the size of subsets grew explosively. in order to prevent undesirable inferences. if the size of subsets needed to be checked exceeds the threshold. Note that although we implemented the full-featured algorithms for difference and intersection attacks. c)-anonymization algorithms. c)-anonymization algorithms. Although the static algorithms produced the best quality datasets. The next step was to investigate how effectively our approach would work with a real dataset. that is. c)-anonymization algorithms as discussed in Section 5 and obtained our inf-checked (k. As such cases not only caused lengthy execution times. respectively. This was mainly due to the complexity of checking for difference attack. it means that the given data must be generalized until there is no vulnerability to any type of inference attack. taking a multidimensional approach. for Incognito. when we encounter such a case while considering a split in Mondrian. As shown. we modified two k-anonymization algorithms. Note that this approach does not affect the security of data. Although this approach is secure from any type of inference attacks. The datasets generated by our inf checked algorithm and the merge algorithm were not vulnerable to any type of inference attack.another quality measure in our experiment. much more records were found to be vulnerable in the datasets anonymized by Mondrian. produces datasets with much less generalization. we repeated our experiment.2 Experimental Results We first measured how many records were vulnerable in statically anonymized datasets with respect to the inference attacks we discussed. We modified the static (k. With these algorithms as well as the static anonymization algorithms. Incognito [11] and Mondrian [12]. we considered all the edges without removing infeasible/unnecessary edges as discussed in Section 4. we expected that the data quality would be the worst. one needs to hide more information. We then generated five more subsequent datasets by adding 5K more records each time. we took a simple approach for record-tracing attack. For example. 6. we set an upper limit threshold for the size of subsets in this experiment. the more vulnerable they are to undesirable inferences. we stop the iteration and consider the node as a vulnerable node. as merging would inevitably have a significant effect on generalization (recoding) scheme. Then we used our vulnerability detection algorithms to count the number of records among these datasets that are vulnerable to each of inference attack. Similarly. DM measures the quality of anonymized data based on the size of the equivalence classes. We then checked the vulnerability and measured the data quality of such datasets. As before. Even though our heuristics to reduce the size of subsets (see Section 4. although 14 . and used them as our static (k. Using these algorithms. as Mondrian. We measured the data quality both with AIL and DM .

14. 7 Related Work The problem of information disclosure [6] has been studied extensively in the framework of statistical databases. They proposed several methods to check whether or not a given set of views violates the k-anonymity requirement collectively. Figure 14 shows our experimental result for each algorithm. Sweeney identified possible inferences when new records are inserted and suggested two simple solutions. we believe that the data quality seems to be still acceptable.. Samarati and Sweeney [20. addition of new attributes). we presented a preliminary limited investigation concerning the inference problem of dynamic anonymization with respect to incremental datasets. While static anonymization has been extensively investigated in the past few years. 21] have been proposed to achieve k-anonymity requirement. Wang and Fung [23] further investigated this issue and proposed a top-down specialization approach to prevent record-linking across multiple anonymous tables. we believe that this is the price you have to pay for better data quality and reliable privacy. 12. While equipped the heuristics and the data structure discussed in Sections 4. as discussed in Section 4. As previously mentioned. 22. 8. [28] addressed the inference issue when a single table is released in the form of multiple views. Rounding. all equivalence classes containing that tuple have exactly the same set of sensitive attribute values. In [5].1. Also. The m-invariance principle requires that each equivalence class in every release contains distinct sensitive attribute values and for each tuple. 9. all released attributes (including sensitive attributes) must be treated as the quasi-identifier in subsequent releases. However. Also. compromised data integrity of the tables. Recently. Yao et al. A number of static anonymization algorithms [3. Even if the optimality cannot be guaranteed. They also introduced the counterfeit generalization technique to achieve the m-invariance requirement. However.it may negatively affect the overall data quality. Cell Suppression. when compared to our previous implementation without any heuristics. The merge algorithm is highly efficient with respect to execution time (although it was very inefficient with respect to data quality). this approach cannot adequately control the risk of attribute disclosure. this approach may suffer from unnecessarily low data quality. Xiao and Tao [27] proposed a new generalization principle m-invariance for dynamic dataset publishing. Another important comparison was the computational efficiency of these algorithms. in any subsequent release of the dataset. 21] introduced the k-anonymity approach and used generalization and suppression techniques to preserve information truthfulness. A number of information disclosure limitation techniques [7] have been designed for data publishing. Optimal k-anonymity has been proved to be NP-hard for k ≥ 3 [17]. this approach has a significant drawback in that every equivalence class will inevitable have a homogeneous sensitive attribute value. and Data Swapping and Perturbation. including Sampling. The first solution is that once records in a dataset are anonymized and released. As the merge algorithm anonymizes the same sized dataset each time and merging datasets can be done very quickly. the infchecked algorithm is still slow. However. this approach cannot protect newly inserted records from difference attack. the execution time is closely constant. their work focuses on the horizontal growth of databases (i. These techniques. however. In [22]. The other solution suggested is that once a dataset is released. the records must be the same or more generalized.3 and 5. 11. this is a very promising result. they did not address how to deal with such violations. and does not address vertically-growing databases where records are inserted. thus. considering the results shown in Figures 12 and 13. This approach seems reasonable as it may effectively prevent linking between records. We identified some inferences and also proposed an approach where new records are directly inserted to the previously anonymized dataset for computational 15 . However.e. considering the previously discussed results. only a few approaches address the problem of anonymization in dynamic environments.

Aggrawal. In Proceedings of the Intenational Conference on Database Systems for Advanced Applications(DASFAA). we are working on essential properties (e. we discussed three basic types of cross-version inference attacks and presented algorithms for detecting each attack. 2006. In Secure Data Management. References [1] C. and also suggest a scheme to store the history (of previously released tables). For the future work. For instance. while in this work we consider more robust and systematic inference attacks in a collection of released tables. The inference attacks discussed in this work subsume attacks using inference enabling sets and address more sophisticated inferences. Agrawal. Li. Cryptography and Data Security. S.efficiency. Bayardo and R. In this paper we also address the issue of computational costs in detecting possible inferences. Based on these ideas. 2005. Byun. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). Bertino. J. Another interesting direction for the further work is to see if there are other types of inferences. Data privacy through optimal k-anonymization. Byun. Our previous approach was also limited to the multidimensional generalization. E. our study of the record-tracing attack is a new contribution in this work. R. correctness) of our methods and analysis. 2007. 2005. T. Thomas.-W. K. Y. and N. We also empirically evaluated our approach by comparing to other approaches. Khuller. Secure anonymization for incremental datasets. Our experimental result showed that our approach outperformed other approaches in terms of privacy and data quality. 2006. one can devise an attack where more than one type of inference are jointly utilized. and N. [6] D. In particular. compared to this current work. Inc. 1982. 8 Conclusions In this paper.g.. However. Li. Aggarwal. A. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Denning.The key differences of this work with respect to [5] are as follows. We discuss various heuristics to significantly reduce the search space. [4] J. On k-anonymity and the curse of dimensionality. [2] G. Bertino. Efficient k-anonymization using clustering techniques. Feder. E. we discussed inference attacks against the anonymization of incremental data. D. Zhu. In [5].. therefore it can be combined with a large variety of anonymization algorithms. Kamra.-W. Achieving anonymity via clustering. Kenthapadi. our current approach considers and is applicable to both the full-domain and multidimensional approaches. we focused only on the inference enabling sets that may exist between two tables. By contrast. we developed secure anonymization algorithms for incremental datasets using two existing anonymization algorithms. We also plan to investigate inference issues in more dynamic environments where deletions and updates of records are allowed. For instance. In Proceedings of the International Conference on Data Engineering (ICDE). We also presented some heuristics to address the efficiency of our algorithms. our previous work has several limitations. 16 . [5] J. Panigrahy. Sohn. [3] R. Addison-Wesley Longman Publishing Co. We also provide a detailed descriptions of attacks and algorithms for detecting them. and A.

Fung. In Confidentiality. In Proceedings of the ACM Symposium on Principles of Database Systems (PODS). [14] T. and M. LeFevre. Anonymizing sequential releases. Merz. Venkitasubramaniam. [16] A. 2002. Fung. In Proceedings of the International Conference on Data Engineering (ICDE). 1998. Injector: Mining Background Knowledge for Data Anonymization. Wang and B. Technical Report SRI-CSL-98-04. International Journal on Uncertainty. Yu. F. 2006. D. R. UCI repository of machine learning databases. [21] L. Disclosure limitation methods and information loss for tabular data. [12] K. K. Sweeney. D. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 2004. Zhang. Fuzziness and Knowledge-based Systems. S. Padman. Yu. Wang. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Li. [9] V. [23] K. and Q. Sweeney. Mondrian multidimensional k-anonymity. International Journal on Uncertainty. In Proceedings of the International Conference on Data Engineering (ICDE). Fienberg. [8] B. Sweeney. Fuzziness and Knowledge-based Systems. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. T. Samarati. Elsevier. Top-down specialization for information and privacy preservation. T. 1998. Li. and P. 2001. DeWitt. On the complexity of optimal k-anonymity. [19] P. [22] L. J. [18] C. Krishnan. In Proceedings of the International Conference on Data Engineering (ICDE). 2008. M. Gehrke. K-anonymity: A model for protecting privacy. T. 17 . Protecting respondents privacy in microdata release. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). [10] N. and R. ‘-diversity: Privacy beyond k-anonymity. 13. 2005. E. B. and R. C. t-closeness: Privacy beyond k-anonymity and -diversity. Ramakrishnan. Li and N. [17] A. Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies. Machanavajjhala. Li and N. Li. Hettich and C. Meyerson and R. Venkatasubramanian. D. M. Towards optimal k-anonymization. and S. Koudas. [11] K. 2002. 2001. Roehrig. 2002. Incognito: Efficient full-domain k-anonymity.[7] G. C. Ramakrishnan. [15] T. 2006. S. DeWitt. pages 414423. Williams. Li. Duncan. In Proceedings of the International Conference on Data Engineering (ICDE). S. Srivastava. Data & Knowledge Engineering. R. LeFevre. IEEE Transactions on Knowledge and Data Engineering. Samarati and L. 2008. Kifer. pages 135166. Aggregate query answering on anonymized tables. In Proceedings of the International Conference on Data Engineering (ICDE). Achieving k-anonymity privacy protection using generalization and suppression. Iyengar. 2005. D. [13] N. S. and S. 2007. [20] P. In Proceedings of the International Conference on Data Engineering (ICDE). Transforming data to satisfy privacy constraints. 2007. 2005.

C. pages 689700. S. In Proceedings of the International Conference on Very Large Data Bases (VLDB).k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing. Jajodia. [27] X. 2006. 18 . [25] X. 2007. pages 139150. X. Wong. and S. 2006. Tao. Yao. W. In Proceedings of the International Conference on Very Large Data Bases (VLDB). Anatomy: simple and effective privacy preservation. Fu. Xiao and Y. J. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). Xiao and Y. In Proceedings of the ACMSIGMOD International Conference onManagement of Data (SIGMOD). Wang.-W. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). Tao. 2006. Checking for k-anonymity violation by views. Xiao and Y. Li. [28] C. and K. [26] X. pages 229240. A. (α.-C. M-invariance: towards privacy preserving re-publication of dynamic datasets. Wang.[24] R. 2005. Tao. pages 754759. Personalized privacy preservation.

NAME Tom Mike Bob Eve AGE 21 23 52 57 Gender Male Male Male Female Diagnosis Asthma Flu Alzheimer Diabetes AGE [21 − 25] [21 − 25] [50 − 60] [50 − 60] Gender Male Male Person Person Diagnosis Asthma Flu Alzheimer Diabetes Figure 1: Patient table Figure 2: Anonymous patient table AGE [21 − 30] [21 − 30] [21 − 30] [51 − 55] [51 − 55] [56 − 60] [56 − 60] Gender Person Person Person Male Male Female Female Diagnosis Asthma Flu Cancer Alzheimer Hepatitis Flu Diabetes NAME Tom Mike Bob Eve Alice Hank Sal AGE 21 23 52 57 27 53 59 Gender Male Male Male Female Female Male Female Diagnosis Asthma Flu Alzheimer Diabetes Cancer Hepatitis Flu Figure 3: Updated patient table Figure 4: Updated anonymous patient table 19 .

ej.n } Output: true if Ti is vulnerable to difference attack with respect to Tj and false otherwise Q = {(∅. . . . Tj ) if |E − E| < k.DirectedCheck Input: Two (k. . . . 0)} while Q = ∅ Remove the first element p = (E. . m} // generates subsets with size |E| + 1 insert (E ∪ ei. return true for each ∈ {index + 1. c)-anonymous tables Ti and Tj . where Ti = {ei. ei. index) from Q if E = ∅ E ← GetM inSet(E.1 .m } and Tj = {ej. .1 . . . ) into Q return false Figure 5: Algorithm for checking if one table is vulnerable to difference attack with respect to another table 20 . . . return true else if (E S − E S ) does not satisfy c-diversity. .

GetMinSet Input: An equivalence class set E and a table T Output: An equivalence class set E ⊆ T that is minimal and contains all records in E E =∅ while E ⊆ E choose a tuple t in E that is not in E find the equivalence class e ⊆ T that contains t E = E ∪ {e} return E Figure 6: Algorithm for obtaining a minimum equivalence class set that contains all the records in the given equivalence class set 21 .

Check Intersection Attack Input: Two (k.j S ) does not satisfy c-diversity.i in T0 for each e1.j ∈ T1 that contains any record in e0. c)-anonymous tables T0 and T1 Output: true if the two tables are vulnerable to intersection attack and false otherwise for each equivalence class e0.i if (e0. return true return false Figure 7: Algorithm for checking if given two tables are vulnerable to intersection attack 22 .i S ∩ e1.

2 en...1 sp e i+1.2 en.5 sp Ti+1 .(i) ei.1 sr ei.2 e i+1.3 e i+1.2 sr (i) e i+1.1 sp ei+1.6 sp Tn (ii) ei.5 sp en.3 sp e i+1.1 sp en.2 sr ei+1. en.2 sr (ii) Ti Figure 8: Record-tracing attacks Figure 9: More inference in record-tracing 23 .4 en.1 sr ei.1 sr sr ei+1.3 ei.1 sr ei.4 sp e i+1.

break return (V. change2 ← false for each rj ∈ V2 e ← all the incoming edges of rj if e contains both definite and plausible edge(s) remove all plausible edges in e from E . Output: A bipartite graph G = (V. E) where V = V1 ∪ V2 and E is a set of edges representing possible matching relationships. change1 ← true if change1 = true for each ri ∈ V1 e ← all the outgoing edges of ri if e contains only a single plausible edge mark the edge in e as definite. E ) Figure 10: Algorithm for removing infeasible edges from a bipartite graph 24 . E ) where E ⊂ E with infeasible edges removed E =E while true change1 ← false.Remove Infeasible Edges Input: A bipartite graph G = (V. change2 ← true if change2 = false.

7) 5000 Number of Vulnerable Records Number of Vulnerable Records 50 40 30 20 10 0 0 5 10 15 20 25 Table Size (unit = 1.Static Incognito (k=5. c=0.7) Difference Intersection R-Tracing 10 15 20 25 Table Size (unit = 1. c=0.000) 30 35 Figure 11: Vulnerability to Inference Attacks 25 .000) 30 35 Difference Intersection R-Tracing 4000 3000 2000 1000 0 0 5 Static Mondrian (k=5.

000) 30 35 Figure 12: Data Quality: Average Information Loss 26 . c=0.7) 9 Average Information Loss 8 7 6 5 4 3 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Average Information Loss Static Merge Inf_Checked 7 6 5 4 3 2 1 0 0 5 Static Merge Inf_Checked Mondrian (k=5.Incognito (k=5. c=0.7) 10 15 20 25 Table Size (unit = 1.

c=0. c=0.Incognito (k=5.000) 30 35 Figure 13: Data Quality: Discernibility Metric 27 .7) 10 15 20 25 Table Size (unit = 1.7) 120 Discernability Metric (unit = 1M) 100 80 60 40 20 0 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Static Merge Inf_Checked Discernability Metric (unit = 1M) 100 80 60 40 20 0 0 5 Static Merge Inf_Checked Mondrian (k=5.

Incognito (k=5.7) 1500 1000 500 0 0 5 10 15 20 25 Table Size (unit = 1. c=0.000) 30 35 Execution Time (unit = sec) Static Merge Inf_Checked Static Merge Inf_Checked Mondrian (k=5.7) 2000 500 Execution Time (unit = sec) 400 300 200 100 0 0 5 10 15 20 25 Table Size (unit = 1.000) 30 35 Figure 14: Execution Time 28 . c=0.