The Effect of Homogeneity on the Complexity

of k-Anonymity
Robert Bredereck
1,
, André Nichterlein
1
, Rolf Niedermeier
1
,
Geevarghese Philip
2
1
Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany
2
The Institute of Mathematical Sciences, Chennai, India
{robert.bredereck,andre.nichterlein,rolf.niedermeier}@tu-berlin.de
gphilip@imsc.res.in
Abstract. The NP-hard k-Anonymity problem asks, given an n ×m-
matrix M over a fixed alphabet and an integer s > 0, whether M can
be made k-anonymous by suppressing (blanking out) at most s entries.
A matrix M is said to be k-anonymous if for each row r in M there
are at least k − 1 other rows in M which are identical to r. Comple-
menting previous work, we introduce two new “data-driven” parameter-
izations for k-Anonymity—the number tin of different input rows and
the number tout of different output rows—both modeling aspects of data
homogeneity. We show that k-Anonymity is fixed-parameter tractable
for the parameter tin, and it is NP-hard even for tout = 2 and alphabet
size four. Notably, our fixed-parameter tractability result implies that
k-Anonymity can be solved in linear time when tin is a constant. Our
results also extend to some interesting generalizations of k-Anonymity.
1 Introduction
Assume that data about individuals are represented by equal-length vectors
consisting of attribute values. If all vectors are identical, then we have full ho-
mogeneity and thus full anonymity of all individuals. Relaxing full anonymity
to k-anonymity, in this work we investigate how the degree of (in)homogeneity
influences the computational complexity of the NP-hard problem of making sets
of individuals k-anonymous.
Sweeney [24] devised the notion of k-anonymity to better quantify the degree
of anonymity in sanitized data. This notion formalizes the intuition that entities
who have identical sets of attributes cannot be distinguished from one another.
For a positive integer k we say that a matrix M is k-anonymous if, for each
row r in M, there are at least k − 1 other rows in M which are identical to r.
Thus k-anonymity provides a concrete optimization goal while sanitizing data:
choose a value of k which would satisfy the relevant privacy requirements, and
then try to modify—“at minimum cost”—the matrix in such a way that it be-
comes k-anonymous. The corresponding decision problem k-Anonymity asks,

Supported by the DFG, research project PAWS, NI 369/10.
To appear in the Proceedings of the 18th International Symposium on Funda-
mentals of Computer Theory (FCT), Oslo, Norway, August 2011, Lecture Notes
in Computer Science. c ( Springer
additionally given an upper bound s for the number of suppressions allowed,
whether a matrix can be made k-anonymous by suppressing (blanking out) at
most s entries. While k-Anonymity is our central problem, our results also
extend to several more general problems.
We focus on a better understanding of the computational complexity and on
tractable special cases of these problems; see Machanavajjhala et al. [18] and
Sweeney [24] for discussions on the pros and cons of these models in terms of
privacy vs preservation of meaningful data. In particular, note that in the data
privacy community “differential privacy” (cleverly adding some random noise) is
now the most popular method [9,14]. However, k-Anonymity is a very natural
combinatorial problem (with potential applications beyond data privacy), and—
for instance—in the case of “one-time anonymization”, may still be valuable for
the sake of providing a simple model that does not introduce noise.
k-Anonymity and many related problems are NP-hard [19], even when the
input matrix is highly restricted. For instance, it is APX-hard when k = 3,
even when the alphabet size is just two [4]; NP-hard when k = 4, even when the
number of columns in the input dataset is 8 [4]; MAX SNP-hard when k = 7, even
when the number of columns in the input dataset is just three [7]; and MAX SNP-
hard when k = 3, even when the number of columns in the input dataset is 27 [3].
Confronted with this computational hardness, we study the parameterized
complexity of k-Anonymity as initiated by Evans et al. [10]. The central ques-
tion here is how naturally occurring parameters influence the complexity of k-
Anonymity. For example, is k-Anonymity polynomial-time solvable for con-
stant values of k? The general answer is “no” since already 3-Anonymity is
NP-hard [19], even on binary data sets [4]. Thus, k alone does not give a promis-
ing parameterization.
1
k-Anonymity has a number of meaningful parameterizations beyond k, in-
cluding the number of rows n, the alphabet size [Σ[, the number of columns m,
and, in the spirit of multivariate algorithmics [21], various combinations of single
parameters. Here the arity of [Σ[ may range from binary (such as gender) to un-
bounded. For instance, answering an open question of Evans et al. [10], Bonizzoni
et al. [5] recently showed that k-Anonymity is fixed-parameter tractable with
respect to the combined parameter (m, [Σ[), whereas there is no hope for fixed-
parameter tractability with respect to the single parameters m and [Σ[ [10]. We
emphasize that Bonizzoni et al. [5] made use of the fact that the value [Σ[
m
is an
upper bound on the number of different input rows, thus implicitly exploiting a
very rough upper bound on input homogeneity. Clearly, [Σ[
m
denotes the max-
imum possible number of different input rows. In this work, we refine this view
by asking how the “degree of homogeneity” of the input matrix influences the
complexity of k-Anonymity. In other words, is k-Anonymity fixed-parameter
tractable for the parameter “number of different input rows”? In a similar vein,
we also study the effect of the degree of homogeneity of the output matrix on
the complexity of k-Anonymity. Table 1, which extends similar tables due to
Evans et al. [10] and Bonizzoni et al. [5], summarizes known and new results.
1
However, it has been shown that 2-Anonymity is polynomial-time solvable [3].
2
Table 1. The parameterized complexity of k-Anonymity. Results proved in this paper
are in bold. The column and row entries represent parameters. For instance, the entry
in row “–” and column “s” refers to the (parameterized) complexity for the single
parameter s whereas the entry in row “m” and column “s” refers to the (parameterized)
complexity for the combined parameter (s, m).
– k s k, s
– NP-hard [19] NP-hard [19] W[1]-hard [5] W[1]-hard [5]
|Σ| NP-hard [2] NP-hard [2] ? ?
m NP-hard [4] NP-hard [4] FPT [10] FPT [10]
n FPT [10] FPT [10] FPT [10] FPT [10]
|Σ|, m FPT [5] FPT [10] FPT [10] FPT [10]
|Σ|, n FPT [10] FPT [10] FPT [10] FPT [10]
tin FPT FPT FPT FPT
tout NP-hard ? FPT FPT
Our contributions. We introduce the “homogeneity parameters” t
in
, the num-
ber of different input rows, and t
out
, the number of different output rows, for
studying the computational complexity of k-Anonymity and related problems.
Typically, we expect t
in
< n and t
in
< [Σ[
m
. Indeed, t
in
is a “data-driven param-
eterization” in the sense that one can efficiently measure in advance the instance-
specific value of t
in
whereas [Σ[
m
is a trivial upper bound for homogeneity.
First, we show that there is always an optimal solution (minimizing the num-
ber of suppressions) with t
out
≤ t
in
. Then, we derive an algorithm that solves
k-Anonymity in O(nm+2
tintout
t
in
(t
out
m+t
2
in
log(t
in
))) time, which compares
favorably with Bonizzoni et al.’s [5] algorithm running in O(2
(|Σ|+1)
m
kmn
2
)
time. Since t
out
≤ t
in
, this shows that k-Anonymity is fixed-parameter tractable
when parameterized by t
in
. In particular, when t
in
is a constant, our algorithm
solves k-Anonymity in time linear in the size of the input. In contrast, when
only t
out
is fixed, then we show that the problem remains NP-hard. More pre-
cisely, opposing the trivial case t
out
= 1, we show that k-Anonymity is already
NP-hard when t
out
= 2, even when [Σ[ = 4. We remark that t
out
is an interesting
parameter since it is “stronger” than t
in
and since, interpreting k-Anonymity as
a (meta-)clustering problem (of Aggarwal et al. [1]), t
out
may also be interpreted
as the number of output “clusters”.
Finally, we mention that all our results extend to more general problems,
including -Diversity [18] and “k-Anonymity with domain generalization hi-
erarchies” [23]; we defer the corresponding details to a full version of the paper.
Preliminaries and basic observations. Our inputs are datasets in the form
of nm-matrices, where the n rows refer to the individuals and the m columns
correspond to attributes with entries drawn from an alphabet Σ. Suppressing
an entry M[i, j] of an n m-matrix M over alphabet Σ with 1 ≤ i ≤ n
and 1 ≤ j ≤ m means to simply replace M[i, j] ∈ Σ by the new symbol “”
ending up with a matrix over the alphabet Σ¬¦¦. A row type is a string
from (Σ¬¦¦)
m
. We say that a row in a matrix has a certain row type if it
3
coincides in all its entries with the row type. In what follows, synonymously,
we sometimes also speak of a row “lying in a row type”, and that the row type
“contains” this row. We say that a matrix is k-anonymous if every row type
contains none or at least k rows in the matrix. A natural objective when trying
to achieve k-anonymity is to minimize the number of suppressed matrix entries.
For a row y (with some entries suppressed) in the output matrix, we call a row x
in the input matrix the preimage of y if y is obtained from x by suppressing in x
the -entry positions of y. The central problem of this work reads as follows.
k-Anonymity
Input: An n m-matrix M and nonnegative integers k, s.
Question: Can at most s elements of M be suppressed to obtain a k-
anonymous matrix M

?
Our algorithmic results mostly rely on concepts of parameterized algorith-
mics [8,12,20]. The fundamental idea herein is, given a computationally hard
problem X, to identify a parameter p (typically a positive integer or a tuple of
positive integers) for X and to determine whether a size-n input instance of X
can be solved in f(p) n
O(1)
time, where f is an arbitrary computable func-
tion. If this is the case, then one says that X is fixed-parameter tractable for
the parameter p. The corresponding complexity class is called FPT. If X could
only be solved in polynomial running time where the degree of the polynomial
depends on p (such as n
O(p)
), then, for parameter p, problem X only lies in the
parameterized complexity class XP.
We study two new parameters t
in
and t
out
, where t
in
denotes the number of
input row types in the given matrix and t
out
denotes the number of row types
in the output k-anonymous matrix. Note that using sorting t
in
can be efficiently
determined for a given matrix. To keep the treatment simple, we assume that t
out
is a user-specified number which bounds the maximum number of output types.
Clearly, one could have variations on this: for instance, one could ask to mini-
mize t
out
with at most a given number s of suppressions. We refrain from further
exploration here and treat t
out
as part of the input specifying an upper bound
on the number of output types.
2
The following lemma says that without loss of
generality one may assume t
out
≤ t
in
.
Lemma 1. Let (M, k, s) be a YES-instance of k-Anonymity. If M has t
in
row types, then there exists a k-anonymous matrix M

with at most t
in
row
types which can be obtained from M by suppressing at most s elements.
2 Parameter t
in
In this section we show that k-Anonymity is fixed-parameter tractable with
respect to the parameter number t
in
of input row types. Since t
in
≤ [Σ[
m
and
2
Indeed, interpreting k-Anonymity as a clustering problem with a guarantee k for
the minimum cluster size (as in the work by Aggarwal et al. [1]), tout can also be
seen as the number of “clusters” (that is, output row types) that are built.
4
Algorithm 1 Pseudo-code for solving k-Anonymity. The function
solveRowAssignment solves Row Assignment in polynomial time, see
Lemma 2.
1: procedure solveKAnonymity(M, k, s, tout)
2: Determine the row types R1, . . . , Rt
in
Phase 1, Step 1
3: for each possible A : [0, 1]
t
in
×t
out
do Phase 1, Step 2
4: for j ←1 to tout do Phase 1, Step 3
5: if A[1, j] = A[2, j] = . . . = A[tin, j] = 0 then
6: delete empty output row type R

j
7: decrease tout by one
8: else
9: Determine all entries of R

j
10: if solveRowAssignment then Phase 2
11: return ‘YES’
12: return ‘NO’
t
in
≤ n, k-Anonymity is also fixed-parameter tractable with respect to the
combined parameter (m, [Σ[) and the single parameter n. Both these latter re-
sults were already proved quite recently; Bonizzoni et al. [5] demonstrated fixed-
parameter tractability for the parameter (m, [Σ[), and Evans et al. [10] showed
the same for the parameter n. Besides achieving a fixed-parameter tractabil-
ity result for a more general and typically smaller parameter, we improve their
results by giving a simpler algorithm with a (usually) better running time.
Let (M, k, s) be an instance of k-Anonymity, and let M

be the (un-
known) k-anonymous matrix which we seek to obtain from M by suppressing
at most s elements. Our fixed-parameter algorithm works in two phases. In the
first phase, the algorithm guesses the entries of each row type R

j
in M

. In the
second phase, the algorithm computes an assignment of the rows of M to the
row types R

j
in M

—see Algorithm 1 for an outline.
We now explain the two phases in detail, beginning with Phase 1. To deter-
mine the row types R
i
of M (line 2 in Algorithm 1), the algorithm constructs a
trie [13] on the rows of M. The leaves of the trie correspond to the row types
of M. For later use, the algorithm also keeps track of the numbers n
1
, n
2
, . . . , n
tin
of each type of row that is present in M; this can clearly be done by keeping a
counter at each leaf of the trie and incrementing it by one whenever a new row
matches the path to a leaf. All of this can be done in a single pass over M.
For implementing the guess in Step 2 of Phase 1, the algorithm goes over all
binary matrices of dimension t
in
t
out
; such a matrix A is interpreted as follows:
A row of type R
i
is mapped
3
to a row of type R

j
if and only if A[i, j] = 1 (see
line 3). Note that we allow in our guessing step an output type to contain no row
of any input row type. These “empty” output row types are deleted. Hence, with
our guessing in Step 2, we guess not only output matrices M

with exactly t
out
types, but also matrices M

with at most t
out
types.
3
Note that not all rows of an input type need to be mapped to the same output type.
5
Now the algorithm computes the entries of each row type R

j
, 1 ≤ j ≤ t
out
,
of M

(Step 3 of Phase 1). Assume for ease of notation that R
1
, . . . , R

are the
row types of M which contribute (according to the guessing) at least one row to
the (as yet unknown) output row type R

j
. Now, for each 1 ≤ i ≤ m, if R
1
[i] =
R
2
[i] = ... = R

[i], then set R

j
[i] := R
1
[i]; otherwise, set R

j
[i] := . This yields
the entries of the output type R

j
, and the number ω
j
of suppressions required to
convert any input row (if possible) to the type R

j
is the number of -entries in R

j
.
The guessing in Step 2 of Phase 1 takes time exponential in the parameter t
in
,
but Phase 2 can be done in polynomial time. To show this, we prove that Row
Assignment is polynomial-time solvable. We do this in the next lemma, after
formally defining the Row Assignment problem. To this end, we use the two
sets T
in
= ¦1, . . . , t
in
¦ and T
out
= ¦1, . . . , t
out
¦.
Row Assignment
Input: Nonnegative integers k, s, ω
1
, . . . , ω
tout
and n
1
, . . . , n
tin
with

tin
i=1
n
i
= n, and a function a : T
in
T
out
→ ¦0, 1¦.
Question: Is there a function g : T
in
T
out
→ ¦0, . . . , n¦ such that
a(i, j) n ≥ g(i, j) ∀i ∈ T
in
∀j ∈ T
out
(1)
tin

i=1
g(i, j) ≥ k ∀j ∈ T
out
(2)
tout

j=1
g(i, j) = n
i
∀i ∈ T
in
(3)
tin

i=1
tout

j=1
g(i, j) ω
j
≤ s (4)
Row Assignment formally defines the remaining problem in Phase 2: At
this stage of the algorithm the input row types R
1
, . . . , R
tin
and the number
of rows n
1
, . . . , n
tin
in these input row types are known. The algorithm has
also computed the output row types R

1
, . . . , R

tout
and the number of suppres-
sions ω
1
, . . . , ω
tout
in these output row types. Now, the algorithm computes an
assignment of the rows of the input row types to output row types such that:
– The assignment of the rows respects the guessing in Step 2 of Phase 1. This
is secured by Inequality (1).
– M

is k-anonymous, that is, each output row type contains at least k rows.
This is secured by Inequality (2).
– All rows of each input row type are assigned. This is secured by Equation (3).
– The total cost of the assignment is at most s. This is secured by Inequal-
ity (4).
Note that in the definition of Row Assignment no row type occurs and,
hence, the problem is independent of the specific entries of the input or output
row types.
6
v1
−n1
v2
−n2
v3
−n3
v4
−n4
v5
−n5
u1
k
u2
k
u3
k
u4
k
t

t
in

i=1
ni

−k · tout
ω1
ω3
ω1
ω2
ω3
ω2
ω4
ω4
0
0
0
0
Fig. 1. Example of the constructed network with tin = 5 and tout = 4. The number on
each arc denotes its cost. The number next to each node denotes its demand.
Lemma 2. Row Assignment can be solved in O(t
3
in
log(t
in
)) time.
Proof (Sketch). We reduce Row Assignment to the Uncapacitated Mini-
mum Cost Flow problem, which is defined as follows [22]:
Uncapacitated Minimum Cost Flow
Input: A network (directed graph) D = (V, A) with demands b : V → Z
on the nodes and costs c : V V → N.
Task: Find a function f which minimizes

(u,v)∈A
c(u, v) f(u, v) and
satisfies:

{v|(u,v)∈A}
f(u, v) −

{v|(v,u)∈A}
f(v, u) = b(u) ∀u ∈ V
f(u, v) ≥ 0 ∀(u, v) ∈ A
We first describe the construction of the network with demands and costs.
For each n
i
, 1 ≤ i ≤ t
in
, add a node v
i
with demand −n
i
(that is, a supply
of n
i
) and for each ω
j
add a node u
j
with demand k. If a(i, j) = 1, then add an
arc (v
i
, u
j
) with cost ω
j
. Finally, add a sink t with demand (

n
i
) −k t
out
and
the arcs (u
i
, t) with cost zero. See Figure 1 for an example of the construction.
Note that, although the arc capacities are unbounded, the maximum flow over
one arc is implicitly bounded by n because the sum of all supplies is

tin
i=1
n
i
= n.
The Uncapacitated Minimum Cost Flow problem is solvable in O([V [
log([V [)([A[ +[V [ log([V [))) time in a network (directed graph) D = (V, A) [22].
Since our constructed network has t
in
+t
out
nodes and t
in
t
out
arcs, we can solve
our Uncapacitated Minimum Cost Flow-instance in O((t
in
+t
out
) log(t
in
+
t
out
)(t
in
t
out
+ (t
in
+ t
out
) log(t
in
+ t
out
))) time. Since, by Lemma 1, t
in
≥ t
out
,
the running time is O(t
3
in
log(t
in
)). ¯ .
Putting all these together, we arrive at the following theorem:
Theorem 1. k-Anonymity can be solved in O(nm + 2
tintout
t
in
(t
out
m + t
2
in

log(t
in
))) time, and so, in O(nm + 2
t
2
in
t
2
in
(m + t
in
log(t
in
))) time.
7
We remark that the described algorithm can be modified to use domain
generalization hierarchies (DGH) [23] or to solve -Diversity [18]. We defer
the details to a full version of the paper.
We end this section by comparing our algorithmic results to the closely re-
lated ones by Bonizzoni et al. [5]. They presented an algorithm for k-Anonymity
with a running time of O(2
(|Σ|+1)
m
kmn
2
) which works—similarly to our algo-
rithm—in two phases: First, their algorithm guesses all possible output row types
together with their entries in O(2
(|Σ|+1)
m
) time. In Phase 1 our algorithm guesses
the output row types producible from M within O(2
tintout
t
in
t
out
m + mn) time
using a different approach. Note that, in general, t
in
is much smaller than the
number [Σ[
m
of all possible different input types. Hence, in general the guessing
step of our algorithm is faster. For instances where [Σ[
m
≤ t
in
t
out
, one can
guess the output types like Bonizzoni et al. in O(2
(|Σ|+1)
m
) time.
Next, we compare Phase 2 of our algorithm to the second step of Bonizzoni
et al.’s algorithm. In both algorithms, the same problem Row Assignment
is solved. Bonizzoni et al. did this using maximum matching on a bipartite
graph with O(n) nodes, while we do it using a flow network with O(t
in
) nodes.
Consequently, the running time of our approach depends only on t
in
, and its
proof of correctness is—arguably—simpler.
As we mentioned above, our algorithm can easily be modified to solve k-
Anonymity using DGHs with no significant increase in the running time. Since
in the DGH setting the alphabet size [Σ[ increases, it is not immediately clear
how the algorithm due to Bonizzoni et al. can be modified to solve this problem
without an exponential increase in the running time. Finally, when the param-
eters t
in
and ([Σ[, m) are constants, then Bonizzoni et al.’s algorithm runs in
O(kmn
2
) time while our algorithm runs in linear time O(mn).
3 Parameter t
out
There is a close relationship between k-Anonymity and clustering problems
where one is also interested in grouping together similar objects. Such a rela-
tionship has already been observed in related work [1,6,11]. The clustering view
on k-Anonymity makes the number t
out
of output types (corresponding to the
number of clusters) of a k-anonymous matrix an interesting parameter.
There is also a more algorithmic motivation for investigating the (param-
eterized) complexity of k-Anonymity for the parameter t
out
. As we saw in
Theorem 1, k-Anonymity is fixed-parameter tractable for the parameter t
in
.
Due to Lemma 1, we know that t
out
is a stronger parameter than t
in
, in the sense
that t
out
≤ t
in
. Hence, it is a natural question whether k-Anonymity is already
fixed-parameter tractable with respect to the number of output types. Answering
this in the negative, we show that k-Anonymity is NP-hard even when there
are only two output types and the alphabet has size four, destroying any hope
for fixed-parameter tractability already for the combined parameter “number
of output types and alphabet size”. The hardness proof uses a polynomial-time
many-one reduction from Balanced Complete Bipartite Subgraph:
8
Balanced Complete Bipartite Subgraph (BCBS)
Input: A bipartite graph G = (V, E) and an integer k ≥ 1.
Question: Is there a complete bipartite subgraph of G whose partition classes
are of size at least k each?
BCBS is NP-complete [15]. We provide a polynomial-time many-one reduc-
tion from a special case of BCBS with a balanced input graph, that is, a bipar-
tite graph that can be partitioned into two independent sets of the same size,
and k = [V [/4. This special case clearly is also NP-complete by a simple reduc-
tion from the general BCBS problem: Let V := A¬ B with A and B being the
two vertex partition classes of the input graph. When the input graph is not
balanced, that is, [A[ −[B[ = 0, add [[A[ −[B[[ isolated vertices to the partition
class of smaller size. If k < [V [/4, then repeat the following until k = [V [/4:
Add one vertex to A and make it adjacent to each vertex from B and add one
vertex to B and make it adjacent to each vertex from A. If k > [V [/4, then
add k −[V [/4 isolated vertices to each of A and B.
We devise a reduction from BCBS with balanced input graph and k = [V [/4
to show the NP-hardness of k-Anonymity.
Theorem 2. k-Anonymity is NP-complete for two output row types and al-
phabet size four.
Proof (Sketch). We only have to prove NP-hardness, since containment in NP is
clear. Let (G, n/4) be a BCBS-instance with G = (V, E) being a balanced bipar-
tite graph and n := [V [. Let A = ¦a
1
, a
2
, . . . , a
n/2
¦ and B = ¦b
1
, b
2
, . . . , b
n/2
¦ be
the two vertex partition classes of G. We construct an (n/4 + 1)-Anonymity-
instance that is a YES-instance if and only if (G, n/4) ∈ BCBS. To this end, the
main idea is to use a matrix expressing the adjacencies between A and B as the
input matrix. Partition class A corresponds to the rows and partition class B
corresponds to the columns. The salient points of our reduction are as follows.
1. By making a matrix with 2x rows x-anonymous we ensure that there are
at most two output types. One of the types, the solution type, corresponds
to the solution set of the original instance: Solution set vertices from A are
represented by rows that are preimages of rows in the solution type and
solution set vertices from B are represented by columns in the solution type
that are not suppressed.
2. We add one row that contains the -symbol in each entry. Since the -
symbol is not used in any other row, this enforces the other output type to
be fully suppressed, that is, each column is suppressed.
3. Since the rows in the solution type have to agree on the columns that are
not suppressed, we have to ensure that they agree on adjacencies to model
BCBS. This is done by using two different types of 0-symbols representing
non-adjacency. The 1-symbol represents adjacency.
The matrix D is described in the following and illustrated in Figure 2. There is
one row for each vertex in A and one column for each vertex in B. The value in
the i
th
column of the j
th
row is 1 if a
j
is adjacent to b
i
and, otherwise, 0
1
if
4
j ≤
4
We assume without loss of generality that n is divisible by four.
9
b1 b2 . . . b
n/2−1
b
n/2
a1 1 1 01 1 1
a2 01 1 01 1 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n/4
1 1 01 01 1
a
n/4+1
1 1 02 02 1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n/2
02 1 02 1 1

1 1 1 1 1 1
Fig. 2. Typical structure of the matrix D of the k-Anonymity instance obtained by
the reduction from Balanced Complete Bipartite Subgraph.
n/4 and 0
2
if j > n/4. Additionally, there are two further rows, one containing
only 1s and one containing only -symbols. The number of allowed suppressions
is s := (n/4 + 1) n/2 + (n/4 + 1) n/4. This completes the construction.
It remains to show that (G, n/4) is a YES-instance of BCBS if and only if
the constructed matrix can be transformed into an (n/4 +1)-anonymous matrix
by suppressing at most s elements; this part of the proof is deferred to a full
version of the paper. ¯ .
Deconstructing intractability. In the remainder of this section we briefly
discuss the NP-hardness proof for k-Anonymity in the spirit of “deconstruct-
ing intractability” [16,21]. In our reduction the alphabet size [Σ[ and the num-
ber t
out
of output row types are constants whereas the number n of rows, the
number m of attributes, number s of suppressions, and the anonymity quality k
are unbounded. This suggests a study of the computational complexity of those
cases where at least one of these quantities is bounded. Some of the correspond-
ing parameterizations have already been investigated, see Table 1 in Section 1.
While for parameters ([Σ[, m), ([Σ[, n), and n k-Anonymity is fixed-parameter
tractable, it is open whether combining t
out
with m, k, or s helps to obtain
fixed-parameter tractability. In particular, the parameterized complexity for the
combined parameter ([Σ[, s, k) is still open. In contrast, k-Anonymity is W[1]-
hard for (s, k) [5], that is, it is presumably fixed-parameter intractable for this
combined parameter.
Whereas k-Anonymity is NP-hard for constant m and unbounded t
out
, one
can easily construct an XP-algorithm with respect to the combined parame-
ter (t
out
, m): In O(2
m·tout
m t
out
) time guess the suppressed columns for all
output row types. Then, guess in n
O(tout)
time one prototype for each output
row type, that is, one input row that is a preimage of a row from the output row
type. Now, knowing the entries for each output row, one can simply apply the
Row Assignment algorithm from Section 2:
Proposition 1. k-Anonymity parameterized by (t
out
, m) is in XP.
10
Next, we prove fixed-parameter tractability for k-Anonymity with respect to
the combined parameter (t
out
, s) by showing that the number t
in
of input types
is at most (t
out
+ s). To this end, consider a feasible solution for an arbitrary
k-Anonymity instance. We distinguish between input row types that have rows
which have at least one suppressed entry in the solution (suppressed input row
types in the following) and input row types that do only have rows that re-
main unchanged in the solution (unsuppressed input row types in the following).
Clearly, every unsuppressed input row type needs at least one unsuppressed
output row type. Thus, the number of unsuppressed input row type cannot ex-
ceed t
out
. Furthermore, the number of rows that have at least one suppressed
entry is at most s. Hence the number of suppressed input row types is also at
most s. It follows that t
in
≤ t
out
+ s. Now, fixed-parameter tractability follows
from Theorem 1:
Proposition 2. k-Anonymity is fixed-parameter tractable with respect to the
combined parameter (t
out
, s).
However, to achieve a better running time one might want to develop a di-
rect fixed-parameter algorithm for (t
out
, s). Finally, we conjecture that an XP-
algorithm for k-Anonymity can be achieved with respect to the combined pa-
rameter (t
out
, k).
4 Conclusion
This paper adopts a data-driven approach towards the design of (exact) algo-
rithms for k-Anonymity and related problems. More specifically, the parame-
ter t
in
measures an easy-to-determine input property. Our central message here is
that if t
in
is small or even constant, then k-Anonymity and its related problems
are efficiently solvable—for constant t
in
even in linear time. On the contrary, al-
ready for two output types k-Anonymity becomes computationally intractable.
We contributed to a refined analysis of the computational complexity of k-
Anonymity. The state of the art including several research challenges con-
cerning natural parameterizations for k-Anonymity is surveyed in Table 1 in
the introductory section. Besides running time improvements in general and
open questions indicated in Table 1 such as the parameterized complexity of
k-Anonymity for the combined parameter “number of suppressions plus alpha-
bet size”, it would also be interesting to determine whether our fixed-parameter
tractability result for k-Anonymity with respect to the parameter t
in
not only
extends to the already mentioned generalizations of k-Anonymity but also to
the (in terms of privacy) more restrictive t-Closeness problem [17].
References
1. G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and
A. Zhu. Achieving anonymity via clustering. ACM Trans. Algorithms, 6(3):1–19,
2010.
11
2. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas,
and A. Zhu. Anonymizing tables. In Proc. 10th ICDT, volume 3363 of LNCS,
pages 246–258. Springer, 2005.
3. J. Blocki and R. Williams. Resolving the complexity of some data privacy problems.
In Proc. 37th ICALP, volume 6199 of LNCS, pages 393–404. Springer, 2010.
4. P. Bonizzoni, G. DellaÂăVedova, and R. Dondi. Anonymizing binary and small
tables is hard to approximate. J. Comb. Optim., 22:97–119, 2011.
5. P. Bonizzoni, G. D. Vedova, R. Dondi, and Y. Pirola. Parameterized complexity
of k-anonymity: Hardness and tractability. In Proc. 21st IWOCA, volume 6460 of
LNCS, pages 242–255. Springer, 2010.
6. R. Bredereck, A. Nichterlein, R. Niedermeier, and G. Philip. Pattern-guided data
anonymization and clustering. In Proc. 36th MFCS, LNCS. Springer, 2011. To
appear.
7. V. T. Chakaravarthy, V. Pandit, and Y. Sabharwal. On the complexity of the
k-anonymization problem. CoRR, abs/1004.4729, 2010.
8. R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999.
9. C. Dwork. A firm foundation for private data analysis. Commun. ACM, 54:86–95,
2011.
10. P. A. Evans, T. Wareham, and R. Chaytor. Fixed-parameter tractability of
anonymizing data by suppressing entries. J. Comb. Optim., 18(4):362–375, 2009.
11. A. M. Fard and K. Wang. An effective clustering approach to web query log
anonymization. In Proc. SECRYPT, pages 109–119. SciTePress, 2010.
12. J. Flum and M. Grohe. Parameterized Complexity Theory. Springer, 2006.
13. E. Fredkin. Trie memory. Commun. ACM, 3(9):490–499, 1960.
14. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data pub-
lishing: A survey of recent developments. ACM Comput. Surv., 42(4):14:1–14:53,
2010.
15. D. S. Johnson. The NP-completeness column: An ongoing guide. J. Algorithms,
8:438–448, 1987.
16. C. Komusiewicz, R. Niedermeier, and J. Uhlmann. Deconstructing intractability–
A multivariate complexity analysis of interval constrained coloring. J. Discrete
Algorithms, 9:137–151, 2011.
17. N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity
and l-diversity. In Proc. 23rd ICDE, pages 106–115. IEEE, 2007.
18. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. -diversity:
Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1, 2007. 52 pages.
19. A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc.
23rd PODS, pages 223–228. ACM, 2004.
20. R. Niedermeier. Invitation to Fixed-Parameter Algorithms. Oxford University
Press, 2006.
21. R. Niedermeier. Reflections on multivariate algorithmics and problem parameter-
ization. In Proc. 27th STACS, volume 5 of LIPIcs, pages 17–32. IBFI Dagstuhl,
2010.
22. J. Orlin. A faster strongly polynomial minimum cost flow algorithm. In Proc. 20th
STOC, pages 377–387. ACM, 1988.
23. L. Sweeney. Achieving k-anonymity privacy protection using generalization and
suppression. IJUFKS, 10(5):571–588, 2002.
24. L. Sweeney. k-anonymity: A model for protecting privacy. IJUFKS, 10(5):557–570,
2002.
12

is k-Anonymity polynomial-time solvable for constant values of k? The general answer is “no” since already 3-Anonymity is NP-hard [19]. In this work. even when the number of columns in the input dataset is just three [7]. including the number of rows n.1 k-Anonymity has a number of meaningful parameterizations beyond k. MAX SNP-hard when k = 7. even when the number of columns in the input dataset is 8 [4]. |Σ|). In particular. Here the arity of |Σ| may range from binary (such as gender) to unbounded. whereas there is no hope for fixedparameter tractability with respect to the single parameters m and |Σ| [10]. answering an open question of Evans et al.14]. it is APX-hard when k = 3. we refine this view by asking how the “degree of homogeneity” of the input matrix influences the complexity of k-Anonymity. Clearly. |Σ|m denotes the maximum possible number of different input rows. NP-hard when k = 4. see Machanavajjhala et al. thus implicitly exploiting a very rough upper bound on input homogeneity. In other words. and MAX SNPhard when k = 3. we study the parameterized complexity of k-Anonymity as initiated by Evans et al. [5] made use of the fact that the value |Σ|m is an upper bound on the number of different input rows. [10]. even when the number of columns in the input dataset is 27 [3]. The central question here is how naturally occurring parameters influence the complexity of kAnonymity. it has been shown that 2-Anonymity is polynomial-time solvable [3]. [10]. in the spirit of multivariate algorithmics [21]. even when the alphabet size is just two [4]. However. 2 . Table 1. note that in the data privacy community “differential privacy” (cleverly adding some random noise) is now the most popular method [9. We emphasize that Bonizzoni et al. [18] and Sweeney [24] for discussions on the pros and cons of these models in terms of privacy vs preservation of meaningful data. For example. [5]. k-Anonymity and many related problems are NP-hard [19]. We focus on a better understanding of the computational complexity and on tractable special cases of these problems. may still be valuable for the sake of providing a simple model that does not introduce noise. even on binary data sets [4]. [10] and Bonizzoni et al. While k-Anonymity is our central problem. Bonizzoni et al.additionally given an upper bound s for the number of suppressions allowed. k alone does not give a promising parameterization. the number of columns m. our results also extend to several more general problems. summarizes known and new results. and— for instance—in the case of “one-time anonymization”. Confronted with this computational hardness. 1 However. and. k-Anonymity is a very natural combinatorial problem (with potential applications beyond data privacy). is k-Anonymity fixed-parameter tractable for the parameter “number of different input rows”? In a similar vein. which extends similar tables due to Evans et al. [5] recently showed that k-Anonymity is fixed-parameter tractable with respect to the combined parameter (m. we also study the effect of the degree of homogeneity of the output matrix on the complexity of k-Anonymity. For instance. whether a matrix can be made k-anonymous by suppressing (blanking out) at most s entries. Thus. various combinations of single parameters. the alphabet size |Σ|. even when the input matrix is highly restricted. For instance.

j] ∈ Σ by the new symbol “ ” ending up with a matrix over the alphabet Σ { }. Results proved in this paper are in bold. Our inputs are datasets in the form of n × m-matrices. m). we mention that all our results extend to more general problems. Indeed. Typically. we expect tin n and tin |Σ|m . In contrast. when only tout is fixed. opposing the trivial case tout = 1. m |Σ|. We say that a row in a matrix has a certain row type if it 3 . the number of different input rows. j] of an n × m-matrix M over alphabet Σ with 1 ≤ i ≤ n and 1 ≤ j ≤ m means to simply replace M [i. [1]). including -Diversity [18] and “k-Anonymity with domain generalization hierarchies” [23]. First. for studying the computational complexity of k-Anonymity and related problems. our algorithm solves k-Anonymity in time linear in the size of the input. Preliminaries and basic observations. when tin is a constant. Since tout ≤ tin . In particular. where the n rows refer to the individuals and the m columns correspond to attributes with entries drawn from an alphabet Σ. The parameterized complexity of k-Anonymity. For instance. Finally. we show that k-Anonymity is already NP-hard when tout = 2. The column and row entries represent parameters. the number of different output rows. Suppressing an entry M [i. n tin tout NP-hard [19] NP-hard [2] NP-hard [4] FPT [10] FPT [5] FPT [10] FPT NP-hard k NP-hard [19] NP-hard [2] NP-hard [4] FPT [10] FPT [10] FPT [10] FPT ? s k. we derive an algorithm that solves k-Anonymity in O(nm + 2tin tout tin (tout m + t2 · log(tin ))) time. which compares in m favorably with Bonizzoni et al. Then. tout may also be interpreted as the number of output “clusters”.’s [5] algorithm running in O(2(|Σ|+1) kmn2 ) time. More precisely. the entry in row “–” and column “s” refers to the (parameterized) complexity for the single parameter s whereas the entry in row “m” and column “s” refers to the (parameterized) complexity for the combined parameter (s.Table 1. tin is a “data-driven parameterization” in the sense that one can efficiently measure in advance the instancespecific value of tin whereas |Σ|m is a trivial upper bound for homogeneity. interpreting k-Anonymity as a (meta-)clustering problem (of Aggarwal et al. We remark that tout is an interesting parameter since it is “stronger” than tin and since. s W[1]-hard [5] W[1]-hard [5] ? ? FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT FPT FPT FPT Our contributions. We introduce the “homogeneity parameters” tin . then we show that the problem remains NP-hard. A row type is a string from (Σ { })m . – – |Σ| m n |Σ|. we show that there is always an optimal solution (minimizing the number of suppressions) with tout ≤ tin . we defer the corresponding details to a full version of the paper. and tout . this shows that k-Anonymity is fixed-parameter tractable when parameterized by tin . even when |Σ| = 4.

to identify a parameter p (typically a positive integer or a tuple of positive integers) for X and to determine whether a size-n input instance of X can be solved in f (p) · nO(1) time. where f is an arbitrary computable function. To keep the treatment simple. we call a row x in the input matrix the preimage of y if y is obtained from x by suppressing in x the -entry positions of y. we assume that tout is a user-specified number which bounds the maximum number of output types. We refrain from further exploration here and treat tout as part of the input specifying an upper bound on the number of output types. problem X only lies in the parameterized complexity class XP. k. Lemma 1. Question: Can at most s elements of M be suppressed to obtain a kanonymous matrix M ? Our algorithmic results mostly rely on concepts of parameterized algorithmics [8. If X could only be solved in polynomial running time where the degree of the polynomial depends on p (such as nO(p) ). If M has tin row types. given a computationally hard problem X. then there exists a k-anonymous matrix M with at most tin row types which can be obtained from M by suppressing at most s elements. Let (M. k-Anonymity Input: An n × m-matrix M and nonnegative integers k. synonymously. interpreting k-Anonymity as a clustering problem with a guarantee k for the minimum cluster size (as in the work by Aggarwal et al.coincides in all its entries with the row type. s) be a YES-instance of k-Anonymity. we sometimes also speak of a row “lying in a row type”. For a row y (with some entries suppressed) in the output matrix. A natural objective when trying to achieve k-anonymity is to minimize the number of suppressed matrix entries. If this is the case. where tin denotes the number of input row types in the given matrix and tout denotes the number of row types in the output k-anonymous matrix. The corresponding complexity class is called FPT. Clearly. We study two new parameters tin and tout .12. We say that a matrix is k-anonymous if every row type contains none or at least k rows in the matrix. then one says that X is fixed-parameter tractable for the parameter p. s. and that the row type “contains” this row. The central problem of this work reads as follows. In what follows.20]. output row types) that are built. 2 Parameter tin In this section we show that k-Anonymity is fixed-parameter tractable with respect to the parameter number tin of input row types. for parameter p. Note that using sorting tin can be efficiently determined for a given matrix.2 The following lemma says that without loss of generality one may assume tout ≤ tin . The fundamental idea herein is. tout can also be seen as the number of “clusters” (that is. then. [1]). one could ask to minimize tout with at most a given number s of suppressions. Since tin ≤ |Σ|m and 2 Indeed. one could have variations on this: for instance. 4 .

we guess not only output matrices M with exactly tout types. j] = A[2. For implementing the guess in Step 2 of Phase 1. the algorithm computes an assignment of the rows of M to the row types Rj in M —see Algorithm 1 for an outline. this can clearly be done by keeping a counter at each leaf of the trie and incrementing it by one whenever a new row matches the path to a leaf. . . . s. Our fixed-parameter algorithm works in two phases. 5 . Step 1 Phase 1. see Lemma 2. . the algorithm guesses the entries of each row type Rj in M . .Algorithm 1 Pseudo-code for solving k-Anonymity. . The leaves of the trie correspond to the row types of M . n2 . [10] showed the same for the parameter n. Besides achieving a fixed-parameter tractability result for a more general and typically smaller parameter. These “empty” output row types are deleted. Hence. tout ) 2: Determine the row types R1 . . j] = 0 then 6: delete empty output row type Rj 7: decrease tout by one 8: else 9: Determine all entries of Rj 10: 11: 12: if solveRowAssignment then return ‘YES’ return ‘NO’ Phase 1. s) be an instance of k-Anonymity. For later use. . beginning with Phase 1. To determine the row types Ri of M (line 2 in Algorithm 1). with our guessing in Step 2. Note that we allow in our guessing step an output type to contain no row of any input row type. We now explain the two phases in detail. the algorithm goes over all binary matrices of dimension tin × tout . 1: procedure solveKAnonymity(M . The function solveRowAssignment solves Row Assignment in polynomial time. k. In the first phase. and let M be the (unknown) k-anonymous matrix which we seek to obtain from M by suppressing at most s elements. Let (M. 3 Note that not all rows of an input type need to be mapped to the same output type. Bonizzoni et al. Rtin 3: for each possible A : [0. |Σ|). All of this can be done in a single pass over M . Step 3 Phase 2 tin ≤ n. j] = . we improve their results by giving a simpler algorithm with a (usually) better running time. the algorithm constructs a trie [13] on the rows of M . but also matrices M with at most tout types. j] = 1 (see line 3). Both these latter results were already proved quite recently. . 1]tin ×tout do 4: for j ← 1 to tout do 5: if A[1. ntin of each type of row that is present in M . and Evans et al. Step 2 Phase 1. k. |Σ|) and the single parameter n. In the second phase. [5] demonstrated fixedparameter tractability for the parameter (m. such a matrix A is interpreted as follows: A row of type Ri is mapped3 to a row of type Rj if and only if A[i. k-Anonymity is also fixed-parameter tractable with respect to the combined parameter (m. = A[tin . . the algorithm also keeps track of the numbers n1 .

each output row type contains at least k rows. set Rj [i] := . if R1 [i] = R2 [i] = . This is secured by Inequality (4). This yields the entries of the output type Rj . s. . . Question: Is there a function g : Tin × Tout → {0. ωtout in these output row types. then set Rj [i] := R1 [i]. Rtout and the number of suppressions ω1 . . – All rows of each input row type are assigned. Now. j) ≥ k i=1 tout g(i. and a function a : Tin × Tout → {0. Note that in the definition of Row Assignment no row type occurs and. j) · n ≥ g(i. . that is. for each 1 ≤ i ≤ m. the algorithm computes an assignment of the rows of the input row types to output row types such that: – The assignment of the rows respects the guessing in Step 2 of Phase 1. . of M (Step 3 of Phase 1). . . . Row Assignment Input: Nonnegative integers k. ntin with tin i=1 ni = n. . The guessing in Step 2 of Phase 1 takes time exponential in the parameter tin . j) · ωj ≤ s i=1 j=1 (4) Row Assignment formally defines the remaining problem in Phase 2: At this stage of the algorithm the input row types R1 .. . j) tin ∀i ∈ Tin ∀j ∈ Tout ∀j ∈ Tout ∀i ∈ Tin (1) (2) g(i. . This is secured by Equation (3). .. 1}. This is secured by Inequality (2). . 1 ≤ j ≤ tout . . . .Now the algorithm computes the entries of each row type Rj . . . we prove that Row Assignment is polynomial-time solvable. 6 . . To show this. n} such that a(i. . – M is k-anonymous. . R are the row types of M which contribute (according to the guessing) at least one row to the (as yet unknown) output row type Rj . Assume for ease of notation that R1 . . To this end. j) = ni j=1 tin tout (3) g(i. we use the two sets Tin = {1. – The total cost of the assignment is at most s. ωtout and n1 . = R [i]. . . . . . . This is secured by Inequality (1). . and the number ωj of suppressions required to convert any input row (if possible) to the type Rj is the number of -entries in Rj . . The algorithm has also computed the output row types R1 . after formally defining the Row Assignment problem. the problem is independent of the specific entries of the input or output row types. . . ω1 . . . tin } and Tout = {1. . . . otherwise. tout }. . ntin in these input row types are known. hence. . We do this in the next lemma. but Phase 2 can be done in polynomial time. Now. Rtin and the number of rows n1 . .

Task: Find a function f which minimizes (u.v)∈A c(u. Row Assignment can be solved in O(t3 · log(tin )) time. A) [22]. v) and satisfies: f (u. v) ≥ 0 ∀u ∈ V ∀(u. we can solve our Uncapacitated Minimum Cost Flow-instance in O((tin + tout ) · log(tin + tout )(tin · tout + (tin + tout ) log(tin + tout ))) time. the running time is O(t3 · log(tin )). in O(nm + 2tin t2 (m + tin log(tin ))) time. in 7 . Lemma 2. j) = 1. a supply of ni ) and for each ωj add a node uj with demand k. add a sink t with demand ( ni ) − k · tout and the arcs (ui . by Lemma 1. in Proof (Sketch). For each ni .−n1 −n2 −n3 −n4 −n5 v1 v2 v3 v4 v5 ω1 ω1 ω3 ω2 ω2 ω3 ω4 ω4 u4 u3 u2 u1 k k k k 0 0 0 0 t tin ni i=1 − k · tout Fig. The Uncapacitated Minimum Cost Flow problem is solvable in O(|V | · log(|V |)(|A| + |V | · log(|V |))) time in a network (directed graph) D = (V. See Figure 1 for an example of the construction. k-Anonymity can be solved in O(nm + 2tin tout tin (tout m + t2 · in 2 log(tin ))) time. If a(i. Since. then add an arc (vi .v)∈A} f (v. t) with cost zero. Note that. 1. we arrive at the following theorem: Theorem 1. Finally. A) with demands b : V → Z on the nodes and costs c : V × V → N. in Putting all these together. u) = b(u) f (u. Since our constructed network has tin + tout nodes and tin · tout arcs. v) ∈ A {v|(v. which is defined as follows [22]: Uncapacitated Minimum Cost Flow Input: A network (directed graph) D = (V. The number on each arc denotes its cost. and so. the maximum flow over tin one arc is implicitly bounded by n because the sum of all supplies is i=1 ni = n. We reduce Row Assignment to the Uncapacitated Minimum Cost Flow problem. v) · f (u. v) − {v|(u. although the arc capacities are unbounded. The number next to each node denotes its demand. 1 ≤ i ≤ tin .u)∈A} We first describe the construction of the network with demands and costs. add a node vi with demand −ni (that is. Example of the constructed network with tin = 5 and tout = 4. uj ) with cost ωj . tin ≥ tout .

Hence. the running time of our approach depends only on tin . Consequently. then Bonizzoni et al. we compare Phase 2 of our algorithm to the second step of Bonizzoni et al. m) are constants. Finally. the same problem Row Assignment is solved. Due to Lemma 1. k-Anonymity is fixed-parameter tractable for the parameter tin . As we saw in Theorem 1. tin is much smaller than the number |Σ|m of all possible different input types.’s algorithm. destroying any hope for fixed-parameter tractability already for the combined parameter “number of output types and alphabet size”. their algorithm guesses all possible output row types m together with their entries in O(2(|Σ|+1) ) time. in O(2(|Σ|+1) ) time. our algorithm can easily be modified to solve kAnonymity using DGHs with no significant increase in the running time. Such a relationship has already been observed in related work [1. For instances where |Σ|m ≤ tin · tout . it is a natural question whether k-Anonymity is already fixed-parameter tractable with respect to the number of output types. [5].We remark that the described algorithm can be modified to use domain generalization hierarchies (DGH) [23] or to solve -Diversity [18]. In both algorithms. We end this section by comparing our algorithmic results to the closely related ones by Bonizzoni et al.’s algorithm runs in O(kmn2 ) time while our algorithm runs in linear time O(mn). did this using maximum matching on a bipartite graph with O(n) nodes. We defer the details to a full version of the paper. Bonizzoni et al. Hence. Since in the DGH setting the alphabet size |Σ| increases. and its proof of correctness is—arguably—simpler. we show that k-Anonymity is NP-hard even when there are only two output types and the alphabet has size four.6. The clustering view on k-Anonymity makes the number tout of output types (corresponding to the number of clusters) of a k-anonymous matrix an interesting parameter. Note that. 3 Parameter tout There is a close relationship between k-Anonymity and clustering problems where one is also interested in grouping together similar objects. In Phase 1 our algorithm guesses the output row types producible from M within O(2tin tout tin tout m + mn) time using a different approach. it is not immediately clear how the algorithm due to Bonizzoni et al. we know that tout is a stronger parameter than tin . in the sense that tout ≤ tin . There is also a more algorithmic motivation for investigating the (parameterized) complexity of k-Anonymity for the parameter tout . As we mentioned above. can be modified to solve this problem without an exponential increase in the running time. in general the guessing step of our algorithm is faster. Next. in general. while we do it using a flow network with O(tin ) nodes. one can m guess the output types like Bonizzoni et al. Answering this in the negative. when the parameters tin and (|Σ|. The hardness proof uses a polynomial-time many-one reduction from Balanced Complete Bipartite Subgraph: 8 .11]. They presented an algorithm for k-Anonymity m with a running time of O(2(|Σ|+1) kmn2 ) which works—similarly to our algorithm—in two phases: First.

corresponds to the solution set of the original instance: Solution set vertices from A are represented by rows that are preimages of rows in the solution type and solution set vertices from B are represented by columns in the solution type that are not suppressed. bn/2 } be the two vertex partition classes of G. 1. We provide a polynomial-time many-one reduction from a special case of BCBS with a balanced input graph. each column is suppressed. 01 if4 j ≤ 4 We assume without loss of generality that n is divisible by four. since containment in NP is clear. 3. that is. n/4) ∈ BCBS. . the solution type. 2. The value in the ith column of the j th row is 1 if aj is adjacent to bi and. This is done by using two different types of 0-symbols representing non-adjacency. By making a matrix with 2x rows x-anonymous we ensure that there are at most two output types. this enforces the other output type to be fully suppressed. We only have to prove NP-hardness. . Question: Is there a complete bipartite subgraph of G whose partition classes are of size at least k each? BCBS is NP-complete [15]. When the input graph is not balanced. . We construct an (n/4 + 1)-Anonymityinstance that is a YES-instance if and only if (G. . |A| − |B| = 0. The salient points of our reduction are as follows. a2 . a bipartite graph that can be partitioned into two independent sets of the same size. We add one row that contains the -symbol in each entry. If k > |V |/4. There is one row for each vertex in A and one column for each vertex in B. Since the rows in the solution type have to agree on the columns that are not suppressed. then add k − |V |/4 isolated vertices to each of A and B. add ||A| − |B|| isolated vertices to the partition class of smaller size. One of the types. The 1-symbol represents adjacency. that is. otherwise. k-Anonymity is NP-complete for two output row types and alphabet size four. We devise a reduction from BCBS with balanced input graph and k = |V |/4 to show the NP-hardness of k-Anonymity. 9 . the main idea is to use a matrix expressing the adjacencies between A and B as the input matrix. and k = |V |/4. The matrix D is described in the following and illustrated in Figure 2. E) being a balanced bipartite graph and n := |V |. . n/4) be a BCBS-instance with G = (V. . If k < |V |/4. To this end.Balanced Complete Bipartite Subgraph (BCBS) Input: A bipartite graph G = (V. E) and an integer k ≥ 1. . Proof (Sketch). an/2 } and B = {b1 . b2 . Partition class A corresponds to the rows and partition class B corresponds to the columns. we have to ensure that they agree on adjacencies to model BCBS. Since the symbol is not used in any other row. . Theorem 2. Let A = {a1 . Let (G. that is. This special case clearly is also NP-complete by a simple reduction from the general BCBS problem: Let V := A B with A and B being the two vertex partition classes of the input graph. then repeat the following until k = |V |/4: Add one vertex to A and make it adjacent to each vertex from B and add one vertex to B and make it adjacent to each vertex from A.

. k) [5]. . . 1 1 . that is. 2. and n k-Anonymity is fixed-parameter tractable. b1 1 01 . k. this part of the proof is deferred to a full version of the paper. the parameterized complexity for the combined parameter (|Σ|. k-Anonymity is W[1]hard for (s. This completes the construction. k) is still open. In our reduction the alphabet size |Σ| and the number tout of output row types are constants whereas the number n of rows. n). . 1 1 . . s. .. Then. k-Anonymity parameterized by (tout . . . . one input row that is a preimage of a row from the output row type. Deconstructing intractability.. Typical structure of the matrix D of the k-Anonymity instance obtained by the reduction from Balanced Complete Bipartite Subgraph. Whereas k-Anonymity is NP-hard for constant m and unbounded tout . In contrast. bn/2−1 bn/2 1 1 1 1 . . This suggests a study of the computational complexity of those cases where at least one of these quantities is bounded. In the remainder of this section we briefly discuss the NP-hardness proof for k-Anonymity in the spirit of “deconstructing intractability” [16. . m) is in XP. that is. In particular. 01 01 . . m). and the anonymity quality k are unbounded.21]. . It remains to show that (G. . see Table 1 in Section 1. an/4 an/4+1 . Some of the corresponding parameterizations have already been investigated. an/2 1 02 1 1 1 02 1 1 1 1 1 Fig. . number s of suppressions. . m): In O(2m·tout · m · tout ) time guess the suppressed columns for all output row types. one containing only 1s and one containing only -symbols. Additionally. 01 02 . . 01 02 . . While for parameters (|Σ|. one can easily construct an XP-algorithm with respect to the combined parameter (tout . . it is open whether combining tout with m. it is presumably fixed-parameter intractable for this combined parameter. n/4) is a YES-instance of BCBS if and only if the constructed matrix can be transformed into an (n/4 + 1)-anonymous matrix by suppressing at most s elements. . (|Σ|. The number of allowed suppressions is s := (n/4 + 1) · n/2 + (n/4 + 1) · n/4.a1 a2 . . one can simply apply the Row Assignment algorithm from Section 2: Proposition 1. . guess in nO(tout ) time one prototype for each output row type. b2 1 1 . there are two further rows. n/4 and 02 if j > n/4. the number m of attributes. . . Now. 10 . . 1 1 . . or s helps to obtain fixed-parameter tractability. knowing the entries for each output row.

the number of rows that have at least one suppressed entry is at most s. T. More specifically. s). k). it would also be interesting to determine whether our fixed-parameter tractability result for k-Anonymity with respect to the parameter tin not only extends to the already mentioned generalizations of k-Anonymity but also to the (in terms of privacy) more restrictive t-Closeness problem [17]. already for two output types k-Anonymity becomes computationally intractable. Now. 4 Conclusion This paper adopts a data-driven approach towards the design of (exact) algorithms for k-Anonymity and related problems. then k-Anonymity and its related problems are efficiently solvable—for constant tin even in linear time. s) by showing that the number tin of input types is at most (tout + s). the number of unsuppressed input row type cannot exceed tout . Besides running time improvements in general and open questions indicated in Table 1 such as the parameterized complexity of k-Anonymity for the combined parameter “number of suppressions plus alphabet size”. Panigrahy.Next. the parameter tin measures an easy-to-determine input property. to achieve a better running time one might want to develop a direct fixed-parameter algorithm for (tout . Our central message here is that if tin is small or even constant. 11 . consider a feasible solution for an arbitrary k-Anonymity instance. and A. every unsuppressed input row type needs at least one unsuppressed output row type. 6(3):1–19. k-Anonymity is fixed-parameter tractable with respect to the combined parameter (tout . fixed-parameter tractability follows from Theorem 1: Proposition 2. Furthermore. Clearly. D. 2010. On the contrary. ACM Trans. G. Finally. Hence the number of suppressed input row types is also at most s. Kenthapadi. Achieving anonymity via clustering. Feder. K. To this end. Algorithms. S. Thomas. we conjecture that an XPalgorithm for k-Anonymity can be achieved with respect to the combined parameter (tout . It follows that tin ≤ tout + s. References 1. R. However. We distinguish between input row types that have rows which have at least one suppressed entry in the solution (suppressed input row types in the following) and input row types that do only have rows that remain unchanged in the solution (unsuppressed input row types in the following). We contributed to a refined analysis of the computational complexity of kAnonymity. Zhu. Aggarwal. Khuller. we prove fixed-parameter tractability for k-Anonymity with respect to the combined parameter (tout . s). The state of the art including several research challenges concerning natural parameterizations for k-Anonymity is surveyed in Table 1 in the introductory section. Thus.

V. Grohe. L. 2010. 52 pages. SECRYPT. L. Fard and K. 15. Springer. Feder. Comb. LNCS. Philip. and R. Springer. Resolving the complexity of some data privacy problems. Panigrahy. abs/1004. 2005. 1988. and S. 12. R. Achieving k-anonymity privacy protection using generalization and suppression. Komusiewicz. A. Reflections on multivariate algorithmics and problem parameterization. To appear. 23. Dwork. G. 20. J. 23rd ICDE.. pages 393–404. Blocki and R. 9. Fellows. and R. pages 17–32. 2010. and Y. K. Algorithms. Orlin. Downey and M. Trie memory. S. pages 106–115. SciTePress. Wang. G. Zhu. 2010. R. volume 6199 of LNCS. 13. B. Springer. Discrete Algorithms. T. 27th STACS. 2011. 2009.2. volume 3363 of LNCS. Sabharwal. D. Chakaravarthy. In Proc. Springer. Chaytor. and P. Evans. The NP-completeness column: An ongoing guide. IEEE. pages 223–228. On the complexity of the k-anonymization problem. 7. 22. Motwani. R. and G. 2010. G. Anonymizing binary and small tables is hard to approximate. 8. and M. Venkitasubramaniam. J. 10(5):557–570. Invitation to Fixed-Parameter Algorithms. Comb. In Proc. Bredereck. 17. A. 21st IWOCA. R. DellaÂăVedova. R. An effective clustering approach to web query log anonymization. A. Optim. Wang. Bonizzoni. In Proc. Niedermeier. IJUFKS. Sweeney. pages 242–255. Aggarwal. Gehrke. Privacy-preserving data publishing: A survey of recent developments. -diversity: Privacy beyond k-anonymity. 2006. C. ACM Trans. 23rd PODS. IJUFKS. Williams. 37th ICALP. Wareham. D. Pirola. Springer. pages 109–119. 24. Dondi. R. 22:97–119. 3. 2010. Machanavajjhala. 2007. 54:86–95. and A. 18. T. 2010. and Y. Anonymizing tables. M. 2011. J. IBFI Dagstuhl. Uhlmann. Knowl. R. J. A faster strongly polynomial minimum cost flow algorithm. 1. R. Niedermeier. M. ACM. On the complexity of optimal k-anonymity. Yu. Parameterized complexity of k-anonymity: Hardness and tractability. In Proc. 2007. pages 246–258. Johnson. Data. Fixed-parameter tractability of anonymizing data by suppressing entries. Kifer. Bonizzoni. 2011. 36th MFCS. 10(5):571–588. 14. V. 2006. G. J. Li. 42(4):14:1–14:53. volume 6460 of LNCS. Fung. CoRR. 8:438–448. Flum and M. Dondi. P. 12 . Oxford University Press. A. In Proc. P. Meyerson and R. Niedermeier. Commun. D. Chen. R. 9:137–151. 1999. 21.. ACM. In Proc. R. 3(9):490–499.. R. 2002. 19. 18(4):362–375. 10. J. Deconstructing intractability– A multivariate complexity analysis of interval constrained coloring. Pattern-guided data anonymization and clustering. T. k-anonymity: A model for protecting privacy. K. 1960. 6. Li. A firm foundation for private data analysis. Commun. 2004. ACM Comput. 11. Thomas. D. Sweeney. 5. In Proc. t-closeness: Privacy beyond k-anonymity and l-diversity. Kenthapadi. 1987. Nichterlein. Parameterized Complexity Theory.4729. J. E. Discov. In Proc. Parameterized Complexity. pages 377–387. 2011. 2002. Surv. Williams. Fredkin. In Proc. Optim. Niedermeier. Pandit. 20th STOC. ACM. and J. T. 4. P. 10th ICDT. A. J. C. Vedova. ACM. volume 5 of LIPIcs. Springer. C. S. N. 16. Venkatasubramanian.

Sign up to vote on this title
UsefulNot useful