This action might not be possible to undo. Are you sure you want to continue?

# The Eﬀect of Homogeneity on the Complexity

of k-Anonymity

Robert Bredereck

1,

, André Nichterlein

1

, Rolf Niedermeier

1

,

Geevarghese Philip

2

1

Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, Germany

2

The Institute of Mathematical Sciences, Chennai, India

{robert.bredereck,andre.nichterlein,rolf.niedermeier}@tu-berlin.de

gphilip@imsc.res.in

Abstract. The NP-hard k-Anonymity problem asks, given an n ×m-

matrix M over a ﬁxed alphabet and an integer s > 0, whether M can

be made k-anonymous by suppressing (blanking out) at most s entries.

A matrix M is said to be k-anonymous if for each row r in M there

are at least k − 1 other rows in M which are identical to r. Comple-

menting previous work, we introduce two new “data-driven” parameter-

izations for k-Anonymity—the number tin of diﬀerent input rows and

the number tout of diﬀerent output rows—both modeling aspects of data

homogeneity. We show that k-Anonymity is ﬁxed-parameter tractable

for the parameter tin, and it is NP-hard even for tout = 2 and alphabet

size four. Notably, our ﬁxed-parameter tractability result implies that

k-Anonymity can be solved in linear time when tin is a constant. Our

results also extend to some interesting generalizations of k-Anonymity.

1 Introduction

Assume that data about individuals are represented by equal-length vectors

consisting of attribute values. If all vectors are identical, then we have full ho-

mogeneity and thus full anonymity of all individuals. Relaxing full anonymity

to k-anonymity, in this work we investigate how the degree of (in)homogeneity

inﬂuences the computational complexity of the NP-hard problem of making sets

of individuals k-anonymous.

Sweeney [24] devised the notion of k-anonymity to better quantify the degree

of anonymity in sanitized data. This notion formalizes the intuition that entities

who have identical sets of attributes cannot be distinguished from one another.

For a positive integer k we say that a matrix M is k-anonymous if, for each

row r in M, there are at least k − 1 other rows in M which are identical to r.

Thus k-anonymity provides a concrete optimization goal while sanitizing data:

choose a value of k which would satisfy the relevant privacy requirements, and

then try to modify—“at minimum cost”—the matrix in such a way that it be-

comes k-anonymous. The corresponding decision problem k-Anonymity asks,

**Supported by the DFG, research project PAWS, NI 369/10.
**

To appear in the Proceedings of the 18th International Symposium on Funda-

mentals of Computer Theory (FCT), Oslo, Norway, August 2011, Lecture Notes

in Computer Science. c ( Springer

additionally given an upper bound s for the number of suppressions allowed,

whether a matrix can be made k-anonymous by suppressing (blanking out) at

most s entries. While k-Anonymity is our central problem, our results also

extend to several more general problems.

We focus on a better understanding of the computational complexity and on

tractable special cases of these problems; see Machanavajjhala et al. [18] and

Sweeney [24] for discussions on the pros and cons of these models in terms of

privacy vs preservation of meaningful data. In particular, note that in the data

privacy community “diﬀerential privacy” (cleverly adding some random noise) is

now the most popular method [9,14]. However, k-Anonymity is a very natural

combinatorial problem (with potential applications beyond data privacy), and—

for instance—in the case of “one-time anonymization”, may still be valuable for

the sake of providing a simple model that does not introduce noise.

k-Anonymity and many related problems are NP-hard [19], even when the

input matrix is highly restricted. For instance, it is APX-hard when k = 3,

even when the alphabet size is just two [4]; NP-hard when k = 4, even when the

number of columns in the input dataset is 8 [4]; MAX SNP-hard when k = 7, even

when the number of columns in the input dataset is just three [7]; and MAX SNP-

hard when k = 3, even when the number of columns in the input dataset is 27 [3].

Confronted with this computational hardness, we study the parameterized

complexity of k-Anonymity as initiated by Evans et al. [10]. The central ques-

tion here is how naturally occurring parameters inﬂuence the complexity of k-

Anonymity. For example, is k-Anonymity polynomial-time solvable for con-

stant values of k? The general answer is “no” since already 3-Anonymity is

NP-hard [19], even on binary data sets [4]. Thus, k alone does not give a promis-

ing parameterization.

1

k-Anonymity has a number of meaningful parameterizations beyond k, in-

cluding the number of rows n, the alphabet size [Σ[, the number of columns m,

and, in the spirit of multivariate algorithmics [21], various combinations of single

parameters. Here the arity of [Σ[ may range from binary (such as gender) to un-

bounded. For instance, answering an open question of Evans et al. [10], Bonizzoni

et al. [5] recently showed that k-Anonymity is ﬁxed-parameter tractable with

respect to the combined parameter (m, [Σ[), whereas there is no hope for ﬁxed-

parameter tractability with respect to the single parameters m and [Σ[ [10]. We

emphasize that Bonizzoni et al. [5] made use of the fact that the value [Σ[

m

is an

upper bound on the number of diﬀerent input rows, thus implicitly exploiting a

very rough upper bound on input homogeneity. Clearly, [Σ[

m

denotes the max-

imum possible number of diﬀerent input rows. In this work, we reﬁne this view

by asking how the “degree of homogeneity” of the input matrix inﬂuences the

complexity of k-Anonymity. In other words, is k-Anonymity ﬁxed-parameter

tractable for the parameter “number of diﬀerent input rows”? In a similar vein,

we also study the eﬀect of the degree of homogeneity of the output matrix on

the complexity of k-Anonymity. Table 1, which extends similar tables due to

Evans et al. [10] and Bonizzoni et al. [5], summarizes known and new results.

1

However, it has been shown that 2-Anonymity is polynomial-time solvable [3].

2

Table 1. The parameterized complexity of k-Anonymity. Results proved in this paper

are in bold. The column and row entries represent parameters. For instance, the entry

in row “–” and column “s” refers to the (parameterized) complexity for the single

parameter s whereas the entry in row “m” and column “s” refers to the (parameterized)

complexity for the combined parameter (s, m).

– k s k, s

– NP-hard [19] NP-hard [19] W[1]-hard [5] W[1]-hard [5]

|Σ| NP-hard [2] NP-hard [2] ? ?

m NP-hard [4] NP-hard [4] FPT [10] FPT [10]

n FPT [10] FPT [10] FPT [10] FPT [10]

|Σ|, m FPT [5] FPT [10] FPT [10] FPT [10]

|Σ|, n FPT [10] FPT [10] FPT [10] FPT [10]

tin FPT FPT FPT FPT

tout NP-hard ? FPT FPT

Our contributions. We introduce the “homogeneity parameters” t

in

, the num-

ber of diﬀerent input rows, and t

out

, the number of diﬀerent output rows, for

studying the computational complexity of k-Anonymity and related problems.

Typically, we expect t

in

< n and t

in

< [Σ[

m

. Indeed, t

in

is a “data-driven param-

eterization” in the sense that one can eﬃciently measure in advance the instance-

speciﬁc value of t

in

whereas [Σ[

m

is a trivial upper bound for homogeneity.

First, we show that there is always an optimal solution (minimizing the num-

ber of suppressions) with t

out

≤ t

in

. Then, we derive an algorithm that solves

k-Anonymity in O(nm+2

tintout

t

in

(t

out

m+t

2

in

log(t

in

))) time, which compares

favorably with Bonizzoni et al.’s [5] algorithm running in O(2

(|Σ|+1)

m

kmn

2

)

time. Since t

out

≤ t

in

, this shows that k-Anonymity is ﬁxed-parameter tractable

when parameterized by t

in

. In particular, when t

in

is a constant, our algorithm

solves k-Anonymity in time linear in the size of the input. In contrast, when

only t

out

is ﬁxed, then we show that the problem remains NP-hard. More pre-

cisely, opposing the trivial case t

out

= 1, we show that k-Anonymity is already

NP-hard when t

out

= 2, even when [Σ[ = 4. We remark that t

out

is an interesting

parameter since it is “stronger” than t

in

and since, interpreting k-Anonymity as

a (meta-)clustering problem (of Aggarwal et al. [1]), t

out

may also be interpreted

as the number of output “clusters”.

Finally, we mention that all our results extend to more general problems,

including -Diversity [18] and “k-Anonymity with domain generalization hi-

erarchies” [23]; we defer the corresponding details to a full version of the paper.

Preliminaries and basic observations. Our inputs are datasets in the form

of nm-matrices, where the n rows refer to the individuals and the m columns

correspond to attributes with entries drawn from an alphabet Σ. Suppressing

an entry M[i, j] of an n m-matrix M over alphabet Σ with 1 ≤ i ≤ n

and 1 ≤ j ≤ m means to simply replace M[i, j] ∈ Σ by the new symbol “”

ending up with a matrix over the alphabet Σ¬¦¦. A row type is a string

from (Σ¬¦¦)

m

. We say that a row in a matrix has a certain row type if it

3

coincides in all its entries with the row type. In what follows, synonymously,

we sometimes also speak of a row “lying in a row type”, and that the row type

“contains” this row. We say that a matrix is k-anonymous if every row type

contains none or at least k rows in the matrix. A natural objective when trying

to achieve k-anonymity is to minimize the number of suppressed matrix entries.

For a row y (with some entries suppressed) in the output matrix, we call a row x

in the input matrix the preimage of y if y is obtained from x by suppressing in x

the -entry positions of y. The central problem of this work reads as follows.

k-Anonymity

Input: An n m-matrix M and nonnegative integers k, s.

Question: Can at most s elements of M be suppressed to obtain a k-

anonymous matrix M

?

Our algorithmic results mostly rely on concepts of parameterized algorith-

mics [8,12,20]. The fundamental idea herein is, given a computationally hard

problem X, to identify a parameter p (typically a positive integer or a tuple of

positive integers) for X and to determine whether a size-n input instance of X

can be solved in f(p) n

O(1)

time, where f is an arbitrary computable func-

tion. If this is the case, then one says that X is ﬁxed-parameter tractable for

the parameter p. The corresponding complexity class is called FPT. If X could

only be solved in polynomial running time where the degree of the polynomial

depends on p (such as n

O(p)

), then, for parameter p, problem X only lies in the

parameterized complexity class XP.

We study two new parameters t

in

and t

out

, where t

in

denotes the number of

input row types in the given matrix and t

out

denotes the number of row types

in the output k-anonymous matrix. Note that using sorting t

in

can be eﬃciently

determined for a given matrix. To keep the treatment simple, we assume that t

out

is a user-speciﬁed number which bounds the maximum number of output types.

Clearly, one could have variations on this: for instance, one could ask to mini-

mize t

out

with at most a given number s of suppressions. We refrain from further

exploration here and treat t

out

as part of the input specifying an upper bound

on the number of output types.

2

The following lemma says that without loss of

generality one may assume t

out

≤ t

in

.

Lemma 1. Let (M, k, s) be a YES-instance of k-Anonymity. If M has t

in

row types, then there exists a k-anonymous matrix M

with at most t

in

row

types which can be obtained from M by suppressing at most s elements.

2 Parameter t

in

In this section we show that k-Anonymity is ﬁxed-parameter tractable with

respect to the parameter number t

in

of input row types. Since t

in

≤ [Σ[

m

and

2

Indeed, interpreting k-Anonymity as a clustering problem with a guarantee k for

the minimum cluster size (as in the work by Aggarwal et al. [1]), tout can also be

seen as the number of “clusters” (that is, output row types) that are built.

4

Algorithm 1 Pseudo-code for solving k-Anonymity. The function

solveRowAssignment solves Row Assignment in polynomial time, see

Lemma 2.

1: procedure solveKAnonymity(M, k, s, tout)

2: Determine the row types R1, . . . , Rt

in

Phase 1, Step 1

3: for each possible A : [0, 1]

t

in

×t

out

do Phase 1, Step 2

4: for j ←1 to tout do Phase 1, Step 3

5: if A[1, j] = A[2, j] = . . . = A[tin, j] = 0 then

6: delete empty output row type R

j

7: decrease tout by one

8: else

9: Determine all entries of R

j

10: if solveRowAssignment then Phase 2

11: return ‘YES’

12: return ‘NO’

t

in

≤ n, k-Anonymity is also ﬁxed-parameter tractable with respect to the

combined parameter (m, [Σ[) and the single parameter n. Both these latter re-

sults were already proved quite recently; Bonizzoni et al. [5] demonstrated ﬁxed-

parameter tractability for the parameter (m, [Σ[), and Evans et al. [10] showed

the same for the parameter n. Besides achieving a ﬁxed-parameter tractabil-

ity result for a more general and typically smaller parameter, we improve their

results by giving a simpler algorithm with a (usually) better running time.

Let (M, k, s) be an instance of k-Anonymity, and let M

be the (un-

known) k-anonymous matrix which we seek to obtain from M by suppressing

at most s elements. Our ﬁxed-parameter algorithm works in two phases. In the

ﬁrst phase, the algorithm guesses the entries of each row type R

j

in M

. In the

second phase, the algorithm computes an assignment of the rows of M to the

row types R

j

in M

**—see Algorithm 1 for an outline.
**

We now explain the two phases in detail, beginning with Phase 1. To deter-

mine the row types R

i

of M (line 2 in Algorithm 1), the algorithm constructs a

trie [13] on the rows of M. The leaves of the trie correspond to the row types

of M. For later use, the algorithm also keeps track of the numbers n

1

, n

2

, . . . , n

tin

of each type of row that is present in M; this can clearly be done by keeping a

counter at each leaf of the trie and incrementing it by one whenever a new row

matches the path to a leaf. All of this can be done in a single pass over M.

For implementing the guess in Step 2 of Phase 1, the algorithm goes over all

binary matrices of dimension t

in

t

out

; such a matrix A is interpreted as follows:

A row of type R

i

is mapped

3

to a row of type R

j

if and only if A[i, j] = 1 (see

line 3). Note that we allow in our guessing step an output type to contain no row

of any input row type. These “empty” output row types are deleted. Hence, with

our guessing in Step 2, we guess not only output matrices M

with exactly t

out

types, but also matrices M

with at most t

out

types.

3

Note that not all rows of an input type need to be mapped to the same output type.

5

Now the algorithm computes the entries of each row type R

j

, 1 ≤ j ≤ t

out

,

of M

**(Step 3 of Phase 1). Assume for ease of notation that R
**

1

, . . . , R

are the

row types of M which contribute (according to the guessing) at least one row to

the (as yet unknown) output row type R

j

. Now, for each 1 ≤ i ≤ m, if R

1

[i] =

R

2

[i] = ... = R

[i], then set R

j

[i] := R

1

[i]; otherwise, set R

j

[i] := . This yields

the entries of the output type R

j

, and the number ω

j

of suppressions required to

convert any input row (if possible) to the type R

j

is the number of -entries in R

j

.

The guessing in Step 2 of Phase 1 takes time exponential in the parameter t

in

,

but Phase 2 can be done in polynomial time. To show this, we prove that Row

Assignment is polynomial-time solvable. We do this in the next lemma, after

formally deﬁning the Row Assignment problem. To this end, we use the two

sets T

in

= ¦1, . . . , t

in

¦ and T

out

= ¦1, . . . , t

out

¦.

Row Assignment

Input: Nonnegative integers k, s, ω

1

, . . . , ω

tout

and n

1

, . . . , n

tin

with

tin

i=1

n

i

= n, and a function a : T

in

T

out

→ ¦0, 1¦.

Question: Is there a function g : T

in

T

out

→ ¦0, . . . , n¦ such that

a(i, j) n ≥ g(i, j) ∀i ∈ T

in

∀j ∈ T

out

(1)

tin

i=1

g(i, j) ≥ k ∀j ∈ T

out

(2)

tout

j=1

g(i, j) = n

i

∀i ∈ T

in

(3)

tin

i=1

tout

j=1

g(i, j) ω

j

≤ s (4)

Row Assignment formally deﬁnes the remaining problem in Phase 2: At

this stage of the algorithm the input row types R

1

, . . . , R

tin

and the number

of rows n

1

, . . . , n

tin

in these input row types are known. The algorithm has

also computed the output row types R

1

, . . . , R

tout

and the number of suppres-

sions ω

1

, . . . , ω

tout

in these output row types. Now, the algorithm computes an

assignment of the rows of the input row types to output row types such that:

– The assignment of the rows respects the guessing in Step 2 of Phase 1. This

is secured by Inequality (1).

– M

**is k-anonymous, that is, each output row type contains at least k rows.
**

This is secured by Inequality (2).

– All rows of each input row type are assigned. This is secured by Equation (3).

– The total cost of the assignment is at most s. This is secured by Inequal-

ity (4).

Note that in the deﬁnition of Row Assignment no row type occurs and,

hence, the problem is independent of the speciﬁc entries of the input or output

row types.

6

v1

−n1

v2

−n2

v3

−n3

v4

−n4

v5

−n5

u1

k

u2

k

u3

k

u4

k

t

t

in

i=1

ni

−k · tout

ω1

ω3

ω1

ω2

ω3

ω2

ω4

ω4

0

0

0

0

Fig. 1. Example of the constructed network with tin = 5 and tout = 4. The number on

each arc denotes its cost. The number next to each node denotes its demand.

Lemma 2. Row Assignment can be solved in O(t

3

in

log(t

in

)) time.

Proof (Sketch). We reduce Row Assignment to the Uncapacitated Mini-

mum Cost Flow problem, which is deﬁned as follows [22]:

Uncapacitated Minimum Cost Flow

Input: A network (directed graph) D = (V, A) with demands b : V → Z

on the nodes and costs c : V V → N.

Task: Find a function f which minimizes

(u,v)∈A

c(u, v) f(u, v) and

satisﬁes:

{v|(u,v)∈A}

f(u, v) −

{v|(v,u)∈A}

f(v, u) = b(u) ∀u ∈ V

f(u, v) ≥ 0 ∀(u, v) ∈ A

We ﬁrst describe the construction of the network with demands and costs.

For each n

i

, 1 ≤ i ≤ t

in

, add a node v

i

with demand −n

i

(that is, a supply

of n

i

) and for each ω

j

add a node u

j

with demand k. If a(i, j) = 1, then add an

arc (v

i

, u

j

) with cost ω

j

. Finally, add a sink t with demand (

n

i

) −k t

out

and

the arcs (u

i

, t) with cost zero. See Figure 1 for an example of the construction.

Note that, although the arc capacities are unbounded, the maximum ﬂow over

one arc is implicitly bounded by n because the sum of all supplies is

tin

i=1

n

i

= n.

The Uncapacitated Minimum Cost Flow problem is solvable in O([V [

log([V [)([A[ +[V [ log([V [))) time in a network (directed graph) D = (V, A) [22].

Since our constructed network has t

in

+t

out

nodes and t

in

t

out

arcs, we can solve

our Uncapacitated Minimum Cost Flow-instance in O((t

in

+t

out

) log(t

in

+

t

out

)(t

in

t

out

+ (t

in

+ t

out

) log(t

in

+ t

out

))) time. Since, by Lemma 1, t

in

≥ t

out

,

the running time is O(t

3

in

log(t

in

)). ¯ .

Putting all these together, we arrive at the following theorem:

Theorem 1. k-Anonymity can be solved in O(nm + 2

tintout

t

in

(t

out

m + t

2

in

log(t

in

))) time, and so, in O(nm + 2

t

2

in

t

2

in

(m + t

in

log(t

in

))) time.

7

We remark that the described algorithm can be modiﬁed to use domain

generalization hierarchies (DGH) [23] or to solve -Diversity [18]. We defer

the details to a full version of the paper.

We end this section by comparing our algorithmic results to the closely re-

lated ones by Bonizzoni et al. [5]. They presented an algorithm for k-Anonymity

with a running time of O(2

(|Σ|+1)

m

kmn

2

) which works—similarly to our algo-

rithm—in two phases: First, their algorithm guesses all possible output row types

together with their entries in O(2

(|Σ|+1)

m

) time. In Phase 1 our algorithm guesses

the output row types producible from M within O(2

tintout

t

in

t

out

m + mn) time

using a diﬀerent approach. Note that, in general, t

in

is much smaller than the

number [Σ[

m

of all possible diﬀerent input types. Hence, in general the guessing

step of our algorithm is faster. For instances where [Σ[

m

≤ t

in

t

out

, one can

guess the output types like Bonizzoni et al. in O(2

(|Σ|+1)

m

) time.

Next, we compare Phase 2 of our algorithm to the second step of Bonizzoni

et al.’s algorithm. In both algorithms, the same problem Row Assignment

is solved. Bonizzoni et al. did this using maximum matching on a bipartite

graph with O(n) nodes, while we do it using a ﬂow network with O(t

in

) nodes.

Consequently, the running time of our approach depends only on t

in

, and its

proof of correctness is—arguably—simpler.

As we mentioned above, our algorithm can easily be modiﬁed to solve k-

Anonymity using DGHs with no signiﬁcant increase in the running time. Since

in the DGH setting the alphabet size [Σ[ increases, it is not immediately clear

how the algorithm due to Bonizzoni et al. can be modiﬁed to solve this problem

without an exponential increase in the running time. Finally, when the param-

eters t

in

and ([Σ[, m) are constants, then Bonizzoni et al.’s algorithm runs in

O(kmn

2

) time while our algorithm runs in linear time O(mn).

3 Parameter t

out

There is a close relationship between k-Anonymity and clustering problems

where one is also interested in grouping together similar objects. Such a rela-

tionship has already been observed in related work [1,6,11]. The clustering view

on k-Anonymity makes the number t

out

of output types (corresponding to the

number of clusters) of a k-anonymous matrix an interesting parameter.

There is also a more algorithmic motivation for investigating the (param-

eterized) complexity of k-Anonymity for the parameter t

out

. As we saw in

Theorem 1, k-Anonymity is ﬁxed-parameter tractable for the parameter t

in

.

Due to Lemma 1, we know that t

out

is a stronger parameter than t

in

, in the sense

that t

out

≤ t

in

. Hence, it is a natural question whether k-Anonymity is already

ﬁxed-parameter tractable with respect to the number of output types. Answering

this in the negative, we show that k-Anonymity is NP-hard even when there

are only two output types and the alphabet has size four, destroying any hope

for ﬁxed-parameter tractability already for the combined parameter “number

of output types and alphabet size”. The hardness proof uses a polynomial-time

many-one reduction from Balanced Complete Bipartite Subgraph:

8

Balanced Complete Bipartite Subgraph (BCBS)

Input: A bipartite graph G = (V, E) and an integer k ≥ 1.

Question: Is there a complete bipartite subgraph of G whose partition classes

are of size at least k each?

BCBS is NP-complete [15]. We provide a polynomial-time many-one reduc-

tion from a special case of BCBS with a balanced input graph, that is, a bipar-

tite graph that can be partitioned into two independent sets of the same size,

and k = [V [/4. This special case clearly is also NP-complete by a simple reduc-

tion from the general BCBS problem: Let V := A¬ B with A and B being the

two vertex partition classes of the input graph. When the input graph is not

balanced, that is, [A[ −[B[ = 0, add [[A[ −[B[[ isolated vertices to the partition

class of smaller size. If k < [V [/4, then repeat the following until k = [V [/4:

Add one vertex to A and make it adjacent to each vertex from B and add one

vertex to B and make it adjacent to each vertex from A. If k > [V [/4, then

add k −[V [/4 isolated vertices to each of A and B.

We devise a reduction from BCBS with balanced input graph and k = [V [/4

to show the NP-hardness of k-Anonymity.

Theorem 2. k-Anonymity is NP-complete for two output row types and al-

phabet size four.

Proof (Sketch). We only have to prove NP-hardness, since containment in NP is

clear. Let (G, n/4) be a BCBS-instance with G = (V, E) being a balanced bipar-

tite graph and n := [V [. Let A = ¦a

1

, a

2

, . . . , a

n/2

¦ and B = ¦b

1

, b

2

, . . . , b

n/2

¦ be

the two vertex partition classes of G. We construct an (n/4 + 1)-Anonymity-

instance that is a YES-instance if and only if (G, n/4) ∈ BCBS. To this end, the

main idea is to use a matrix expressing the adjacencies between A and B as the

input matrix. Partition class A corresponds to the rows and partition class B

corresponds to the columns. The salient points of our reduction are as follows.

1. By making a matrix with 2x rows x-anonymous we ensure that there are

at most two output types. One of the types, the solution type, corresponds

to the solution set of the original instance: Solution set vertices from A are

represented by rows that are preimages of rows in the solution type and

solution set vertices from B are represented by columns in the solution type

that are not suppressed.

2. We add one row that contains the -symbol in each entry. Since the -

symbol is not used in any other row, this enforces the other output type to

be fully suppressed, that is, each column is suppressed.

3. Since the rows in the solution type have to agree on the columns that are

not suppressed, we have to ensure that they agree on adjacencies to model

BCBS. This is done by using two diﬀerent types of 0-symbols representing

non-adjacency. The 1-symbol represents adjacency.

The matrix D is described in the following and illustrated in Figure 2. There is

one row for each vertex in A and one column for each vertex in B. The value in

the i

th

column of the j

th

row is 1 if a

j

is adjacent to b

i

and, otherwise, 0

1

if

4

j ≤

4

We assume without loss of generality that n is divisible by four.

9

b1 b2 . . . b

n/2−1

b

n/2

a1 1 1 01 1 1

a2 01 1 01 1 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a

n/4

1 1 01 01 1

a

n/4+1

1 1 02 02 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

a

n/2

02 1 02 1 1

1 1 1 1 1 1

Fig. 2. Typical structure of the matrix D of the k-Anonymity instance obtained by

the reduction from Balanced Complete Bipartite Subgraph.

n/4 and 0

2

if j > n/4. Additionally, there are two further rows, one containing

only 1s and one containing only -symbols. The number of allowed suppressions

is s := (n/4 + 1) n/2 + (n/4 + 1) n/4. This completes the construction.

It remains to show that (G, n/4) is a YES-instance of BCBS if and only if

the constructed matrix can be transformed into an (n/4 +1)-anonymous matrix

by suppressing at most s elements; this part of the proof is deferred to a full

version of the paper. ¯ .

Deconstructing intractability. In the remainder of this section we brieﬂy

discuss the NP-hardness proof for k-Anonymity in the spirit of “deconstruct-

ing intractability” [16,21]. In our reduction the alphabet size [Σ[ and the num-

ber t

out

of output row types are constants whereas the number n of rows, the

number m of attributes, number s of suppressions, and the anonymity quality k

are unbounded. This suggests a study of the computational complexity of those

cases where at least one of these quantities is bounded. Some of the correspond-

ing parameterizations have already been investigated, see Table 1 in Section 1.

While for parameters ([Σ[, m), ([Σ[, n), and n k-Anonymity is ﬁxed-parameter

tractable, it is open whether combining t

out

with m, k, or s helps to obtain

ﬁxed-parameter tractability. In particular, the parameterized complexity for the

combined parameter ([Σ[, s, k) is still open. In contrast, k-Anonymity is W[1]-

hard for (s, k) [5], that is, it is presumably ﬁxed-parameter intractable for this

combined parameter.

Whereas k-Anonymity is NP-hard for constant m and unbounded t

out

, one

can easily construct an XP-algorithm with respect to the combined parame-

ter (t

out

, m): In O(2

m·tout

m t

out

) time guess the suppressed columns for all

output row types. Then, guess in n

O(tout)

time one prototype for each output

row type, that is, one input row that is a preimage of a row from the output row

type. Now, knowing the entries for each output row, one can simply apply the

Row Assignment algorithm from Section 2:

Proposition 1. k-Anonymity parameterized by (t

out

, m) is in XP.

10

Next, we prove ﬁxed-parameter tractability for k-Anonymity with respect to

the combined parameter (t

out

, s) by showing that the number t

in

of input types

is at most (t

out

+ s). To this end, consider a feasible solution for an arbitrary

k-Anonymity instance. We distinguish between input row types that have rows

which have at least one suppressed entry in the solution (suppressed input row

types in the following) and input row types that do only have rows that re-

main unchanged in the solution (unsuppressed input row types in the following).

Clearly, every unsuppressed input row type needs at least one unsuppressed

output row type. Thus, the number of unsuppressed input row type cannot ex-

ceed t

out

. Furthermore, the number of rows that have at least one suppressed

entry is at most s. Hence the number of suppressed input row types is also at

most s. It follows that t

in

≤ t

out

+ s. Now, ﬁxed-parameter tractability follows

from Theorem 1:

Proposition 2. k-Anonymity is ﬁxed-parameter tractable with respect to the

combined parameter (t

out

, s).

However, to achieve a better running time one might want to develop a di-

rect ﬁxed-parameter algorithm for (t

out

, s). Finally, we conjecture that an XP-

algorithm for k-Anonymity can be achieved with respect to the combined pa-

rameter (t

out

, k).

4 Conclusion

This paper adopts a data-driven approach towards the design of (exact) algo-

rithms for k-Anonymity and related problems. More speciﬁcally, the parame-

ter t

in

measures an easy-to-determine input property. Our central message here is

that if t

in

is small or even constant, then k-Anonymity and its related problems

are eﬃciently solvable—for constant t

in

even in linear time. On the contrary, al-

ready for two output types k-Anonymity becomes computationally intractable.

We contributed to a reﬁned analysis of the computational complexity of k-

Anonymity. The state of the art including several research challenges con-

cerning natural parameterizations for k-Anonymity is surveyed in Table 1 in

the introductory section. Besides running time improvements in general and

open questions indicated in Table 1 such as the parameterized complexity of

k-Anonymity for the combined parameter “number of suppressions plus alpha-

bet size”, it would also be interesting to determine whether our ﬁxed-parameter

tractability result for k-Anonymity with respect to the parameter t

in

not only

extends to the already mentioned generalizations of k-Anonymity but also to

the (in terms of privacy) more restrictive t-Closeness problem [17].

References

1. G. Aggarwal, T. Feder, K. Kenthapadi, S. Khuller, R. Panigrahy, D. Thomas, and

A. Zhu. Achieving anonymity via clustering. ACM Trans. Algorithms, 6(3):1–19,

2010.

11

2. G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas,

and A. Zhu. Anonymizing tables. In Proc. 10th ICDT, volume 3363 of LNCS,

pages 246–258. Springer, 2005.

3. J. Blocki and R. Williams. Resolving the complexity of some data privacy problems.

In Proc. 37th ICALP, volume 6199 of LNCS, pages 393–404. Springer, 2010.

4. P. Bonizzoni, G. DellaÂăVedova, and R. Dondi. Anonymizing binary and small

tables is hard to approximate. J. Comb. Optim., 22:97–119, 2011.

5. P. Bonizzoni, G. D. Vedova, R. Dondi, and Y. Pirola. Parameterized complexity

of k-anonymity: Hardness and tractability. In Proc. 21st IWOCA, volume 6460 of

LNCS, pages 242–255. Springer, 2010.

6. R. Bredereck, A. Nichterlein, R. Niedermeier, and G. Philip. Pattern-guided data

anonymization and clustering. In Proc. 36th MFCS, LNCS. Springer, 2011. To

appear.

7. V. T. Chakaravarthy, V. Pandit, and Y. Sabharwal. On the complexity of the

k-anonymization problem. CoRR, abs/1004.4729, 2010.

8. R. G. Downey and M. R. Fellows. Parameterized Complexity. Springer, 1999.

9. C. Dwork. A ﬁrm foundation for private data analysis. Commun. ACM, 54:86–95,

2011.

10. P. A. Evans, T. Wareham, and R. Chaytor. Fixed-parameter tractability of

anonymizing data by suppressing entries. J. Comb. Optim., 18(4):362–375, 2009.

11. A. M. Fard and K. Wang. An eﬀective clustering approach to web query log

anonymization. In Proc. SECRYPT, pages 109–119. SciTePress, 2010.

12. J. Flum and M. Grohe. Parameterized Complexity Theory. Springer, 2006.

13. E. Fredkin. Trie memory. Commun. ACM, 3(9):490–499, 1960.

14. B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data pub-

lishing: A survey of recent developments. ACM Comput. Surv., 42(4):14:1–14:53,

2010.

15. D. S. Johnson. The NP-completeness column: An ongoing guide. J. Algorithms,

8:438–448, 1987.

16. C. Komusiewicz, R. Niedermeier, and J. Uhlmann. Deconstructing intractability–

A multivariate complexity analysis of interval constrained coloring. J. Discrete

Algorithms, 9:137–151, 2011.

17. N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyond k-anonymity

and l-diversity. In Proc. 23rd ICDE, pages 106–115. IEEE, 2007.

18. A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam. -diversity:

Privacy beyond k-anonymity. ACM Trans. Knowl. Discov. Data, 1, 2007. 52 pages.

19. A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc.

23rd PODS, pages 223–228. ACM, 2004.

20. R. Niedermeier. Invitation to Fixed-Parameter Algorithms. Oxford University

Press, 2006.

21. R. Niedermeier. Reﬂections on multivariate algorithmics and problem parameter-

ization. In Proc. 27th STACS, volume 5 of LIPIcs, pages 17–32. IBFI Dagstuhl,

2010.

22. J. Orlin. A faster strongly polynomial minimum cost ﬂow algorithm. In Proc. 20th

STOC, pages 377–387. ACM, 1988.

23. L. Sweeney. Achieving k-anonymity privacy protection using generalization and

suppression. IJUFKS, 10(5):571–588, 2002.

24. L. Sweeney. k-anonymity: A model for protecting privacy. IJUFKS, 10(5):557–570,

2002.

12

is k-Anonymity polynomial-time solvable for constant values of k? The general answer is “no” since already 3-Anonymity is NP-hard [19]. In this work. even when the number of columns in the input dataset is just three [7]. including the number of rows n.1 k-Anonymity has a number of meaningful parameterizations beyond k. MAX SNP-hard when k = 7. even when the number of columns in the input dataset is 8 [4]. |Σ|). In particular. Here the arity of |Σ| may range from binary (such as gender) to unbounded. whereas there is no hope for ﬁxedparameter tractability with respect to the single parameters m and |Σ| [10]. answering an open question of Evans et al.14]. it is APX-hard when k = 3. we reﬁne this view by asking how the “degree of homogeneity” of the input matrix inﬂuences the complexity of k-Anonymity. Clearly. |Σ|m denotes the maximum possible number of diﬀerent input rows. NP-hard when k = 4. see Machanavajjhala et al. thus implicitly exploiting a very rough upper bound on input homogeneity. In other words. and MAX SNPhard when k = 3. we study the parameterized complexity of k-Anonymity as initiated by Evans et al. [5] made use of the fact that the value |Σ|m is an upper bound on the number of diﬀerent input rows. [10]. even when the number of columns in the input dataset is 27 [3]. The central question here is how naturally occurring parameters inﬂuence the complexity of kAnonymity. it has been shown that 2-Anonymity is polynomial-time solvable [3]. [10]. in the spirit of multivariate algorithmics [21]. even when the alphabet size is just two [4]. However. 2 . Table 1. note that in the data privacy community “diﬀerential privacy” (cleverly adding some random noise) is now the most popular method [9. We emphasize that Bonizzoni et al. [18] and Sweeney [24] for discussions on the pros and cons of these models in terms of privacy vs preservation of meaningful data. For example. [5]. k-Anonymity and many related problems are NP-hard [19]. We focus on a better understanding of the computational complexity and on tractable special cases of these problems. may still be valuable for the sake of providing a simple model that does not introduce noise. even on binary data sets [4]. [10] and Bonizzoni et al. While k-Anonymity is our central problem. Bonizzoni et al.additionally given an upper bound s for the number of suppressions allowed. k alone does not give a promising parameterization. the number of columns m. our results also extend to several more general problems. summarizes known and new results. and— for instance—in the case of “one-time anonymization”. Confronted with this computational hardness. 1 However. and. k-Anonymity is a very natural combinatorial problem (with potential applications beyond data privacy). is k-Anonymity ﬁxed-parameter tractable for the parameter “number of diﬀerent input rows”? In a similar vein. which extends similar tables due to Evans et al. [5] recently showed that k-Anonymity is ﬁxed-parameter tractable with respect to the combined parameter (m. we also study the eﬀect of the degree of homogeneity of the output matrix on the complexity of k-Anonymity. For instance. whether a matrix can be made k-anonymous by suppressing (blanking out) at most s entries. Thus. various combinations of single parameters. the alphabet size |Σ|. even when the input matrix is highly restricted. For instance.

j] ∈ Σ by the new symbol “ ” ending up with a matrix over the alphabet Σ { }. Results proved in this paper are in bold. Our inputs are datasets in the form of n × m-matrices. m). we mention that all our results extend to more general problems. Indeed. Typically. we expect tin n and tin |Σ|m . In contrast. when only tout is ﬁxed. opposing the trivial case tout = 1. m |Σ|. We say that a row in a matrix has a certain row type if it 3 . the number of diﬀerent input rows. j] of an n × m-matrix M over alphabet Σ with 1 ≤ i ≤ n and 1 ≤ j ≤ m means to simply replace M [i. [1]). including -Diversity [18] and “k-Anonymity with domain generalization hierarchies” [23]. First. for studying the computational complexity of k-Anonymity and related problems. our algorithm solves k-Anonymity in time linear in the size of the input. Preliminaries and basic observations. when tin is a constant. Since tout ≤ tin . In particular. where the n rows refer to the individuals and the m columns correspond to attributes with entries drawn from an alphabet Σ. The parameterized complexity of k-Anonymity. For instance. Finally. we show that k-Anonymity is already NP-hard when tout = 2. The column and row entries represent parameters. the number of diﬀerent output rows. Suppressing an entry M [i. n tin tout NP-hard [19] NP-hard [2] NP-hard [4] FPT [10] FPT [5] FPT [10] FPT NP-hard k NP-hard [19] NP-hard [2] NP-hard [4] FPT [10] FPT [10] FPT [10] FPT ? s k. we derive an algorithm that solves k-Anonymity in O(nm + 2tin tout tin (tout m + t2 · log(tin ))) time. which compares in m favorably with Bonizzoni et al. Then. tout may also be interpreted as the number of output “clusters”.’s [5] algorithm running in O(2(|Σ|+1) kmn2 ) time. More precisely. the entry in row “–” and column “s” refers to the (parameterized) complexity for the single parameter s whereas the entry in row “m” and column “s” refers to the (parameterized) complexity for the combined parameter (s.Table 1. tin is a “data-driven parameterization” in the sense that one can eﬃciently measure in advance the instancespeciﬁc value of tin whereas |Σ|m is a trivial upper bound for homogeneity. interpreting k-Anonymity as a (meta-)clustering problem (of Aggarwal et al. We remark that tout is an interesting parameter since it is “stronger” than tin and since. s W[1]-hard [5] W[1]-hard [5] ? ? FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT [10] FPT FPT FPT FPT Our contributions. We introduce the “homogeneity parameters” tin . then we show that the problem remains NP-hard. A row type is a string from (Σ { })m . – – |Σ| m n |Σ|. we show that there is always an optimal solution (minimizing the number of suppressions) with tout ≤ tin . we defer the corresponding details to a full version of the paper. and tout . this shows that k-Anonymity is ﬁxed-parameter tractable when parameterized by tin . even when |Σ| = 4.

to identify a parameter p (typically a positive integer or a tuple of positive integers) for X and to determine whether a size-n input instance of X can be solved in f (p) · nO(1) time. where f is an arbitrary computable function. To keep the treatment simple. we call a row x in the input matrix the preimage of y if y is obtained from x by suppressing in x the -entry positions of y. we assume that tout is a user-speciﬁed number which bounds the maximum number of output types. We refrain from further exploration here and treat tout as part of the input specifying an upper bound on the number of output types. problem X only lies in the parameterized complexity class XP. k. Lemma 1. Question: Can at most s elements of M be suppressed to obtain a kanonymous matrix M ? Our algorithmic results mostly rely on concepts of parameterized algorithmics [8. If X could only be solved in polynomial running time where the degree of the polynomial depends on p (such as nO(p) ). If M has tin row types. given a computationally hard problem X. then there exists a k-anonymous matrix M with at most tin row types which can be obtained from M by suppressing at most s elements. Let (M. k-Anonymity Input: An n × m-matrix M and nonnegative integers k. synonymously. interpreting k-Anonymity as a clustering problem with a guarantee k for the minimum cluster size (as in the work by Aggarwal et al.coincides in all its entries with the row type. s) be a YES-instance of k-Anonymity. we sometimes also speak of a row “lying in a row type”. For a row y (with some entries suppressed) in the output matrix. A natural objective when trying to achieve k-anonymity is to minimize the number of suppressed matrix entries. If this is the case. where tin denotes the number of input row types in the given matrix and tout denotes the number of row types in the output k-anonymous matrix. The corresponding complexity class is called FPT. Clearly. We study two new parameters tin and tout .12. We say that a matrix is k-anonymous if every row type contains none or at least k rows in the matrix. then one says that X is ﬁxed-parameter tractable for the parameter p. s. and that the row type “contains” this row. The central problem of this work reads as follows. In what follows.20]. output row types) that are built. 2 Parameter tin In this section we show that k-Anonymity is ﬁxed-parameter tractable with respect to the parameter number tin of input row types. for parameter p. Note that using sorting tin can be eﬃciently determined for a given matrix.2 The following lemma says that without loss of generality one may assume tout ≤ tin . The fundamental idea herein is. tout can also be seen as the number of “clusters” (that is. then. [1]). one could ask to minimize tout with at most a given number s of suppressions. Since tin ≤ |Σ|m and 2 Indeed. one could have variations on this: for instance. 4 .

we guess not only output matrices M with exactly tout types. j] = A[2. For implementing the guess in Step 2 of Phase 1. the algorithm computes an assignment of the rows of M to the row types Rj in M —see Algorithm 1 for an outline. this can clearly be done by keeping a counter at each leaf of the trie and incrementing it by one whenever a new row matches the path to a leaf. . . . s. Our ﬁxed-parameter algorithm works in two phases. 5 . Step 1 Phase 1. see Lemma 2. . the algorithm guesses the entries of each row type Rj in M . .Algorithm 1 Pseudo-code for solving k-Anonymity. . The leaves of the trie correspond to the row types of M . n2 . [10] showed the same for the parameter n. Besides achieving a ﬁxed-parameter tractability result for a more general and typically smaller parameter. These “empty” output row types are deleted. Hence. tout ) 2: Determine the row types R1 . . j] = 0 then 6: delete empty output row type Rj 7: decrease tout by one 8: else 9: Determine all entries of Rj 10: 11: 12: if solveRowAssignment then return ‘YES’ return ‘NO’ Phase 1. s) be an instance of k-Anonymity. For later use. . beginning with Phase 1. To determine the row types Ri of M (line 2 in Algorithm 1). with our guessing in Step 2. Note that we allow in our guessing step an output type to contain no row of any input row type. We now explain the two phases in detail. the algorithm goes over all binary matrices of dimension tin × tout . 1: procedure solveKAnonymity(M . The function solveRowAssignment solves Row Assignment in polynomial time. k. In the ﬁrst phase. and let M be the (unknown) k-anonymous matrix which we seek to obtain from M by suppressing at most s elements. Let (M. 3 Note that not all rows of an input type need to be mapped to the same output type. Bonizzoni et al. Rtin 3: for each possible A : [0. |Σ|). All of this can be done in a single pass over M . Step 3 Phase 2 tin ≤ n. j] = . we improve their results by giving a simpler algorithm with a (usually) better running time. the algorithm constructs a trie [13] on the rows of M . but also matrices M with at most tout types. j] = 1 (see line 3). Both these latter results were already proved quite recently. . 1]tin ×tout do 4: for j ← 1 to tout do 5: if A[1. ntin of each type of row that is present in M . and Evans et al. Step 2 Phase 1. k. |Σ|) and the single parameter n. In the second phase. [5] demonstrated ﬁxedparameter tractability for the parameter (m. such a matrix A is interpreted as follows: A row of type Ri is mapped3 to a row of type Rj if and only if A[i. k-Anonymity is also ﬁxed-parameter tractable with respect to the combined parameter (m. = A[tin . . the algorithm also keeps track of the numbers n1 .

each output row type contains at least k rows. set Rj [i] := . if R1 [i] = R2 [i] = . This is secured by Inequality (4). This yields the entries of the output type Rj . s. . . Question: Is there a function g : Tin × Tout → {0. ωtout in these output row types. then set Rj [i] := R1 [i]. Rtout and the number of suppressions ω1 . . – All rows of each input row type are assigned. Now. j) ≥ k i=1 tout g(i. and a function a : Tin × Tout → {0. Note that in the deﬁnition of Row Assignment no row type occurs and. j) · n ≥ g(i. . that is. for each 1 ≤ i ≤ m. the algorithm computes an assignment of the rows of the input row types to output row types such that: – The assignment of the rows respects the guessing in Step 2 of Phase 1. . of M (Step 3 of Phase 1). . . . Row Assignment Input: Nonnegative integers k. ntin with tin i=1 ni = n. . The guessing in Step 2 of Phase 1 takes time exponential in the parameter tin . j) · ωj ≤ s i=1 j=1 (4) Row Assignment formally deﬁnes the remaining problem in Phase 2: At this stage of the algorithm the input row types R1 .. . j) tin ∀i ∈ Tin ∀j ∈ Tout ∀j ∈ Tout ∀i ∈ Tin (1) (2) g(i. . This is secured by Equation (3). .. 1}. This is secured by Inequality (2). . 1 ≤ j ≤ tout . . . .Now the algorithm computes the entries of each row type Rj . . . we prove that Row Assignment is polynomial-time solvable. 6 . . To show this. n} such that a(i. . – M is k-anonymous. . R are the row types of M which contribute (according to the guessing) at least one row to the (as yet unknown) output row type Rj . Assume for ease of notation that R1 . . To this end. j) = ni j=1 tin tout (3) g(i. we use the two sets Tin = {1. – The total cost of the assignment is at most s. ωtout and n1 . = R [i]. . . . . . . This is secured by Inequality (1). . and the number ωj of suppressions required to convert any input row (if possible) to the type Rj is the number of -entries in Rj . . The algorithm has also computed the output row types R1 . after formally deﬁning the Row Assignment problem. the problem is independent of the speciﬁc entries of the input or output row types. . . ω1 . . . tin } and Tout = {1. . . . otherwise. tout }. . ntin in these input row types are known. hence. . We do this in the next lemma. but Phase 2 can be done in polynomial time. Now. Rtin and the number of rows n1 . .

Task: Find a function f which minimizes (u.v)∈A c(u. Row Assignment can be solved in O(t3 · log(tin )) time. A) [22]. v) and satisﬁes: f (u. v) ≥ 0 ∀u ∈ V ∀(u. we can solve our Uncapacitated Minimum Cost Flow-instance in O((tin + tout ) · log(tin + tout )(tin · tout + (tin + tout ) log(tin + tout ))) time. the running time is O(t3 · log(tin )). in O(nm + 2tin t2 (m + tin log(tin ))) time. in 7 . Lemma 2. j) = 1. a supply of ni ) and for each ωj add a node uj with demand k. add a sink t with demand ( ni ) − k · tout and the arcs (ui . by Lemma 1. in Proof (Sketch). For each ni .−n1 −n2 −n3 −n4 −n5 v1 v2 v3 v4 v5 ω1 ω1 ω3 ω2 ω2 ω3 ω4 ω4 u4 u3 u2 u1 k k k k 0 0 0 0 t tin ni i=1 − k · tout Fig. The Uncapacitated Minimum Cost Flow problem is solvable in O(|V | · log(|V |)(|A| + |V | · log(|V |))) time in a network (directed graph) D = (V. See Figure 1 for an example of the construction. k-Anonymity can be solved in O(nm + 2tin tout tin (tout m + t2 · in 2 log(tin ))) time. If a(i. Since. then add an arc (vi .v)∈A} f (v. t) with cost zero. Note that. 1. we arrive at the following theorem: Theorem 1. Finally. A) with demands b : V → Z on the nodes and costs c : V × V → N. in Putting all these together. u) = b(u) f (u. Since our constructed network has tin + tout nodes and tin · tout arcs. v) ∈ A {v|(v. which is deﬁned as follows [22]: Uncapacitated Minimum Cost Flow Input: A network (directed graph) D = (V. The number on each arc denotes its cost. and so. the maximum ﬂow over tin one arc is implicitly bounded by n because the sum of all supplies is i=1 ni = n. We reduce Row Assignment to the Uncapacitated Minimum Cost Flow problem. v) · f (u. v) − {v|(u. although the arc capacities are unbounded. The number next to each node denotes its demand. 1 ≤ i ≤ tin .u)∈A} We ﬁrst describe the construction of the network with demands and costs. add a node vi with demand −ni (that is. Example of the constructed network with tin = 5 and tout = 4. uj ) with cost ωj . tin ≥ tout .

Hence. the running time of our approach depends only on tin . Consequently. then Bonizzoni et al. we compare Phase 2 of our algorithm to the second step of Bonizzoni et al. m) are constants. Finally. the same problem Row Assignment is solved. Due to Lemma 1. k-Anonymity is ﬁxed-parameter tractable for the parameter tin . As we saw in Theorem 1. tin is much smaller than the number |Σ|m of all possible diﬀerent input types.’s algorithm. destroying any hope for ﬁxed-parameter tractability already for the combined parameter “number of output types and alphabet size”. their algorithm guesses all possible output row types m together with their entries in O(2(|Σ|+1) ) time. in O(2(|Σ|+1) ) time. our algorithm can easily be modiﬁed to solve kAnonymity using DGHs with no signiﬁcant increase in the running time. Such a relationship has already been observed in related work [1. For instances where |Σ|m ≤ tin · tout . it is a natural question whether k-Anonymity is already ﬁxed-parameter tractable with respect to the number of output types. [5].We remark that the described algorithm can be modiﬁed to use domain generalization hierarchies (DGH) [23] or to solve -Diversity [18]. In both algorithms. We end this section by comparing our algorithmic results to the closely related ones by Bonizzoni et al.’s algorithm runs in O(kmn2 ) time while our algorithm runs in linear time O(mn). did this using maximum matching on a bipartite graph with O(n) nodes. We defer the details to a full version of the paper. Bonizzoni et al. Hence. Since in the DGH setting the alphabet size |Σ| increases. and its proof of correctness is—arguably—simpler. we show that k-Anonymity is NP-hard even when there are only two output types and the alphabet has size four.6. The clustering view on k-Anonymity makes the number tout of output types (corresponding to the number of clusters) of a k-anonymous matrix an interesting parameter. Note that. 3 Parameter tout There is a close relationship between k-Anonymity and clustering problems where one is also interested in grouping together similar objects. In Phase 1 our algorithm guesses the output row types producible from M within O(2tin tout tin tout m + mn) time using a diﬀerent approach. it is not immediately clear how the algorithm due to Bonizzoni et al. we know that tout is a stronger parameter than tin . in the sense that tout ≤ tin . There is also a more algorithmic motivation for investigating the (parameterized) complexity of k-Anonymity for the parameter tout . As we mentioned above. can be modiﬁed to solve this problem without an exponential increase in the running time. in general the guessing step of our algorithm is faster. Next. in general. while we do it using a ﬂow network with O(tin ) nodes. one can m guess the output types like Bonizzoni et al. Answering this in the negative. when the parameters tin and (|Σ|. The hardness proof uses a polynomial-time many-one reduction from Balanced Complete Bipartite Subgraph: 8 .11]. They presented an algorithm for k-Anonymity m with a running time of O(2(|Σ|+1) kmn2 ) which works—similarly to our algorithm—in two phases: First.

corresponds to the solution set of the original instance: Solution set vertices from A are represented by rows that are preimages of rows in the solution type and solution set vertices from B are represented by columns in the solution type that are not suppressed. bn/2 } be the two vertex partition classes of G. 1. We provide a polynomial-time many-one reduction from a special case of BCBS with a balanced input graph. each column is suppressed. 01 if4 j ≤ 4 We assume without loss of generality that n is divisible by four. since containment in NP is clear. 3. that is. n/4) ∈ BCBS. . the solution type. 2. The value in the ith column of the j th row is 1 if aj is adjacent to bi and. This is done by using two diﬀerent types of 0-symbols representing non-adjacency. By making a matrix with 2x rows x-anonymous we ensure that there are at most two output types. this enforces the other output type to be fully suppressed. We only have to prove NP-hardness. . Question: Is there a complete bipartite subgraph of G whose partition classes are of size at least k each? BCBS is NP-complete [15]. When the input graph is not balanced. . We construct an (n/4 + 1)-Anonymityinstance that is a YES-instance if and only if (G. . |A| − |B| = 0. The salient points of our reduction are as follows. a2 . a bipartite graph that can be partitioned into two independent sets of the same size. We add one row that contains the -symbol in each entry. If k > |V |/4. There is one row for each vertex in A and one column for each vertex in B. Since the rows in the solution type have to agree on the columns that are not suppressed. then add k − |V |/4 isolated vertices to each of A and B. add ||A| − |B|| isolated vertices to the partition class of smaller size. One of the types. The 1-symbol represents adjacency. that is. otherwise. k-Anonymity is NP-complete for two output row types and alphabet size four. We devise a reduction from BCBS with balanced input graph and k = |V |/4 to show the NP-hardness of k-Anonymity. 9 . the main idea is to use a matrix expressing the adjacencies between A and B as the input matrix. and k = |V |/4. The matrix D is described in the following and illustrated in Figure 2. E) being a balanced bipartite graph and n := |V |. . n/4) be a BCBS-instance with G = (V. . If k < |V |/4. To this end.Balanced Complete Bipartite Subgraph (BCBS) Input: A bipartite graph G = (V. E) and an integer k ≥ 1. . Proof (Sketch). an/2 } and B = {b1 . b2 . Partition class A corresponds to the rows and partition class B corresponds to the columns. we have to ensure that they agree on adjacencies to model BCBS. Since the symbol is not used in any other row. . Theorem 2. Let A = {a1 . Let (G. that is. This special case clearly is also NP-complete by a simple reduction from the general BCBS problem: Let V := A B with A and B being the two vertex partition classes of the input graph. then repeat the following until k = |V |/4: Add one vertex to A and make it adjacent to each vertex from B and add one vertex to B and make it adjacent to each vertex from A.

. k) [5]. . . 1 1 . that is. 2. and n k-Anonymity is ﬁxed-parameter tractable. b1 1 01 . k. this part of the proof is deferred to a full version of the paper. the parameterized complexity for the combined parameter (|Σ|. k-Anonymity is W[1]hard for (s. This completes the construction. k) is still open. In our reduction the alphabet size |Σ| and the number tout of output row types are constants whereas the number n of rows. n). . 1 1 . . s. .. Then. k-Anonymity parameterized by (tout . . . . one input row that is a preimage of a row from the output row type. Deconstructing intractability.. Typical structure of the matrix D of the k-Anonymity instance obtained by the reduction from Balanced Complete Bipartite Subgraph. Whereas k-Anonymity is NP-hard for constant m and unbounded tout . In contrast. bn/2−1 bn/2 1 1 1 1 . . This suggests a study of the computational complexity of those cases where at least one of these quantities is bounded. In the remainder of this section we brieﬂy discuss the NP-hardness proof for k-Anonymity in the spirit of “deconstructing intractability” [16. . m) is in XP. that is. In particular. 01 01 . . m). and the anonymity quality k are unbounded.21]. . It remains to show that (G. . see Table 1 in Section 1. an/4 an/4+1 . Some of the corresponding parameterizations have already been investigated. an/2 1 02 1 1 1 02 1 1 1 1 1 Fig. . number s of suppressions. . m): In O(2m·tout · m · tout ) time guess the suppressed columns for all output row types. one containing only 1s and one containing only -symbols. Additionally. 01 02 . . 01 02 . . While for parameters (|Σ|. one can easily construct an XP-algorithm with respect to the combined parameter (tout . . it is open whether combining tout with m. it is presumably ﬁxed-parameter intractable for this combined parameter. n/4) is a YES-instance of BCBS if and only if the constructed matrix can be transformed into an (n/4 + 1)-anonymous matrix by suppressing at most s elements. . (|Σ|. The number of allowed suppressions is s := (n/4 + 1) · n/2 + (n/4 + 1) · n/4.a1 a2 . . one can simply apply the Row Assignment algorithm from Section 2: Proposition 1. . guess in nO(tout ) time one prototype for each output row type. b2 1 1 . there are two further rows. n/4 and 02 if j > n/4. the number m of attributes. . . Now. 10 . . 1 1 . . or s helps to obtain ﬁxed-parameter tractability. knowing the entries for each output row.

the number of rows that have at least one suppressed entry is at most s. T. More speciﬁcally. s). k). it would also be interesting to determine whether our ﬁxed-parameter tractability result for k-Anonymity with respect to the parameter tin not only extends to the already mentioned generalizations of k-Anonymity but also to the (in terms of privacy) more restrictive t-Closeness problem [17]. already for two output types k-Anonymity becomes computationally intractable. Now. 4 Conclusion This paper adopts a data-driven approach towards the design of (exact) algorithms for k-Anonymity and related problems. then k-Anonymity and its related problems are eﬃciently solvable—for constant tin even in linear time. s) by showing that the number tin of input types is at most (tout + s). the number of unsuppressed input row type cannot exceed tout . Besides running time improvements in general and open questions indicated in Table 1 such as the parameterized complexity of k-Anonymity for the combined parameter “number of suppressions plus alphabet size”. Panigrahy.Next. the parameter tin measures an easy-to-determine input property. to achieve a better running time one might want to develop a direct ﬁxed-parameter algorithm for (tout . Our central message here is that if tin is small or even constant. 11 . consider a feasible solution for an arbitrary k-Anonymity instance. and A. every unsuppressed input row type needs at least one unsuppressed output row type. 6(3):1–19. k-Anonymity is ﬁxed-parameter tractable with respect to the combined parameter (tout . ﬁxed-parameter tractability follows from Theorem 1: Proposition 2. Furthermore. Clearly. D. 2010. On the contrary. ACM Trans. G. Finally. Hence the number of suppressed input row types is also at most s. Kenthapadi. Achieving anonymity via clustering. Feder. K. To this end. Algorithms. S. Thomas. we conjecture that an XPalgorithm for k-Anonymity can be achieved with respect to the combined parameter (tout . It follows that tin ≤ tout + s. References 1. R. However. We distinguish between input row types that have rows which have at least one suppressed entry in the solution (suppressed input row types in the following) and input row types that do only have rows that remain unchanged in the solution (unsuppressed input row types in the following). We contributed to a reﬁned analysis of the computational complexity of kAnonymity. Zhu. Aggarwal. Khuller. we prove ﬁxed-parameter tractability for k-Anonymity with respect to the combined parameter (tout . s). The state of the art including several research challenges concerning natural parameterizations for k-Anonymity is surveyed in Table 1 in the introductory section. Thus.

V. Grohe. L. 2010. 52 pages. SECRYPT. L. Fard and K. 15. Springer. Feder. Comb. LNCS. Philip. and R. Springer. Resolving the complexity of some data privacy problems. Panigrahy. abs/1004. 2005. 1988. and S. 12. R. Achieving k-anonymity privacy protection using generalization and suppression. Komusiewicz. A. Reﬂections on multivariate algorithmics and problem parameterization. To appear. 23. Dwork. G. 20. J. 23rd ICDE.. pages 393–404. Blocki and R. 9. Fellows. and R. pages 17–32. 2010. and Y. K. Algorithms. Orlin. Downey and M. Trie memory. S. pages 106–115. SciTePress. Wang. G. Zhu. 2010. R. volume 6199 of LNCS. 13. B. Springer. Discrete Algorithms. T. 27th STACS. 2011. 2009.2. volume 3363 of LNCS. Sabharwal. D. Chakaravarthy. In Proc. Springer. Chaytor. and P. Evans. The NP-completeness column: An ongoing guide. IEEE. pages 223–228. On the complexity of the k-anonymization problem. 7. 22. Motwani. R. and G. 2010. G. Anonymizing binary and small tables is hard to approximate. 8. and M. Venkitasubramaniam. J. 10(5):557–570. Invitation to Fixed-Parameter Algorithms. Comb. In Proc. Bredereck. 17. A. 21st IWOCA. R. DellaÂăVedova. R. An eﬀective clustering approach to web query log anonymization. A. Optim. Wang. Bonizzoni. In Proc. Niedermeier. IJUFKS. Sweeney. pages 242–255. Aggarwal. Gehrke. Privacy-preserving data publishing: A survey of recent developments. -diversity: Privacy beyond k-anonymity. 2006. C. ACM Trans. 23rd PODS. IJUFKS. Williams. 37th ICALP. Wareham. D. Pirola. Springer. pages 109–119. 24. Dondi. R. 22:97–119. 3. 2010. Machanavajjhala. 2007. 54:86–95. and A. 18. T. 2010. and Y. Anonymizing tables. M. 2011. J. IBFI Dagstuhl. Uhlmann. Knowl. R. J. A faster strongly polynomial minimum cost ﬂow algorithm. 1. R. Niedermeier. M. ACM. On the complexity of optimal k-anonymity. Yu. Parameterized complexity of k-anonymity: Hardness and tractability. In Proc. 2007. pages 246–258. Johnson. Data. Fixed-parameter tractability of anonymizing data by suppressing entries. Kifer. Bonizzoni. 2011. 36th MFCS. 10(5):571–588. 14. V. 2006. G. J. Li. 42(4):14:1–14:53. volume 6460 of LNCS. Fung. CoRR. 8:438–448. Flum and M. Dondi. P. 12 . Oxford University Press. A. In Proc. P. Meyerson and R. Niedermeier. Commun. D. Chen. R. 9:137–151. 1999. 21.. ACM. In Proc. R. 3(9):490–499.. R. 2002. 19. 18(4):362–375. 10. J. Deconstructing intractability– A multivariate complexity analysis of interval constrained coloring. Pattern-guided data anonymization and clustering. T. k-anonymity: A model for protecting privacy. K. 1960. 6. Li. A ﬁrm foundation for private data analysis. Commun. 2004. ACM Comput. 11. Thomas. D. Sweeney. 5. In Proc. t-closeness: Privacy beyond k-anonymity and l-diversity. Kenthapadi. 1987. Nichterlein. Parameterized Complexity Theory.4729. J. E. Discov. In Proc. Parameterized Complexity. pages 377–387. 2011. 2002. Surv. Williams. Fredkin. In Proc. Optim. Niedermeier. Pandit. 20th STOC. ACM. and J. T. 4. P. 10th ICDT. A. J. C. Vedova. ACM. volume 5 of LIPIcs. Springer. C. S. N. 16. Venkatasubramanian.