Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
DOI 10.1007/s1061800801199
FRAPP: a framework for highaccuracy
privacypreserving mining
Shipra Agrawal · Jayant R. Haritsa ·
B. Aditya Prakash
Received: 16 July 2006 / Accepted: 26 September 2008 / Published online: 27 October 2008
Springer Science+Business Media, LLC 2008
Abstract To preserve client privacy in the data mining process, a variety of
techniques based on random perturbation of individual data records have been propo
sed recently. In this paper, we present FRAPP, a generalized matrixtheoretic frame
work of random perturbation, which facilitates a systematic approach to the design of
perturbation mechanisms for privacypreserving mining. Speciﬁcally, FRAPP is used
to demonstrate that (a) the prior techniques differ only in their choices for the per
turbation matrix elements, and (b) a symmetric positivedeﬁnite perturbation matrix
with minimal condition number can be identiﬁed, substantially enhancing the accuracy
even under strict privacy requirements. We also propose a novel perturbation mecha
nism wherein the matrix elements are themselves characterized as random variables,
and demonstrate that this feature provides signiﬁcant improvements in privacy at
only a marginal reduction in accuracy. The quantitative utility of FRAPP, which is
a generalpurpose randomperturbationbased privacypreserving mining technique,
is evaluated speciﬁcally with regard to association and classiﬁcation rule mining on
Responsible editor: Johannes Gehrke.
A partial and preliminary version of this paper appeared in the Proc. of the 21st IEEE Intl. Conf. on Data
Engineering (ICDE), Tokyo, Japan, 2005, pgs. 193–204.
S. Agrawal · J. R. Haritsa (B)
Indian Institute of Science, Bangalore 560012, India
email: haritsa@dsl.serc.iisc.ernet.in
Present Address:
S. Agrawal
Stanford University, Stanford, CA, USA
B. A. Prakash
Indian Institute of Technology, Mumbai 400076, India
123
102 S. Agrawal et al.
a variety of real datasets. Our experimental results indicate that, for a given privacy
requirement, either substantially lower modeling errors are incurred as compared to
the prior techniques, or the errors are comparable to those of direct mining on the true
database.
Keywords Privacy · Data mining
1 Introduction
The knowledge models produced through data mining techniques are only as good
as the accuracy of their input data. One source of data inaccuracy is when users,
due to privacy concerns, deliberately provide wrong information. This is especially
common with regard to customers asked to provide personal information on Web forms
to Ecommerce service providers. The standard approach to address this problem is
for the service providers to assure the users that the databases obtained from their
information would be anonymized through the variety of techniques proposed in the
statistical database literature (see Adam and Wortman 1989; Shoshani 1982), before
being supplied to the data miners. For example, the swapping of values between
different customer records, as proposed by Denning (1982). However, in today’s world,
most users are (perhaps justiﬁably) cynical about such assurances, and it is therefore
imperative to demonstrably provide privacy at the point of data collection itself, that
is, at the user site.
For the above “B2C (businesstocustomer)” privacy environment (Zhang et al.
2004), a variety of privacypreserving data mining techniques have been proposed in
the last fewyears (e.g. Aggarwal and Yu 2004; Agrawal and Srikant 2000; Evﬁmievski
et al. 2002; Rizvi and Haritsa 2002), in an effort to encourage users to submit correct
inputs. The goal of these techniques is to ensure the privacy of the raw local data but,
at the same time, support accurate reconstruction of the global data mining models.
Most of the techniques are based on a data perturbation approach, wherein the user
data is distorted in a probabilistic manner that is disclosed to the eventual miner.
For example, in the MASK technique Rizvi and Haritsa (2002), intended for privacy
preserving associationrule mining on sparse boolean databases, each bit in the original
(true) user transaction vector is independently ﬂipped with a parametrized probability.
1.1 The FRAPP framework
The trend in the prior literature has been to propose speciﬁc perturbation techniques,
which are then analyzed for their privacy and accuracy properties. We move on, in
this paper, to proposing FRAPP
1
(FRamework for Accuracy in PrivacyPreserving
mining), a generalized matrixtheoretic framework that facilitates a systematic
approach to the design of randomperturbation schemes for privacypreserving mining.
It supports “ampliﬁcation”, a particularly strong notion of privacy proposed by
1
Also the name of a popular coffeebased beverage, where the ingredients are perturbed and hidden under
foam http://en.wikibooks.org/wiki/Cookbook:Frapp%C3%A9_Coffee.
123
A framework for highaccuracy privacypreserving mining 103
Evﬁmievski et al. (2003), which guarantees strict limits on privacy breaches of indivi
dual user information, independent of the distribution of the original (true) data. The
distinguishing feature of FRAPP is its quantitative characterization of the sources of
error in the random data perturbation and model reconstruction processes.
We ﬁrst demonstrate that the prior techniques differ only in their choices for the
elements in the FRAPP perturbation matrix. Next, and more importantly, we showthat
through appropriate choices of matrix elements, new perturbation techniques can be
constructed that provide highly accurate mining results even under strict ampliﬁcation
based (Evﬁmievski et al. 2003) privacy guarantees. In fact, we identify a perturbation
matrix with provably minimal condition number,
2
substantially improving the accu
racy under the given constraints. An efﬁcient implementation for this optimal pertur
bation matrix is also presented.
FRAPP’s quantiﬁcationof reconstructionerror highlights that, apart fromthe choice
of perturbation matrix, the size of the dataset also has signiﬁcant impact on the accu
racy of the mining model. We explicitly characterize this relationship, thus aiding
the miner decide the minimum amount of data to be collected in order to achieve,
with high probability, a desired level of accuracy in the mining results. Further, for
those environments where data collection possibilities are limited, we propose a novel
“multidistortion” method that makes up for the lack of data by collecting multiple
distorted versions from each individual user without materially compromising on
privacy.
We then investigate, for the ﬁrst time, the possibility of randomizing the perturba
tion parameters themselves. The motivation is that it could result in increased privacy
levels since the actual parameter values used by a speciﬁc client will not be known to
the data miner. This approach has the obvious downside of perhaps reducing the model
reconstruction accuracy. However, our investigation shows that the tradeoff is very
attractive in that the privacy increase is signiﬁcant whereas the accuracy reduction is
only marginal. This opens up the possibility of using FRAPP in a twostep process:
First, given a userdesired level of privacy, identifying the deterministic values of the
FRAPP parameters that both guarantee this privacy and also maximize the accuracy;
and then, (optionally) randomizing these parameters to obtain even better privacy
guarantees at a minimal cost in accuracy.
1.2 Evaluation of FRAPP
The FRAPP model is valid for randomperturbationbased privacypreserving mining
in general. Here, we focus on its applications to categorical databases, where attribute
domains are ﬁnite. Note that boolean data is a special case of this class, and further, that
continuousvaluedattributes canbe convertedintocategorical attributes bypartitioning
the domain of the attribute into ﬁxed length intervals. To quantitatively assess FRAPP’s
utility, we speciﬁcally evaluate the performance of our new perturbation mechanisms
on popular mining tasks such as association rule mining and classiﬁcation rule mining.
2
In the class of symmetric positivedeﬁnite matrices (refer Sect. 4).
123
104 S. Agrawal et al.
With regard to association rule mining, our experiments on a variety of real datasets
indicate that FRAPP is substantially more accurate than the prior privacypreserving
techniques. Further, while their accuracy degrades with increasing itemset length,
FRAPP is almost impervious to this parameter, making it particularly wellsuited
to datasets where the lengths of the maximal frequent itemsets are comparable to
the cardinality of the set of attributes requiring privacy. Similarly, with regard to
classiﬁcation rule mining, our experiments show that FRAPP provides an accuracy
that is, in fact, comparable to direct classiﬁcation on the true database.
Apart from mining accuracy, the running time and memory costs for perturbed
data mining, as compared to classical mining on the original data, are also important
considerations. In contrast to much of the earlier literature, FRAPP uses a generali
zed dependent perturbation scheme, where the perturbation of an attribute value may
be affected by the perturbations of the other attributes in the same record. However,
we show that it is fully decomposable into the perturbation of individual attributes,
and hence has the same runtime complexity as any independent perturbation method.
Further, we present experimental evidence that FRAPPtakes only a fewminutes to per
turb datasets running to millions of records. Subsequently, due to its wellconditioned
and trivially invertible perturbation matrix, FRAPP incurs only negligible additional
overheads with respect to memory usage and mining execution time, as compared to
traditional mining. Overall, therefore, FRAPP does not pose any signiﬁcant additional
computational burdens on the data mining process.
1.3 Contributions
In a nutshell, the work presented here provides mathematical and algorithmic founda
tions for efﬁciently providing both strict privacy and enhanced accuracy in privacy
conscious data mining applications. Speciﬁcally, our main contributions are as follows:
– FRAPP, a generalized matrixtheoretic framework for random perturbation and
mining model reconstruction;
– Using FRAPP to derive new perturbation mechanisms for minimizing the model
reconstruction error while ensuring strict privacy guarantees;
– Introducing the concept of randomization of perturbation parameters, and thereby
deriving enhanced privacy;
– Efﬁcient implementations of the proposed perturbation mechanisms;
– Quantitatively demonstrating the utility of FRAPP in the context of association and
classiﬁcation rule mining.
1.4 Organization
The remainder of this paper is organized as follows: Related work on privacy
preserving mining is reviewed in Sect. 2. The FRAPP framework for data perturbation
and model reconstruction is presented in Sect. 3. Appropriate choices of the frame
work parameters for simultaneously guaranteeing strict data privacy and improving
123
A framework for highaccuracy privacypreserving mining 105
model accuracy are discussed in Sects. 4 and 5. The impact of randomizing the FRAPP
parameters is investigated in Sect. 6.
Efﬁcient schemes for implementing the FRAPP approach are described in Sect. 7.
The application of these mechanisms to speciﬁc patterns is discussed in Sect. 8, and
their utility is quantitatively evaluated in Sect. 9. Finally, in Sect. 10, we summarize
the conclusions of our study and outline future research avenues.
2 Related work
The issue of maintaining privacy in data mining has attracted considerable attention
over the last few years. The literature closest to our approach includes that of Agrawal
and Aggarwal (2001), Agrawal and Srikant (2000), de Wolf et al. (1998), Evﬁmievski
et al. (2002, 2003), Kargupta et al. (2003), Rizvi and Haritsa (2002). In the pioneering
work of Agrawal and Srikant (2000), privacypreserving data classiﬁers based on
adding noise to the record values were proposed. This approach was extended by
Agrawal and Aggarwal (2001) and Kargupta et al. (2003) to address a variety of
subtle privacy loopholes.
New randomization operators for maintaining data privacy for boolean data were
presented and analyzed by Evﬁmievski et al. (2002), Rizvi and Haritsa (2002). These
methods are applicable to categorical/boolean data and are based on probabilistic
mapping from the domain space to the range space, rather than by incorporating
additive noise to continuousvalued data. Atheoretical formulation of privacy breaches
for such methods, and a methodology for limiting them, were given in the foundational
work of Evﬁmievski et al. (2003).
Techniques for data hiding using perturbation matrices have also been investigated
in the statistics literature. For example, in the early 90s work of Duncan and Pearson
(1991), various disclosurelimitation methods for microdata are formulated as “matrix
masking” methods. Here, the data consumer is provided the masked data ﬁle M =
AXB +C instead of the true data X, with A, B and C being masking matrices. But,
no quantiﬁcation of privacy guarantees or reconstruction errors was discussed in their
analysis.
The PRAM method (de Wolf et al. 1998; Gouweleeuw et al. 1998), also intended
for disclosure limitation in microdata ﬁles, considers the use of Markovian perturba
tion matrices. However, the ideal choice of matrix is left as an open research issue,
and an iterative reﬁnement process to produce acceptable matrices is proposed as
an alternative. They also discuss the possibility of developing perturbation matrices
such that data mining can be carried out directly on the perturbed database (that is,
as if it were the original database and therefore not requiring any matrix inversion),
and still produce accurate results. While this “invariant PRAM”, as they call it, is
certainly an attractive notion, the systematic identiﬁcation of such matrices and the
conditions on their applicability is still an open research issue—moreover, it appears
to be feasible only in a “B2B (businesstobusiness)” environment, as opposed to the
B2C environment considered here.
The work recently presented by Agrawal et al. (2005) for ensuring privacy in the
OLAP environment, also models data perturbation and reconstruction as
123
106 S. Agrawal et al.
matrixtheoretic operations. A transition matrix is used for perturbation, and recons
truction is executed using matrix inversion. They also suggest that the condition num
ber of the perturbationmatrixis a goodindicator of the error inreconstruction. However
the issue of choosing a perturbation matrix to minimize this error is not addressed.
Our work extends the abovementioned methodologies for privacypreserving
mining in a variety of ways. First, we combine the various approaches for random
perturbation on categorical data into a common theoretical framework, and explore
how well random perturbation methods can perform in the face of strict privacy
requirements. Second, through quantiﬁcation of privacy and accuracy measures, we
present an ideal choice of perturbation matrix, thereby taking the PRAM approach to,
in a sense, its logical conclusion. Third, we propose the idea of randomizing the pertur
bation matrix elements themselves, which has not been, to the best of our knowledge,
previously discussed in the literature.
Very recently, Rastogi et al. (2007) utilize and extend the FRAPP framework to
a B2B environment like publishing. That is, they assume that users provide correct
data to a central server and then this data is collectively anonymized. In contrast,
our schemes assume that the users trust no one but themselves, and therefore the
perturbation has to happen locally for each user. Formally, the transformation in their
algorithmis described as y = Ax+b, thereby effectively adding a noise vector b to Ax.
They also analyze the privacy and accuracy tradeoff under bounded prior knowledge
assumptions.
The “sketching” methods that were very recently presented by Mishra and Sandler
(2006) are complementary to our approach. Their basic idea is that a kbit attribute
with 2
k
possible values can be represented using 2
k
binaryvalued attributes which
can then each be perturbed independently. However, a direct application of this idea
requires extra (2
k
− k) bits, and therefore, Mishra and Sandler (2006) proposes a
summary sketching technique that requires an extra number of bits logarithmic in
the number of instances in the dataset. Due to the extra bits, the method provides
good estimation accuracy for single itemcounts. However, the multipleattribute count
estimation accuracy is shown to depend on the condition number of the perturbation
matrix. Our results on optimally conditioned perturbation matrices can be combined
with the sketching methods to provide better estimation of joint distributions. Another
difference between the two works is that we provide experimental results in addition
to the theoretical formulations.
Recently, a new privacypreserving scheme based on the interesting idea of alge
braic distortion, rather than statistical methods, has been proposed by Zhang et al.
(2004). Their work is based on the assumption that statistical methods cannot handle
long frequent itemsets. But, as shown in this paper, FRAPP successfully ﬁnds even
length7 frequent itemsets. A second assumption is that each attribute is randomi
zed independently, thereby losing correlations—however, FRAPP supports dependent
attribute perturbation and can therefore preserve correlations quite effectively. Finally,
their work is restricted to handling only “upward privacy breaches” (Evﬁmievski et al.
2003), whereas FRAPP handles downward privacy breaches as well.
Another model of privacypreserving data mining is the kanonymity model
(Samarati and Sweeney 1998; Aggarwal and Yu 2004), where each record value is
replaced with a corresponding generalized value. Speciﬁcally, each perturbed record
123
A framework for highaccuracy privacypreserving mining 107
cannot be distinguished from at least k other records in the data. However, the
constraints of this model are less strict than ours since the intermediate database
formingserver can learn or recover precise records.
Adifferent perspective is takeninHippocratic databases, whichare database systems
that take responsibility for the privacy of the data they manage, and are discussed by
Agrawal et al. (2002, 2004a,b), LeFevre et al. (2004). They involve speciﬁcation of
how the data is to be used in a privacy policy, and enforcing limited disclosure rules
for regulatory concerns prompted by legislation.
Finally, the problemaddressed by Atallah et al. (1999), Dasseni et al. (2001), Saygin
et al. (2001, 2002) is preventing sensitive models from being inferred by the data
miner—this work is complementary to ours since it addresses concerns about output
privacy, whereas our focus is on the privacy of the input data. Maintaining input data
privacy is considered by Kantarcioglu and Clifton (2002), Vaidya and Clifton (2002,
2003, 2004) in the context of databases that are distributed across a number of sites
with each site only willing to share data mining results, but not the source data.
3 The FRAPP framework
In this section, we describe the construction of the FRAPP framework, and its quan
tiﬁcation of privacy and accuracy measures.
Data model We assume that the original (true) database U consists of N records,
with each record having M categorical attributes. The domain of attribute j is denoted
by S
j
U
, resulting in the domain S
U
of a record in U being given by S
U
=
M
j =1
S
j
U
.
We map the domain S
U
to the index set I
U
= {1, . . . , S
U
}, thereby modeling the
database as a set of N values from I
U
. If we denote the i th record of U as U
i
, then
U = {U
i
}
N
i =1
, U
i
∈ I
U
.
To make this concrete, consider a database U with 3 categorical attributes Age, Sex
and Education having the following category values:
Age Child, Adult, Senior
Sex Male, Female
Education Elementary, Graduate
For this schema, M = 3, S
1
U
={Child, Adult, Senior}, S
2
U
={Male, Female}, S
3
U
=
{Elementary, Graduate}, S
U
= S
1
U
×S
2
U
×S
3
U
, S
U
 = 12. The domain S
U
is indexed
by the index set I
U
= {1, . . . , 12}, and hence the set of records
U U
Child Male Elementary
Child Male Graduate
Child Female Graduate
Senior Male Elementary
maps
to
1
2
4
9
Each record U
i
represents the private information of customer i . Further, we assume
that the U
i
’s are independent and identically distributed according to a ﬁxed distribu
tion p
U
. This distribution p
U
is not private and the customers are aware that the miner
123
108 S. Agrawal et al.
is expected to learn it—in fact, that is usually the goal of the data mining exercise.
However, the assumption of independence implies that once p
U
is known, possession
of the private informationU
j
of any other customer j provides no additional inferences
about customer i ’s private information U
i
(Evﬁmievski et al. 2002).
Perturbation model As mentioned in Sect. 1, we consider the B2C privacy situation
wherein the customers trust no one except themselves, that is, they wish to perturb
their records at their client sites before the information is sent to the miner, or any
intermediate party. This means that perturbation is carried out at the granularity of
individual customer records U
i
, without being inﬂuenced by the contents of the other
records in the database.
For this situation, there are two possibilities: (a) A simple independent attribute
perturbation, wherein the value of each attribute in the user record is perturbed inde
pendently of the rest; or (b) A more generalized dependent attribute perturbation,
where the perturbation of each attribute may be affected by the perturbations of the
other attributes in the record. Most of the prior perturbation techniques, including
Evﬁmievski et al. (2002, 2003), Rizvi and Haritsa (2002), fall into the independent
attribute perturbation category. The FRAPP framework, however, includes both kinds
of perturbation in its analysis.
Let the perturbed database be V = {V
1
, . . . , V
N
}, with domain S
V
, and correspon
ding index set I
V
. For example, given the sample database U discussed above, and
assuming that each attribute is distorted to produce a value within its original domain,
the distortion may result in
V V
5
7
2
12
which
maps
to
Adult Male Elementary
Adult Female Elementary
Child Male Graduate
Senior Female Graduate
Let the probability of an original customer record U
i
= u, u ∈ I
U
being perturbed
to a record V
i
= v, v ∈ I
V
using randomization opertor R(u) be p(u → v), and let
A denote the matrix of these transition probabilities, with A
vu
= p(u → v). This
random process maps to a Markov process, and the perturbation matrix A should
therefore satisfy the following properties (Strang 1988):
A
vu
≥ 0 and
v∈I
V
A
vu
= 1 ∀u ∈ I
U
, v ∈ I
V
(1)
Due to the constraints imposed by Eq. 1, the domain of A is a subset of R
S
V
×S
U

.
This domain is further restricted by the choice of the randomization operator. For
example, for the MASK technique (Rizvi and Haritsa 2002) mentioned in Sect. 1, all
the entries of matrix A are decided by the choice of a single parameter, namely, the
ﬂipping probability.
In this paper, we explore the preferred choices of A to simultaneously achieve data
privacy guarantees and high model accuracy, without restricting ourselves abinitio to
a particular perturbation method.
123
A framework for highaccuracy privacypreserving mining 109
3.1 Privacy guarantees
The miner is provided the perturbed database V, and the perturbation matrix A.
Obviously, by receiving V
i
corresponding to customer i , the miner gains partial infor
mationabout U
i
. However, as mentionedearlier inthis section, due tothe independence
assumption, all V
i
for j = i disclose nothing about U
i
—they certainly help the miner
to learn the distribution p
U
, but this is already factored in our privacy analysis since we
assume the most conservative scenario wherein the miner has complete and precise
knowledge of p
U
. In fact, extracting information about p
U
is typically the goal of
the data mining exercise and, therefore, our privacy technique must encourage, rather
than preclude, achieving this objective. The problem therefore reduces to analyzing
speciﬁcally how much can be disclosed by V
i
about the particular source record U
i
.
We utilize the deﬁnition, given by Evﬁmievski et al. (2003), that a property Q(u)
of a data record U(i ) = u is a function Q: u → {true, false}. Further, a property holds
for a record U
i
= u if Q(u) = true. For example, consider the following record from
our example dataset U
Age Sex Education
Child Male Elementary
Sample properties of this data record are
Q
1
(U
i
) ≡ “Age = Child and Sex = Male”,
and Q
2
(U
i
) ≡ “Age = Child or Adult ”.
For this context, the prior probability of a property of a customer’s private information
is the likelihood of the property in the absence of any knowledge about the customer’s
private information. On the other hand, the posterior probability is the likelihood of the
property given the perturbed information from the customer and the knowledge of the
prior probabilities through reconstruction from the perturbed database. Speciﬁcally,
the prior probability of any property Q(U
i
) is given by
P[Q(U
i
)] =
u:Q(u)
P[U
i
= u]
=
u:Q(u)
p
U
(u)
The posterior probability of any such property can be computed using Bayes formula
P[Q(U
i
)V
i
= v] =
u:Q(u)
P[U
i
= uV
i
= v]
=
u:Q(u)
P[U
i
= u] · p[u → v]
P[V
i
= v]
As discussed by Evﬁmievski et al. (2003), in order to preserve the privacy of some
property of a customer’s private information, the posterior probability of that pro
perty should not be unduly different from the prior probability of the property for the
123
110 S. Agrawal et al.
customer. This notion of privacy is quantiﬁed by Evﬁmievski et al. (2003) through
the following results, where ρ
1
and ρ
2
denote the prior and posterior probabilities,
respectively:
Privacy breach An upward ρ
1
toρ
2
privacy breach exists with respect to property
Q if ∃v ∈ S
V
such that
P[Q(U
i
)] ≤ ρ
1
and P[Q(U
i
)R(U
i
) = v] ≥ ρ
2
.
Conversely, a downward ρ
2
toρ
1
privacy breach exists with respect to property Q if
∃v ∈ S
V
such that
P[Q(U
i
)] ≥ ρ
2
and P[Q(U
i
)R(U
i
) = v] ≤ ρ
1
.
Ampliﬁcation A randomization operator R(u) is at most γ amplifying for v ∈ S
V
if
∀u
1
, u
2
∈ S
U
:
p[u
1
→ v]
p[u
2
→ v]
≤ γ
where γ ≥ 1 and ∃u: p[u → v] > 0. Operator R(u) is at most γ amplifying if it is
at most γ amplifying for all qualifying v ∈ S
V
.
Breach prevention Let R be a randomization operator, v ∈ S
V
be a randomized
value such that ∃u: p[u → v] > 0, and ρ
1
, ρ
2
(0 < ρ
1
< ρ
2
< 1) be two probabilities
as per the above privacy breach deﬁnition. Then, if R is at most γ amplifying for v,
revealing “R(u) = v” will cause neither upward (ρ
1
toρ
2
) nor downward (ρ
2
toρ
1
)
privacy breaches with respect to any property if the following condition is satisﬁed:
ρ
2
(1 − ρ
1
)
ρ
1
(1 − ρ
2
)
> γ
If this situation holds, R is said to support (ρ
1
, ρ
2
) privacy guarantees.
From the above results of Evﬁmievski et al. (2003), we can derive for our formula
tion, the following condition on the perturbation matrix A in order to support (ρ
1
, ρ
2
)
privacy:
A
vu
1
A
vu
2
≤ γ <
ρ
2
(1 −ρ
1
)
ρ
1
(1 −ρ
2
)
∀u
1
, u
2
∈ I
U
, ∀v ∈ I
V
(2)
That is, the choice of perturbation matrix A should follow the restriction that the ratio
of any two matrix entries (in a row) should not be more than γ .
Application environment At this juncture, we wish to clearly specify the environ
ments under which the above guarantees are applicable. Firstly, our quantiﬁcation of
privacy breaches analyzes only the information leaked to the miner through observing
the perturbed data; it does not take into account any prior knowledge that the miner
may have about the original database. Secondly, we assume that the contents of each
client’s record are completely independent from those of other customers—that is,
123
A framework for highaccuracy privacypreserving mining 111
there are no intertransaction dependencies. Due to this independence assumption, all
the R(U
j
) for j = i do not disclose anything about U
i
and can therefore be ignored in
privacy analysis; they certainly help the miner to learn the distribution of the original
data, but in our analysis we have already assumed that this distribution is fully known
by the miner. So the problem reduces to evaluating how much can be disclosed by
R(U
i
) about U
i
Evﬁmievski et al. (2003). We also hasten to add that we do not make
any such restrictive assumptions about intratransaction dependencies—in fact, the
objective of associationrule mining is precisely to establish such dependencies.
3.2 Reconstruction model
We now move on to analyzing how the distribution of the original database is recons
tructed from the perturbed database. As per the perturbation model, a client C
i
with
data record U
i
= u, u ∈ I
U
generates record V
i
= v, v ∈ I
V
with probability
p[u → v]. The generation event can be viewed as a Bernoulli trial with success pro
bability p[u → v]. If we denote the outcome of the i th Bernoulli trial by the random
variable Y
i
v
, the total number of successes Y
v
in N trials is given by the sum of the N
Bernoulli random variables:
Y
v
=
N
i =1
Y
i
v
(3)
That is, the total number of records with value v in the perturbed database is given by
Y
v
.
Note that Y
v
is the sum of N independent but nonidentical Bernoulli trials. The
trials are nonidentical because the probability of success varies from trial i to trial j ,
depending on the values of U
i
and U
j
, respectively. The distribution of such a random
variable Y
v
is known as the PoissonBinomial distribution (Wang 1993).
From Eq. 3, the expectation of Y
v
is given by
E(Y
v
) =
N
i =1
E(Y
i
v
) =
N
i =1
P(Y
i
v
= 1) (4)
Using X
u
to denote the number of records with value u in the original database, and
noting that P(Y
i
v
= 1) = p[u → v] = A
vu
for U
i
= u, we get
E(Y
v
) =
u∈I
U
A
vu
X
u
(5)
Let X = [X
1
X
2
. . . X
S
U

]
T
, Y = [Y
1
Y
2
. . . Y
S
V

]
T
. Then, the following expression
is obtained from Eq. 5:
E(Y) = AX (6)
123
112 S. Agrawal et al.
At ﬁrst glance, it may appear that X, the distribution of records in the original
database (and the objective of the reconstruction exercise), can be directly obtained
from the above equation. However, we run into the difﬁculty that the data miner does
not possess E(Y), but only a speciﬁc instance of Y, with which he has to approximate
E(Y).
3
Therefore, we resort to the following approximation to Eq. 6:
Y = A
¨
X (7)
where X is estimated as
¨
X. This is a system of S
V
 equations in S
U
 unknowns, and
for the system to be uniquely solvable, a necessary condition is that the space of the
perturbed database is a superset of the original database (i.e. S
V
 ≥ S
U
). Further, if
the inverse of matrix A exists, the solution of this system of equations is given by
¨
X = A
−1
Y (8)
providing the desired estimate of the distribution of records in the original database.
Note that this estimation is unbiased because E(
¨
X) = A
−1
E(Y) = X.
3.3 Estimation error
To analyze the error in the above estimation process, we employ the following well
known theorem from linear algebra Strang (1988):
Theorem 1 Given an equation of the form Ax = b and that the measurement
ˆ
b of b
is inexact, the relative error in the solution ˆ x = A
−1
ˆ
b satisﬁes
ˆ x − x
x
≤ c
ˆ
b −b
b
where c is the condition number of matrix A.
For a positivedeﬁnite matrix, c = λ
max
/λ
mi n
, where λ
max
and λ
mi n
are the maxi
mum and minimum eigenvalues of matrix A, respectively. Informally, the condition
number is a measure of the sensitivity of a matrix to numerical operations. Matrices
with condition numbers near one are said to be wellconditioned, i.e. stable, whereas
those with condition numbers much greater than one (e.g. 10
5
for a 5∗5 Hilbert matrix
Strang 1988) are said to be illconditioned, i.e. highly sensitive.
From Eqs. 6 and 8, coupled with Theorem 1, we have
¨
X − X
X
≤ c
Y − E(Y)
E(Y)
(9)
whichmeans that the error inestimationarises fromtwosources: First, the sensitivityof
the problem, indicated by the condition number of matrix A; and second, the deviation
3
If multiple distorted versions of the database are provided, then E(Y) is approximated by the observed
average of these versions.
123
A framework for highaccuracy privacypreserving mining 113
of Y from its mean, i.e. the deviation of perturbed database counts from their expected
values, indicatedbythe variance of Y. Inthe followingtwosections, we mathematically
determine how to reduce this error by: (a) appropriately choosing the perturbation
matrix to minimize the condition number, and (b) identifying the minimum size of the
database required to (probabilistically) bound the deviation within a desired threshold.
4 Perturbation matrix with minimum condition number
The perturbation techniques proposed in the literature primarily differ in their choices
for perturbation matrix A. For example:
(1) MASK: The MASK(Rizvi and Haritsa 2002) randomization scheme uses a matrix
A with
A
vu
= p
k
(1 − p)
M
b
−k
(10)
where M
b
is the number of boolean attributes when each categorical attribute j
is converted into  S
j
U
 boolean attributes, (1 − p) is the bit ﬂipping probability
for each boolean attribute, and k is the number of attributes with matching bits
between the perturbed value v and the original value u.
(2) Cutandpaste: The cutandpaste (C&P) randomization operator (Evﬁmievski
et al. 2002) employs a matrix A with
A
vu
=
M
z=0
p
M
[z] ·
mi n{z,l
u
,l
v
}
q=max{0,z+l
u
−M,l
u
+l
v
−M
b
}
_
l
u
q
__
M−l
u
z−q
_
_
M
z
_
·
_
M
b
−l
u
l
v
−q
_
ρ
(l
v
−q)
(1 −ρ)
(M
b
−l
u
−l
v
+q)
(11)
where
p
M
[z] =
mi n{K,z}
w=0
_
M − w
z − w
_
ρ
(z−w)
(1 −ρ)
(M−z)
·
_
1 − M/(K +1) if w = M & w < K
1/(K +1) o.w.
herel
u
andl
v
are the number of 1bits inthe original recordu andits corresponding
perturbed record v, respectively, while K and ρ are operator parameters.
To enforce strict privacy guarantees, the parameter settings for the above methods are
bounded by the constraints, given in Eqs. 1 and 2, on the values of the elements of the
perturbation matrix A. It turns out that for practical values of privacy requirements,
the resulting matrix A for these previous schemes is extremely illconditioned—in
123
114 S. Agrawal et al.
fact, the condition numbers in our experiments were of the order of 10
5
and 10
7
for
MASK and C&P, respectively.
Such illconditioned matrices make the reconstruction very sensitive to the variance
in the distribution of the perturbed database. Thus, it is important to carefully choose
the matrix A such that it is wellconditioned (i.e. has a low condition number). If we
decide on a distortion method ab initio, as in the earlier techniques, then there is little
room for making speciﬁc choices of perturbation matrix A. Therefore, we take the
opposite approach of ﬁrst designing matrices of the required type, and then devising
perturbation methods that are compatible with these matrices.
To choose a suitable matrix, we start fromthe intuition that for γ = ∞, the obvious
matrix choice is the unity matrix, which both satisﬁes the constraints on matrix A
(Eqs. 1 and 2), and has the lowest possible condition number, namely, 1. Hence, for a
given γ , we can choose the following matrix:
A
i j
=
_
γ x if i = j
x o.w.
where x =
1
γ +(S
U
 −1)
(12)
which is of the form
x
⎡
⎢
⎢
⎢
⎣
γ 1 1 . . .
1 γ 1 . . .
1 1 γ . . .
.
.
.
.
.
.
.
.
.
.
.
.
⎤
⎥
⎥
⎥
⎦
It is easy to see that the above matrix, which incidentally is symmetric positive
deﬁnite and Toeplitz (Strang 1988), also satisﬁes the conditions given by Eqs. 1 and 2.
Further, its condition number can be algebraically computed (as shown in the Appen
dix) to be1 +
S
U

γ −1
. At an intuitive level, this matrix implies that the probability of
a record u remaining as u after perturbation is γ times the probability of its being
distorted to some v =u. For ease of exposition, we will hereafter informally refer to
this matrix as the “GammaDiagonal matrix”.
At this point, an obvious question is whether it is possible to design matrices with
even lower condition number than the gammadiagonal matrix. We prove next that
the gammadiagonal matrix has the lowest possible condition number among the class
of symmetric positivedeﬁnite perturbation matrices satisfying the constraints of the
problem, that is, it is an optimal choice (albeit nonunique).
4.1 Proof of optimality
Theorem 2 Under the given privacy constraints, the GammaDiagonal matrix has
the lowest condition number in the class of symmetric positivedeﬁnite perturbation
matrices.
Proof To prove this proposition, we will ﬁrst derive the expression for minimum
condition number of symmetric positivedeﬁnite matrices. For such matrices, the
123
A framework for highaccuracy privacypreserving mining 115
condition number is given by c = λ
max
/λ
mi n
, where λ
max
and λ
mi n
are the maximum
and minimum eigenvalues of the matrix, respectively. Further, since A is a Markov
matrix (refer Eq. 1), the following results for eigenvalues of a Markov matrix (Strang
1988) are applicable.
Theorem 3 For an n ×n Markov matrix, one of the eigenvalues is 1, and the remai
ning n −1 eigenvalues all satisfy  λ
i
≤ 1.
Theorem 4 The sum of the n eigenvalues equals the sum of the n diagonal entries,
that is,
λ
1
+ · · · +λ
n
= A
11
+· · · + A
nn
FromTheorem3, we obtain λ
max
= 1, and fromTheorem4, that the sumof the rest of
the eigenvalues is ﬁxed. If we denote λ
1
= λ
max
, it is straightforward to see that λ
mi n
is maximized when λ
2
= λ
3
· · · = λ
n
, leading to λ
mi n
=
1
n−1
n
i =2
λ
i
. Therefore,
λ
mi n
≤
1
n −1
n
i =2
λ
i
Using Theorem 4, we directly get
λ
mi n
≤
1
n −1
_
n
i =1
A
i i
−1
_
resulting in the matrix condition number being lowerbounded by
c =
1
λ
mi n
≥
n −1
n
i =1
A
i i
−1
(13)
Due to the privacy constraints on A given by Eq. 2,
A
i i
≤ γ A
i j
∀j = i
Summing the above equation over all values of j except j = i , we get
(n −1)A
i i
≤ γ
j =i
A
i j
= γ (1 − A
i i
)
where the second step is due to the condition on A given by Eq. 1 and the restriction
to symmetric positivedeﬁnite matrices. Solving for A
i i
results in
A
i i
≤
γ
γ + n −1
(14)
123
116 S. Agrawal et al.
and using this inequality in Eq. 13, we ﬁnally obtain
c ≥
n − 1
nγ
γ +n−1
−1
=
γ + n −1
γ −1
= 1 +
n
γ −1
(15)
Therefore, the minimum condition number for the symmetric positivedeﬁnite pertur
bation matrices under privacy constraints represented by γ is (1+
n
γ −1
). The condition
number of our “gammadiagonal” matrix of size S
U
 can be computed as shown in
the Appendix, and its value turns out to be (1+
S
U

γ −1
). Thus, it is a minimum condition
number perturbation matrix.
5 Database size and mining accuracy
In this section, we analyze the dependence of deviations of itemset counts in the
perturbed database fromtheir expected values, with respect to the size of the database.
Then, we give bounds on the database sizes required for obtaining a desired accuracy.
As discussed earlier, Y
v
denotes the total number of records with value v in the
perturbed database, given by
Y
v
=
N
1
Y
i
v
where Y
i
v
is the Bernoulli randomvariable for recordi , and N is the size of the database.
To bound the deviation of Y
v
from its expected value E(Y
v
), we use Hoeffding’s
General Bound (Motwani and Raghavan 1995), which bounds the deviation of the
sum of Bernoulli random variables from its mean. Using these bounds for Y
v
, we get
P
_
 Y
v
− E(Y
v
) 
N
<
_
≥ 1 − 2e
−2
2
N
where (0 < < 1) represents the desired upper bound on the normalized deviation.
For the above probability to be greater than a userspeciﬁed value , the value of
N should satisfy the following:
1 −2e
−2
2
N
≥
⇒ N ≥ ln(2/(1 − ))/(2
2
) (16)
That is, to achieve the desired accuracy (given by ), with the desired probability
(given by ), the miner must collect data from at least the number of customers given
by the above bound. For example, with = 0.001 and = 0.95, this turns out to be
N ≥ 2 × 10
6
, which is well within the norm for typical ecommerce environments.
Moreover, note that these acceptable values were obtained with the Hoeffding Bound,
a comparatively loose bound, and that in practice, it is possible that even datasets that
do not fully meet this requirement may be capable of providing the desired accuracy.
123
A framework for highaccuracy privacypreserving mining 117
For completeness, we now consider the hopefully rare situation wherein the custo
mers are so few that accuracy cannot be guaranteed as per Eq. 16. Here, one approach
that could be taken is to collect multiple independent perturbations of each customer’s
record, thereby achieving the desired target size. But, this has to be done carefully since
the multiple distorted copies can potentially lead to a privacy breach, as described next.
5.1 Multiple versions of perturbed database
Assume that each user perturbs his/her record m times independently, so that overall
the miner obtains m versions of the perturbed database. We hereafter refer to the set
of perturbed records that share a common source record as “siblings”.
Recall that a basic assumption made when deﬁning privacy breaches in Sect. 3.1
was that the perturbed value R(U
i
) for the record i does not reveal any informa
tion about a record j = i . This assumption continues to be true in the multiple
versions variant if the miner is not aware of which records in the perturbed data set are
siblings. Consequently, the privacy analysis of Sect. 3 can be applied verbatimto prove
γ ampliﬁcation privacy guarantees in this environment as well. Therefore, all that
needs to be done is to choose m such that the overall size of the database satisﬁes
Eq. 16.
5.1.1 Multiple known siblings
The preceding analysis still leaves open the question as to what happens in situations
wherein the data miner is aware of the siblings in the perturbed data set? It appears
to us that maintaining accuracy requirements under such extreme circumstances may
require relaxing the privacy constraints, as per the following discussion: With the
gammadiagonal matrix, the probability of a data value remaining unchanged is more
than the probability of its being altered to any other value. Therefore, to guess the
original value, the miner will obviously look for the value that appears the most
number of times in the sibling records. For example, if 9 out of 10 versions of a
given record have the identical perturbed value for an attribute, the miner knows
with high probability the original value of that attribute. Clearly, in this case, one
sibling reveals information about another sibling, violating the assumption required for
γ ampliﬁcation privacy. At ﬁrst glance, it might appear that this problemcan be easily
tackled by treating each group of siblings as a single multidimensional vector; but
this strategy completely nulliﬁes the original objective of having multiple versions
to enhance accuracy. Therefore, in the remainder of this section, we quantitatively
investigate the impact on privacy of having multiple known siblings in the database,
with privacy now deﬁned as the probability of correctly guessing the original value.
The ﬁrst analysis technique that comes to mind is to carry out a hypothesis
test—“the value seen the maximum number of times is indeed the true value”—
using the χ
2
statistic. However, this test is not practical in our environment because
of the extreme skewness of the distribution and the large cardinalities of the value
domain. Therefore, we pursue the following alternate line of analysis: Consider a par
ticular record with original (true) value u, which is independently perturbed m times,
123
118 S. Agrawal et al.
producing m perturbed record values. Let n
v
be the number of times a perturbed value
v appears in these m values, and let R be the random variable representing the value
which is present the maximum number of times, i.e., R = i if ∀i, n
i
> n
j
. Then, the
probability of correctly guessing R = u is
P(R = u) = P(∧
v=u
(n
u
> n
v
)) with
n
i
= m
Clearly, if u appears less than or equal to L =
m
S
V

times, it cannot be the most
frequent occurrence, since there must be another value v appearing at least
m
S
V

times in the perturbed records. Hence, the probability of a correct guess satisﬁes the
following inequality:
P(R = u) = 1 − P(M = u)
≤ 1 − P(n
u
≤ L)
= 1 −
L
k=1
P(n
u
= k)
= 1 −
L
k=1
m
C
k
· p
k
· (1 − p)
m−k
(17)
where p is the probability p[u → u]. The last step follows from the fact that n
u
is a
binomially distributed random variable.
Observe that L =
m
S
V

≥ 0, and hence the above inequality can be reduced to
P(R = u) ≤ 1 − P(n
u
= 0)
= 1 −(1 − p)
m
For the gammadiagonal matrix, p = p[u → u] = γ x, resulting in the probability of
a correct guess being
P(R = u) ≤ 1 − (1 − γ x)
m
(18)
The record domain size S
V
 can be reasonably expected to be (much) greater than m
in most database environments. This implies that the value of p = γ x will usually be
very small, leading to an acceptably low guessing probability.
Alegitimate concern here is that the miner may try to guess the values of individual
sensitive attributes (or a subset of such attributes) in a record, rather than its entire
contents. To assess this possibility, let us assume that u and v, which were used earlier
to denote values of complete records, now refer to a single attribute. As derived later
in Sect. 8, for an attribute of domain size S
1
V
, the probability p[u → u] is given by:
123
A framework for highaccuracy privacypreserving mining 119
Fig. 1 P(R = u) vs. m
p[u → u] = γ x +
_
S
V

S
1
V

−1
_
x
An upper bound for the singleattribute guessing probability is directly obtained by
substituting the above value of p, and L =
m
S
1
V

, in the inequality of Eq. 17.
A quantitative assessment of the number of versions that can be provided without
jeopardizing user privacy is achieved by plotting the guessing probability upper bound
against m, the number of versions. Sample plots are shown in Fig. 1 for a representative
setup: γ =19 with a record domain size S
V
 = 2000 and various singleattribute
domain sizes (S
1
V
= 2, 4, 8). The solid line corresponds to the fullrecord whereas the
dashed lines reﬂect the singleattribute cases.
Observe in Fig. 1 that the fullrecord guessing probability remains less than 0.1
even when the number of versions is as many as 50, and is limited to 0.25 for the
extreme of 100 versions. Turning our attention to the singleattribute case, we see that
for the lowest possible domain size, namely 2, the guessing probability levels off at
around 0.5—note that this is no worse than the miner’s ability to correctly guess the
attribute value without having access to the data. Of course, for larger domain sizes
such as 4 and 8, there is added information from the data—however, the key point
again is that the guessing probabilities for these cases also level off around 0.5 in the
practical range of m. In short, the miner’s guess is at least as likely to be wrong as it
is to be correct, which appears to be an acceptable privacy level in practice.
Moreover, for less stringent γ values, the guessing probabilities will decrease even
further. Overall, these results imply that a substantial number of perturbed versions
can be provided by users to the miner before their (guessingprobability) privacy can
be successfully breached.
The observations inthis sectionalsoindicate that FRAPPis robust against a potential
privacy breach scenario where the information obtained from the users is (a) gathered
periodically, (b) the set of users is largely the same, and (c) the data inputs of the users
123
120 S. Agrawal et al.
are oftenthe same or verysimilar totheir previous values. Sucha scenariocanoccur, for
example, when there is a core user community that regularly updates its subscription to
an Internet service, like those found in the health or insurance industries. We therefore
opine that FRAPP can be successfully used even in these challenging situations.
6 Randomizing the perturbation matrix
The estimation models discussed thus far implicitly assumed the perturbation matrix
A to be deterministic. However, it appears intuitive that if the perturbation matrix
parameters were themselves randomized, so that each client uses a perturbation
matrix not speciﬁcally known to the miner, the privacy of the client will be further
increased. Of course, it may also happen that the reconstruction accuracy suffers in
this process. We explore this tradeoff, in this section, by replacing the deterministic
matrix A with a randomized matrix
˜
A, where each entry
˜
A
vu
is a randomvariable with
E(
˜
A
vu
) = A
vu
. The values taken by the random variables for a client C
i
provide the
speciﬁc parameter settings for her perturbation matrix.
6.1 Privacy guarantees
Let Q(U
i
) be a “property” (as explained in Sect. 3.1) of client C
i
’s private information,
and let record U
i
= u be perturbed to V
i
= v. Denote the prior probability of Q(U
i
)
by P(Q(U
i
)). Then, on seeing the perturbed data, the posterior probability of the
property is calculated to be:
P(Q(U
i
)V
i
= v) =
u: Q(u)
P
U
i
V
i
(uv)
=
u: Q(u)
P
U
i
(u)P
V
i
U
i
(vu)
P
V
i
(v)
When a deterministic perturbation matrix A is used for all clients, then ∀i P
V
i
U
i
(vu) = A
vu
, and hence
P(Q(U
i
)V
i
= v) =
Q(u)
P
U
i
(u)A
vu
Q(u)
P
U
i
(u)A
vu
+
¬Q(u)
P
U
i
(u)A
vu
As discussed by Evﬁmievski et al. (2003), the data distribution P
U
i
can, in the worst
case, be such that P(U
i
= u) >0 only if {u ∈ I
U
Q(u); A
vu
= max
Q(u
)
A
vu
}
or {u ∈ I
U
¬Q(u); A
vu
= min
¬Q(u
)
A
vu
}. For the deterministic gammadiagonal
matrix, max
Q(u
)
A
vu
= γ x and min
¬Q(u
)
A
vu
= x, resulting in
P(Q(U
i
)V
i
= v) =
P(Q(u)) · γ x
P(Q(u)) · γ x + P(¬Q(u))x
123
A framework for highaccuracy privacypreserving mining 121
Since the distribution P
U
is known through reconstruction, the above posterior
probability can be determined by the miner. For example, if P(Q(u)) = 5%, and
γ = 19, the posterior probability works out to 50% for perturbation with the gamma
diagonal matrix.
But, in the randomized matrix case, where P
V
i
U
i
(vu) is a realization of random
variable
˜
A, only its distribution (and not the exact value for a given i ) is known to the
miner. This means that posterior probability computations like the one shown above
cannot be made by the miner for a given record U
i
. To make this concrete, consider a
randomized matrix
˜
A such that
˜
A
uv
=
_
γ x +r if u = v
x −
r
S
U
−1
o.w.
(19)
where x =
1
γ +S
U
−1
andr is a randomvariable uniformlydistributedbetween[−α, α].
Here, the worstcase posterior probability (and, hence, the privacy guarantee) for a
record U
i
is a function of the value of r, and is given by
ρ
2
(r) = P(Q(u)v)
=
P(Q(u)) · (γ x +r)
P(Q(u)) · (γ x +r) + P(¬Q(u))(x −
r
S
U
−1
)
Therefore, onlythe posterior probabilityrange, that is, [ρ
−
2
, ρ
+
2
] = [ρ
2
(−α), ρ
2
(+α)],
and the distribution over this range, can be determined by the miner. For example, for
the scenario where P(Q(u)) = 5%, γ = 19, and α = γ x/2, the posterior probability
lies in the range [33%, 60%], with its probability of being greater than 50% (ρ
2
at
r = 0) equal to its probability of being less than 50%.
6.2 Reconstruction model
With minor modiﬁcations, the reconstruction model analysis for the randomized per
turbation matrix
˜
A can be carried out similar to that carried out earlier in Sect. 3.2
for the deterministic matrix A. Speciﬁcally, the probability of success for Bernoulli
variable Y
i
v
is now modiﬁed to
P(Y
i
v
= 1
˜
A
vu
) =
˜
A
vu
, for U
i
= u
and, from Eq. 4,
E(Y
v

˜
A
vu
) =
N
i =1
P(Y
i
v
= 1/
˜
A
vu
)
=
u∈I
U
{i U
i
=u}
˜
A
vu
123
122 S. Agrawal et al.
=
u∈I
U
˜
A
vu
X
u
⇒ E(Y
˜
A) =
˜
AX (20)
leading to
E(E(Y
˜
A)) = AX (21)
We estimate X as
¨
X given by the solution of the following equation
Y = A
¨
X (22)
which is an approximation to Eq. 21. From Theorem 1, the error in estimation is
bounded by:
¨
X − X
X
≤ c
Y − E(E(Y
˜
A))
E(E(Y
˜
A))
(23)
where c is the condition number of perturbation matrix A = E(
˜
A).
We now compare these bounds with the corresponding bounds of the deterministic
case. Firstly, note that, due to the use of the randomized matrix, there is a double
expectation for Y on the RHS of the inequality, as opposed to the single expectation in
the deterministic case. Secondly, only the numerator is different between the two cases
since we can easily show that E(E(Y
˜
A)) = AX. The numerator can be bounded by
Y − E(E(Y
˜
A))
= (Y − E(Y
˜
A)) + (E(Y
˜
A) − E(E(Y
˜
A)))
≤ Y − E(Y
˜
A) + E(Y
˜
A) − E(E(Y
˜
A))
Here, Y − E(Y
˜
A) is taken to represent the empirical variance of random variable
Y
v
. Since Y
v
is, as discussed before, PoissonBinomial distributed, its variance is given
by (Wang 1993)
Var(Y
v

˜
A) = N p
v
−
i
( p
i
v
)
2
(24)
where p
v
=
1
N
i
p
i
v
and p
i
v
= P(Y
i
v
= 1
˜
A).
It is easily seen (by elementary calculus or induction) that among all combina
tions { p
i
v
} such that
i
p
i
v
= n p
v
, the sum
i
( p
i
v
)
2
assumes its minimum value
when all p
i
v
are equal. It follows that, if the average probability of success p
v
is kept
constant, Var(Y
v
) assumes its maximum value when p
1
v
= · · · = p
N
v
. In other words,
the variability of p
i
v
, or its lack of uniformity, decreases the magnitude of chance
ﬂuctuations (Feller 1988). By using random matrix
˜
A instead of deterministic A, we
increase the variability of p
i
v
(now p
i
v
assumes variable values for all i ), hence decrea
sing the ﬂuctuation of Y
v
from its expectation, as measured by its variance. In short,
123
A framework for highaccuracy privacypreserving mining 123
Y − E(Y
˜
A) is likely to be decreased as compared to the deterministic case,
thereby reducing the error bound.
On the other hand, the value of the second term: E(Y
˜
A) − E(E(Y
˜
A)) , which
depends upon the variance of the random variables in
˜
A, is now positive whereas it
was 0 in the deterministic case. Thus, the error bound is increased by this term.
Overall, we have a tradeoff situation here, and as shown later in our experiments
of Sect. 9, the tradeoff turns out such that the two opposing terms almost cancel each
other out, making the error only marginally worse than the deterministic case.
7 Implementation of perturbation algorithm
Having discussed the privacy and accuracy issues of the FRAPPapproach, we nowturn
our attention to the implementation of the perturbation algorithm described in Sect. 3.
For this, we effectively need to generate for each U
i
= u, a discrete distribution with
PMF P(v) = A
vu
and CDF F(v) =
i ≤v
A
i u
, deﬁned over v = 1, . . . ,  S
V
.
Astraightforward algorithmfor generating the perturbed record v fromthe original
record u is the following
(1) Generate r ∼ U(0, 1)
(2) Repeat for v = 1, . . . ,  S
V

if F(v − 1) < r ≤ F(v)
return V
i
= v
where U(0, 1) denotes uniform continuous distribution over [0, 1].
This algorithm, whose complexity is proportional to the product of the cardinalities
of the attribute domains, will require  S
V
 /2 iterations on average which can turn
out to be very large. For example, with 31 attributes, each with two categories, this
amounts to 2
30
iterations per customer! We therefore present below an alternative
algorithm whose complexity is proportional to the sum of the cardinalities of the
attribute domains.
Speciﬁcally, to perturb record U
i
= u, we can write
P(V
i
; U
i
= u) = P(V
i 1
, . . . , V
i M
; u)
= P(V
i 1
; u) · P(V
i 2
V
i 1
; u) . . . P(V
i M
V
i 1
, . . . , V
i (M−1)
; u)
where V
i j
denotes the j th attribute of record V
i
. For the perturbation matrix A, this
works out to
P(V
i 1
= a; u) =
{vv(1)=a}
A
vu
P(V
i 2
= bV
i 1
= a; u) =
P(V
i 2
= b, V
i 1
= a; u)
P(V
i 1
= a; u)
=
{vv(1)=a and v(2)=b}
A
vu
P(V
i 1
= a; u)
. . . and so on
123
124 S. Agrawal et al.
where v(i ) denotes the value of the i th attribute for the record with value v.
When A is chosen to be the gammadiagonal matrix, and n
j
is used to represent
j
k=1
 S
k
U
, we get the following expressions for the above probabilities after some
simple algebraic manipulations:
P(V
i 1
= b; U
i 1
= b) =
_
γ +
n
M
n
1
−1
_
x
P(V
i 1
= b; U
i 1
= b) =
n
M
n
1
x
(25)
and for the j th attribute
P(V
i j
= bV
i 1
, . . . , V
i ( j −1)
; U
i j
= b)
=
⎧
⎪
⎪
⎨
⎪
⎪
⎩
(γ +
n
M
n
j
−1)x
j −1
k=1
p
k
if ∀k < j, V
i k
= U
i k
(
n
M
n
j
)x
j −1
k=1
p
k
o.w.
(26)
P(V
i j
= bV
i 1
, . . . , V
i ( j −1)
; U
i j
= b) =
(
n
M
n
j
)x
j −1
k=1
p
k
where p
k
is the probability that V
i k
takes value a, given that a is the outcome of the
randomprocess performed for the kth attribute, i.e. p
k
= P(V
i k
= aV
i 1
, . . . , V
i (k−1)
;
U
i
).
The above perturbation algorithm takes M steps, one for each attribute. For the
ﬁrst attribute, the probability distribution of the perturbed value depends only on the
original value for the attribute and is given by Eq. 25. For any subsequent column j , to
achieve the desired randomperturbation, we use as input both its original value and the
perturbed values of the previous j −1 columns, and then generate the perturbed value
for j as per the discrete distribution given in Eq. 26. This is an example of dependent
column perturbation, in contrast to the independent column perturbations used in most
of the prior techniques.
Note that even though the perturbation of a column depends on the perturbed values
of previous columns, the columns can be perturbed in any order. Speciﬁcally, the
probability distribution for each column perturbation, as given by Eqs. 25 and 26, gets
modiﬁed accordingly so that the overall distribution for record perturbation remains
the same.
Finally, to assess the complexity of the algorithm, it is easy to see that the maximum
number of iterations for generating the jth discrete distribution is S
j
U
, and hence the
maximum number of iterations for generating a perturbed record is
j
S
j
U
.
Remark The scheme presented above gives a general approach to ensure that the
complexity is proportional to the sum of attribute cardinalities, for any choice of
123
A framework for highaccuracy privacypreserving mining 125
perturbation matrix. However, speciﬁcally for the gammadiagonal matrix, a simpler
algorithm could be used. Namely, with probability x(γ −1) return the original tuple,
otherwise choose the value of each attribute in the perturbed tuple uniformly and
independently.
4
In this special case, the algorithm is a generalization of Warner’s
classical randomized response technique (Warner 1965).
8 Application to mining tasks
To illustrate the utility of the FRAPP framework, we demonstrate in this section
how it can be integrated in two representative mining tasks, namely association rule
mining, which identiﬁes interesting correlations between database attributes Agrawal
and Srikant (1994), and classiﬁcation rule mining, which produces class labeling rules
for data records based on an initial training set (Mitchell 1997).
8.1 Association rule mining
The core computation in association rule mining is to identify “frequent itemsets”,
that is, itemsets whose support (i.e. frequency) in the database is in excess of a user
speciﬁed threshold sup
mi n
. Eq. 8 can be directly used to estimate the support of
itemsets containing all M categorical attributes. However, in order to incorporate the
reconstruction procedure into bottomup association rule mining algorithms such as
Apriori (Agrawal and Srikant 1994), we need to also be able to estimate the supports
of itemsets consisting of only a subset of attributes—this procedure is described next.
Let C denote the set of all attributes in the database, and C
s
be a subset of these
attributes. Each of the attributes j ∈ C
s
can assume one of the S
j
U
 values. Thus,
the number of itemsets over attributes in C
s
is given by I
C
s
=
j ∈C
s
S
j
U
. Let L, H
denote generic itemsets over this subset of attributes.
Auser recordsupports the itemset Lif the attributes inC
s
take the values represented
by L. Let the support of L in the original and distorted databases be denoted by sup
U
L
and sup
V
L
, respectively. Then,
sup
V
L
=
1
N
v supports L
Y
v
where Y
v
denotes the number of records in V with value v (refer Sect. 3.2). From
Eq. 7, we know
Y
v
=
u∈I
U
A
vu
¨
X
u
and therefore, using the fact that A is symmetric,
4
Note that the notion of independence is with regard to the perturbation process, not the data distributions
of the attributes.
123
126 S. Agrawal et al.
sup
V
L
=
1
N
v supports L
u
A
vu
¨
X
u
=
1
N
u
¨
X
u
v supports L
A
vu
Grouping the records u by the itemsets H that they support:
sup
V
L
=
1
N
H
u supports H
¨
X
u
v supports L
A
vu
(27)
Analyzing the term
v supports L
A
vu
in the above equation, we see that it represents the
sum of the entries of column u in A over rows v that support itemset L. Now, consider
the columns u that support a given itemset H. Note that due to the structure of the
gamma diagonal matrix A, if H = L, then one diagonal entry is part of this sum,
otherwise the summation involves only nondiagonal terms. Therefore, for all u that
support a given itemset H:
v supports L
A
vu
=
_
γ x +(
I
C
I
C
s
−1)x if H = L
I
C
I
C
s
x o.w.
_
:= A
HL
(28)
i.e. the probability of an itemset remaining the same after perturbation is
γ +I
C
/I
C
s
−1
I
C
/I
C
s
times the probability of it being distorted to any other itemset.
Substituting in Eq. 27:
sup
V
L
=
1
N
H
A
HL
u supports H
¨
X
u
=
H
A
HL
sup
U
H
Thus, we can estimate the supports of itemsets over any subset C
s
of attributes using
the matrix A which is of much smaller dimension (I
C
s
× I
C
s
) for small itemsets as
compared to the original full matrix A.
A legitimate concern here might be that the matrix inversion could become time
consuming as we proceed to larger itemsets making I
C
s
large. Fortunately, the inverse
for this matrix has a simple closedform expression:
Theorem 5 The inverse of Ais a matrix of order n = I
C
s
of the formB = {B
i j
: 1 ≤ i
≤ n, 1 ≤ j ≤ n}, where
B
i j
=
_
δy if i = j
y o.w.
with δ = −(γ +n − 2) and y = −
I
C
s
I
C
·
1
(γ −1)
123
A framework for highaccuracy privacypreserving mining 127
Proof As both A and B are square matrices of the same order, AB and BA are valid
products. Also it can be trivially seen (by actual multiplication) that AB = BA = I,
where I is the identity matrix of order I
C
s
.
The above closedform inverse can be directly used in the reconstruction process,
greatly reducing both space and time resources. Speciﬁcally, the reconstruction algo
rithm can now be very simply written as:
for each L from 1 to n do
sup
U
L
= sup
V
L
δy +(N −sup
V
L
)y (29)
where N is the database cardinality, sup
V
L
and sup
U
L
are the perturbed and recons
tructed frequencies, respectively, and n is the size of the index set, which is I
C
s
for a
subset of attributes and I
C
for fulllength itemsets.
Thus we can efﬁciently reconstruct the counts of itemsets over any subset of attri
butes without needing to construct the counts of complete records, and our scheme can
be implemented efﬁciently on bottomup association rule mining algorithms such as
Apriori (Agrawal and Srikant 1994). Further, it is trivially easy to incorporate FRAPP
even in incremental association rule mining algorithms such as DELTA (Pudi and
Haritsa 2000) which operate periodically on changing historical databases, and use
the results of previous mining operations to minimize the amount of work carried out
during each new mining operation.
8.2 Classiﬁcation rule mining
We now turn our attention to the task of classiﬁcation rule mining. The primary input
required for this process is the distribution of attribute values for each class in the
training data. This input can be produced through the “ByClass” privacypreserving
algorithm enunciated by Agrawal and Srikant (2000), which partitions the training
data by class label, and then separately distorts and reconstructs the distributions
for the records corresponding to each class. After this reconstruction, an offthe
shelf classiﬁer can be used to produce the actual classiﬁcation rules. However, a
complication that may arise in the privacypreserving environment is that of negative
reconstructed frequencies, described next.
8.2.1 Negative reconstructed frequencies
During the reconstruction process, it is sometimes possible that using the expressions
given in Eq. 29, negative reconstructed frequencies may arise—this is because, given
a large index set, it is possible that several indices may have little or no representa
tion at all (i.e. low sup
V
L
), even after perturbation of the dataset. While this occurs
for association rule mining too, it is not a problem there because such itemsets are
automatically pruned due to the minimum support criterion. In the case of classiﬁca
tion, however, negative frequencies pose difﬁculties because (a) they lack meaningful
interpretation, and (b) classiﬁcation techniques based on calculating logarithms of
123
128 S. Agrawal et al.
the itemset frequencies, such as decision tree classiﬁers (Quinlan 1993), now become
infeasible.
To address this problem, we ﬁrst set all negative reconstructed frequencies to zero
and then uniformly scale down the positive frequencies such that their sum remains
equal to the original dataset size. The rationale is that records corresponding to negative
frequencies are scarce in the original dataset (i.e. “outliers”) and can therefore be
ignored without signiﬁcant loss of accuracy. Further the scaling down of the positive
frequencies is consistent with rule generation since classiﬁcation techniques are based
on relative frequencies or distributions, rather than absolute frequencies.
9 Performance evaluation
We move on, in this section, to quantitatively assessing the utility of the FRAPP
approach with respect to the privacy and accuracy levels that it can provide for asso
ciation rule mining and classiﬁcation rule mining.
9.1 Association rule mining
9.1.1 Datasets
Two datasets, CENSUS and HEALTH, are used in our experiments, which are both
derived fromrealworld repositories. Since it has been established in several sociologi
cal studies (e.g. Cranor et al. 1999; Westin 1999) that users typically expect privacy on
only a few of the database ﬁelds—usually sensitive attributes such as health, income,
etc.—our datasets also project out a representative subset of the columns in the original
databases. The complete details of the datasets are given below:
CENSUS This dataset contains census information for about 50,000 adult Ameri
can citizens, and is available fromthe UCI repository http://www.ics.uci.edu/~mlearn/
mlsummary.html. We used three categorical (nativecountry, sex, race)
attributes and three continuous (age, fnlwgt, hoursperweek) attributes
from the census database in our experiments, with the continuous attributes parti
tioned into discrete intervals to convert them into categorical attributes. The speciﬁc
categories used for these six attributes are listed in Table1.
Table 1 CENSUS dataset
Attribute Categories
Race White, AsianPacIslander, AmerIndianEskimo, Other, Black
Sex Female, Male
Nativecountry UnitedStates, Other
Age [15−35), [35−55), [55−75), ≥75
Fnlwgt [0−1e5], [1e5−2e5), [1e5−3e5), [3e5−4e5), ≥4e5
Hoursperweek [0−20), [20−40), [40−60), [60−80), ≥80
123
A framework for highaccuracy privacypreserving mining 129
Table 2 HEALTH dataset
Attribute Categories
INCFAM20 (family income) Less than $20, 000; $20,000 or more
HEALTH (health status) Excellent; Very good; Good; Fair; Poor
SEX (sex) Male; Female
PHONE (has telephone) Yes, phone number given; Yes, no phone number given; No
AGE (age) [0−20), [20−40), [40−60), [60−80), ≥80)
BDDAY12 (bed days in past 12 months) [0−7), [7−15), [15−30), [30−60), ≥60
DV12 (Doctor visits in past 12 months) [0−7), [7−15), [15−30), [30−60), ≥60
Table 3 Frequent itemsets for sup
mi n
= 0.02
Itemset length
1 2 3 4 5 6 7
CENSUS 19 102 203 165 64 10 –
HEALTH 23 123 292 361 250 86 12
HEALTH This dataset captures health information for over 100,000 patients col
lected by the US government http://dataferrett.census.gov. We selected 4 categorical
and 3 continuous attributes from the dataset for our experiments. These attributes and
their categories are listed in Table 2.
The associationrule miningaccuracyof our schemes onthese datasets was evaluated
for a userspeciﬁed minimum support of sup
mi n
= 2%. Table 3 gives the number of
frequent itemsets in the datasets for this support threshold, as a function of the itemset
length.
9.1.2 Multiple versions
In Sect. 5, Eq. 16 gave the number of data records required to obtain relative inaccuracy
of less than with a probability greater than . For = 0.001, and = 0.95, this
turned out to be N ≥ 2 × 10
6
. Note that we need to consider small values of , since
the error given by will be further ampliﬁed by the condition number, as indicated
by Eq. 9 for relative error in reconstructed counts.
Since the datasets available to us were much smaller than the desired N, we resorted
toscalingeachdataset bya factor of 50tocross the size threshold, byprovidingmultiple
distortions of each user record. As per the discussion in Sect. 5.1, such scaling does
not result in any additional privacy breach if the miner has no knowledge of the sibling
identities. Further, even when the miner does possess this knowledge, 50 versions was
shown to retain an acceptable privacy level under the modiﬁed (guessingprobability)
privacy deﬁnition. A useful sideeffect of the dataset scaling is that it also ensures that
our results are applicable to large diskresident databases.
123
130 S. Agrawal et al.
9.1.3 Performance metrics
We measure the performance of the system with regard to the accuracy that can be
provided for a given privacy requirement speciﬁed by the user.
Privacy The (ρ
1
, ρ
2
) strict privacy measure fromEvﬁmievski et al. (2003) is used as
the privacy metric. We experimented with a variety of privacy settings—for example,
varying ρ
2
from 30% to 50% while keeping ρ
1
ﬁxed at 5%, resulting in γ values
ranging from 9 to 19. The value of ρ
1
is representative of the fact that users typically
want to hide uncommon values which set them apart from the rest, while a ρ
2
value
of 50% indicates that the user can still plausibly deny any value attributed to him or
her since it is equivalent to a random cointoss attribution.
Accuracy We evaluate two kinds of mining errors, Support Error and Identity Error,
in our experiments. The Support Error (µ) metric reﬂects the average relative error
(in percent) of the reconstructed support values for those itemsets that are correctly
identiﬁed to be frequent. Denoting the number of frequent itemsets by F, the recons
tructed support bysup and the actual support by sup, the support error is computed
over all frequent itemsets as
µ =
1
 F 
f ∈F
 sup
f
−sup
f

sup
f
∗ 100
The Identity Error (σ) metric, on the other hand, reﬂects the percentage error in
identifying frequent itemsets and has two components: σ
+
, indicating the percentage
of false positives, and σ
−
indicating the percentage of false negatives. Denoting the
reconstructed set of frequent itemsets with R and the correct set of frequent itemsets
with F, these metrics are computed as
σ
+
=
 R − F 
 F 
∗ 100 σ
−
=
 F − R 
 F 
∗ 100
9.1.4 Perturbation mechanisms
We present the results for FRAPP and representative prior techniques. For all the
perturbation mechanisms, the mining of the distorted database was done using the
Apriori (Agrawal and Srikant 1994) algorithm, with an additional support reconstruc
tion phase at the end of each pass to recover the original supports from the perturbed
database supports computed during the pass (Agrawal et al. 2004; Rizvi and Haritsa
2002).
Speciﬁcally, the perturbation mechanisms evaluated in our study are the following:
DETGD This scheme uses the deterministic gammadiagonal perturbation matrix
A (Sect. 4) for perturbation and reconstruction. The perturbation was implemented
using the techniques described in Sect. 7, and the equations of Sect. 8.1 were employed
to construct the perturbation matrix used in each pass of Apriori.
RANGD This scheme uses the randomized gammadiagonal perturbation matrix
˜
A (Sect. 6) for perturbation and reconstruction. Though, in principle, any distribution
123
A framework for highaccuracy privacypreserving mining 131
can be used for
˜
A, here we evaluate the performance of uniformly distributed
˜
A
(as given by Eq. 19) over the entire range of the α randomization parameter (0 to γ x).
MASKThis is the perturbation scheme proposed in Rizvi and Haritsa (2002), inten
ded for boolean databases and characterized by a single parameter 1 − p, which
determines the probability of an attribute value being ﬂipped. In our scenario, the
categorical attributes are mapped to boolean attributes by making each value of the
category an attribute. Thus, the M categorical attributes map to M
b
=
j
 S
j
U

boolean attributes.
The ﬂipping probability 1−p was chosen as the lowest value which could satisfy the
privacy constraints given by Eq. 2. The constraint ∀v: ∀u
1
, u
2
:
A
vu
1
A
vu
2
≤ γ is satisﬁed
for MASK(Rizvi andHaritsa 2002), if
p
M
b
(1−p)
M
b
≤ γ . But, for eachcategorical attribute,
one and only one of its associated boolean attributes takes value 1 in a particular
record. Therefore, all the records contain exactly M number of 1
s
. Hence the ratio of
two entries in the matrix cannot be greater than
p
2M
(1−p)
2M
and the following condition
is sufﬁcient for the privacy constraints to be satisﬁed:
p
2M
(1 − p)
2M
≤ γ
The above equation was used to determine the appropriate value of p. For γ = 19
(corresponding to (ρ
1
, ρ
2
) = (5%, 50%)), this value turned out to be 0.439 and 0.448
for the CENSUS and HEALTH datasets, respectively.
C&P This is the CutandPaste perturbation scheme proposed by Evﬁmievski et al.
(2002), with algorithmic parameters K and ξ. To choose K, we varied K from 0
to M, and for each K, ξ was chosen such that the matrix (Eq. 11) satisﬁes the pri
vacy constraints (Eq. 2). The results reported here are for the (K, ξ) combination
giving the best mining accuracy, which for γ = 19, turned out to be K = 3 and
ξ = 0.494.
9.1.5 Experimental results
For the CENSUS dataset, the support (µ) and identity (σ
−
, σ
+
) errors of the four
perturbation mechanisms (DETGD, RANGD, MASK, C&P) for γ = 19 are shown
in Fig. 2, as a function of the length of the frequent itemsets (the performance of RAN
GD is shown for randomization parameter α = γ x/2). The corresponding graphs for
the HEALTH dataset are shown in Fig. 3. Note that the support error (µ) graphs are
plotted on a logscale. The detailed results are presented here for a representative
privacy requirement of (ρ
1
, ρ
2
) = (5%, 50%), which was also used by Evﬁmievski
et al. (2003), and results in γ = 19. Similar performance trends were observed for the
other practical values of γ , with the results for γ = 13.28 and γ = 9 on CENSUS
dataset shown in Figs. 4 and 5, respectively.
In these ﬁgures, we ﬁrst note that DETGD performs, on an absolute scale, extre
mely well, the error being of the order of 10% for the longer itemsets. Further, its
123
132 S. Agrawal et al.
(a) (b) (c)
Fig. 2 CENSUS γ = 19. a Support error µ, b false negatives σ
−
, c false positives σ
+
Fig. 3 HEALTH γ = 19. a Support error µ, b false negatives σ
−
, c false positives σ
+
Fig. 4 Results for γ = 13.28 (ρ
1
, ρ
2
) = (5%, 41%) on CENSUS. a Support error µ, b False negatives
σ
−
, c false positives σ
+
Fig. 5 Results for γ = 9 (ρ
1
, ρ
2
) = (5%, 32%) on CENSUS. a Support error µ, b false negatives σ
−
,
c false positives σ
+
123
A framework for highaccuracy privacypreserving mining 133
(a) (b) (c)
Fig. 6 Varying randomization of perturbation matrix (γ = 19). a Posterior probability, b support error µ
(HEALTH), c support error µ (CENSUS)
performance is visibly better than that of MASK and C&P. In fact, as the length of
the frequent itemset increases, the performance of both MASK and C&P degrade
drastically. Speciﬁcally, MASK is not able to ﬁnd any itemsets of length above 4 for
the CENSUS dataset, and above 5 for the HEALTH dataset, while C&P could not
identify itemsets beyond length 3 in both datasets.
The second point to note is that the accuracy of RANGD, although employing
a randomized matrix, is only marginally worse than that of DETGD. In return, it
provides a substantial increase in the privacy—its worstcase (determinable) privacy
breach is only 33% as compared to 50% with DETGD. Figure6a shows the perfor
mance of RANGD over the entire range of α with respect to the posterior probabi
lity range [ρ
−
2
, ρ
+
2
]. The mining support reconstruction errors for itemsets of length
4 are shown in Fig. 6b and c for the CENSUS and HEALTH datasets, respectively.
We observe that the performance of RANGDdoes not deviate much fromthe determi
nistic case over the entire range, whereas very low determinable posterior probability
is obtained for higher values of α.
Role of condition numbers The primary reason for DETGD and RANGD’s good
performance is the lowcondition numbers of their perturbation matrices. This is quan
titatively shown in Fig. 7, which plots these condition numbers on a logscale (the
condition numbers of DETGD and RANGD are identical in this graph because
E(
˜
A) = A). Note that the condition numbers are not only low but also independent
of the frequent itemset length (algebraic computation of condition numbers is shown
in the Appendix).
In marked contrast, the condition numbers for MASK and C&P increase expo
nentially with increasing itemset length, resulting in drastic degradation in accuracy.
Thus, our choice of a gammadiagonal matrix indicates highly promising results for
discovery of long patterns.
Computational overheads Finally, with regard to actual mining response times also,
FRAPP takes about the same time as Apriori for the complete mining process on the
original and perturbed databases, respectively. This is because, as mentioned before,
the reconstruction component shows up only in between mining passes and involves
very simple computations (see Eq. 29). Further, the initial perturbation step took only
a very modest amount of time even on vanilla PC hardware. Speciﬁcally, on a PIV
2.0GHz PC with 1GB RAM and 40GB hard disk, perturbing 2.5million records of
123
134 S. Agrawal et al.
2 3 4 5 6
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
frequent itemset length
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
DET−GD, RAN−GD
MASK
C&P
(a)
2 3 4 5 6 7
10
0
10
1
10
2
10
3
10
4
10
5
10
6
10
7
10
8
10
9
frequent itemset length
c
o
n
d
i
t
i
o
n
n
u
m
b
e
r
DET−GD, RAN−GD
MASK
C&P
(b)
Fig. 7 Perturbation matrix condition numbers (γ = 19). a CENSUS, b HEALTH
CENSUS took about a minute, while 5million records of HEALTH were distorted in
a little over 2min.
9.2 Classiﬁcation rule mining
We now turn our attention to assessing the performance of FRAPP in the context of
classiﬁcation rule mining.
9.2.1 Experimental setup
The US Census dataset mentioned earlier was used in our experiments, out of which
about 75% of the records were used for training and the remaining as test data. The
attributes used in the experiment are given in Table4, among which salary was chosen
as the Class Label attribute. The classiﬁer used is the highly popular publicdomain
C4.5 decision tree classiﬁer Quinlan (1993), speciﬁcally the one available at http://
www.cs.waikato.ac.nz/ml/weka.
Table 4 US CENSUS dataset for classiﬁcation
Attribute Categories
Nativecountry UnitedStates, Other
Salary Less or equal to $50, 000, Greater than $50,000
Age [15−35), [35−55), [55−75), ≥75
Typeofemployment Private, Self Employment not Inc, Self Employment Inc, Federal Government,
Local Government, State Government, Without pay, Never worked
Hoursperweek [0−20), [20−40), [40−60), [60−80), ≥80
123
A framework for highaccuracy privacypreserving mining 135
Table 5 Classiﬁcation accuracy
Mining technique Correct labeling (%) Incorrect labeling (%)
FRAPP 72.88 27.12
DIRECT 75.34 24.66
BOTH 71.34 23.12
9.2.2 Experimental results
We choose a privacy level of γ = 19, corresponding to a maximum privacy breach
of 50%. With this privacy setting, the accuracy results for FRAPPbased privacy
preserving classiﬁcation are shown in Table 5, which also provides the corresponding
accuracies for direct classiﬁcation on the original database, representing in a sense, the
“best case”. We see here that the FRAPP accuracies are quite comparable to DIRECT,
indicating that there is little cost associated with supporting the privacy functionality.
Finally, the last line (BOTH) in Table 5 shows the proportion of cases where FRAPP
and DIRECT concurred in their labeling—i.e. either both got it correct or both got it
wrong, and as can be seen, the overlap between the two classiﬁers is very high, close
to 95%.
10 Conclusions and future work
In this paper, we developed FRAPP, a generalized model for randomperturbation
based methods operating on categorical data under strict privacy constraints. The
framework provides us with the ability to ﬁrst make careful choices of the model
parameters and then build perturbation methods for these choices. This results in
orderofmagnitude improvements in model accuracy as compared to the conventional
approach of deciding on a perturbation method upfront, which implicitly freezes the
associated model parameters.
Using the framework, a “gammadiagonal” perturbation matrix was identiﬁed as
the best conditioned among the class of symmetric positivedeﬁnite matrices, and the
refore expected to deliver the highest accuracy within this class. We also presented an
implementation method for gammadiagonalbased perturbation whose complexity is
proportional to the sum of the domain cardinalities of the attributes in the database.
Empirical evaluation of our approach on the CENSUS and HEALTHdatasets demons
trated signiﬁcant reductions in mining errors for association rule mining relative to
prior privacypreserving techniques, and comparable accuracy to direct mining for
classiﬁcation models.
The relationship between data size and model accuracy was also evaluated and it
was shown that it is often possible to construct a sufﬁciently large dataset to achieve
the desired accuracy by the simple expedient of generating multiple distorted versions
of each customer’s true data record, without materially compromising the data privacy.
Finally, we investigated the novel strategy of having the perturbation matrix compo
sed of not values, but randomvariables instead. Our analysis of this approach indicated
123
136 S. Agrawal et al.
that at a marginal cost in accuracy, signiﬁcant improvements in privacy levels could
be achieved.
In our future work, we plan to investigate whether it is possible, as discussed in
Sect. 2, to design distortion matrices such that the mining can be carried out directly
on the distorted database without any explicit reconstruction—that is, to develop an
“invariant FRAPP matrix”.
Appendix
Condition number of gammadiagonal matrix
We provide here the formula for computing the condition number of the gamma
diagonal distortion matrix. Speciﬁcally, consider the n × n matrix A of form
A
i j
=
_
ξ x if i = j
x o.w.
where ξ x +(n − 1)x = 1 (30)
Since matrix A is symmetric, we can use the following wellknown result (Strang
1988):
Theorem 6 A symmetric matrix has real eigenvalues.
Let X be an eigenvector of the matrix A corresponding to eigenvalue λ. Then, it must
satisfy:
AX = λX
Using the structure of matrix A from Eq. 30, for any i = 1, . . . , n,
ξ x X
i
+
j =i
x X
j
= λX
i
⇔ (ξ x − x)X
i
+
j
x X
j
= λX
i
⇒ either X
i
=
x
j
X
j
λ + x − ξ x
(31)
or λ + x −ξ x = 0 (32)
Eq. 31 implies that all X
i
are equal. Let this common value be g, leading to
123
A framework for highaccuracy privacypreserving mining 137
g(λ + x −ξ x) = ngx
⇒ λ = ξ x +nx − x = 1 (33)
From Eq. 32,
λ = (ξ −1)x =
(ξ − 1)
(ξ +n −1)
< 1
Thus, only two distinct values are taken by the eigenvalues of matrix A: λ
1
= 1 and
λ
2
= λ
3
= · · · = λ
n
=
(ξ−1)
(ξ+n−1)
. For ξ ≥ 1, the eigenvalues of the matrix A are
positive, hence A is a positivedeﬁnite matrix, and its condition number is:
cond(A) =
λ
max
λ
mi n
=
(ξ + n −1)
(ξ −1)
– For the matrix A given by Eq. 12, ξ = γ, n = S
U
, γ ≥ 1, so
cond(A) =
(γ +  S
U
 −1)
(γ − 1)
= 1 +
 S
U
)
(γ − 1)
– For matrix A for mining itemsets over subset of attributes C
s
, given by Eq. 28,
ξ =
γ +
I
C
I
C
s
−1
I
C
I
C
s
, n = I
C
s
.
Hence,
ξ +n − 1 =
γ + I
C
−1
I
C
I
C
s
ξ − 1 =
γ −1
I
C
I
C
s
cond(A) =
(ξ +n −1)
(ξ − 1)
=
(γ + I
C
− 1)
(γ −1)
= 1 +
 S
U
)
(γ −1)
References
Adam N, Wortman J (1989) Security control methods for statistical databases. ACM Comput Surv 21(4):
515–556
Aggarwal C, Yu P (2004, March) A condensation approach to privacy preserving data mining. In: Procee
dings of the 9th international conference on extending database technology (EDBT), Heraklion, Crete,
Greece
Agrawal D, Aggarwal C (2001, May) On the design and quantiﬁcation of privacy preserving data mining
algorithms. In: Proceedings of the ACM symposium on principles of database systems (PODS), Santa
Barbara, California, USA
Agrawal R, Bayardo R, Faloutsos C, Kiernan J, Rantzau R, Srikant R (2004, August) Auditing compliance
with a hippocratic database. In: Proceedings of the 30th international conference on very large data
bases (VLDB), Toronto, Canada
Agrawal R, Kiernan J, Srikant R, Xu Y (2002, August) Hippocratic databases. In: Proceedings of the 28th
international conference on very large data bases (VLDB), Hong Kong, China
123
138 S. Agrawal et al.
Agrawal R, Kini A, LeFevre K, Wang A, Xu Y, Zhou D (2004, June) Managing healthcare data hippocrati
cally. In: Proceedings of the ACM SIGMOD international conference on management of data, Paris,
France
Agrawal R, Srikant R (1994, September) Fast algorithms for mining association rules. In: Proceedings of
the 20th international conference on very large data bases (VLDB), Santiago de Chile, Chile
Agrawal R, Srikant R (2000, May) Privacypreserving data mining. In: Proceedings of the ACM SIGMOD
international conference on management of data, Dallas, Texas, USA
Agrawal R, Srikant R, Thomas D (2005, June) Privacypreserving OLAP. In: Proceedings of the ACM
SIGMOD international conference on management of data, Baltimore, Maryland, USA
Agrawal S, Krishnan V, Haritsa J (2004, March) On addressing efﬁciency concerns in privacypreserving
mining. In: Proceedings of the 9th international conference on database systems for advanced appli
cations (DASFAA), Jeju Island, Korea
Atallah M, Bertino E, Elmagarmid A, Ibrahim M, Verykios V (1999, November) Disclosure limitation
of sensitive rules. In: Proceedings of the IEEE knowledge and data engineering exchange workshop
(KDEX), Chicago, Illinois, USA
Cranor L, Reagle J, Ackerman M (1999, April) Beyond concern: understanding net users’ attitudes about
online privacy, AT&T labs research technical report TR 99.4.3
Dasseni E, Verykios V, Elmagarmid A, Bertino E(2001, April) Hiding association rules by using conﬁdence
and support. In: Proceedings of the 4th international information hiding workshop (IHW), Pittsburgh,
Pennsylvania, USA
de Wolf P, Gouweleeuw J, Kooiman P, Willenborg L (1998, March) Reﬂections on PRAM. In: Proceedings
of the statistical data protection conference, Lisbon, Portugal
Denning D (1982) Cryptography and data security. AddisonWesley
Duncan G, Pearson R (1991) Enhancing access to microdata while protecting conﬁdentiality: prospects for
the future. Stat Sci 6(3):219–232
Evﬁmievski A, Gehrke J, Srikant R (2003, June) Limiting privacy breaches in privacy preserving data
mining. In: Proceedings of the ACM symposium on principles of database systems (PODS), San
Diego, California, USA
Evﬁmievski A, Srikant R, Agrawal R, Gehrke J (2002, July) Privacy preserving mining of association rules.
In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data
mining (KDD), Edmonton, Alberta, Canada
Feller W (1988) An introduction to probability theory and its applications, vol I. Wiley
Gouweleeuw J, Kooiman P, Willenborg L, de Wolf P (1998) Post randomisation for statistical disclosure
control: Theory and implementation. J Off Stat 14(4):485–502
Kantarcioglu M, Clifton C (2002, June) Privacypreserving distributed mining of association rules on
horizontally partitioned data. In: Proceedings of the ACM SIGMOD workshop on research issues in
data mining and knowledge discovery (DMKD), Madison, Wisconsin, USA
Kargupta H, Datta S, Wang Q, Sivakumar K (2003, December) On the privacy preserving properties of
random data perturbation techniques. In: Proceedings of the 3rd IEEE international conference on
data mining (ICDM), Melbourne, Florida, USA
LeFevre K, Agrawal R, Ercegovac V, Ramakrishnan R, Xu Y, DeWitt D(2004, August) Limiting disclosure
in hippocratic databases. In: Proceedings of the 30th international conference on very large data bases
(VLDB), Toronto, Canada
Mishra N, Sandler M (2006, June) Privacy via pseudorandom sketches. In: Proceedings of the ACM sym
posium on principles of database systems (PODS), Chicago, Illinois, USA
Mitchell T (1997) Machine learning. McGraw Hill
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press
Pudi V, Haritsa J (2000) Quantifying the utility of the past in mining large databases. Inf Sys 25(5):323–344
Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann
Rastogi V, Suciu D, Hong S (2007, September) The boundary between privacy and utility in data publishing.
In: Proceedings of the 33rd international conference on very large data bases (VLDB), Vienna, Austria
Rizvi S, Haritsa J (2002, August) Maintaining data privacy in association rule mining. In: Proceedings of
the 28th international conference on very large databases (VLDB), Hong Kong, China
Samarati P, Sweeney L (1998, June) Generalizing data to provide anonymity when disclosing informa
tion. In: Proceedings of the ACM symposium on principles of database systems (PODS), Seattle,
Washington, USA
123
A framework for highaccuracy privacypreserving mining 139
Saygin Y, Verykios V, Clifton C (2001) Using unknowns to prevent discovery of association rules. ACM
SIGMOD Rec 30(4):45–54
Saygin Y, Verykios V, Elmagarmid A (2002, February) Privacy preserving association rule mining. In:
Proceedings of the 12th international workshop on research issues in data engineering (RIDE), San
Jose, California, USA
Shoshani A (1982, September) Statistical databases: characteristics, problems and some solutions. In:
Proceedings of the 8th international conference on very large databases (VLDB), Mexico City, Mexico
Strang G (1988) Linear algebra and its applications. Thomson Learning Inc
Vaidya J, Clifton C (2002, July) Privacy preserving association rule mining in vertically partitioned data.
In: Proceedings of the 8th ACM SIKGDD international conference on knowledge discovery and data
mining (KDD), Edmonton, Alberta, Canada
Vaidya J, Clifton C (2003, August) Privacypreserving kmeans clustering over vertically partitioned data.
In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data
mining (KDD), Washington, DC, USA
Vaidya J, Clifton C (2004, April) Privacy preserving naive bayes classiﬁer for vertically partitioned data.
In: Proceedings of the SIAM international conference on data mining (SDM), Toronto, Canada
Wang Y (1993) On the number of successes in independent trials. Statistica Silica 3
Warner S (1965) Randomized response: a survey technique for eliminating evasive answer bias. J Am Stat
Assoc 60:63–69
Westin A (1999, July) Freebies and privacy: what net users think. Technical report, Opinion Research
Corporation
Zhang N, Wang S, Zhao W(2004, September) Anewscheme on privacypreserving association rule mining.
In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in
databases (PKDD), Pisa, Italy
123
102
S. Agrawal et al.
a variety of real datasets. Our experimental results indicate that, for a given privacy requirement, either substantially lower modeling errors are incurred as compared to the prior techniques, or the errors are comparable to those of direct mining on the true database. Keywords Privacy · Data mining
1 Introduction The knowledge models produced through data mining techniques are only as good as the accuracy of their input data. One source of data inaccuracy is when users, due to privacy concerns, deliberately provide wrong information. This is especially common with regard to customers asked to provide personal information on Web forms to Ecommerce service providers. The standard approach to address this problem is for the service providers to assure the users that the databases obtained from their information would be anonymized through the variety of techniques proposed in the statistical database literature (see Adam and Wortman 1989; Shoshani 1982), before being supplied to the data miners. For example, the swapping of values between different customer records, as proposed by Denning (1982). However, in today’s world, most users are (perhaps justiﬁably) cynical about such assurances, and it is therefore imperative to demonstrably provide privacy at the point of data collection itself, that is, at the user site. For the above “B2C (businesstocustomer)” privacy environment (Zhang et al. 2004), a variety of privacypreserving data mining techniques have been proposed in the last few years (e.g. Aggarwal and Yu 2004; Agrawal and Srikant 2000; Evﬁmievski et al. 2002; Rizvi and Haritsa 2002), in an effort to encourage users to submit correct inputs. The goal of these techniques is to ensure the privacy of the raw local data but, at the same time, support accurate reconstruction of the global data mining models. Most of the techniques are based on a data perturbation approach, wherein the user data is distorted in a probabilistic manner that is disclosed to the eventual miner. For example, in the MASK technique Rizvi and Haritsa (2002), intended for privacypreserving associationrule mining on sparse boolean databases, each bit in the original (true) user transaction vector is independently ﬂipped with a parametrized probability. 1.1 The FRAPP framework The trend in the prior literature has been to propose speciﬁc perturbation techniques, which are then analyzed for their privacy and accuracy properties. We move on, in this paper, to proposing FRAPP1 (FRamework for Accuracy in PrivacyPreserving mining), a generalized matrixtheoretic framework that facilitates a systematic approach to the design of random perturbation schemes for privacypreserving mining. It supports “ampliﬁcation”, a particularly strong notion of privacy proposed by
1 Also the name of a popular coffeebased beverage, where the ingredients are perturbed and hidden under foam http://en.wikibooks.org/wiki/Cookbook:Frapp%C3%A9_Coffee.
123
A framework for highaccuracy privacypreserving mining
103
Evﬁmievski et al. (2003), which guarantees strict limits on privacy breaches of individual user information, independent of the distribution of the original (true) data. The distinguishing feature of FRAPP is its quantitative characterization of the sources of error in the random data perturbation and model reconstruction processes. We ﬁrst demonstrate that the prior techniques differ only in their choices for the elements in the FRAPP perturbation matrix. Next, and more importantly, we show that through appropriate choices of matrix elements, new perturbation techniques can be constructed that provide highly accurate mining results even under strict ampliﬁcationbased (Evﬁmievski et al. 2003) privacy guarantees. In fact, we identify a perturbation matrix with provably minimal condition number,2 substantially improving the accuracy under the given constraints. An efﬁcient implementation for this optimal perturbation matrix is also presented. FRAPP’s quantiﬁcation of reconstruction error highlights that, apart from the choice of perturbation matrix, the size of the dataset also has signiﬁcant impact on the accuracy of the mining model. We explicitly characterize this relationship, thus aiding the miner decide the minimum amount of data to be collected in order to achieve, with high probability, a desired level of accuracy in the mining results. Further, for those environments where data collection possibilities are limited, we propose a novel “multidistortion” method that makes up for the lack of data by collecting multiple distorted versions from each individual user without materially compromising on privacy. We then investigate, for the ﬁrst time, the possibility of randomizing the perturbation parameters themselves. The motivation is that it could result in increased privacy levels since the actual parameter values used by a speciﬁc client will not be known to the data miner. This approach has the obvious downside of perhaps reducing the model reconstruction accuracy. However, our investigation shows that the tradeoff is very attractive in that the privacy increase is signiﬁcant whereas the accuracy reduction is only marginal. This opens up the possibility of using FRAPP in a twostep process: First, given a userdesired level of privacy, identifying the deterministic values of the FRAPP parameters that both guarantee this privacy and also maximize the accuracy; and then, (optionally) randomizing these parameters to obtain even better privacy guarantees at a minimal cost in accuracy.
1.2 Evaluation of FRAPP The FRAPP model is valid for randomperturbationbased privacypreserving mining in general. Here, we focus on its applications to categorical databases, where attribute domains are ﬁnite. Note that boolean data is a special case of this class, and further, that continuousvalued attributes can be converted into categorical attributes by partitioning the domain of the attribute into ﬁxed length intervals. To quantitatively assess FRAPP’s utility, we speciﬁcally evaluate the performance of our new perturbation mechanisms on popular mining tasks such as association rule mining and classiﬁcation rule mining.
2 In the class of symmetric positivedeﬁnite matrices (refer Sect. 4).
123
Further. with regard to classiﬁcation rule mining. 1. Further. Overall. the running time and memory costs for perturbed data mining. comparable to direct classiﬁcation on the true database. FRAPP incurs only negligible additional overheads with respect to memory usage and mining execution time. Subsequently. are also important considerations. – Quantitatively demonstrating the utility of FRAPP in the context of association and classiﬁcation rule mining. our experiments on a variety of real datasets indicate that FRAPP is substantially more accurate than the prior privacypreserving techniques. where the perturbation of an attribute value may be affected by the perturbations of the other attributes in the same record. we show that it is fully decomposable into the perturbation of individual attributes.4 Organization The remainder of this paper is organized as follows: Related work on privacypreserving mining is reviewed in Sect. FRAPP does not pose any signiﬁcant additional computational burdens on the data mining process. Apart from mining accuracy. – Introducing the concept of randomization of perturbation parameters. and hence has the same runtime complexity as any independent perturbation method. – Efﬁcient implementations of the proposed perturbation mechanisms. Appropriate choices of the framework parameters for simultaneously guaranteeing strict data privacy and improving 123 . due to its wellconditioned and trivially invertible perturbation matrix. our experiments show that FRAPP provides an accuracy that is. With regard to association rule mining. we present experimental evidence that FRAPP takes only a few minutes to perturb datasets running to millions of records. 2. in fact. as compared to classical mining on the original data. In contrast to much of the earlier literature. a generalized matrixtheoretic framework for random perturbation and mining model reconstruction. Speciﬁcally. However.3 Contributions In a nutshell.104 S. making it particularly wellsuited to datasets where the lengths of the maximal frequent itemsets are comparable to the cardinality of the set of attributes requiring privacy. FRAPP uses a generalized dependent perturbation scheme. FRAPP is almost impervious to this parameter. Similarly. as compared to traditional mining. – Using FRAPP to derive new perturbation mechanisms for minimizing the model reconstruction error while ensuring strict privacy guarantees. 1. Agrawal et al. therefore. 3. the work presented here provides mathematical and algorithmic foundations for efﬁciently providing both strict privacy and enhanced accuracy in privacyconscious data mining applications. while their accuracy degrades with increasing itemset length. our main contributions are as follows: – FRAPP. and thereby deriving enhanced privacy. The FRAPP framework for data perturbation and model reconstruction is presented in Sect.
the ideal choice of matrix is left as an open research issue. de Wolf et al. (2003). (2003). we summarize the conclusions of our study and outline future research avenues. 8. 2003). However. Rizvi and Haritsa (2002). (2002). the systematic identiﬁcation of such matrices and the conditions on their applicability is still an open research issue—moreover. and their utility is quantitatively evaluated in Sect. 10. The application of these mechanisms to speciﬁc patterns is discussed in Sect. But. Techniques for data hiding using perturbation matrices have also been investigated in the statistics literature.A framework for highaccuracy privacypreserving mining 105 model accuracy are discussed in Sects. 2 Related work The issue of maintaining privacy in data mining has attracted considerable attention over the last few years. with A. and an iterative reﬁnement process to produce acceptable matrices is proposed as an alternative. also intended for disclosure limitation in microdata ﬁles. (2005) for ensuring privacy in the OLAP environment. in the early 90s work of Duncan and Pearson (1991). (2003) to address a variety of subtle privacy loopholes. (1998). is certainly an attractive notion. it appears to be feasible only in a “B2B (businesstobusiness)” environment. The impact of randomizing the FRAPP parameters is investigated in Sect. Here. as they call it. and a methodology for limiting them. In the pioneering work of Agrawal and Srikant (2000). 6. Kargupta et al. also models data perturbation and reconstruction as 123 . While this “invariant PRAM”. The work recently presented by Agrawal et al. This approach was extended by Agrawal and Aggarwal (2001) and Kargupta et al. as if it were the original database and therefore not requiring any matrix inversion). For example. 9. the data consumer is provided the masked data ﬁle M = AX B + C instead of the true data X . A theoretical formulation of privacy breaches for such methods. 1998). The PRAM method (de Wolf et al. 7. as opposed to the B2C environment considered here. These methods are applicable to categorical/boolean data and are based on probabilistic mapping from the domain space to the range space. New randomization operators for maintaining data privacy for boolean data were presented and analyzed by Evﬁmievski et al. were given in the foundational work of Evﬁmievski et al. various disclosurelimitation methods for microdata are formulated as “matrix masking” methods. Gouweleeuw et al. and still produce accurate results. Agrawal and Srikant (2000). (2002. Rizvi and Haritsa (2002). no quantiﬁcation of privacy guarantees or reconstruction errors was discussed in their analysis. Finally. They also discuss the possibility of developing perturbation matrices such that data mining can be carried out directly on the perturbed database (that is. privacypreserving data classiﬁers based on adding noise to the record values were proposed. 4 and 5. 1998. in Sect. B and C being masking matrices. Efﬁcient schemes for implementing the FRAPP approach are described in Sect. The literature closest to our approach includes that of Agrawal and Aggarwal (2001). Evﬁmievski et al. considers the use of Markovian perturbation matrices. rather than by incorporating additive noise to continuousvalued data.
A second assumption is that each attribute is randomized independently. Their work is based on the assumption that statistical methods cannot handle long frequent itemsets. their work is restricted to handling only “upward privacy breaches” (Evﬁmievski et al. But. The “sketching” methods that were very recently presented by Mishra and Sandler (2006) are complementary to our approach. we propose the idea of randomizing the perturbation matrix elements themselves. through quantiﬁcation of privacy and accuracy measures. previously discussed in the literature. we present an ideal choice of perturbation matrix. Formally. Second. Finally. thereby losing correlations—however. and therefore the perturbation has to happen locally for each user. and explore how well random perturbation methods can perform in the face of strict privacy requirements. thereby taking the PRAM approach to. (2004). Our work extends the abovementioned methodologies for privacypreserving mining in a variety of ways. However.106 S. Their basic idea is that a kbit attribute with 2k possible values can be represented using 2k binaryvalued attributes which can then each be perturbed independently. thereby effectively adding a noise vector b to Ax. as shown in this paper. the transformation in their algorithm is described as y = Ax +b. Due to the extra bits. However. matrixtheoretic operations. In contrast. the method provides good estimation accuracy for single item counts. First. Very recently. a direct application of this idea requires extra (2k − k) bits. a new privacypreserving scheme based on the interesting idea of algebraic distortion. 2003). we combine the various approaches for random perturbation on categorical data into a common theoretical framework. However the issue of choosing a perturbation matrix to minimize this error is not addressed. A transition matrix is used for perturbation. Another difference between the two works is that we provide experimental results in addition to the theoretical formulations. Our results on optimally conditioned perturbation matrices can be combined with the sketching methods to provide better estimation of joint distributions. in a sense. rather than statistical methods. FRAPP supports dependent attribute perturbation and can therefore preserve correlations quite effectively. Rastogi et al. Recently. Agrawal et al. They also suggest that the condition number of the perturbation matrix is a good indicator of the error in reconstruction. and therefore. its logical conclusion. FRAPP successfully ﬁnds even length7 frequent itemsets. They also analyze the privacy and accuracy tradeoff under bounded prior knowledge assumptions. Mishra and Sandler (2006) proposes a summary sketching technique that requires an extra number of bits logarithmic in the number of instances in the dataset. has been proposed by Zhang et al. whereas FRAPP handles downward privacy breaches as well. each perturbed record 123 . the multipleattribute count estimation accuracy is shown to depend on the condition number of the perturbation matrix. they assume that users provide correct data to a central server and then this data is collectively anonymized. and reconstruction is executed using matrix inversion. Another model of privacypreserving data mining is the kanonymity model (Samarati and Sweeney 1998. Aggarwal and Yu 2004). That is. our schemes assume that the users trust no one but themselves. Speciﬁcally. which has not been. where each record value is replaced with a corresponding generalized value. Third. to the best of our knowledge. (2007) utilize and extend the FRAPP framework to a B2B environment like publishing.
. The domain of attribute j is denoted j j by SU . A different perspective is taken in Hippocratic databases. If we denote the ith record of U as Ui . . and enforcing limited disclosure rules for regulatory concerns prompted by legislation.A framework for highaccuracy privacypreserving mining 107 cannot be distinguished from at least k other records in the data. which are database systems that take responsibility for the privacy of the data they manage. However. whereas our focus is on the privacy of the input data. then N U = {Ui }i=1 . 2002) is preventing sensitive models from being inferred by the data miner—this work is complementary to ours since it addresses concerns about output privacy. SU = SU U U U U by the index set IU = {1. SU ={Male. Graduate}. Senior}. Further. . LeFevre et al. Sex and Education having the following category values: Age Sex Education Child. we describe the construction of the FRAPP framework. Ui ∈ IU .b). . Adult. the problem addressed by Atallah et al. (2001. thereby modeling the database as a set of N values from IU . Finally. (2004). 12}. the constraints of this model are less strict than ours since the intermediate databaseformingserver can learn or recover precise records. S  = 12. M = 3. 3 The FRAPP framework In this section. 2004) in the context of databases that are distributed across a number of sites with each site only willing to share data mining results. resulting in the domain SU of a record in U being given by SU = M SU . 2003. . SU }. SU = {Child. Adult. Female Elementary. To make this concrete. Saygin et al. Data model We assume that the original (true) database U consists of N records. (1999). SU = 1 × S 2 × S 3 . Female}. . and are discussed by Agrawal et al. 2004a. consider a database U with 3 categorical attributes Age. and hence the set of records U Child Child Child Senior Male Male Female Male Elementary Graduate Graduate Elementary maps to U 1 2 4 9 Each record Ui represents the private information of customer i. . Dasseni et al. but not the source data. . j=1 We map the domain SU to the index set IU = {1. and its quantiﬁcation of privacy and accuracy measures. This distribution pU is not private and the customers are aware that the miner 123 . (2002. with each record having M categorical attributes. Graduate 1 2 3 For this schema. (2001). Senior Male. Maintaining input data privacy is considered by Kantarcioglu and Clifton (2002). we assume that the Ui ’s are independent and identically distributed according to a ﬁxed distribution pU . Vaidya and Clifton (2002. The domain S is indexed {Elementary. They involve speciﬁcation of how the data is to be used in a privacy policy.
without being inﬂuenced by the contents of the other records in the database. the assumption of independence implies that once pU is known. we explore the preferred choices of A to simultaneously achieve data privacy guarantees and high model accuracy. we consider the B2C privacy situation wherein the customers trust no one except themselves. with domain SV . there are two possibilities: (a) A simple independent attribute perturbation. The FRAPP framework. Most of the prior perturbation techniques. and let A denote the matrix of these transition probabilities. For this situation. the distortion may result in V 5 7 2 12 which maps to Adult Adult Child Senior V Male Female Male Female Elementary Elementary Graduate Graduate Let the probability of an original customer record Ui = u. 1. or any intermediate party. 1. for the MASK technique (Rizvi and Haritsa 2002) mentioned in Sect. . VN }. without restricting ourselves ab initio to a particular perturbation method. Let the perturbed database be V = {V1 . the ﬂipping probability. namely. . v ∈ I V (1) Due to the constraints imposed by Eq. Rizvi and Haritsa (2002). . This random process maps to a Markov process. all the entries of matrix A are decided by the choice of a single parameter. (2002. and the perturbation matrix A should therefore satisfy the following properties (Strang 1988): Avu ≥ 0 and v∈I V Avu = 1 ∀u ∈ IU . includes both kinds of perturbation in its analysis. Agrawal et al. v ∈ I V using randomization opertor R(u) be p(u → v). however. where the perturbation of each attribute may be affected by the perturbations of the other attributes in the record. that is. possession of the private information U j of any other customer j provides no additional inferences about customer i’s private information Ui (Evﬁmievski et al. u ∈ IU being perturbed to a record Vi = v. In this paper. or (b) A more generalized dependent attribute perturbation. However. including Evﬁmievski et al. This means that perturbation is carried out at the granularity of individual customer records Ui . . with Avu = p(u → v). and assuming that each attribute is distorted to produce a value within its original domain. given the sample database U discussed above. they wish to perturb their records at their client sites before the information is sent to the miner. For example. 123 . 1. For example. the domain of A is a subset of RSV ×SU  . Perturbation model As mentioned in Sect. 2002). and corresponding index set I V . that is usually the goal of the data mining exercise. This domain is further restricted by the choice of the randomization operator. 2003). fall into the independent attribute perturbation category. wherein the value of each attribute in the user record is perturbed independently of the rest.108 S. is expected to learn it—in fact.
(2003). the miner gains partial information about Ui . the posterior probability of that property should not be unduly different from the prior probability of the property for the 123 . However. (2003). and Q 2 (Ui ) ≡ “Age = Child or Adult”. On the other hand. given by Evﬁmievski et al. false}. in order to preserve the privacy of some property of a customer’s private information. The problem therefore reduces to analyzing speciﬁcally how much can be disclosed by Vi about the particular source record Ui . a property holds for a record Ui = u if Q(u) = true. our privacy technique must encourage. as mentioned earlier in this section. due to the independence assumption. rather than preclude. the prior probability of any property Q(Ui ) is given by P[Q(Ui )] = u:Q(u) P[Ui = u] pU (u) u:Q(u) = The posterior probability of any such property can be computed using Bayes formula P[Q(Ui )Vi = v] = u:Q(u) P[Ui = uVi = v] P[Ui = u] · p[u → v] P[Vi = v] = u:Q(u) As discussed by Evﬁmievski et al. therefore.1 Privacy guarantees The miner is provided the perturbed database V . all Vi for j = i disclose nothing about Ui —they certainly help the miner to learn the distribution pU . extracting information about pU is typically the goal of the data mining exercise and. but this is already factored in our privacy analysis since we assume the most conservative scenario wherein the miner has complete and precise knowledge of pU . Speciﬁcally. by receiving Vi corresponding to customer i. Further. For example. achieving this objective. In fact. We utilize the deﬁnition. and the perturbation matrix A. the prior probability of a property of a customer’s private information is the likelihood of the property in the absence of any knowledge about the customer’s private information. Obviously. For this context. the posterior probability is the likelihood of the property given the perturbed information from the customer and the knowledge of the prior probabilities through reconstruction from the perturbed database. consider the following record from our example dataset U Age Child Sex Male Education Elementary Sample properties of this data record are Q 1 (Ui ) ≡ “Age = Child and Sex = Male”. that a property Q(u) of a data record U (i) = u is a function Q: u → {true.A framework for highaccuracy privacypreserving mining 109 3.
(2003). Conversely. v ∈ SV be a randomized value such that ∃u: p[u → v] > 0. we wish to clearly specify the environments under which the above guarantees are applicable. it does not take into account any prior knowledge that the miner may have about the original database. where ρ1 and ρ2 denote the prior and posterior probabilities. R is said to support (ρ1 . we can derive for our formulation. ρ2 ) privacy guarantees. Application environment At this juncture. ρ2 ) privacy: Avu 1 ρ2 (1 − ρ1 ) ≤γ < Avu 2 ρ1 (1 − ρ2 ) ∀u 1 . the choice of perturbation matrix A should follow the restriction that the ratio of any two matrix entries (in a row) should not be more than γ . (2003) through the following results. From the above results of Evﬁmievski et al. our quantiﬁcation of privacy breaches analyzes only the information leaked to the miner through observing the perturbed data. 123 . This notion of privacy is quantiﬁed by Evﬁmievski et al. u 2 ∈ IU . revealing “R(u) = v” will cause neither upward (ρ1 toρ2 ) nor downward (ρ2 toρ1 ) privacy breaches with respect to any property if the following condition is satisﬁed: ρ2 (1 − ρ1 ) >γ ρ1 (1 − ρ2 ) If this situation holds. Breach prevention Let R be a randomization operator. and ρ1 . Operator R(u) is at most γ amplifying if it is at most γ amplifying for all qualifying v ∈ SV . Secondly. Then. customer. ∀v ∈ I V (2) That is. u 2 ∈ SU : p[u 1 → v] ≤γ p[u 2 → v] where γ ≥ 1 and ∃u: p[u → v] > 0.110 S. ρ2 (0 < ρ1 < ρ2 < 1) be two probabilities as per the above privacy breach deﬁnition. respectively: Privacy breach An upward ρ1 toρ2 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui )] ≤ ρ1 and P[Q(Ui )R(Ui ) = v] ≥ ρ2 . the following condition on the perturbation matrix A in order to support (ρ1 . Agrawal et al. if R is at most γ amplifying for v. a downward ρ2 toρ1 privacy breach exists with respect to property Q if ∃v ∈ SV such that P[Q(Ui )] ≥ ρ2 and P[Q(Ui )R(Ui ) = v] ≤ ρ1 . Firstly. Ampliﬁcation A randomization operator R(u) is at most γ amplifying for v ∈ SV if ∀u 1 . we assume that the contents of each client’s record are completely independent from those of other customers—that is.
If we denote the outcome of the ith Bernoulli trial by the random i variable Yv . all the R(U j ) for j = i do not disclose anything about Ui and can therefore be ignored in privacy analysis. they certainly help the miner to learn the distribution of the original data. respectively. From Eq. Due to this independence assumption. the expectation of Yv is given by N N i E(Yv ) = i=1 i=1 i P(Yv = 1) E(Yv ) = (4) Using X u to denote the number of records with value u in the original database. the following expression is obtained from Eq. Then. . we get E(Yv ) = u∈IU Avu X u (5) Let X = [X 1 X 2 . As per the perturbation model. 3. the objective of associationrule mining is precisely to establish such dependencies. We also hasten to add that we do not make any such restrictive assumptions about intratransaction dependencies—in fact. 5: E(Y ) = AX (6) 123 . (2003). 3. v ∈ I V with probability p[u → v]. The trials are nonidentical because the probability of success varies from trial i to trial j. YSV  ]T . Y = [Y1 Y2 . The generation event can be viewed as a Bernoulli trial with success probability p[u → v]. So the problem reduces to evaluating how much can be disclosed by R(Ui ) about Ui Evﬁmievski et al. . depending on the values of Ui and U j . and i noting that P(Yv = 1) = p[u → v] = Avu for Ui = u. . The distribution of such a random variable Yv is known as the PoissonBinomial distribution (Wang 1993). Note that Yv is the sum of N independent but nonidentical Bernoulli trials.A framework for highaccuracy privacypreserving mining 111 there are no intertransaction dependencies. a client Ci with data record Ui = u. . the total number of records with value v in the perturbed database is given by Yv . u ∈ IU generates record Vi = v.2 Reconstruction model We now move on to analyzing how the distribution of the original database is reconstructed from the perturbed database. X SU  ]T . but in our analysis we have already assumed that this distribution is fully known by the miner. the total number of successes Yv in N trials is given by the sum of the N Bernoulli random variables: N Yv = i=1 i Yv (3) That is.
coupled with Theorem 1. where λmax and λmin are the maximum and minimum eigenvalues of matrix A. 6 and 8. whereas those with condition numbers much greater than one (e. Informally. the deviation 3 If multiple distorted versions of the database are provided. 123 . we employ the following wellknown theorem from linear algebra Strang (1988): ˆ Theorem 1 Given an equation of the form Ax = b and that the measurement b of b −1 b satisﬁes ˆ is inexact. it may appear that X .e. can be directly obtained from the above equation. and for the system to be uniquely solvable.112 S. Agrawal et al. highly sensitive. the relative error in the solution x = A ˆ x−x ˆ x ≤c ˆ b−b b where c is the condition number of matrix A.e. 3. a necessary condition is that the space of the perturbed database is a superset of the original database (i. Note that this estimation is unbiased because E( X ) = A−1 E(Y ) = X . indicated by the condition number of matrix A. Matrices with condition numbers near one are said to be wellconditioned. c = λmax /λmin . i. but only a speciﬁc instance of Y . if the inverse of matrix A exists. the distribution of records in the original database (and the objective of the reconstruction exercise). From Eqs. with which he has to approximate E(Y ).g. 105 for a 5∗5 Hilbert matrix Strang 1988) are said to be illconditioned. SV  ≥ SU ). Further. i. This is a system of SV  equations in SU  unknowns. 6: Y = AX (7) where X is estimated as X . then E(Y ) is approximated by the observed average of these versions. and second.3 Therefore. For a positivedeﬁnite matrix. However. respectively. stable. the solution of this system of equations is given by X = A−1 Y (8) providing the desired estimate of the distribution of records in the original database.3 Estimation error To analyze the error in the above estimation process.e. the condition number is a measure of the sensitivity of a matrix to numerical operations. we resort to the following approximation to Eq. we run into the difﬁculty that the data miner does not possess E(Y ). the sensitivity of the problem. At ﬁrst glance. we have X−X X ≤c Y − E(Y ) E(Y ) (9) which means that the error in estimation arises from two sources: First.
we mathematically determine how to reduce this error by: (a) appropriately choosing the perturbation matrix to minimize the condition number.w. while K and ρ are operator parameters. 4 Perturbation matrix with minimum condition number The perturbation techniques proposed in the literature primarily differ in their choices for perturbation matrix A.lu . · 1 − M/(K + 1) 1/(K + 1) here lu and lv are the number of 1 bits in the original record u and its corresponding perturbed record v.lu +lv −Mb } lu q M−lu z−q M z Mb − lu (lv −q) (1 − ρ)(Mb −lu −lv +q) · ρ lv − q (11) where min{K . and (b) identifying the minimum size of the database required to (probabilistically) bound the deviation within a desired threshold. respectively.z} p M [z] = w=0 M − w (z−w) ρ (1 − ρ)(M−z) z−w if w = M & w < K o. (1 − p) is the bit ﬂipping probability for each boolean attribute. (2) Cutandpaste: The cutandpaste (C&P) randomization operator (Evﬁmievski et al.A framework for highaccuracy privacypreserving mining 113 of Y from its mean. given in Eqs. the parameter settings for the above methods are bounded by the constraints. 2002) employs a matrix A with M min{z. In the following two sections.lv } Avu = z=0 p M [z] · q=max{0. indicated by the variance of Y . For example: (1) MASK: The MASK (Rizvi and Haritsa 2002) randomization scheme uses a matrix A with Avu = p k (1 − p) Mb −k (10) where Mb is the number of boolean attributes when each categorical attribute j j is converted into  SU  boolean attributes.z+lu −M. on the values of the elements of the perturbation matrix A. It turns out that for practical values of privacy requirements. i. 1 and 2. the deviation of perturbed database counts from their expected values. and k is the number of attributes with matching bits between the perturbed value v and the original value u. the resulting matrix A for these previous schemes is extremely illconditioned—in 123 . To enforce strict privacy guarantees.e.
Thus. an obvious question is whether it is possible to design matrices with even lower condition number than the gammadiagonal matrix.w. which incidentally is symmetric positivedeﬁnite and Toeplitz (Strang 1988). fact. .1 Proof of optimality Theorem 2 Under the given privacy constraints. Therefore. it is an optimal choice (albeit nonunique). as in the earlier techniques. which both satisﬁes the constraints on matrix A (Eqs. We prove next that the gammadiagonal matrix has the lowest possible condition number among the class of symmetric positivedeﬁnite perturbation matrices satisfying the constraints of the problem. the 123 . At an intuitive level. it is important to carefully choose the matrix A such that it is wellconditioned (i. . we will hereafter informally refer to this matrix as the “GammaDiagonal matrix”.. . the GammaDiagonal matrix has the lowest condition number in the class of symmetric positivedeﬁnite perturbation matrices. we will ﬁrst derive the expression for minimum condition number of symmetric positivedeﬁnite matrices. and then devising perturbation methods that are compatible with these matrices. has a low condition number). Proof To prove this proposition.. . that is. we can choose the following matrix: Ai j = which is of the form ⎡ ⎢ ⎢ x⎢ ⎣ γ 1 1 . respectively. For such matrices. its condition number can be algebraically computed (as shown in the AppenSU  dix) to be 1 + γ −1 . . If we decide on a distortion method ab initio.. the obvious matrix choice is the unity matrix. we take the opposite approach of ﬁrst designing matrices of the required type. also satisﬁes the conditions given by Eqs. 1 and 2. To choose a suitable matrix. . At this point. 1 and 2). .e. . the condition numbers in our experiments were of the order of 105 and 107 for MASK and C&P. namely.114 S. . for a given γ .. 4. . 1. where x = 1 γ + (SU  − 1) (12) It is easy to see that the above matrix. this matrix implies that the probability of a record u remaining as u after perturbation is γ times the probability of its being distorted to some v = u... ⎤ ⎥ ⎥ ⎥ ⎦ γx x if i = j o. 1 1 γ . 1 γ 1 . For ease of exposition. and has the lowest possible condition number. . Further. Hence. Such illconditioned matrices make the reconstruction very sensitive to the variance in the distribution of the perturbed database. we start from the intuition that for γ = ∞. then there is little room for making speciﬁc choices of perturbation matrix A.. Agrawal et al.
λ1 + · · · + λn = A11 + · · · + Ann From Theorem 3. we directly get λmin ≤ 1 n−1 n 1 n−1 n λi i=2 Aii − 1 i=1 resulting in the matrix condition number being lowerbounded by c= 1 ≥ λmin n i=1 n−1 Aii − 1 (13) Due to the privacy constraints on A given by Eq. λmin ≤ Using Theorem 4. we obtain λmax = 1. that is. one of the eigenvalues is 1. we get (n − 1)Aii ≤ γ j =i Ai j = γ (1 − Aii ) where the second step is due to the condition on A given by Eq. Theorem 4 The sum of the n eigenvalues equals the sum of the n diagonal entries. respectively. 1 and the restriction to symmetric positivedeﬁnite matrices. 1). and the remaining n − 1 eigenvalues all satisfy  λi ≤ 1. 2. that the sum of the rest of the eigenvalues is ﬁxed. Therefore. Theorem 3 For an n × n Markov matrix. Aii ≤ γ Ai j ∀j = i Summing the above equation over all values of j except j = i. it is straightforward to see that λmin n 1 is maximized when λ2 = λ3 · · · = λn .A framework for highaccuracy privacypreserving mining 115 condition number is given by c = λmax /λmin . Further. since A is a Markov matrix (refer Eq. and from Theorem 4. leading to λmin = n−1 i=2 λi . where λmax and λmin are the maximum and minimum eigenvalues of the matrix. Solving for Aii results in Aii ≤ γ γ +n−1 (14) 123 . If we denote λ1 = λmax . the following results for eigenvalues of a Markov matrix (Strang 1988) are applicable.
with the desired probability (given by ). which is well within the norm for typical ecommerce environments. this turns out to be N ≥ 2 × 106 . a comparatively loose bound. we analyze the dependence of deviations of itemset counts in the perturbed database from their expected values. Thus. we get P  Yv − E(Yv )  < N ≥ 1 − 2e−2 2N where (0 < < 1) represents the desired upper bound on the normalized deviation. Moreover. the miner must collect data from at least the number of customers given by the above bound. given by N Yv = 1 i Yv i where Yv is the Bernoulli random variable for record i.95. and that in practice. it is possible that even datasets that do not fully meet this requirement may be capable of providing the desired accuracy. note that these acceptable values were obtained with the Hoeffding Bound. Using these bounds for Yv . and its value turns out to be (1 + γ −1 ).116 S. with respect to the size of the database.001 and = 0. Then. To bound the deviation of Yv from its expected value E(Yv ). The condition number of our “gammadiagonal” matrix of size SU  can be computed as shown in SU  the Appendix. the value of N should satisfy the following: 1 − 2e−2 N ≥ ⇒ N ≥ ln(2/(1 − ))/(2 2 ) 2 (16) That is. For example. For the above probability to be greater than a userspeciﬁed value . Yv denotes the total number of records with value v in the perturbed database. 123 . 13. it is a minimum condition number perturbation matrix. we give bounds on the database sizes required for obtaining a desired accuracy. and using this inequality in Eq. with = 0. and N is the size of the database. we use Hoeffding’s General Bound (Motwani and Raghavan 1995). the minimum condition number for the symmetric positivedeﬁnite perturn bation matrices under privacy constraints represented by γ is (1+ γ −1 ). Agrawal et al. we ﬁnally obtain c≥ nγ γ +n−1 n n−1 γ +n−1 =1+ = γ −1 γ −1 −1 (15) Therefore. to achieve the desired accuracy (given by ). 5 Database size and mining accuracy In this section. which bounds the deviation of the sum of Bernoulli random variables from its mean. As discussed earlier.
Therefore. thereby achieving the desired target size. it might appear that this problem can be easily tackled by treating each group of siblings as a single multidimensional vector. if 9 out of 10 versions of a given record have the identical perturbed value for an attribute. Consequently. to guess the original value.1 was that the perturbed value R(Ui ) for the record i does not reveal any information about a record j = i. the privacy analysis of Sect. in this case. we quantitatively investigate the impact on privacy of having multiple known siblings in the database. the miner will obviously look for the value that appears the most number of times in the sibling records. we pursue the following alternate line of analysis: Consider a particular record with original (true) value u. so that overall the miner obtains m versions of the perturbed database. as per the following discussion: With the gammadiagonal matrix. in the remainder of this section. We hereafter refer to the set of perturbed records that share a common source record as “siblings”. 123 . 16. Here. which is independently perturbed m times. 5. The ﬁrst analysis technique that comes to mind is to carry out a hypothesis test—“the value seen the maximum number of times is indeed the true value”— using the χ 2 statistic.1. Recall that a basic assumption made when deﬁning privacy breaches in Sect.A framework for highaccuracy privacypreserving mining 117 For completeness. Therefore. all that needs to be done is to choose m such that the overall size of the database satisﬁes Eq. but this strategy completely nulliﬁes the original objective of having multiple versions to enhance accuracy. 5. one sibling reveals information about another sibling.1 Multiple known siblings The preceding analysis still leaves open the question as to what happens in situations wherein the data miner is aware of the siblings in the perturbed data set? It appears to us that maintaining accuracy requirements under such extreme circumstances may require relaxing the privacy constraints. the probability of a data value remaining unchanged is more than the probability of its being altered to any other value.1 Multiple versions of perturbed database Assume that each user perturbs his/her record m times independently. For example. Therefore. 3. 16. this test is not practical in our environment because of the extreme skewness of the distribution and the large cardinalities of the value domain. Therefore. However. 3 can be applied verbatim to prove γ ampliﬁcation privacy guarantees in this environment as well. But. as described next. Clearly. the miner knows with high probability the original value of that attribute. we now consider the hopefully rare situation wherein the customers are so few that accuracy cannot be guaranteed as per Eq. violating the assumption required for γ ampliﬁcation privacy. one approach that could be taken is to collect multiple independent perturbations of each customer’s record. this has to be done carefully since the multiple distorted copies can potentially lead to a privacy breach. At ﬁrst glance. with privacy now deﬁned as the probability of correctly guessing the original value. This assumption continues to be true in the multipleversions variant if the miner is not aware of which records in the perturbed data set are siblings.
resulting in the probability of a correct guess being P(R = u) ≤ 1 − (1 − γ x)m (18) The record domain size SV  can be reasonably expected to be (much) greater than m in most database environments. n i > n j . p = p[u → u] = γ x. if u appears less than or equal to L = SV  times. i. A legitimate concern here is that the miner may try to guess the values of individual sensitive attributes (or a subset of such attributes) in a record. producing m perturbed record values. m Observe that L = SV  ≥ 0. and let R be the random variable representing the value which is present the maximum number of times. since there must be another value v appearing at least SV  times in the perturbed records. R = i if ∀i. now refer to a single attribute. rather than its entire contents.e. it cannot be the most m frequent occurrence. 8. The last step follows from the fact that n u is a binomially distributed random variable. Hence. and hence the above inequality can be reduced to P(R = u) ≤ 1 − P(n u = 0) = 1 − (1 − p)m For the gammadiagonal matrix. As derived later 1 in Sect. the probability p[u → u] is given by: 123 . the probability of a correct guess satisﬁes the following inequality: P(R = u) = 1 − P(M = u) ≤ 1 − P(n u ≤ L) L = 1− k=1 L P(n u = k) m k=1 = 1− Ck · p k · (1 − p)m−k (17) where p is the probability p[u → u]. Agrawal et al.. for an attribute of domain size SV .118 S. Let n v be the number of times a perturbed value v appears in these m values. This implies that the value of p = γ x will usually be very small. let us assume that u and v. Then. To assess this possibility. which were used earlier to denote values of complete records. the probability of correctly guessing R = u is P(R = u) = P(∧v=u (n u > n v )) with ni = m m Clearly. leading to an acceptably low guessing probability.
1 for a representative setup: γ = 19 with a record domain size SV  = 2000 and various singleattribute 1 domain sizes (SV = 2.A framework for highaccuracy privacypreserving mining 119 Fig.25 for the extreme of 100 versions. for less stringent γ values. namely 2. the number of versions. Observe in Fig. which appears to be an acceptable privacy level in practice.5 in the practical range of m.5—note that this is no worse than the miner’s ability to correctly guess the attribute value without having access to the data. (b) the set of users is largely the same. 17. 1 that the fullrecord guessing probability remains less than 0. Of course. in the inequality of Eq. Turning our attention to the singleattribute case. these results imply that a substantial number of perturbed versions can be provided by users to the miner before their (guessingprobability) privacy can be successfully breached. there is added information from the data—however. Sample plots are shown in Fig. m p[u → u] = γ x + SV  −1 x 1 SV  An upper bound for the singleattribute guessing probability is directly obtained by m substituting the above value of p. 4. the guessing probability levels off at around 0. and is limited to 0. 1 P(R = u) vs. The observations in this section also indicate that FRAPP is robust against a potential privacy breach scenario where the information obtained from the users is (a) gathered periodically. Moreover. The solid line corresponds to the fullrecord whereas the dashed lines reﬂect the singleattribute cases. the miner’s guess is at least as likely to be wrong as it is to be correct. 8).1 even when the number of versions is as many as 50. In short. A quantitative assessment of the number of versions that can be provided without jeopardizing user privacy is achieved by plotting the guessing probability upper bound against m. the guessing probabilities will decrease even further. the key point again is that the guessing probabilities for these cases also level off around 0. we see that for the lowest possible domain size. and (c) the data inputs of the users V 123 . for larger domain sizes such as 4 and 8. Overall. and L = S 1  .
Avu = max Q(u ) Avu } or {u ∈ IU ¬Q(u). are often the same or very similar to their previous values. We therefore opine that FRAPP can be successfully used even in these challenging situations. (2003). the data distribution PUi can.1 Privacy guarantees Let Q(Ui ) be a “property” (as explained in Sect. where each entry Avu is a random variable with ˜ E( Avu ) = Avu . For the deterministic gammadiagonal matrix. However. Of course. Denote the prior probability of Q(Ui ) by P(Q(Ui )). Avu = min¬Q(u ) Avu }. 6 Randomizing the perturbation matrix The estimation models discussed thus far implicitly assumed the perturbation matrix A to be deterministic. in the worstcase. max Q(u ) Avu = γ x and min¬Q(u ) Avu = x. 3. when there is a core user community that regularly updates its subscription to an Internet service. Agrawal et al. We explore this tradeoff. and let record Ui = u be perturbed to Vi = v. like those found in the health or insurance industries. then ∀i (vu) = Avu . 6. so that each client uses a perturbation matrix not speciﬁcally known to the miner. it appears intuitive that if the perturbation matrix parameters were themselves randomized. be such that P(Ui = u) > 0 only if {u ∈ IU Q(u). Such a scenario can occur. and hence PUi (u)Avu ¬Q(u) P(Q(Ui )Vi = v) = Q(u) Q(u) PUi (u)Avu + PUi (u)Avu As discussed by Evﬁmievski et al. in this section. by replacing the deterministic ˜ ˜ matrix A with a randomized matrix A. on seeing the perturbed data.1) of client Ci ’s private information. the posterior probability of the property is calculated to be: P(Q(Ui )Vi = v) = u: Q(u) PUi Vi (uv) PUi (u)PVi Ui (vu) PVi (v) PVi Ui = u: Q(u) When a deterministic perturbation matrix A is used for all clients. the privacy of the client will be further increased. The values taken by the random variables for a client Ci provide the speciﬁc parameter settings for her perturbation matrix. Then. it may also happen that the reconstruction accuracy suffers in this process.120 S. resulting in P(Q(Ui )Vi = v) = P(Q(u)) · γ x P(Q(u)) · γ x + P(¬Q(u))x 123 . for example.
For example. only the posterior probability range. But. This means that posterior probability computations like the one shown above cannot be made by the miner for a given record Ui . the reconstruction model analysis for the randomized per˜ turbation matrix A can be carried out similar to that carried out earlier in Sect. where PVi Ui (vu) is a realization of random ˜ variable A. if P(Q(u)) = 5%. To make this concrete. ˜ E(Yv  Avu ) = = u∈IU {iUi =u} N i=1 i ˜ P(Yv = 1/ Avu ) ˜ Avu 123 .2 for the deterministic matrix A. the posterior probability works out to 50% for perturbation with the gammadiagonal matrix. 4. ρ2 ] = [ρ2 (−α). in the randomized matrix case. 6. (19) where x = γ +S1 −1 and r is a random variable uniformly distributed between [−α. Speciﬁcally. the worstcase posterior probability (and. ρ2 (+α)]. [ρ2 . 60%]. the probability of success for Bernoulli i variable Yv is now modiﬁed to i ˜ ˜ P(Yv = 1 Avu ) = Avu . the privacy guarantee) for a record Ui is a function of the value of r . only its distribution (and not the exact value for a given i) is known to the miner. for Ui = u and.w. α]. from Eq. For example. and α = γ x/2. U Here. the above posterior probability can be determined by the miner. and is given by ρ2 (r ) = P(Q(u)v) = P(Q(u)) · (γ x + r ) P(Q(u)) · (γ x + r ) + P(¬Q(u))(x − r SU −1 ) − + Therefore.A framework for highaccuracy privacypreserving mining 121 Since the distribution PU is known through reconstruction. for the scenario where P(Q(u)) = 5%. with its probability of being greater than 50% (ρ2 at r = 0) equal to its probability of being less than 50%. and γ = 19. hence. γ = 19. the posterior probability lies in the range [33%. and the distribution over this range.2 Reconstruction model With minor modiﬁcations. that is. 3. consider a ˜ randomized matrix A such that ˜ Auv = γx +r x − SUr−1 if u = v o. can be determined by the miner.
i ( p v )2 (24) 123 . only the numerator is different between the two cases ˜ since we can easily show that E(E(Y  A)) = AX . Agrawal et al. It follows that. From Theorem 1. PoissonBinomial distributed. We now compare these bounds with the corresponding bounds of the deterministic case. the error in estimation is bounded by: X−X X ≤c ˜ Y − E(E(Y  A)) ˜ E(E(Y  A)) (23) ˜ where c is the condition number of perturbation matrix A = E( A). The numerator can be bounded by ˜ Y − E(E(Y  A)) ˜ ˜ ˜ = (Y − E(Y  A)) + (E(Y  A) − E(E(Y  A))) ˜ ˜ − E(E(Y  A)) ˜ ≤ Y − E(Y  A) + E(Y  A) ˜ Here. if the average probability of success p is kept when all pv v 1 N constant. as opposed to the single expectation in the deterministic case. due to the use of the randomized matrix. as discussed before. hence decreaincrease the variability of pv v sing the ﬂuctuation of Yv from its expectation. as measured by its variance. In short. By using random matrix A instead of deterministic A. In other words. V ar (Yv ) assumes its maximum value when pv = · · · = pv . note that. Y − E(Y  A) is taken to represent the empirical variance of random variable Yv . Firstly.122 S. Since Yv is. i the variability of pv . there is a double expectation for Y on the RHS of the inequality. its variance is given by (Wang 1993) ˜ V ar (Yv  A) = N p v − i 1 i i i ˜ where p v = N i pv and pv = P(Yv = 1 A). = u∈IU ˜ Avu X u (20) ˜ ˜ ⇒ E(Y  A) = AX leading to ˜ E(E(Y  A)) = AX We estimate X as X given by the solution of the following equation Y = AX (21) (22) which is an approximation to Eq. or its lack of uniformity. decreases the magnitude of chance ˜ ﬂuctuations (Feller 1988). It is easily seen (by elementary calculus or induction) that among all combinai i i tions { pv } such that i pv = n p v . we i (now p i assumes variable values for all i). 21. Secondly. the sum i ( pv )2 assumes its minimum value i are equal.
we effectively need to generate for each Ui = u. . This algorithm. u) where Vi j denotes the jth attribute of record Vi . . ˜ ˜ On the other hand.A framework for highaccuracy privacypreserving mining 123 ˜ Y − E(Y  A) is likely to be decreased as compared to the deterministic case. each with two categories. was 0 in the deterministic case. 1]. .  SV . Vi M . 9. P(Vi M Vi1 . which ˜ is now positive whereas it depends upon the variance of the random variables in A. Speciﬁcally. deﬁned over v = 1. with 31 attributes. 3. . . we can write P(Vi . we now turn our attention to the implementation of the perturbation algorithm described in Sect. this works out to P(Vi1 = a. 1) (2) Repeat for v = 1. the value of the second term: E(Y  A) − E(E(Y  A)) . Vi(M−1) . . u) {vv(1)=a and v(2)=b} Avu P(Vi1 = a. a discrete distribution with PMF P(v) = Avu and CDF F(v) = i≤v Aiu . will require  SV  /2 iterations on average which can turn out to be very large. .  SV  if F(v − 1) < r ≤ F(v) return Vi = v where U(0. . Thus. thereby reducing the error bound. the error bound is increased by this term. Ui = u) = P(Vi1 . making the error only marginally worse than the deterministic case. u) = = 123 . . Overall. whose complexity is proportional to the product of the cardinalities of the attribute domains. and so on P(Vi2 = bVi1 = a. . For this. . u) = {vv(1)=a} Avu P(Vi2 = b. u) . u) . and as shown later in our experiments of Sect. the tradeoff turns out such that the two opposing terms almost cancel each other out. . . . A straightforward algorithm for generating the perturbed record v from the original record u is the following (1) Generate r ∼ U(0. u) · P(Vi2 Vi1 . . 1) denotes uniform continuous distribution over [0. 7 Implementation of perturbation algorithm Having discussed the privacy and accuracy issues of the FRAPP approach. Vi1 = a. u) P(Vi1 = a. . to perturb record Ui = u. we have a tradeoff situation here. For the perturbation matrix A. For example. . . u) = P(Vi1 . . . this amounts to 230 iterations per customer! We therefore present below an alternative algorithm whose complexity is proportional to the sum of the cardinalities of the attribute domains.
. . Finally. given that a is the outcome of the random process performed for the kth attribute. 25. . we use as input both its original value and the perturbed values of the previous j − 1 columns. The above perturbation algorithm takes M steps. . For any subsequent column j. i.124 S.w. and then generate the perturbed value for j as per the discrete distribution given in Eq. Vi( j−1) . For the ﬁrst attribute. to assess the complexity of the algorithm. When A is chosen to be the gammadiagonal matrix. as given by Eqs. Speciﬁcally. . Ui ). gets modiﬁed accordingly so that the overall distribution for record perturbation remains the same. one for each attribute. . 25 and 26. it is easy to see that the maximum j number of iterations for generating the jth discrete distribution is SU . to achieve the desired random perturbation.e. where v(i) denotes the value of the ith attribute for the record with value v. Vi( j−1) . Remark The scheme presented above gives a general approach to ensure that the complexity is proportional to the sum of attribute cardinalities. . pk = P(Vik = aVi1 . the probability distribution for each column perturbation. Agrawal et al. and n j is used to represent j k k=1  SU . Ui j = b) ⎧ nM ⎪ (γ + n j −1)x ⎪ ⎨ if ∀k < j. Vik = Uik j−1 k=1 pk = nM ⎪ ( n j )x ⎪ ⎩ j−1 o. . 26. for any choice of 123 . . the probability distribution of the perturbed value depends only on the original value for the attribute and is given by Eq. we get the following expressions for the above probabilities after some simple algebraic manipulations: nM −1 x P(Vi1 = b. and hence the j maximum number of iterations for generating a perturbed record is j SU . in contrast to the independent column perturbations used in most of the prior techniques. k=1 (25) (26) pk P(Vi j = bVi1 . Ui j = b) = ( n M )x nj j−1 k=1 pk where pk is the probability that Vik takes value a. . Note that even though the perturbation of a column depends on the perturbed values of previous columns. . Ui1 = b) = γ + n1 nM x P(Vi1 = b. the columns can be perturbed in any order. Ui1 = b) = n1 and for the jth attribute P(Vi j = bVi1 . Vi(k−1) . This is an example of dependent column perturbation. .
speciﬁcally for the gammadiagonal matrix. 123 . Each of the attributes j ∈ Cs can assume one of the SU  values. However. 8 can be directly used to estimate the support of itemsets containing all M categorical attributes. we know Yv = u∈IU Avu X u and therefore. respectively. namely association rule mining. in order to incorporate the reconstruction procedure into bottomup association rule mining algorithms such as Apriori (Agrawal and Srikant 1994). Let the support of L in the original and distorted databases be denoted by supL V . Eq.1 Association rule mining The core computation in association rule mining is to identify “frequent itemsets”.2).A framework for highaccuracy privacypreserving mining 125 perturbation matrix. Then. we need to also be able to estimate the supports of itemsets consisting of only a subset of attributes—this procedure is described next. frequency) in the database is in excess of a userspeciﬁed threshold supmin . 7. and classiﬁcation rule mining. and supL V supL = 1 N Yv v supports L where Yv denotes the number of records in V with value v (refer Sect. a simpler algorithm could be used. Thus. the algorithm is a generalization of Warner’s classical randomized response technique (Warner 1965). A user record supports the itemset L if the attributes in Cs take the values represented U by L. Namely. otherwise choose the value of each attribute in the perturbed tuple uniformly and independently. However. 8. From Eq. 3. itemsets whose support (i. 8 Application to mining tasks To illustrate the utility of the FRAPP framework. not the data distributions of the attributes.e. Let C denote the set of all attributes in the database. using the fact that A is symmetric. with probability x(γ − 1) return the original tuple. Let L. j the number of itemsets over attributes in Cs is given by ICs = j∈Cs SU . 4 Note that the notion of independence is with regard to the perturbation process. which produces class labeling rules for data records based on an initial training set (Mitchell 1997). that is. and Cs be a subset of these j attributes. which identiﬁes interesting correlations between database attributes Agrawal and Srikant (1994). we demonstrate in this section how it can be integrated in two representative mining tasks. H denote generic itemsets over this subset of attributes.4 In this special case.
27: V supL = 1 N AHL H u supports H Xu supU H = H AHL Thus.126 V supL = S. Therefore. the probability of an itemset remaining the same after perturbation is times the probability of it being distorted to any other itemset. A legitimate concern here might be that the matrix inversion could become timeconsuming as we proceed to larger itemsets making ICs large. := AHL (28) γ +IC /ICs −1 IC /ICs i. otherwise the summation involves only nondiagonal terms. we see that it represents the sum of the entries of column u in A over rows v that support itemset L. the inverse for this matrix has a simple closedform expression: Theorem 5 The inverse of A is a matrix of order n = ICs of the form B = {Bi j : 1 ≤ i ≤ n.e. consider the columns u that support a given itemset H. Fortunately. 1 N 1 N Avu X u v supports L u = Xu u v supports L Avu Grouping the records u by the itemsets H that they support: V supL = 1 N Xu H u supports H v supports L Avu (27) Analyzing the term v supports L Avu in the above equation. then one diagonal entry is part of this sum. Substituting in Eq. if H = L. Now. 1 ≤ j ≤ n}. with δ = −(γ + n − 2) and y = − IC s IC · 1 (γ −1) 123 . where δy if i = j Bi j = y o. Agrawal et al. for all u that support a given itemset H: C γ x + ( IIC − 1)x if H = L Avu = v supports L IC IC s s x o. Note that due to the structure of the gamma diagonal matrix A.w. we can estimate the supports of itemsets over any subset Cs of attributes using the matrix A which is of much smaller dimension (ICs × ICs ) for small itemsets as compared to the original full matrix A.w.
described next. and (b) classiﬁcation techniques based on calculating logarithms of 123 . it is sometimes possible that using the expressions given in Eq. supL and supL are the perturbed and reconstructed frequencies. and then separately distorts and reconstructs the distributions for the records corresponding to each class. The primary input required for this process is the distribution of attribute values for each class in the training data. it is not a problem there because such itemsets are automatically pruned due to the minimum support criterion.A framework for highaccuracy privacypreserving mining 127 Proof As both A and B are square matrices of the same order. which partitions the training data by class label. however.e. greatly reducing both space and time resources. 8. respectively.2 Classiﬁcation rule mining We now turn our attention to the task of classiﬁcation rule mining. 8. 29. which is ICs for a subset of attributes and IC for fulllength itemsets. given a large index set. This input can be produced through the “ByClass” privacypreserving algorithm enunciated by Agrawal and Srikant (2000). even after perturbation of the dataset.2. where I is the identity matrix of order ICs . Further. and n is the size of the index set. it is trivially easy to incorporate FRAPP even in incremental association rule mining algorithms such as DELTA (Pudi and Haritsa 2000) which operate periodically on changing historical databases.1 Negative reconstructed frequencies During the reconstruction process. and our scheme can be implemented efﬁciently on bottomup association rule mining algorithms such as Apriori (Agrawal and Srikant 1994). an offtheshelf classiﬁer can be used to produce the actual classiﬁcation rules. negative reconstructed frequencies may arise—this is because. While this occurs for association rule mining too. The above closedform inverse can be directly used in the reconstruction process. Speciﬁcally. low supL ). the reconstruction algorithm can now be very simply written as: for each L from 1 to n do U V V supL = supL δy + (N − supL )y (29) V U where N is the database cardinality. After this reconstruction. However. negative frequencies pose difﬁculties because (a) they lack meaningful interpretation. AB and BA are valid products. Thus we can efﬁciently reconstruct the counts of itemsets over any subset of attributes without needing to construct the counts of complete records. and use the results of previous mining operations to minimize the amount of work carried out during each new mining operation. In the case of classiﬁcation. it is possible that several indices may have little or no representaV tion at all (i. Also it can be trivially seen (by actual multiplication) that AB = BA = I. a complication that may arise in the privacypreserving environment is that of negative reconstructed frequencies.
such as decision tree classiﬁers (Quinlan 1993).e.uci.html. Other [15−35). AmerIndianEskimo. To address this problem. Further the scaling down of the positive frequencies is consistent with rule generation since classiﬁcation techniques are based on relative frequencies or distributions. ≥4e5 [0−20). Cranor et al. which are both derived from realworld repositories.1 Datasets Two datasets. fnlwgt. income. rather than absolute frequencies. the itemset frequencies. ≥ 75 [0−1e5]. “outliers”) and can therefore be ignored without signiﬁcant loss of accuracy. etc. [1e5−2e5). [3e5−4e5). [1e5−3e5). now become infeasible. 1999. 9. Male UnitedStates. The rationale is that records corresponding to negative frequencies are scarce in the original dataset (i.000 adult American citizens.ics. The complete details of the datasets are given below: CENSUS This dataset contains census information for about 50.128 S. We used three categorical (nativecountry.1 Association rule mining 9. Other. [35−55). Since it has been established in several sociological studies (e. [55−75). race) attributes and three continuous (age. sex.edu/~mlearn/ mlsummary.1. and is available from the UCI repository http://www. to quantitatively assessing the utility of the FRAPP approach with respect to the privacy and accuracy levels that it can provide for association rule mining and classiﬁcation rule mining. Agrawal et al. ≥ 80 123 . Table 1 CENSUS dataset Attribute Race Sex Nativecountry Age Fnlwgt Hoursperweek Categories White. [20−40). with the continuous attributes partitioned into discrete intervals to convert them into categorical attributes. [40−60). we ﬁrst set all negative reconstructed frequencies to zero and then uniformly scale down the positive frequencies such that their sum remains equal to the original dataset size. in this section. AsianPacIslander. Westin 1999) that users typically expect privacy on only a few of the database ﬁelds—usually sensitive attributes such as health.g. CENSUS and HEALTH. 9 Performance evaluation We move on. hoursperweek) attributes from the census database in our experiments. are used in our experiments. The speciﬁc categories used for these six attributes are listed in Table 1. Black Female. [60−80).—our datasets also project out a representative subset of the columns in the original databases.
Further.000 or more Excellent. Since the datasets available to us were much smaller than the desired N . even when the miner does possess this knowledge. $20. [7−15). by providing multiple distortions of each user record. Very good. we resorted to scaling each dataset by a factor of 50 to cross the size threshold. [60−80). ≥ 80) [0−7). 5. Fair. ≥ 60 Table 3 Frequent itemsets for supmin = 0. 9 for relative error in reconstructed counts. Note that we need to consider small values of . 50 versions was shown to retain an acceptable privacy level under the modiﬁed (guessingprobability) privacy deﬁnition. For = 0. [30−60). [15−30). this turned out to be N ≥ 2 × 106 . Female 129 Yes. ≥ 60 [0−7). and = 0. as a function of the itemset length. Eq. 16 gave the number of data records required to obtain relative inaccuracy of less than with a probability greater than . 123 . The association rule mining accuracy of our schemes on these datasets was evaluated for a userspeciﬁed minimum support of supmin = 2%. We selected 4 categorical and 3 continuous attributes from the dataset for our experiments.02 Itemset length 1 CENSUS HEALTH 19 23 2 102 123 3 203 292 4 165 361 5 64 250 6 10 86 7 – 12 HEALTH This dataset captures health information for over 100. as indicated by Eq. No [0−20). As per the discussion in Sect. [30−60).000 patients collected by the US government http://dataferrett.gov. 5. [7−15). [20−40). Yes. phone number given.1. Good. no phone number given.2 Multiple versions In Sect.001.census. [40−60). 9.95.A framework for highaccuracy privacypreserving mining Table 2 HEALTH dataset Attribute INCFAM20 (family income) HEALTH (health status) SEX (sex) PHONE (has telephone) AGE (age) BDDAY12 (bed days in past 12 months) DV12 (Doctor visits in past 12 months) Categories Less than $20. Table 3 gives the number of frequent itemsets in the datasets for this support threshold. A useful sideeffect of the dataset scaling is that it also ensures that our results are applicable to large diskresident databases. 000. [15−30). These attributes and their categories are listed in Table 2. Poor Male. since the error given by will be further ampliﬁed by the condition number.1. such scaling does not result in any additional privacy breach if the miner has no knowledge of the sibling identities.
indicating the percentage of false positives. 9. and σ − indicating the percentage of false negatives. while a ρ2 value of 50% indicates that the user can still plausibly deny any value attributed to him or her since it is equivalent to a random cointoss attribution. The value of ρ1 is representative of the fact that users typically want to hide uncommon values which set them apart from the rest. We experimented with a variety of privacy settings—for example.1. Though. resulting in γ values ranging from 9 to 19. 6) for perturbation and reconstruction.3 Performance metrics We measure the performance of the system with regard to the accuracy that can be provided for a given privacy requirement speciﬁed by the user. the reconstructed support by sup and the actual support by sup. on the other hand. any distribution 123 .1. Denoting the number of frequent itemsets by F. reﬂects the percentage error in identifying frequent itemsets and has two components: σ + . The Support Error (µ) metric reﬂects the average relative error (in percent) of the reconstructed support values for those itemsets that are correctly identiﬁed to be frequent. varying ρ2 from 30% to 50% while keeping ρ1 ﬁxed at 5%.130 S.1 were employed to construct the perturbation matrix used in each pass of Apriori. the support error is computed over all frequent itemsets as µ= 1 F  sup f − sup f  sup f ∗ 100 f ∈F The Identity Error (σ ) metric. the mining of the distorted database was done using the Apriori (Agrawal and Srikant 1994) algorithm. the perturbation mechanisms evaluated in our study are the following: DETGD This scheme uses the deterministic gammadiagonal perturbation matrix A (Sect. 4) for perturbation and reconstruction. in our experiments. The perturbation was implemented using the techniques described in Sect. Support Error and Identity Error. and the equations of Sect.4 Perturbation mechanisms We present the results for FRAPP and representative prior techniques. (2003) is used as the privacy metric. For all the perturbation mechanisms. in principle. Privacy The (ρ1 . ρ2 ) strict privacy measure from Evﬁmievski et al. 2004. Rizvi and Haritsa 2002). 7. RANGD This scheme uses the randomized gammadiagonal perturbation matrix ˜ A (Sect. with an additional support reconstruction phase at the end of each pass to recover the original supports from the perturbed database supports computed during the pass (Agrawal et al. 8. these metrics are computed as σ+ =  R−F  ∗ 100 F σ− = F−R ∗ 100 F 9. Speciﬁcally. Accuracy We evaluate two kinds of mining errors. Agrawal et al. Denoting the reconstructed set of frequent itemsets with R and the correct set of frequent itemsets with F.
439 and 0. 2). 19) over the entire range of the α randomization parameter (0 to γ x). the M categorical attributes map to Mb = j  SU  boolean attributes. Further. The ﬂipping probability 1− p was chosen as the lowest value which could satisfy the Avu privacy constraints given by Eq. ξ ) combination giving the best mining accuracy. The constraint ∀v: ∀u 1 . ρ2 ) = (5%. ξ was chosen such that the matrix (Eq. extremely well. Note that the support error (µ) graphs are plotted on a logscale. here we evaluate the performance of uniformly distributed A (as given by Eq. 2. (2002). The corresponding graphs for the HEALTH dataset are shown in Fig. p Mb (1− p) Mb 2 9.494. 3. Hence the ratio of p 2M two entries in the matrix cannot be greater than (1− p)2M and the following condition is sufﬁcient for the privacy constraints to be satisﬁed: p 2M ≤γ (1 − p)2M The above equation was used to determine the appropriate value of p.448 for the CENSUS and HEALTH datasets. with the results for γ = 13. as a function of the length of the frequent itemsets (the performance of RANGD is shown for randomization parameter α = γ x/2). and results in γ = 19. the error being of the order of 10% for the longer itemsets. the support (µ) and identity (σ − . on an absolute scale. RANGD. intended for boolean databases and characterized by a single parameter 1 − p. its 123 . σ + ) errors of the four perturbation mechanisms (DETGD.5 Experimental results For the CENSUS dataset. u 2 : Avu 1 ≤ γ is satisﬁed ≤ γ . if one and only one of its associated boolean attributes takes value 1 in a particular record. Similar performance trends were observed for the other practical values of γ . this value turned out to be 0. we ﬁrst note that DETGD performs. for MASK (Rizvi and Haritsa 2002). (2003).A framework for highaccuracy privacypreserving mining 131 ˜ ˜ can be used for A. But. 11) satisﬁes the privacy constraints (Eq. In our scenario. C&P) for γ = 19 are shown in Fig. To choose K . Therefore. MASK This is the perturbation scheme proposed in Rizvi and Haritsa (2002). ρ2 ) = (5%.1. 50%). C&P This is the CutandPaste perturbation scheme proposed by Evﬁmievski et al. In these ﬁgures. For γ = 19 (corresponding to (ρ1 . The detailed results are presented here for a representative privacy requirement of (ρ1 . 4 and 5. we varied K from 0 to M. 50%)). The results reported here are for the (K . which was also used by Evﬁmievski et al. Thus. respectively. all the records contain exactly M number of 1s . for each categorical attribute. and for each K .28 and γ = 9 on CENSUS dataset shown in Figs. which determines the probability of an attribute value being ﬂipped. MASK. with algorithmic parameters K and ξ . which for γ = 19. turned out to be K = 3 and ξ = 0. 2. respectively. the categorical attributes are mapped to boolean attributes by making each value of the j category an attribute.
a Support error µ. b false negatives σ − . 41%) on CENSUS. c false positives σ + 123 . b False negatives σ − . 4 Results for γ = 13. c false positives σ + Fig. a Support error µ. 2 CENSUS γ = 19. ρ2 ) = (5%.132 S. c false positives σ + Fig. b false negatives σ − . 32%) on CENSUS. c false positives σ + Fig. (a) (b) (c) Fig. Agrawal et al. a Support error µ. a Support error µ. b false negatives σ − . 5 Results for γ = 9 (ρ1 . ρ2 ) = (5%. 3 HEALTH γ = 19.28 (ρ1 .
on a PIV 2. In return. and above 5 for the HEALTH dataset. b support error µ (HEALTH). while C&P could not identify itemsets beyond length 3 in both datasets. Role of condition numbers The primary reason for DETGD and RANGD’s good performance is the low condition numbers of their perturbation matrices. respectively.5 million records of 123 .0 GHz PC with 1 GB RAM and 40 GB hard disk. This is because. is only marginally worse than that of DETGD. The second point to note is that the accuracy of RANGD. In fact. with regard to actual mining response times also. as mentioned before. FRAPP takes about the same time as Apriori for the complete mining process on the original and perturbed databases. Computational overheads Finally. Speciﬁcally. 6b and c for the CENSUS and HEALTH datasets. Speciﬁcally. Figure 6a shows the performance of RANGD over the entire range of α with respect to the posterior probabi− + lity range [ρ2 . our choice of a gammadiagonal matrix indicates highly promising results for discovery of long patterns. c support error µ (CENSUS) performance is visibly better than that of MASK and C&P. We observe that the performance of RANGD does not deviate much from the deterministic case over the entire range. The mining support reconstruction errors for itemsets of length 4 are shown in Fig. In marked contrast. the condition numbers for MASK and C&P increase exponentially with increasing itemset length. resulting in drastic degradation in accuracy. Note that the condition numbers are not only low but also independent of the frequent itemset length (algebraic computation of condition numbers is shown in the Appendix). as the length of the frequent itemset increases. 7. the initial perturbation step took only a very modest amount of time even on vanilla PC hardware. although employing a randomized matrix. 6 Varying randomization of perturbation matrix (γ = 19). perturbing 2. Thus. Further. respectively.A framework for highaccuracy privacypreserving mining 133 (a) (b) (c) Fig. a Posterior probability. This is quantitatively shown in Fig. which plots these condition numbers on a logscale (the condition numbers of DETGD and RANGD are identical in this graph because ˜ E( A) = A). the reconstruction component shows up only in between mining passes and involves very simple computations (see Eq. ρ2 ]. whereas very low determinable posterior probability is obtained for higher values of α. the performance of both MASK and C&P degrade drastically. 29). it provides a substantial increase in the privacy—its worstcase (determinable) privacy breach is only 33% as compared to 50% with DETGD. MASK is not able to ﬁnd any itemsets of length above 4 for the CENSUS dataset.
RAN−GD MASK C&P 10 10 9 DET−GD. Self Employment not Inc.134 S. Federal Government. (a) 10 10 10 8 (b) DET−GD. [55−75). 9. Table 4 US CENSUS dataset for classiﬁcation Attribute Nativecountry Salary Age Typeofemployment Hoursperweek Categories UnitedStates. RAN−GD MASK C&P 7 8 6 condition number 10 10 10 10 10 10 5 condition number 10 10 10 10 10 10 7 6 4 5 4 3 3 2 2 1 10 10 1 0 0 2 3 4 5 6 2 3 4 5 6 7 frequent itemset length frequent itemset length Fig.nz/ml/weka. The classiﬁer used is the highly popular publicdomain C4.2 Classiﬁcation rule mining We now turn our attention to assessing the performance of FRAPP in the context of classiﬁcation rule mining. ≥ 75 Private. out of which about 75% of the records were used for training and the remaining as test data. 9. 000. Agrawal et al. [40−60).waikato. Never worked [0−20). while 5 million records of HEALTH were distorted in a little over 2 min.000 [15−35). Self Employment Inc. [35−55). [20−40). 7 Perturbation matrix condition numbers (γ = 19).2. Other Less or equal to $50. b HEALTH CENSUS took about a minute. speciﬁcally the one available at http:// www. Greater than $50. a CENSUS. Local Government. The attributes used in the experiment are given in Table 4. State Government. Without pay. ≥ 80 123 .1 Experimental setup The US Census dataset mentioned earlier was used in our experiments.5 decision tree classiﬁer Quinlan (1993). [60−80).ac.cs. among which salary was chosen as the Class Label attribute.
The framework provides us with the ability to ﬁrst make careful choices of the model parameters and then build perturbation methods for these choices. Our analysis of this approach indicated 123 .34 71. the accuracy results for FRAPPbased privacypreserving classiﬁcation are shown in Table 5. we investigated the novel strategy of having the perturbation matrix composed of not values. With this privacy setting. We see here that the FRAPP accuracies are quite comparable to DIRECT. a generalized model for randomperturbationbased methods operating on categorical data under strict privacy constraints. we developed FRAPP. Using the framework. the “best case”.2 Experimental results We choose a privacy level of γ = 19.66 23. either both got it correct or both got it wrong. but random variables instead. and as can be seen. without materially compromising the data privacy. The relationship between data size and model accuracy was also evaluated and it was shown that it is often possible to construct a sufﬁciently large dataset to achieve the desired accuracy by the simple expedient of generating multiple distorted versions of each customer’s true data record.88 75. Finally. and comparable accuracy to direct mining for classiﬁcation models. representing in a sense. This results in orderofmagnitude improvements in model accuracy as compared to the conventional approach of deciding on a perturbation method upfront. corresponding to a maximum privacy breach of 50%. which also provides the corresponding accuracies for direct classiﬁcation on the original database.12 24. close to 95%.e.12 9. Empirical evaluation of our approach on the CENSUS and HEALTH datasets demonstrated signiﬁcant reductions in mining errors for association rule mining relative to prior privacypreserving techniques.2. and therefore expected to deliver the highest accuracy within this class. Finally.A framework for highaccuracy privacypreserving mining Table 5 Classiﬁcation accuracy Mining technique FRAPP DIRECT BOTH Correct labeling (%) 72. indicating that there is little cost associated with supporting the privacy functionality.34 135 Incorrect labeling (%) 27. 10 Conclusions and future work In this paper. the last line (BOTH) in Table 5 shows the proportion of cases where FRAPP and DIRECT concurred in their labeling—i. a “gammadiagonal” perturbation matrix was identiﬁed as the best conditioned among the class of symmetric positivedeﬁnite matrices. which implicitly freezes the associated model parameters. We also presented an implementation method for gammadiagonalbased perturbation whose complexity is proportional to the sum of the domain cardinalities of the attributes in the database. the overlap between the two classiﬁers is very high.
consider the n × n matrix A of form ξ x if i = j x o. to develop an “invariant FRAPP matrix”. where ξ x + (n − 1)x = 1 Ai j = (30) Since matrix A is symmetric. Then. . . Let this common value be g. ξ x Xi + j =i x X j = λX i x X j = λX i j ⇔ (ξ x − x)X i + ⇒ either X i = x j Xj λ + x − ξx (31) or λ + x − ξ x = 0 Eq. In our future work. .w. n. it must satisfy: AX = λX Using the structure of matrix A from Eq. as discussed in Sect. for any i = 1. 30. that at a marginal cost in accuracy. .136 S. Agrawal et al. Appendix Condition number of gammadiagonal matrix We provide here the formula for computing the condition number of the gammadiagonal distortion matrix. 31 implies that all X i are equal. we can use the following wellknown result (Strang 1988): Theorem 6 A symmetric matrix has real eigenvalues. leading to (32) 123 . to design distortion matrices such that the mining can be carried out directly on the distorted database without any explicit reconstruction—that is. we plan to investigate whether it is possible. Speciﬁcally. Let X be an eigenvector of the matrix A corresponding to eigenvalue λ. signiﬁcant improvements in privacy levels could be achieved. 2.
Toronto. 32. Heraklion. n = SU . USA Agrawal R. California. n = IC s . Crete. and its condition number is: cond(A) = λmax (ξ + n − 1) = λmin (ξ − 1) – For the matrix A given by Eq. Aggarwal C (2001. In: Proceedings of the 30th international conference on very large data bases (VLDB). In: Proceedings of the ACM symposium on principles of database systems (PODS). so cond(A) = (γ +  SU  −1)  SU ) =1+ (γ − 1) (γ − 1) – For matrix A for mining itemsets over subset of attributes Cs . ξ +n−1 = ξ −1 = cond(A) = References γ + IC − 1 γ −1 IC IC s IC IC s (γ + IC − 1)  SU ) (ξ + n − 1) = =1+ (ξ − 1) (γ − 1) (γ − 1) Adam N. Kiernan J. Rantzau R. Kiernan J. Hence.A framework for highaccuracy privacypreserving mining 137 g(λ + x − ξ x) = ngx ⇒ λ = ξ x + nx − x = 1 From Eq. May) On the design and quantiﬁcation of privacy preserving data mining algorithms. August) Auditing compliance with a hippocratic database. August) Hippocratic databases. In: Proceedings of the 28th international conference on very large data bases (VLDB). March) A condensation approach to privacy preserving data mining. 28. 12. λ = (ξ − 1)x = (ξ − 1) <1 (ξ + n − 1) (33) Thus. given by Eq. Srikant R (2004. ξ= γ + I C −1 I Cs IC IC s . Hong Kong. only two distinct values are taken by the eigenvalues of matrix A: λ1 = 1 and λ2 = λ3 = · · · = λn = (ξ(ξ −1) . In: Proceedings of the 9th international conference on extending database technology (EDBT). ACM Comput Surv 21(4): 515–556 Aggarwal C. For ξ ≥ 1. ξ = γ . Wortman J (1989) Security control methods for statistical databases. Santa Barbara. Xu Y (2002. Srikant R. Canada Agrawal R. hence A is a positivedeﬁnite matrix. Yu P (2004. Bayardo R. Greece Agrawal D. the eigenvalues of the matrix A are +n−1) positive. γ ≥ 1. China 123 . Faloutsos C.
June) Privacy via pseudorandom sketches. Srikant R. California. Canada Feller W (1988) An introduction to probability theory and its applications. Jeju Island. Elmagarmid A. Pearson R (1991) Enhancing access to microdata while protecting conﬁdentiality: prospects for the future. AT&T labs research technical report TR 99. Wang Q. March) On addressing efﬁciency concerns in privacypreserving mining. June) Managing healthcare data hippocratically. Willenborg L (1998. McGraw Hill Motwani R. Wiley Gouweleeuw J. Hong S (2007. Srikant R (1994. Pittsburgh. Srikant R (2003. In: Proceedings of the 33rd international conference on very large data bases (VLDB). Madison. Haritsa J (2000) Quantifying the utility of the past in mining large databases. Chicago. Willenborg L. Melbourne. In: Proceedings of the ACM SIGMOD workshop on research issues in data mining and knowledge discovery (DMKD). Pennsylvania. Agrawal R. Sivakumar K (2003.138 S. In: Proceedings of the ACM SIGMOD international conference on management of data. Chicago. San Diego. Xu Y. USA Agrawal S. In: Proceedings of the 3rd IEEE international conference on data mining (ICDM). Thomas D (2005. Chile Agrawal R. In: Proceedings of the ACM symposium on principles of database systems (PODS). Ackerman M (1999. Srikant R (2000. USA 123 . Haritsa J (2004. Gehrke J. China Samarati P. Portugal Denning D (1982) Cryptography and data security. In: Proceedings of the ACM SIGMOD international conference on management of data. Datta S. Lisbon. June) Limiting privacy breaches in privacy preserving data mining. Texas. In: Proceedings of the IEEE knowledge and data engineering exchange workshop (KDEX).3 Dasseni E. USA Mitchell T (1997) Machine learning. Austria Rizvi S. June) Generalizing data to provide anonymity when disclosing information. Maryland. DeWitt D (2004. June) Privacypreserving OLAP. March) Reﬂections on PRAM. Canada Mishra N. AddisonWesley Duncan G. In: Proceedings of the ACM SIGMOD international conference on management of data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). Reagle J. Baltimore. Elmagarmid A. Agrawal et al. April) Hiding association rules by using conﬁdence and support. August) Maintaining data privacy in association rule mining. July) Privacy preserving mining of association rules. Kooiman P. In: Proceedings of the 20th international conference on very large data bases (VLDB). Wang A. Gehrke J (2002. Hong Kong. Agrawal R. Seattle. Santiago de Chile. Wisconsin. Zhou D (2004. Ercegovac V. In: Proceedings of the 4th international information hiding workshop (IHW). Vienna. In: Proceedings of the 30th international conference on very large data bases (VLDB). J Off Stat 14(4):485–502 Kantarcioglu M. Verykios V. LeFevre K. Morgan Kaufmann Rastogi V. Suciu D. Bertino E. In: Proceedings of the ACM symposium on principles of database systems (PODS). Inf Sys 25(5):323–344 Quinlan JR (1993) C4. August) Limiting disclosure in hippocratic databases. In: Proceedings of the 9th international conference on database systems for advanced applications (DASFAA). Sweeney L (1998. Verykios V (1999. Haritsa J (2002. Paris. USA Evﬁmievski A. Toronto. Alberta.4. Cambridge University Press Pudi V. Florida. vol I. Kini A. France Agrawal R.5: Programs for machine learning. June) Privacypreserving distributed mining of association rules on horizontally partitioned data. USA Agrawal R. Krishnan V. Agrawal R. Xu Y. Clifton C (2002. Ibrahim M. USA de Wolf P. In: Proceedings of the 28th international conference on very large databases (VLDB). USA Kargupta H. USA Cranor L. USA LeFevre K. April) Beyond concern: understanding net users’ attitudes about online privacy. September) The boundary between privacy and utility in data publishing. Bertino E (2001. May) Privacypreserving data mining. Gouweleeuw J. Korea Atallah M. Illinois. Srikant R. Kooiman P. Stat Sci 6(3):219–232 Evﬁmievski A. Raghavan P (1995) Randomized algorithms. In: Proceedings of the statistical data protection conference. Sandler M (2006. December) On the privacy preserving properties of random data perturbation techniques. November) Disclosure limitation of sensitive rules. Edmonton. Illinois. September) Fast algorithms for mining association rules. Washington. In: Proceedings of the ACM symposium on principles of database systems (PODS). Dallas. Ramakrishnan R. de Wolf P (1998) Post randomisation for statistical disclosure control: Theory and implementation.
August) Privacypreserving kmeans clustering over vertically partitioned data. problems and some solutions. Pisa. USA Shoshani A (1982. September) Statistical databases: characteristics. Edmonton. DC. Technical report. Statistica Silica 3 Warner S (1965) Randomized response: a survey technique for eliminating evasive answer bias. Thomson Learning Inc Vaidya J. Elmagarmid A (2002. Clifton C (2004. In: Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases (PKDD). In: Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (KDD). April) Privacy preserving naive bayes classiﬁer for vertically partitioned data. Canada Wang Y (1993) On the number of successes in independent trials. California. Verykios V. J Am Stat Assoc 60:63–69 Westin A (1999. July) Privacy preserving association rule mining in vertically partitioned data. Toronto. Opinion Research Corporation Zhang N. ACM SIGMOD Rec 30(4):45–54 Saygin Y. In: Proceedings of the 12th international workshop on research issues in data engineering (RIDE). Mexico City. February) Privacy preserving association rule mining. September) A new scheme on privacypreserving association rule mining. Verykios V.A framework for highaccuracy privacypreserving mining 139 Saygin Y. July) Freebies and privacy: what net users think. USA Vaidya J. In: Proceedings of the 8th ACM SIKGDD international conference on knowledge discovery and data mining (KDD). San Jose. In: Proceedings of the SIAM international conference on data mining (SDM). Zhao W (2004. Italy 123 . Clifton C (2001) Using unknowns to prevent discovery of association rules. Washington. Alberta. Clifton C (2002. Clifton C (2003. Canada Vaidya J. Mexico Strang G (1988) Linear algebra and its applications. In: Proceedings of the 8th international conference on very large databases (VLDB). Wang S.
This action might not be possible to undo. Are you sure you want to continue?
Use one of your book credits to continue reading from where you left off, or restart the preview.