Distributed Data Mining in Multi-Cloud Settings

1
Delegated Secure Sum Service for

Distributed Data Mining in Multi-Cloud Settings
Dinh Tien Tuan Anh, Quach Vinh Thanh, Anwitaman Datta
AbstractAn increasing number of businesses are migrating
their IT operations to the cloud. Likewise there is an increased
emphasis on data analytics based on multiple datasets and sources
to derive information not derivable when a dataset is mined
in isolation. While ensuring security of data and computation
outsourced to a third party cloud service provider is in itself
challenging, supporting mash-ups and analytics of data from
different parties hosted across different services is even more so.
In this paper we propose a cloud-based service allowing multiple
parties to perform secure multi-party secure sum computation
using their clouds as delegates. Our scheme provides data privacy
both from the delegates as well as from the other data owners
under a lazy-and-curious adversary (semi-honest) model. We
then describe how such a secure sum primitive may be used
in various collaborative, cloud-based distributed data mining
tasks (classication, association rule mining and clustering). We
implement a prototype and benchmark the service, both as a
stand-alone secure sum service, and as a building block for more
complex analytics. The results suggest reasonable overhead and
demonstrate the practicality of carrying out privacy preserved
distributed analytics despite migrating (encrypted) data to pos-
sibly different and untrusted (semi-honest) cloud services.
I. INTRODUCTION
A. Multi-Party Computing Service.
An enormous amount of data is being generated everyday
by a plethora of human activities and computing devices.
Traditionally, data is stored in a data owners in-house infras-
tructure, and access to outsiders is provided typically through
web services. Example services including MedlinePlus [1],
Xignite [2], NOAA [3], ResMap [4], Yahoo! Trafc [5] etc.
offer a wide range of data: medical, nancial market, me-
teorological, satellite images, trafc, etc. Data from multiple
sources can be mashed-up or jointly analyzed to create new
services and infer information that can not be realized from
a stand alone dataset. For instance, satellite images and data
from weather sensors are used together to improve forecast,
nancial data from multiple institutions to make better market
predictions [6], trafc and human mobility data to aid urban
planing [7], medical and mobility data to yield more insights
on spread of diseases [8].
However, it is not always desirable or feasible to expose the
data itself, and yet, being able to carry out computations or
analytics over the same may provide benets without violating
the privacy concerns. Multi-party computations [9], [10] is one
way to enable the same, and have recently been successfully
deployed in applications such as blind auctions which require
bidding information from parties to determine prices [11].
Even as multi-party computation protocols have matured
to be applied in practice, for it to be widely adoptable, it is
desirable to provide the same as a basic service. Our work
database and logic components
in a local infrasturcture
(a) Traditional MPC
database
multi-party computation
service
data
owner/party
cloud/delegate
(b) Cloud service delegated MPC
Fig. 1: Multi-party computation (MPC): In Traditional vs
Multi-Cloud settings
is motivated by this observation, as well as noting another
common recent trend, namely the move by many organizations
to cloud based services in order to eliminate or downsize the
in-house IT infrastructure.
B. Our Work.
Recent developments of cloud computing have materialized
a concrete platform for rapid realization of the service-oriented
computing paradigm [12]. Cloud providers offer computing
as a service, from which software services can be built,
sold and integrated into complex applications. Companies
are leveraging the cloud for its cheap, elastic and scalable
resources. Migration of IT infrastructure are taking place, in
which most data, application logics and front-end services
are being moved to the cloud [13]. Many existing works
focus on what to migrate [13], [14], [15], assuming that the
cloud is trusted. Others investigate mechanisms for protecting
data privacy and for verifying computation correctness [16],
[17], [18], [19], [20]. The latter works consider single-party
a
r
X
i
v
:
1
2
0
6
.
2
0
3
8
v
1

[
c
s
.
C
R
]

1
0

J
u
n

2
0
1
2
2
settings, i.e., how one data-owner can outsource its data in a
privacy preserving manner, and still carry out analytics on the
same. Furthermore, these works mainly explore the theoretical
aspects of the problem.
Our work concerns the design space of multi-party com-
putation outsourcing, which has so far not been explored to
the best of our knowledge. In particular, we design a multi-
party computation service which is invokable by authorized
users when taking part in a multi-party protocol. The protocol
can be run over untrusted (curious and lazy) clouds that act
as delegates, and it guarantees individual data owners data
privacy (from the delegated clouds, as well as from other
parties) and lets users verify if the computation has been
carried out correctly. Specically, we consider secure multi-
party sum computation.
Figure 1 illustrates the settings of our work, contrasting it
with the traditional multi-party computation setting (Figure
1a). Traditionally, each party houses its databases and comput-
ing components for handling business logics internally. To take
part in a multi-party computation, they need to initiate network
connections with other parties. Not only is negotiating access
to internal network a troublesome endeavor [13], but also
the potential heterogeneity in the network and computational
resources may hinder the overall performance [21].
Our work focuses on scenarios in Figure 1b, in which parties
move their data to clouds (which is in any case happening for
many other reasons [12], [13]) and rely on the cloud-provided
service for the multi-party computation.
Specically, we propose a protocol allowing the cloud
delegates to evaluate the sums of parties private inputs without
learning either the individual values or the resulting sums. The
parties are also able to check if the sum is correct.
Not only does cloud based deployment provide the basic
computation as a service, more sophisticated analytics services
can be realized using it as building block. We demonstrate this
by designing some representative data mining tasks such as
Naive Bayes (classication), Apriori (association rule mining)
and K-Means (clustering). The resulting protocols support
encrypted databases, hence the cloud delegates cannot learn
the private data. Naturally, there are overheads in using the
service, as a trade-off for the security guarantees. However, it
is amortized when used within complex applications.
Contributions. Our contributions are as follows:
1) We present a multi-party computation (secure sum)
service which can be run on curious-and-lazy cloud
delegates while maintaining data privacy and correct-
ness. The service can work with encrypted databases,
and it ensures computation correctness with a minimal
coordination among the parties outsourcing the task.
2) We demonstrate how this service can be used in complex
cloud-based data mining jobs.
3) We benchmark the secure sum service as a stand-alone
application in a multi-cloud like environment. We also
experiment with various data mining applications using
our services. Compared to the traditional multi-party
computation implementations without delegates (we re-
fer to them as the non-delegated versions), there are nat-
urally overheads due to the added security mechanisms.
However, the experiments show that such overheads
are within reasonable range for applying our approach
in practice, and the cost amortization becomes more
prominent with increased analytics workloads.
Organization. The rest of this paper is structured as fol-
lows: Section II details the delegated secure sum protocol
in an abstract manner. Section III shows how complex data
mining tasks can be built with this service. We consider three
classic data mining algorithms: classication, association rule
mining and clustering. Section IV follows with experimental
benchmarkings of the protocol and the data-mining tasks.
Related works are discussed in Section V before we draw
our conclusions and outline the future directions for this work
in Section VI.
II. DELEGATED SECURE SUM
We briey explain two traditional approaches for multi-
party secure sum computation, based upon one of which our
service is built. We suppose a number of parties P
0
, P
1
, ..
with private inputs x
0
, x
1
, .. wish to compute s =
i
x
i
in
a privacy-preserving manner, so that P
i
will not learn x
j
for
j ,= i.
In the ring-based approach [22], the parties form a ring and
messages are forwarded in a pre-dened direction. A master
party P
m
is elected and starts the computation by sending
v = x
m
+ r (for a random r) to its immediate clockwise (or
counter-clockwise) neighbor, which adds its own input to v
and forwards the result along. Once arrived back at the master
party, the nal sum is obtained as s = v r and broadcast to
other parties.
Another approach is based on broadcast communica-
tion [23], [24]. At the beginning, each party generates r
i
such
that
i
r
i
= 0. Next, P
i
sends v
i
= x
i
+ r
i
to other parties,
and independently computes the sum s =
i
v
i
. Our work
is an extension of this broadcast-based protocol. We discuss
the trade-offs between broadcast and ring-based protocols in
Section II-D which prompted us to make the specic choice.
A. Model
Our model consists of a number of parties and delegates.
The party P
i
sends its transformed input (x
i
) to its delegate
D
i
, and receives s at the end. The delegates D
0
, .., D
n1
are connected to each other. They receive inputs from the
parties, start a multi-party computation and nally send back
the results.
1) Adversary model.: We assume that delegates are curious
and lazy. They will try to learn inputs from the parties
and/or the resulting sum. To that end, they may listen to the
communication channels, but will not try more active attacks
that are to deliberately subvert the computation. For example,
we do not consider attacks in which an adversarial delegate
sends different values in a computation round in order to
partition the result. However, we assume that delegates may
have incentives to be lazy, i.e. to do as little as possible (and
still charge the parties). Specically, they may skip some (or all
of the) computations, replay results from previous rounds, or
even replace party inputs with other values to save computation
3
00 00 x1 xc
b bits
t bits
b bits
B-c.(t+b) bits
...
Fig. 2: Packed Paillier scheme. c plaintext values are packed
into a single plaintext of B bits
time, which could consequently compromise the integrity of
the sum computation.
1
We assume that every party/data-owner is honest-but-
curious, i.e, a data owner will not diverge from the protocol
but may try to learn other parties inputs. This is a standard
assumption often used in designing multi-party computation
protocols, where the resulting approaches are simple yet
secure, even though they do not deal with attack scenarios in
which adversaries are in majority with respect to the honest
parties.
Dishonest delegates may collude with each other, and also
with the parties. However, the party-delegate collusion is weak,
in the sense that the party can ask for messages seen by the
delegate, but the party would not reveal its secret keys. Since
all the parties are legitimate recipients of the nal answer,
if a party colludes to the extent of providing a delegate the
decryption key or the nal outcome - then naturally, this (or
any other) protocol can not safeguard against such a situation.
2) Crypto model.: Each party has a public key pair
(PKp
i
, PKs
i
), each delegate has a pair (DKp
i
, DKs
i
). We
assume that parties are running a common, well-known ran-
dom number generator (RNG). Our protocols use an ad-
ditively homomorphic encryption scheme, whose encryption
Enc(eK,m) and decryption Dec(dK,c) operations satisfy:
Enc(eK, m
1
) Enc(eK, m
2
) = Enc(dK, m
1
+m
2
)
for an operation . ECC-Elgamal [25] and Paillier [26] are
two of such schemes, both providing randomized encryp-
tions. ECC-Elgamal uses additive groups of an elliptic curve,
whereas Paillier relies on composite residuosity classes over a
group. The former is more efcient, but requires homomorphic
transformations of plaintexts to and from an elliptic curve,
which is expensive. Our work uses the latter. It requires larger
bit-length, but we can employ an optimization that allows us
to perform multiple encryptions at the same time. Specically,
we pack multiple values in one plaintext so that they can be
encrypted at the same time [16]. Suppose plaintext values are
at most b bits and Pailliers plaintexts are B bits. Suppose
further that any sum value is smaller than 2
t+b
, then we
can pack c values into a single plaintext, as demonstrated in
Figure. 2, where c ,
B
b+t
|. Denote
pm(v) = 0|x
1
|0|x
2
..|0|x
c
as the packed Paillier message using values v =
x
1
, x
2
, .., x
c
. We can extract x
i
(1 i c) as: x
i
=
pm(v)[i] = [ pm(v) ((c i) t + b) ] & (2
t+b
1). The
1
This simplied adversary model is justied by legal and economic realities,
where a commercial cloud service provider may try to over-charge customers
or try to (passively) sniff information readily exposed without the liability
of committing a criminal offense, as opposed to launching any deliberate
(proactive) attack to subvert condentiality.
homomorphic property is maintained, i.e.:
cipher = Enc(eK, pm(v)).Enc(eK, pm(v
))
= Enc(eK, pm(v +v
))
B. Delegated Secure Sum Protocol
The protocol comprises four phases: SETUP, ENCRYPT,
COMPUTE and VERIFY. The SETUP phase is run once to
initialize the system and security parameters, the rest are
needed for each computation round.
SETUP. The goal of this phase is two-fold. First, parties
agree on a secret Paillier key pair (eK,dK), and an initial
value roundId. Although Paillier is a public-key scheme,
we treat it as a symmetric scheme, where both public and
private key are kept secret. Second, each party generates a
random value r
i
(later used for perturbing its inputs) such
that
i
r
i
= 0 (explained shortly).
To generate (eK,dK) and roundId, the parties rst agree
on a secret X, then use it as the seed for the RNG. Having the
same source of randomness, they run the same algorithms for
creating (eK,dK) and roundId. One method for establishing
X could be to appoint a master party which generates X
and distributes it to the rest. This way, the master produces
and sends O(N) different ciphertexts to other parties. We
instead adopt a cryptographic approach rst proposed for
secret key exchange [27], in which every party contributes
its own randomness to the nal value. In our context, each
party sends only 2 messages, while most of the computations
can be outsourced to the delegates. Computations are done in
a globally known group of prime order G
p
(g):
1) P
i
generates a random value x
i
and sends z
i
= g
xi
to
D
i
.
2) D
i
broadcasts z
i
, then forwards a = (z
i+1
/z
i1
) once
it has received from other delegates.
3) P
i
computes X
i
= a
xi
and sends to D
i
4) D
i
broadcasts X
i
, then forwards b = z
n
i1
and c =
X
n1
i
.X
n2
i+1
..X
i2
to P
i
5) P
i
computes X = b
xi
.c = g
x1.x2+..+xn.x1
, which is the
shared secret seed between all parties.
Having generated (eK,dK), P
i
constructs Enc(eK, i[[0) and
sends it to D
i
which broadcasts to other D
j
and consequently
to party P
j
. On receipt of m
j
for all j ,= i, P
i
decrypts it
and checks if the message was properly formed. If successful,
it means that X and (eK,dK) have been agreed upon by all
parties. Each assigns the next random number from the RNG
as roundId.
A common RNG cannot be utilized to create r
i
satisfying
i
r
i
= 0. Instead, we adopt an approach proposed in [24].
Specically:
1) P
i
generates random values r
ij
for i ,= j.
2) P
i
sends m
ij
= Enc(PKp
j
, r
ij
), sign
i
(m
ij
) for all
j ,= i to D
i
, which then distributes it to D
j
and
subsequently to P
j
.
3) P
i
decrypts and veries signatures of m
ji
(j ,= i). Then
it computes:
r
i
=
j=i
(r
ij
r
ji
)
4
It can be seen that
i
r
i
=
j=i
(r
ij
r
ji
) = 0.
ENCRYPT. In this phase, each party encrypts its private
input x
i
and sends to the delegate. More specically, P
i
packs
roundId and x
i
+ r
i
into a single plaintext and encrypts it
with Paillier, the result is m
i
= Enc(eK, roundId[[ (x
i
+
r
i
)). When there are multiple inputs x
i0
, x
i1
, .., x
ik
, P
i
packs them into a smaller number of plaintext m
i0
, m
i1
,.. with
increasing values of roundId.
COMPUTE. On receipt of m
i
from party P
i
, the delegate
D
i
broadcasts this value to other delegates. Once having all
messages from other delegates, it computes
c =
0i<n
m
i
=
0i<n
Enc(eK, roundId[[ (x
i
+r
i
))
= Enc(eK, n.roundId[[
i
x
i
)
and sends it back to the party. Each delegate also signs the
result message:
i
= sign
i
(i [[ c).
VERIFY. Once receiving c, P
i
asks D
i
for its signature
as well as signatures from t other delegates. P
i
can decide at
random the value of t and identities of the verifying delegates.
For instance, it can set t = 2 and ask its delegate for the signa-
tures from D
i
, D
i1
, D
i+1
:
i
,
i1
,
i+1
. Once successfully
verifying that the signatures are correct, P
i
decrypts it to get:
s = Dec(dK, c) = n.roundId[[
i
x
i
Finally, it extracts s[0], s[1] from s, and checks that s[0] =
n.roundId before assigning s[1] as the nal sum value.
C. Security Analysis
Our protocol has the following security properties. First,
curious delegates and parties cannot see private inputs of the
honest parties, even if they collude with each other. This
is because each input x
i
is randomized with r
i
. Since the
generation of r
i
is done in a secure manner, only P
i
will be
able to recover x
i
from (x
i
+r
i
).
Second, delegates cannot see the sum value

i
x
i
. It has
the encrypted value s = Enc(eK, n.roundId[[
i
x
i
), but
cannot extract the sum since it has no knowledge of the
decryption key. Recall that our model does not consider active
collusion between dishonest parties and delegates in which
secret keys are revealed to the delegates.
Third, the protocol computes the correct sum from the
party inputs, provided that at least one of the t verifying
delegates is honest. This means dishonest delegates cannot
skip computations or replay old values. They cannot also
replace party inputs with other values without getting caught.
The sketch of proof is as follows. Since the delegate does
not know the encryption key, should it use an input different
to what is given from the party, the VERIFY step will fail
because s[1] ,= n.roundId. It cannot replay old values either,
as each round is tagged with an unique value of roundId.
The delegate can skip the computation in two ways. First,
it may not use its partys input m
i
during the COMPUTE
step. However, this causes the VERIFY step to fail since
s[1] ,= n.roundId. Second, it may ignore m
j
from other
delegates. In this case, to ensure s[1] = n.roundId, the
delegate must construct s = (m
i
)
n
(raising m
i
to power of n
is cheaper than multiplying n different values). However, the
VERIFY step also checks for the results from other delegates,
of which one is honest and therefore its result must be different
to s. Hence, the verication will again detect D
i
s laziness.
These security guarantees are achieved while assuming
that the cryptographic schemes are secure. Attacks on these
primitives will inevitably affect our protocols, but they are
outside the scope of this paper. The VERIFY step is done at the
end of every round, which can be expensive over many rounds.
We extend the protocol to support probabilistic verication, in
which verication is carried out with a probability p at any
given round. The probability of detecting misbehavior can be
made arbitrarily high after a number of verications. Speci-
cally, let ck be the number of random checks, P(p, n, ck) be
the probability that misbehavior is detected after ck checks
(assuming that delegates misbehave consistently
2
). Then:
P(p, n, ck) = 1 (1 p)
n.ck
D. Broadcast vs Ring-based delegated secure sum
The traditional ring-based protocol [22] can be extended to
support delegates as follows. A party P
i
still encrypts (x
i
+r
i
)
in the same way using the shared Paillier key, and sends
it to D
i
. The master delegate D
0
starts the computation by
forwarding m
0
along the pre-dened direction, with which
each delegate on the path multiplies its party input and sends
the result along the same direction. Once a message c arrives
at D
0
, it was broadcast to other delegates and subsequently
to their parties. Verication at parties involves performing the
same checks as described above. This simple protocol ensures
that delegates cannot learn private inputs and the sum outputs,
however it is possible for lazy delegates to skip computations
and render the sum value incorrect. For example, suppose
D
0
sends m
0
to D
1
which then computes c
1
= m
0
.m
1
and
forwards to D
2
. Suppose D
2
and D
3
are both dishonest, they
can bypass the computation c
2
= c
1
.m
2
and c
3
= c
2
.m
3
by
sending c
3
= c
1
.c
1
to D
4
. The nally verication checks out,
but the sum is not correct.
To achieve correctness, several enhancements are needed.
The master party rst splits the value v
0
= x
0
+ r
0
into two
parts v
l
0
and v
r
0
such that v
l
0
+ v
r
0
= v
0
. D
0
then sends m
l
0
to its left neighbor and m
r
0
to its right neighbor (m
l
0
, m
r
0
are
the ciphertexts of v
l
0
, v
r
0
). Effectively, there are two streams
of messages in opposite directions. Party P
i
receiving
l
in
and
r
in
from its two neighbors, it computes
l
out
=
l
in
.m
i
,
r
out
=
r
in
.m
i
and forwards them to the next neighbor.
Two messages arriving at the master delegates are
l
and
r
.
During verication, party P
i
checks the following conditions:
1) Dec(
l
) = Dec(
r
) and Dec(
l
)[1] = n.roundId.
2) Dec(
l
in
)[1] = i.roundId and Dec(
r
in
)[1] = (n
i).roundId and Dec(
l
in
) + v
i
= Dec(
l
out
) and
Dec(
r
in
) +v
i
= Dec(
r
out
).
2
If the delegate misbehaves probabilistically with some probability p
, then
the detection probability will be 1 (1 p p
)
n.ck
5
3) Dec(
l
in
)[2] + Dec(
r
out
)[2] = Dec(
r
in
)[2] +
Dec(
l
out
)[2] = Dec(
l
)[2]
This protocols only advantage over our broadcast based
design is that each delegate maintains two connections to
its immediate neighbors. Therefore, the communication cost
for the delegates are constant, whereas in our protocol each
delegate must handle O(n) messages. However, there are many
added overheads. First, the master party has to wait until
the message traverse the ring completely before broadcasting
the result. This does not scale with an increasing number of
delegates, as network latency will become signicant. Second,
parties need to establish other shared states besides Paillier
keys. Particularly, they have to agree on the identity of the
master party, and on which direction to forward the messages.
In case of probabilistic verication, they also must agree on
which rounds to carry out the checks. Having many shared
states makes the system less robust in presence of failure.
Finally, electing a node as a master party to whom the com-
putation correctness is rested requires such trust assumption
that may not be realistic in decentralized settings.
III. DATA MINING APPLICATIONS
A data mining algorithm can be modeled as a process of
extracting knowledge from a database, which consists of two
iterative steps: x query(D) that queries a database D, and
process(x) that computes knowledge from the query results.
In a distributed settings, the rst step is executed locally, and
the results are aggregated across multiple parties. In other
words, a distributed data mining algorithm now comprises
three steps: x
i
query(D
i
), x aggregate(x
i
)
and process(x). Possible aggregate functions are: sum, set
union, scalar product, etc. [28]. Our protocol described in
the previous section has provided an implementation for the
sum function. We now take three classical data mining tasks
(classication, association rule mining and clustering) and
show how they can be realized utilizing our service in a
collaborative, cloud-based settings.
A. Secure Database Outsourcing
Database outsourcing has always been an attractive option
for small-to-medium businesses even before the era of cloud
computing. Main reasons for moving data to third-party sites
are: scalability, high-availability and cost effectiveness, freeing
up an enterprises resources for its core business priorities.
With the advent of cloud computing, this trend has tremen-
dously accelerated.
For simplicity, we assume that databases are in relational
format, and every attribute has a non-negative integer domain
(other domains could be mapped into the integer domain, the
details of which are not within the scope of our work). The
party is the data owner, who wishes to move the data and
computation to the cloud (or delegate). We assume that the
party retains a local copy, but most queries and processing
on the data are to be done on the cloud. This may be the
case, for instance due to the availability of arbitrary amount
of computing power and specialized tools (software services)
on the cloud, which are hard to achieve in-house.
That data residing at a third-party and queries being ex-
ecuted remotely raise several security issues: data privacy,
integrity, query completeness and query freshness. Since our
adversary model for the cloud is curious and lazy (Sec-
tion II-A), we will only deal with the data privacy and
query completeness problems. Techniques for ensuring query
freshness are discussed elsewhere [29], [30].
1) Data privacy.: To protect data privacy, encryptions can
be used. Many encryption schemes exists, each differs to an-
other in its security guarantee and the range of operations that
can be done over the ciphertexts. When outsourcing data, the
trade-off between security and possibility of computing over
ciphertexts must be made in order for queries to be executable
at a third-party [16]. Randomized encryptions (for example
AES in CBC mode with randomized initialization vector)
offer security against adaptive chosen-plaintext attacks, but no
meaningful computations can be done. Deterministic encryp-
tions (DETs) such as AES offers less security, but facilitates
equality comparison: Enc(x) = Enc(y) x = y. Order-
preserving encryptions (OPEs) [31], [32] support inequality
comparisons on ciphertexts: Enc(x) < Enc(y) x < y.
However, they have weaker security guarantee than DETs,
since they reveal plaintexts order. Another useful encryption
primitive is homomorphic encryptions (HOMs) such as the
Paillier scheme used in our secure sum protocol. Such schemes
are inherently malleable. Using DETs, simple queries such as
equality selection, COUNT and GROUP BY can be performed
by database engines. With OPEs, MIN, MAX, SORT, ORDER
BY are also supported. Furthermore, these can be done ef-
ciently since the database engine can build its B+ tree directly
from the ciphertexts. With HOMs, aggregate functions like
SUM or AVG can be performed.
Since there are different types of queries, multiple types
of encryptions must be supported at the same time. In
CryptDb [16], data is encrypted in multiple onions used for
different use-cases, each onion is multi-layer: the outer-most
layer is the most secure and the inner-most supports the most
complex operations. The database queries required for the data
mining algorithms in our work are limited to those supported
by DETs and OPEs. We adopt a simple approach that stores
two copies of the database at the cloud: one encrypted with
AES and one with OPE. Column names and table names are
also encrypted with AES. A database query is transformed to
the encrypted version, by encrypting column name, table name
and attribute values with the appropriate encryption key and
scheme. For example, two queries:
select COUNT from t
where t.a1 = x AND t.a2 = y;
select COUNT from t ORDERED BY t.a1
where t.a1 < x AND t.a2 < y;
are translated into:
select COUNT from t_aes
where t_aes.AES(a1) = AES(x)
AND t_aes.AES(a2) = AES(y);
6
select COUNT from t_ope
where t_ope.AES(a1) < OPE(x)
AND t_ope.AES(a2) < OPE(y)
ORDERED BY t_ope.AES(a1)
where t aes and t ope are the encrypted names of the AES-
encrypted and OPE-encrypted table derived from the original
table t. We assume that both database encryption and query
translation is done by the data owner (or by authorized users).
2) Query completeness.: Executing a query over a large
database is an expensive operation. As the cloud is lazy, it
has incentives to skip some parts of the data when performing
the query, or even ignore the query altogether and return a
random response instead. Li et al. proposed a datastructure
called Authenticated Aggregation R tree (AAR tree [33]) that
can produce a proof that the query has been executed over
the complete data. However, the most complex queries that
AAR tree can support is COUNT or SUM over range selection
conditions, plus the proofs are expensive to construct and
verify.
We propose a probabilistic solution that takes advantage
of the data owner having a local copy of the database. A
straight-forward protocol would require the data owner to
probabilistically execute a query q over its local copy of the
data and compare the result with what returned from the cloud.
Thus, let p be the probability that the cloud is lazy for any
given query, then the probability of it getting caught after k
checks will be 1(1p)
k
, whose value rapidly approaches 1.
Notice that even though k may be small, executing q directly
on the local database may not be desirable, as the party wishes
to be involved as little as possible.
We observe that even if the cloud ignores some parts of the
data, the data mining output may still be accurate. We conduct
experiments with NaiveBayes and K-Means algorithms to test
the accuracy when blocks of data are removed at random.
We divide the data into b blocks, and use mis-classication
rates and root mean square errors as accuracy metrics. The
result depicted in Figure 3[b] suggests that accuracy indeed
remains high. The output of NaiveBayes, for example, is above
99.6% correct even when 30% of the blocks are removed.
Another observation is that when data is divided into blocks
and the cloud removes a considerable number of blocks, the
party may execute the query only over a small number of
blocks in order to detect inconsistency. Our protocol is based
on [34], and proceeds as follows. The query q is transformed
to a list of smaller queries Q = q
1
, q
2
, .., q
b
, each executed
on one block. The party sends Q to the cloud. Let r, w be the
number of queries in Q performed by the party locally and by
the cloud at its site. The probability P(b, w, r) that the cloud
stays undetected when performing only w out of b queries are:
P(b, r, w) =
1
b
r
min(r,w)
i=max(0,w+rb)
w
i
min(b w, max(1, b i))

Figure 3[a] shows that the probability of successful cheating
decreases exponentially when the cloud ignores more data. It
indicates that the party only needs to execute the query over
a small portion of its local data (10 15%) for the detection
to be effective. This suggests that the party may not need to
store the entire database locally: it may sufce to have 20%
of the data (unknown to the cloud delegate) and refresh them
at pre-dened intervals. We leave further investigation to this
question for future work.
In summary, our protocol is effective for ensuring query
completeness: if a large amount of data is ignored, the cloud
gets caught with very high probability; if a small part is
ignored, the data mining algorithms are largely unaffected.
B. Data Mining Algorithms
Algorithm 1: Naive Bayes classication
Input: Y , A, V , local party p
Output: N, Ny, Ny,a,v for all y, a, v
1 foreach y Y do
2 N
p
y
QueryCount(lx = y)
3 foreach attribute a A and v Va do
4 N
p
y,a,v
QueryCount(xa = v, lx = y)
5 foreach y Y do
6 Ny securesum
N
p
y
7 foreach a A and v Va do
8 Ny,a,v securesum
N
p
y,a,v
1) Classication (Naive Bayes): A classication algorithm

takes as input a set of labeled (training) data and outputs a
classier that can be used to label new (test) data. Let Y
be a set of labels, l
x
Y be the label of x (a multi-variate
vector). Let A = a
1
, a
2
, .. be a set of attributes (columns)
and V = V
i
be the set of attribute domains. Let N be the
number of instances (rows), N
y
be the number of instances
(rows) with label y, N
y,a,v
be the number of instances whose
column as value is v and whose label is y. The NaiveBayes
algorithm (Algorithm 1) returns (N, N
y
, N
y,a,v
) for all y, a, v
as the classier [35]. The label of a new instance x is:
argmax
y
(
Ny
N
.
i
Ny,i,x
i
Ny
).
QueryCount(x
1
= v
1
, ..) follows the protocol for query-
ing outsourced databases described in the previous section.
Basically, the party constructs a COUNT query of the form
select COUNT(.) from Data
where (x_1=v_1) AND ..
, then encrypts the names Data, v
1
, v
2
, .. before sending it to
the cloud. The cloud executes the query, returns a result which
is veried (if applied). securesum(x) invokes the secure
sum service using x as party input. If verication of the query
or secure sum process fails, the algorithm terminates.
2) Clustering (K-Means): Clustering algorithms partition
data into separate clusters such that distance between elements
belonging to the same cluster is smaller than between ones in
different clusters. The K-Means algorithm (Algorithm 2) nds
k clusters identied by their centroids (the mean centers of the
clusters). Each party starts with a set of chosen centroids, then
computes new centroids by grouping data into clusters using
previous centroids. The algorithm works in multiple rounds
until convergence (i.e., the set of centroids is unchanged
compared to the previous round). This algorithm is slightly
different from the standard K-Means [36], for we are only
7
1e-20
1e-18
1e-16
1e-14
1e-12
1e-10
1e-08
1e-06
0.0001
0.01
1
0.7 0.75 0.8 0.85 0.9 0.95 1
s
u
c
c
e
s
s

p
r
o
b
a
b
i
l
i
t
y
work performed
Success probability for the adversary
b=100,r=5
b=100,r=15
b=1000,r=5
b=1000,r=15
(a) Success probability
0.986
0.988
0.99
0.992
0.994
0.996
0.998
1
1.002
1.004
0.7 0.75 0.8 0.85 0.9 0.95 1
a
c
c
u
r
a
c
y
% data
Accuracy of data mining results
NaiveBayes, b=1000
K-means, b=1000
NaiveBayes, b=100
K-means, b=100
(b) Data mining accuracy
Fig. 3: Query completeness with probabilistic verication
Algorithm 2: K-Means Clustering
Input: Number of clusters k, A, local party p
Output: Set of centroids M = {m1, .., m
k
}
1 foreach mi M do
2 mi = (i, i, . . . , i)
3 C
p
= foreach mi M do
4 C
p
m
i
=
5 foreach a A do
6 C
p
m
i
(a) QueryGroupBy(a, mi, M)
7 C
p
m
i
= C
p
m
i
C
p
m
i
(a)
8 C
p
= C
p
C
p
m
i
9 foreach C
p
[i] C
p
do
10 C[i] securesum (C
p
[i])
11 C
p
C
12 foreach mi M do
13 Extract C
p
m
i
from C
p
14 foreach a A do
15 mi(a) Mode(C
p
m
i
(a))
16 Repeat Step 3 M converges.
dealing with integer-domain attributes (or categorical data).
Specically, data mode metric (the most frequently seen value,
computed by the Mode function) is used instead of mean.
Many different metrics exist for quantifying distance from
an element x to a centroid c. We use Manhattan distance
for its simplicity. In particular: (x, c) =
i
[x
i
c
i
[.
QueryGroupBy(a, m
i
, M) asks the cloud to return a list of
frequencies for attribute a in the portion of data closest to the
centroid m
i
. The database query has the following form:
select a, COUNT(
*
)from Data as freq
where (a, m
i
) < (a, m
0
) AND ...
AND (a, m
i
) < (a, m
i1
)
AND (a, m
i
) < (a, m
i+1
)AND ...
Group by a, Order by a
As is computed over OPE ciphertext, what returned
from the cloud for QueryGroupBy might not be correct.
OPEs only guarantee is Enc(x) < Enc(y) x < y,
hence it does not always follow that [Enc(x) Enc(x
)[ <
[Enc(y) Enc(y
)[ [x x
[ < [y y
[. Suppose that in
the unencrypted database, an element x is closer to centroid
c
1
than c
2
, the distance based on OPE values of x, c
1
and c
2
might indicate that x is closer to c
2
. Our experimental study
(discussed later) shows that the phenomenon occurs frequently,
but the nal clusters are very close to the clusters found using
the unencrypted databases.
Algorithm 3: Apriori association rule mining
Input: Support threshold minsup and condence threshold minconf
Output: Set of association rules X Y
1 L1 GenerateFrequentItemsetSize1()
2 k = 2
3 C
k
GenerateCandidates(L
k1
)
4 Bp =
5 foreach c C
k
do
6 t QueryCount(c)
7 Bp = Bp t
8 foreach 1 i |C
k
| do
9 B[i] securesum (Bp[i])
10 Extract c.count from B[i]
11 L
k
{c C
k
|c.count minsup}
12 Increase k and repeat from line 3 until L
k
=
13 GenerateRules(
k
L
k
,minconf)
3) Association rule mining (Apriori): Association rule
mining algorithms extract relationships between attributes
that occur frequently in the data. An association rule is
of the form X Y where X, Y A. Apriori algorithm
(Algorithm 3) rst determines frequent item sets containing a
single item. GenerateFrequentItemSetSize1 issues
COUNT queries with different attributes. The results are
merged into a set of candidates (larger item sets, using
GenerateCandidates). The threshold value minsup
species the lower bound for item set frequency. These
steps are repeated until there is no more item set to
be found. Finally, GenerateRules generates outputs
by establishing rules between non-empty subsets and
removing rules whose condence values are below minconf.
The details of GenerateFrequentItemSetSize1,
GenerateCandidates and GenerateRules can be
found in [37].
8
4 delegates
encryption (512-bit) 0.64 (0.15)
decryption (512-bit) 0.36 (0.11)
encryption (1024-bit) 4.53 (0.18)
decryption (1024-bit) 2.49 (0.17)
signing (1024-bit RSA) 1.01 (0.005)
signature verication 0.38 (0.05)
Fig. 6: Cost of cryptographic operations (ms)
IV. EVALUATION
A. Prototypes
We have implemented a prototype for the protocols dis-
cussed in the previous sections
3
. The secure sum service is
implemented in Java, in which AES encryptions and RSA
signatures are provided by the Crypto++ library [38], OPE and
Paillier encryptions by CryptDB library [16]. Communications
between parties and delegates are done via Java sockets, using
the thread-per-connection model. The data mining algorithms
are implemented in Java, using the Weka library [39]
We now discuss our experiments for evaluating the secure
sum service, rst as a stand-alone service, then as being a part
of complex data mining applications. All experiments are run
in a cluster of 16 nodes, each has a Xeon processor 3.0Ghz,
running OCS5.1 (2.6.18-53El5smp) operation system with
4GB of RAM. The machines are connected via InniBand
20Gbps.
B. Secure Sum Benchmark
We rst examine the secure sum service as a stand-alone
application. The salient metric here is its throughput: number
of sum operations completed per second, especially in com-
parison with the traditional, non-delegate secure sum protocol.
We set up a cloud-like environment with up to 6 parties
and 6 delegates. We emulate the condition in which network
connections for individual party have lower capacity and speed
than those employed between cloud providers, by adding arti-
cial latency to messages sent from any party. We experiment
with the differences in latency between ingress/egress and
intra-cluster trafc, the results indicate an extra delay of 1
ms. In the experiments, each party starts an innite loop that
invokes the secure sum service with a single value. Throughput
is measured per second as the number of sum values returned
successfully to the party. Figure 4[a] shows the recorded
throughput at steady states, using 512-bit Paillier encryption.
Our protocol achieves over 320 sums/sec. The effect of
increasing the number of parties/delegates on throughput is
small, although there is a slight reduction when the number
of delegates increases. This is because, with more delegates,
each will have to wait for more messages before computing
the nal sums.
Figure 4[b] compares the service throughput with that of
the non-delegated protocol, using 4 parties. The throughput
is 83% for 512-bit Paillier, and drops to 35% for 1024-
bit. Cryptographic operations: Paillier encryption, decryption,
signature verication at the party, and the signing operations
at the delegates are accountable for the observed overhead.
3
Source code is available on request and is being sanitized for release.
1
10
100
1000
10000
100000
b
r
e
a
s
t
_
c
a
n
c
e
r
b
r
e
a
s
t
_
c
a
n
c
e
r
_
x
5
0
m
u
s
h
r
o
o
m
m
u
s
h
r
o
o
m
_
x
5
0
s
p
lic
e
s
p
lic
e
_
x
1
0
t
i
m
e

(
m
s
)
database encryption
database loading
Fig. 8: Preparation (or one-off) cost of encrypting and loading
the database)
Their computational costs (quantied in terms of completion
times) are detailed in Figure 6. In the non-delegate version,
main factors restricting the throughput are network speeds and
CPUs, which explain the high throughput as well as the large
uctuation. The close gap between our 512-bit service and the
non-delegate version can be explained as follows. Using 512-
bit Paillier, encryption and decryption at a party takes roughly
1 ms, which is close to the overhead incurred in handling
O(n) messages to/from the other parties instead of 1 message
to/from the delegate. For 1024-bit, however, the overall cost
is so dominated by encryption/decryption operations that it
explains the low and consistent throughput. Figure 5 illustrates
the number of messages sent and received by each party during
the secure sum protocol. With respect to network cost at a
party, increasing the number of parties does not affect our
protocol.
C. Data Mining Performance
Having explored the performance and cost associated with
the secure sum service, we next evaluate the distributed data
mining applications that use the service as a building block.
We use both real and synthetic datasets for running data
mining algorithms (Figure 7[a]). Three datasets: breast cancer
(small), mushroom (large, many row) and splice (large, many
columns) are from the UCI Machine Learning Repository [40],
from which we synthesize larger datasets by extending them
with random values from similar distributions. Other system
parameters are summarized in Figure 7[b]. The results pre-
sented below, unless otherwise stated, are for 4 parties and
with 1024-bit Paillier encryptions.
In our implementation, the party encrypts its data with AES
and OPE and uploads it to the delegate, which loads it into
a MySQL server. The party needs to break its original data
into b blocks and keeps parts of it locally. We select b = 10
and load 20% of the data to the database server at the party.
With b = 10 and r = 20%, the probability of cheating
successfully is only 0.77 when the delegate skips 1 block, and
drops to 0.04 when it skips 5 blocks. This is for one check,
but gets arbitrarily small over a period of time with multiple
checks. Each database uploaded to the delegates are divided
9
0
100
200
300
400
500
16 18 20 22 24 26 28 30
S
u
m
s
/
s
e
c
Elapsed time (s)
2 parties
4 parties
6 parties
(a) 512-bit Paillier encryption
0
100
200
300
400
500
600
700
16 18 20 22 24 26 28 30
S
u
m
s
/
s
e
c
Elapsed time (s)
with delegate
with delegate, 1024 bit
no delegate
(b) 4 parties
Fig. 4: Comparing sums/sec of delegated (our protocol) and non-delegated secure sum
4 parties, 4 delegates 6 parties, 6 delegates 4 parties, 0 delegates 6 parties, 0 delegates
sent 374 (25) 340 (17) 1340 (91) 2064 (37)
received 320 (2) 321 (3) 1125 (23) 1896 (55)
verication 5 (1) 5 (1) 0 0
Fig. 5: Network cost (number of party messages at steady state.
database name size (nRows nCols)
breast cancer (63 10)
breast cancer x50 (2100 10)
mushroom (1827 23)
mushroom x50 (91350 23)
splice (717 62)
splice x10 (7177 62)
(a) Database parameters
variable name values
data mining algorithms NaiveBayes, Apriori, K-Means
database encryption key length 64 bit (OPE), 128 bit (AES)
number of parties 2, 4, 6
Paillier bit-length 512, 1024
query verication probability 0.1, 0.2
secure sum verication probability 0.05, 0.1
(b) Other parameters
Fig. 7: System variables
into 10 smaller tables. When querying, the party generates
10 queries from its original query, and assembles the partial
results when they come back. With a pre-dened probability,
the party executes queries on its local database (consisting
of 2 small tables) and compares the outputs with what is
returned from the delegate. Figure 8 shows the initial, one-
off cost for encrypting the data at the party and for loading
it at its delegate. The cost is proportional to the data size,
with maximum of 30 seconds for the mushroom x50 dataset.
Compared to the original, the encrypted data uploaded to the
delegate has bigger size (22.8(0.14) times bigger), but the
loading times remain below 45 seconds.
The overall completion time for every data mining algorithm
can be broken down into two components: secure sum and
database query time. Figure 9 depicts this metric for different
algorithms and datasets. A common pattern is that database
queries are at least an order of magnitude more expensive
than secure sum. The longest experiment takes 33 minutes
(for running Apriori on splice x10), of which secure sum itself
takes less than 2 minutes. This observation suggests that when
used in real data mining algorithms, the overhead incurred by
using the secure sum service has little effect on the overall
performance.
1) Query time: Figure 10 illustrates the impact of increas-
ing data size to the query time metric. It can be observed
that query time does not always scale with the size of the
data, especially for Apriori and KMeans. In particular, mush-
100
1000
10000
100000
1e+06
N
a
iv
e
B
a
y
e
s
A
p
r
io
r
i
K
M
e
a
n
s
t
i
m
e

(
m
s
)
local query
cloud query
Fig. 12: Local database query time (for verication) vs
database query at the cloud
room x50 dataset is almost 50 times larger than mushroom, but
in Apriori, the query times increases by more than two orders
of magnitudes. Similarly, splice x10 is 10 times bigger, yet
query times for KMeans are roughly the same. These are due
to intrinsic properties of the data mining algorithms which we
attempt to explain in the following.
We extract the number of queries and time per query, the
results of which are shown in Figure 11. As many queries from
the party are duplicates and subsequently cached at the party in
Apriori (80% for the splice datasets), the gure shows only the
10
100
1000
10000
100000
1e+06
b
r
e
a
s
t_
c
a
n
c
e
r
b
r
e
a
s
t_
c
a
n
c
e
r
_
x
5
0
m
u
s
h
r
o
o
m
m
u
s
h
r
o
o
m
_
x
5
0
s
p
lic
e
s
p
lic
e
_
x
1
0
t
i
m
e

(
m
s
)
secure sum
database query
(a) NaiveBayes
100
1000
10000
100000
1e+06
1e+07
b
r
e
a
s
t_
c
a
n
c
e
r
b
r
e
a
s
t_
c
a
n
c
e
r
_
x
5
0
m
u
s
h
r
o
o
m
m
u
s
h
r
o
o
m
_
x
5
0
s
p
lic
e
s
p
lic
e
_
x
1
0
t
i
m
e

(
m
s
)
secure sum
database query
(b) Apriori
100
1000
10000
100000
1e+06
b
r
e
a
s
t_
c
a
n
c
e
r
b
r
e
a
s
t_
c
a
n
c
e
r
_
x
5
0
m
u
s
h
r
o
o
m
m
u
s
h
r
o
o
m
_
x
5
0
s
p
lic
e
s
p
lic
e
_
x
1
0
t
i
m
e

(
m
s
)
secure sum
database query
(c) K-Means
Fig. 9: Running times for different data mining algorithms.
100
1000
10000
N
a
iv
e
B
a
y
e
s
A
p
r
io
r
i
K
M
e
a
n
s
t
i
m
e

(
m
s
)
breast_cancer
breast_cancer_x50
(a) breast cancer
1000
10000
100000
1e+06
N
a
iv
e
B
a
y
e
s
A
p
r
io
r
i
K
M
e
a
n
s
t
i
m
e

(
m
s
)
mushroom
mushroom_x50
(b) mushroom
10000
100000
1e+06
1e+07
N
a
iv
e
B
a
y
e
s
A
p
r
io
r
i
K
M
e
a
n
s
t
i
m
e

(
m
s
)
splice
splice_x10
(c) splice
Fig. 10: Database query time at different datasets
mushroom mushroom x50 splice splice x10
NaiveBayes 12.5(0.5) 244.5(1.4) 8.8(0.2) 44.5(0.07)
Apriori 16(1.1) 231.7(0.1) 8.5(0.02) 44.4(0.02)
KMeans 2.8(0.2) 24.5(0.1) 5.3(0.02) 8.7(2.5)
(a) Time per query (ms)
mushroom mushroom x50 splice splice x10
NaiveBayes 252 252 10398 10398
Apriori 132 4214 9103 44821
KMeans 690 2024(307) 14396(1519.8) 5960(2183)
(b) Number of queries
Fig. 11: Database query benchmark
number of queries that actually get executed at the delegate.
For Apriori on the mushroom datasets, bigger data size causes
longer execution time (from 16 to 231 ms) and a sharp increase
in the number of queries (132 to 3214). Therefore the query
time for the mushroom x50 dataset is much longer.
For KMeans on the splice datasets, however, time per query
does not increase much, whereas the number of queries actu-
ally decreases. This explains why query time for splice x10 is
roughly the same as splice. Not only is the number of queries
smaller when increasing data size, it is not the same for every
run of KMeans on the same dataset (standard errors are from
5 10%). We will revisit this behavior later.
There are three types of database queries in our experiments:
simple COUNT with at most 2 selection condition (Naive-
Bayes), COUNT with multiple selection conditions (Apriori),
GROUP BY and ORDER BY (KMeans). Figure 11[a] shows
the difference in execution times for these queries. The seem-
ingly more complex queries (GROUP BY and ORDER BY)
take less time. This is because these queries are executed
over OPE databases, which are both smaller and more ef-
ciently managed by the database engine (since they are order-
preserving, the engine can build a B+ tree directly from them).
Figure 12 compares query execution time at the party versus
at the delegate. It is clear that the latter takes much longer.
There are three reasons: the encrypted database being larger,
the delegate executes every query over the entire database
whereas the party does so over only 20%, and the verication
process being probabilistic. The average difference is more
than two order of magnitudes. This is a further evidence of the
benet of outsourcing databases to the cloud, with which the
party only needs to perform a small amount of work locally.
2) Secure sum time: We extract and analyse the time
taken by the secure sum service. In contrast to the previous
benchmark (Section IV-B), only a small number of secure
sum operations are used during the execution of the data
mining tasks. Furthermore, one round of secure sum may
involve multiple values (as opposed to one value per round
in Section IV-B), hence many of them could be packed
into a single Paillier encryption as elaborated previously in
Section II-A and Figure 2.
11
100
1000
2 3 4 5 6
s
e
c
u
r
e

s
u
m

t
i
m
e

(
m
s
)
# parties
NaiveBayes
Apriori
KMeans
Fig. 13: Secure sum time with varying number of par-
ties/delegates, using breast cancer x50 dataset
10
100
1000
10000
100000
1e+06
100 1000 10000 100000 1e+06
t
i
m
e

(
m
s
)
# sum values
512 bits
encryption, 1024 bits
without packing, 1024 bits
without packing, 512 bits
Fig. 14: Encryption times for Apriori algorithm, with varying
number of sum values (with different datasets)
Figure 13 shows the effect of the number of parties on
secure sum. It can be observed, similarly in Figure 4, that
increasing N has almost no impact on the secure sum time.
However, the variance is visible, especially for NaiveBayes.
The NaiveBayes experiments on this particular dataset involves
a single round of secure sum, hence the inherent variance
will not be amortized over multiple rounds. Second, when
running as part of a data mining task, database operations will
inevitably interfere with the secure sum protocol, and cause
more variance. For example, party P
i
may have nished its
database queries and starts sending values to its delegate for
secure summing, but delegate D
j
(j ,= i) may still be busy
with queries from its party, therefore D
i
will have to wait until
after D
j
nishes and receives a value from P
j
.
Figure 14 demonstrates the benet of packing multiple
values into a single Paillier encryption. The x-axis is the
number of values the party sends to the delegate to be summed,
which is smaller than the actual number of encryptions. It can
be seen that encryption time rises with the number of sum
values and the bit length. In addition, using 1024 bit with
packing is faster than 512 bit without packing, because the
benet of parallel encryptions is 15-fold, whereas the speed-
up gained when encrypting using 512 bit is less than 10-fold.
3) Correctness of KMeans: As noticed earlier in Figure 11,
the number of database queries for a KMeans experiment
dataset # mismatched query results cluster error
breast cancer 12.4(1.2) 0.03(0.006)
breast cancer x50 23(5.2) 0.03(0.009)
mushroom 0 0
mushroom x50 68(13.1) 0.02(0.002)
splice 508.2(63.9) 0.002(0.0005)
splice x50 224.2(61) 0.01(0.004)
Fig. 15: KMeans correctness
varies between different runs on the same dataset. This be-
havior is caused by the potential error with the GROUP BY
and ORDER BY queries over OPE databases (explained in
Section III-A). In brief, the query, even if executed truthfully
by the delegate, may yield a different result to that from
performing the query locally on the plain-text data. We refer to
this as mismatched query. The errors affect convergence of the
algorithm as well as the nal centroids. All of our experiments
with KMeans, however, converge to nal centroids.
To quantify the differences between centroids found by our
protocols and what found by standard KMeans on plain-text
data (called standard KMeans), we compute cluster error for
each pair of centroid:
cluster error(C
i
, C
i
) =
[(C
i
) (C
i
)[
(C
i
)
where C
i
, C
i
is the i
th
cluster found by our protocol and by
the standard KMeans respectively. (C
i
) is the mean squared
error (the mean squared distance of each member of a cluster
to its centroid) of cluster C
i
. Figure 15 shows this metric for
all datasets, together with the number of mismatched queries.
The errors seem independent of how many mismatched queries
there are. In all cases, our protocols yield clusters whose
quality is very close to that obtained from the standard
KMeans.
D. Discussion.
So far, we have quantied the cost for doing collaborative
data mining on the cloud in a secure manner. There are
overheads when using our secure sum service which alone may
not be a favorable argument for moving ones IT infrastructure
to the cloud. However, when used in the context of data
mining, benets of the cloud could outweigh the costs of
maintaining ones own infrastructure.
Let m be the number of unique sum messages sent and
received by the party for secure summing during a data
mining algorithm. Let be the crypto cost for encrypting and
decrypting a message (with additive homomorphic encryption
schemes). Let q be the number of database queries and
c
q
the CPU cost for each query. The computation cost of
performing the collaborative data mining algorithm on an in-
house infrastructure can be estimated as:
C = q.c
q
while the cost using our cloud-based approach is:
C
d
= .m
Thus, the overhead at each party becomes O = (C
d
C) =
(.m q.c
q
), which diminishes quickly and becomes nega-
tive for complex data mining algorithms (larger q) or larger
12
datasets (larger c
q
). It has been shown in Figure 9, for example,
that the query costs are in orders of magnitude more than
cryptographic costs incurred by the secure sum service.
Since each party communicates only with a delegate, not
only does it make security policy enforcement easier, but
also saves network costs. In particular, the number of extra
messages handled by each party in our protocols compared
to the non-cloud version is O(q n.m), which for a given
data mining tasks will decrease with more parties. Even with
complex tasks where q is large, the marginal network cost will
be more than offset by the computation saving.
Furthermore, we note that a party needs to maintain (a small
amount of) local data only if it desires to verify the task results,
and is thus needed for resilience against lazy delegates. If
delegates laziness is not a concern, but only data privacy is,
then no data needs to even be maintained locally. This means
the party could benet greatly from storage saving.
In summary, as the workload (and the number of parties)
increases, it is more benecial to migrate data to the cloud
and use our secure sum service for data mining tasks, than to
maintain ones own infrastructure. There are other qualitative
advantages of using the cloud, such as access to diverse tools
and services that are provided on demand which may be
difcult or expensive to acquire, deploy, maintain in-house.
V. RELATED WORK
Our work is based on several areas of researches: out-
sourced databases, secure multi-party computation and ver-
iable computation. For outsourced databases, existing works
concern authenticated datastructures for guaranteeing query
freshness [30], [29] and completeness [41], [33]. Data privacy
is considered in CryptDB [16]. Our work addresses the query
completeness property, using a probabilistic approach based
on [34]. Querying outsourced databases, especially by a third
party, may give rise to privacy issues relating to the query
outputs. This issue is not within the scope of our work, but has
been studied under differential privacy notion [42], [43]. The
basic technique requires adding noises to the query outputs,
and has mainly been applied to COUNT queries. More the
number of queries there are, the lower the privacy guarantees
in such approaches. It thus remains challenging to implement
complex data mining algorithms within a restricted privacy
budget [44].
Most protocols for secure multi-party computations (rst
proposed by Yao [9] and Goldreich et al. [10]) are highly
inefcient, especially under malicious adversary models. Our
work assumes a semi-honest adversary model. Because the
computation is outsourced, we have to take into account
both the parties and delegates adversarial behaviors. Kamara
et al. [21] recently investigate how to outsource multi-party
computation, but considering only a single delegate. However,
in practice, different parties may be using different public
cloud service providers (and some may deploy private or
hybrid clouds), and hence investigating the multi-cloud setting
is of essence.
The use of homomorphic encryptions for privacy-preserving
addition has been studied elsewhere [23], [45], [24]. These
works share the same model in which an untrusted aggregator
collects inputs from multiple parties and computes the sum
without learning their individual values. Each delegate in our
model can be considered as such an untrusted aggregator,
which is not only curious but also potentially lazy. Our
protocol both preserves privacy and ensures correctness of
the computation. Furthermore, these related works on secure
sum do not go as far as considering their protocols as parts
of complex, cloud-based applications such as collaborative
data mining. In contrast, our effort has been equally on the
conceptual foundations as well as actual implementation and
benchmarking of the same.
Our delegated computation model is a special case of
veriable computation, in which a client outsources its com-
putations to a more powerful entity and is able to later
verify the outputs. Theoretical results have shown that any
computation can be outsourced with guaranteed input and out-
put privacy [17]. However, a general protocol for outsourced
computation is highly inefcient [20]. Some systems propose
to detect cheating and mis-computation at the expense of data
privacy [19], [18], but they rely on probabilistic checking and
require the client to pre-compute the results or the delegate to
commit certain values. Wang et al. [20] proposes a practical
method to outsource linear programming to the cloud. These
works consider a single party and delegate, as opposed to our
model.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we have described a service that allows mul-
tiple parties to take part in a secure multi-party computation
(sum) in which computation is outsourced to a set of delegates.
The protocol protects data privacy and ensures correctness
of the computation against a lazy-and-curious delegate and
curious party model. We have used the service in designing a
cloud-based system for carrying out collaborative data mining.
We discussed techniques for outsourcing databases to the cloud
in a secure manner, and for checking if the cloud has executed
queries truthfully. We have chosen three classical data mining
algorithms representative of some standard tasks: NaiveBayes
(classication), Apriori (association rule mining) and KMeans
(clustering) to demonstrate how the secure sum service can be
used in complex analytics.
A prototype for the secure sum service and data mining
applications has been implemented in Java and evaluated in
a cloud-like environment with real-world datasets. Our exper-
imental studies have quantied the service overhead caused
by cryptographic operations. When used within data mining
applications, however, the cost of performing database queries
are orders of magnitude more signicant than that of secure
sum. For clustering algorithm, our cloud-based system does
not always yield the exact clusters, due to potential query
errors caused by the use of order-preserving encryptions, but
they are very close to the outputs of standard KMeans running
on unencrypted databases. As workloads increase (more par-
ties, complex algorithm, bigger database), the savings achieved
by moving to the cloud outweighs the overhead incurred by
our secure sum service.
13
An immediate extension to our work is to conduct the
experiments on real clouds, in order to quantify the real
cost and efciency. The clouds elasticity will also enable us
to scale our studies to many more parties and much larger
datasets.
We have shown with a proof of concept that it is possible
to delegate multi-party computations to the cloud in a secure
manner, and to realize secure, collaborative data mining ap-
plications. In future work, we would like to consider other
delegated computations besides sums, such as scalar vector
multiplication, min/max, etc, which will consequently enable
more complex data mining applications. Our current adversary
model for the delegate is still semi-honest, extending it to
a malicious model poses signicant challenges and research
opportunities.
There exists other additively homomorphic encryption
schemes besides Paillier [25], which we intend to study and
compare in the context of our work. We have not investigated
rigorously the SETUP phrase in which the group of parties is
formed and agree on the keys. Dynamic group memberships
could affect our protocol in interesting ways. Finally, we plan
to incorporate differential privacy techniques into the query
phrase, and investigate the maximum privacy budget needed
to realize any data mining algorithm.
Acknowledgements. This work has been supported by
A*Star TSRP grant number 1021580038 for pCloud: Privacy
in data value chains using peer-to-peer primitives project.
REFERENCES
[1] Medlineplus, www.nlm.nih.gov/medlineplus/.
[2] xignite: on demand nancial market data, xignite.com.
[3] National oceanic and admospheric administration, www.noaa.gov.
[4] Resmap, earth image source, www.resmap.com.
[5] Yahoo! trafc web services, developer.yahoo.com/trafc, 2010.
[6] Secure multiparty computation goes live, in Financial Cryptography
and Data Security, 2009, pp. 32543.
[7] E. L. Glaeser and M. E. Kahn, Sprawl and Urban Growth. Elsevier,
2003, vol. 4, ch. 56.
[8] S. Eubank, H. Guclu, V. A. Kumar, M. V. Marathe, A. Srinivasan,
Z. Toroczkai, and N. Wang, Modelling diseas outbreaks in realistic
urban social networks, Nature, no. 429, pp. 18084, 2004.
[9] A. C. Yao, Protocols for secure computations, in 23rd Annual Sym-
posium on Foundations of Computer Science, 1982.
[10] O. Goldreich, S. Micali, and A. Wigderson, How to play any mental
game, in 19th annual ACM symposium on Theory of Computing, 1987.
[11] E. A. Abbe, A. E. Khandani, and A. W. Lo, Privacy-preserving methods
for sharing nancial risk exposures, http://arxiv.org/abs/1111.5228, Nov
2011.
[12] Y. Weii and M. B. Blake, Service-oriented computing and cloud
computing: challenges and opportunities, Internet Computing, vol. 14,
no. 6, pp. 6275, 2010.
[13] M. Hajjat, X. Sun, Y.-W. E. Sung, D. Maltz, S. Rao, K. Sripanidkulchai,
and M. Tawarmalani, Cloudward bound: planning for benecial migra-
tion of enterprise applications to the cloud, in SIGCOMM, 2010.
[14] B. C. Tak, B. Urgaonkar, and A. Sivasubramaniam.
[15] Y. Chen and R. Sion, To cloud or not to cloud? musings on costs and
viability, in 2nd ACM Symposium on Cloud Computing, 2011.
[16] R. A. Popa, N. Zeldovich, and H. Balakrishnan, Cryptdb: a practical
encrypted relational dbms, CSAIL, MIT, Tech. Rep. MIT-CSAIL-TR-
2011-005, 2011.
[17] R. Gennaro, C. Gentry, and B. Parno, Non-interactive veriable com-
puting: outsourcing computation to untrusted workers, in CRYPTO10,
August 2010.
[18] S. Goldwasser, Y. T. Kalai, and G. N. Rothblum, Delegating com-
putation: interactive proofs for muggles, in Symposium of Theory of
Computing, STOC08, 2008.
[19] P. Golle and I. Mironov, Uncheatable distributed computations, in
Conference on Topics in Cryptology, 2001.
[20] C. Wang, K. Ren, and J. Wang, Secure and practical outsourcing of
linear programming in cloud computing, in INFOCOM11, 2011.
[21] S. Kamara, P. Mohassel, and M. Raykova, Outsourcing multi-party
computation, http://eprint.iacr.org/2011/272.pdf, 2011.
[22] B. Schneier, Applied Cryptography. John Wiley & Sons, 1996.
[23] E. Shi, T.-H. H. Chan, E. R. FxPal, R. Chow, and D. Song, Privacy-
preserving aggregation of time-series data, in Network and Distributed
System Security Symposium, 2011.
[24] K. Kursawe, G. Danezis, and M. Kohlweiss, Privacy-friendly aggrega-
tion for the smart-grid, Microsoft Research, Tech. Rep., 2011.
[25] O. Ugus, D. Westhoff, R. Laue, A. Shoufan, and S. A. Huss, Optimized
implementation of elliptic curve based additive homomorphic encryption
for wireless sensor, CoRR, 2009.
[26] P. Paillier, Public-key cryptosystems based on coposite degree residu-
osity classes, in EUROCRYPT, 1999, pp. 22338.
[27] M. Burmester and Y. Desmedt, A secure and scalable group key
exchange system, Information Processing, 2005.
[28] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu, Tools
for privacy preserving distributed data mining, SIGKDD Explorations
Newsletter, vol. 4, no. 2, 2002.
[29] M. T. Goodrich, R. Tamassia, and A. Schwerin, Implementation of an
authenticated dictionary with skip lists and communicative hashing, in
DARPA Information survivability conference and exposition, 2001, pp.
6882.
[30] R. Merkle, Secrecy, authentication and public key systems, Ph.D.
dissertation, Dept. of Electrical Engineering, Stanford University, 1979.
[31] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, Order preserving
encryption for numeric data, in SIGMOD 2004, 2004.
[32] A. Boldyreva, N. Chenette, Y. Lee, and A. ONeill, Order-preserving
symmetric encryption, in EUROCRYPT09, 2009.
[33] F. Li, M. Hadjieleftheriou, G. Kollios, and L. Reyzin, Authenticated
index structures for aggregation queries, Transactions on information
and system security, vol. 13, no. 4, 2010.
[34] R. Sion, Query execution assurance for outsourced databases, in 31st
VLDB Conference, 2005, pp. 60112.
[35] P. Domingos and M. Pazzani, On the optimality of the simple bayesian
classier under zero-one loss, Machine learning, vol. 29, pp. 103130,
1997.
[36] J. MacQueen, Some methods for classication and analysis of mul-
tivariate observations, in 5th Berkeley Symposium on Mathematical
Statistics and Probability, 1967, pp. 28197.
[37] R. Agrawal and R. Srikant, Fast algorithms for mining association
rules in large databases, in 20th International conference on very large
databases, 1994, pp. 48799.
[38] Crypto++ library 5.6.1, www.cryptopp.com.
[39] Machine Learning Group, Uni of Waikato, Data mining software in
java, www.cs.waikato.ac.nz/ml/weka.
[40] Uci machine learning repository, archieve.ics.uci.edu/ml/datasets.html.
[41] M. Narasimha and G. Tsudik, Dsac: an approach to ensure integrity
of outsourced databases using signature aggregation and chaining, http:
//eprint.iacr.org/2005/297.ps, 2005.
[42] C. Dwork, Differential privacy, in ICALP, 2006.
[43] I. Mironow, O. Pandey, O. Reingold, and S. Vadhan, Computational
differential privacy, in CRYPTO, 2009.
[44] G. Rothblum, Privacy-preserving data analysis and computational learn-
ing: a match made in heaven, http://windowsontheory.org/2012/05/27/
privacy-preserving-data-analysis-and-computational-learning-a-match-made-in-heaven/,
2012.
[45] V. Rastogi and S. Nath, Differentially private aggregation of distributed
time-series with transformation and encryption, in SIGMOD, 2010, pp.
73546.

Distributed Data Mining in Multi-Cloud Settings

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Data Mining in Multi-Cloud Settings

Uploaded by

Copyright:

Available Formats

1

Delegated Secure Sum Service for

min(b w, max(1, b i))

1) Classication (Naive Bayes): A classication algorithm

You might also like