You are on page 1of 6

Privacy Preserving Association Rule Mining in Vertically

Partitioned Data

Jaideep Vaidya Chris Clifton


Department of ComputerSciences Department of Computer Sciences
Purdue University Purdue University
West Lafayette, Indiana West Lafayette, Indiana
jsvaidya@cs.purdue.edu clifton @ cs.purdue.edu

ABSTRACT the individual purchases at each site, possibly violating con-


Privacy considerations often constrain data mining projects. sumer privacy agreements.
This paper addresses the problem of association rule mining There are more realistic examples. In the sub-assembly
where transactions are distributed across sources. Each site manufacturing process, different manufacturers provide com-
holds some attributes of each transaction, and the sites wish ponents of the finished product. Cars incorporate several
to collaborate to identify globally valid association rules. subcomponents; tires, electrical equipment, etc.; made by
However, the sites must not reveal individual transaction independent producers. Again, we have proprietary data
data. We present a two-party algorithm for efficiently dis- collected by several parties, with a single key joining all the
covering frequent itemsets with minimum support levels, data sets, where mining would help detect/predict malfunc-
without either site revealing individual transaction values. tions. The recent trouble between Ford Motor and Firestone
Tire provide a real-life example. Ford Explorers with Fire-
stone tires from a specific factory had tread separation prob-
Categories and Subject Descriptors lems in certain situations, resulting in 800 injuries. Since the
H.2.8 [ D a t a b a s e M a n a g e m e n t ] : Database Applications-- tires did not have problems on other vehicles, and other tires
Data mining; H.2.4 [ D a t a b a s e M a n a g e m e n t ] : Systems-- on Ford Explorers did not pose a problem, neither side felt
Distributed databases; H.2.7 [ D a t a b a s e M a n a g e m e n t ] : responsible. The delay in identifying the real problem led to
Database Administration--Security, integrity, and protec- a public relations nightmare and the eventual replacement
tion of 14.1 million tires[16]. Many of these were probably fine
- Ford Explorers accounted for only 6.5 million of the re-
placed tires [11]. Both manufacturers had their own data -
1. INTRODUCTION early generation of association rules based on all of the data
Data mining technology has emerged as a means for iden- may have enabled Ford and Firestone to resolve the safety
tifying patterns and trends from large quantities of data. problem before it became a public relations nightmare.
Mining encompasses various algorithms such as clustering, Informally, the problem is to mine association rules across
classification, association rule mining and sequence detec- two databases, where the columns in the table are at differ-
tion. Traditionally, all these algorithms have been developed ent sites, splitting each row. One database is designated
within a centralized model, with all data being gathered into the primary, and is the initiator of the protocol. The other
a central site, and algorithms being run against that data. database is the responder. There is a join key present in
Privacy concerns can prevent this approach - there may both databases. The remaining attributes are present in
not be a central site with authority to see all the data. We one database or the other, but not both. The goal is to find
present a privacy preserving algorithm to mine association association rules involving attributes other than the join key.
rules from vertically partitioned data. By vertically parti- We must also lay out the privacy constraints. Ideally we
tioned, we mean that each site contains some elements of a would achieve complete zero knowledge, but for a practical
transaction. Using the traditional "market basket" example, solution controlled information disclosure may be accept-
one site may contain grocery purchases, while another has able. Finally, we need to quantify the accuracy and the
clothing purchases. Using a key such as credit card number efficiency of the algorithm, in view of the security restric-
and date, we can join these to identify relationships between tions.
purchases of clothing and groceries. However, this discloses We first outline related work and the problem background.
In Section 3, we formally define the problem and give the
overall solution. Section 4 presents the component scalar
product protocol, the key privacy-preserving part of the so-
Permission to make digital or hard copies of all or part of this work for lution. We discuss what the method discloses in Section 5,
personal or classroom use is granted without fee provided that copies are and present extensions that further limit disclosure. Section
not made or distributed for profit or commercialadvantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to 6 analyzes the security provided and communication costs
republish, to post on servers or to redistribute to lists, requires prior specific of the algorithm. We conclude by summarizing the contri-
permission and/or a fee. butions of this paper and giving directions for future work.
SIGKDD '02 Edmonton, Alberta, Canada
Copyright 2002 ACM 1-58113-567-X/02/0007 ...$5.00.

639
2. BACKGROUND AND RELATED WORK compute every gate in the circuit. Every participant gets
The centralized data mining model assumes that all the corresponding shares of the input wires and the output wires
data required by any d a t a mining algorithm is either avail- for every gate. This approach, though appealing in its gen-
able at or can be sent to a central site. A simple approach to erality and simplicity, means that the size of the protocol
data mining over multiple sources that will not share data depends on the size of the circuit, which depends on the size
is to run existing data mining tools at each site indepen- of the input. This is inefficient for large inputs, as in data
dently and combine the results[5, 6, 17]. However, this will mining. In [9], relationships have been drawn between sev-
often fail to give globally valid results. Issues that cause a eral problems in D a t a Mining and Secure Multiparty Com-
disparity between local and global results include: putation. Although this shows that secure solutions exist,
achieving e~icient secure solutions for privacy preserving
• Values for a single entity may be split across sources. distributed d a t a mining is still open.
D a t a mining at individual sites will be unable to detect
cross-site correlations.
3. PROBLEM DEFINITION
• The same item may be duplicated at different sites, We consider the heterogeneous database scenario consid-
and will be over-weighted in the results. ered in [7], a vertical partitioning of the database between
• D a t a at a single site is likely to be from a homoge- two parties A and B. The association rule mining problem
neous population, hiding geographic or demographic can be formally stated as follows[2]: Let 27 = { Q , i 2 , . . . , ira}
distinctions between that population and others. be a set of literals, called items. Let D be a set of transac-
tions, where each transaction T is a set of items such that
Algorithms have been proposed for distributed data min- T C 27. Associated with each transaction is a unique identi-
ing. Cheung et al. proposed a m e t h o d for horizontally parti- fier, called its TID. We say that a transaction T contains X ,
tioned data[8], and more recent work has addressed privacy a set of some items in 27, if X C T. An association rule is an
in this model[14]. Distributed classification has also been ad- implication of the form, X =:~ Y, where X C 27, Y C 27, and
dressed. A meta-learning approach has been developed that X N Y = ¢. The rule X =*- Y holds in the transaction set
uses classifiers trained at different sites to develop a global D with confidence c if c% of transactions in T~ that contain
classifier [6, 17]. This could protect the individual entities, X also contain Y. The rule X ~ Y has support s in the
but it remains to be shown that the individual classifiers transaction set ~ if s% of transactions in ~P contain X U Y.
do not disclose private information. Recent work has ad- Within this framework, we consider mining boolean as-
dressed classification using Bayesian Networks in vertically sociation rules. The absence or presence of an attribute is
partitioned data [7], and situations where the distribution is represented as a 0 or 1. Transactions are strings of 0 and 1.
itself interesting with respect to what is learned [19]. How- To find out if a particular itemset is frequent, we count
ever, none of this work addresses privacy concerns. the number of records where the values for all the attributes
There has been research considering how much informa- in the itemset are 1. This translates into a simple mathe-
tion can be inferred, calculated or revealed from the data matical problem, given the following definitions:
made available through data mining algorithms, and how Let the total number of attributes be l -4- m, where A
to minimize the leakage of information [15, 4]. However, has I attributes A1 through Al, and B has the remaining
this has been restricted to classification, and the problem m attributes B1 through Bin. Transactions/records are a
has been treated with an "all or nothing" approach. We sequence of l + m ls or 0s. Let k be the support threshold
desire quantification of the security of the process. Corpo- required, and n be the total number of transaction/records.
rations may not require absolute zero knowledge protocols Let .~ and Y represent columns in the database, i.e., xi =
(that leak no information at all) as long as they can keep 1 iff row i has value 1 for attribute X. The scalar (or dot)
the information shared within bounds.
product of two cardinality n vectors X and ~ is defined as:
In [4], data perturbation techniques are used to protect in-
dividual privacy for classification, by adding random values n

from a normal/Ganssian distribution of mean 0 to the actual


data values). One problem with this approach is a tradeoff i=1

between privacy and the accuracy of the results[I]. More Determining if the two-itemset ( X Y ) is frequent thus re-
recently, d a t a perturbation has been applied to boolean as-
duces to testing if .~. Y _> k.
sociation rules [18]. One interesting feature of this work is
In Section 4 we present an efficient way to compute scalar
a flexible definition of privacy; e.g., the ability to correctly
product X . ~ without either side disclosing its vector. First
guess a value of '1' from the perturbed d a t a can be con-
we will show how to generalize the above protocol from two-
sidered a greater threat to privacy than correctly learning
itemsets to general association rules without sharing infor-
a '0'. [15] use cryptographic protocols to achieve complete
mation other than through scalar product computation.
zero knowledge leakage for the ID3 classification algorithm
Generalizing this protocol to a w-itemset is straightfor-
for two parties with horizontally partitioned data.
ward. Assume A has p attributes a l . . • ap and B has q at-
There has been work in cooperative computation between
tributes bl ... bq, and we want to compute the frequency of
entities that mutually distrust one another. Secure two
the w = p + q-itemset ( a l , . . . , av, b l , . . . , bq). Each item in
party computation was first investigated by Yao [20], and
later generalized to multiparty computation. The seminal X (Y) is composed of the product of the corresponding indi-
P q
paper by Goldreich proves existence of a secure solution for vidual elements, i.e., xl = I-i#=1 a~ and y~ = I-I~=l be. This
any functionality[12]. The idea is as follows: the function F computes )f and Y without sharing information between A
to be computed is first represented as a combinatorial cir- and B. The :scalar product protocol then securely computes
cuit, and then the parties run a short protocol to securely the frequency of the entire w-itemset.

640
For example, suppose we want to c o m p u t e if a particular In step 2, B c o m p u t e s S = ~,~ • Y. B also calculates t h e
5-itemset is frequent, w i t h A h a v i n g 2 of t h e a t t r i b u t e s , a n d following n values:
B h a v i n g t h e r e m a i n i n g 3 a t t r i b u t e s . I.e., A a n d B want to (c1,1 * yl + c2,1 * y2 + .. + c~,1 * Yn/
know if t h e itemset l : < A ~ , A b , B ~ , B b , B ~ > is frequent. A (cl,2 * yl + c2,2 * y2 + " + cn,2 * y~)
creates a new vector X of cardinality n where .~ = A~ • ~'tb
( c o m p o n e n t multiplication) a n d B creates a new vector 1~ of
(cl,n * y~ + c2,, * Y2 + ' " + c,,n * y , )
card inality n where Y = B,~*Bb*B~. Now t h e scalar p r o d u c t
B u t B c a n ' t send these values, since A would t h e n have n
of X a n d Y provides t h e (in)frequency of t h e itemset.
i n d e p e n d e n t equations in n u n k n o w n s ( y l . . . y,~), revealing
T h e complete a l g o r i t h m to find frequent itemsets is:
t h e y values. Instead, B generates r r a n d o m values, R~ ...
1. L1 = {large 1-itemsets} R~. T h e n u m b e r of values A would need to know to obtain
2. for (k=2; Lk-1 ~ ¢; k + + ) do begin full disclosure of B's values is governed by r.
3. C~ -- apriori-gen(Lk-1); B partitions t h e n values created earlier into r sets, a n d
4. for all candidates c E Ca do begin t h e R ' values are used to hide t h e equations as follows:
5. if all t h e a t t r i b u t e s in c are entirely at A or B (c1,1 * y l +c2,1 *y2 + " ' + e ~ , l * y , + R ~ )
6. t h a t p a r t y i n d e p e n d e n t l y Calculates c.count
7. else
(Cl,n/r * Yl + C2,n/,- * Y2 "4- "'" + Cn,n/r * Yn + R~)
8. let A have l of t h e a t t r i b u t e s a n d B have t h e ( c l , ( , / r + D * Yl + c2,(,/~+1) * Y2+
remaining m attributes • " + Cn,(,/~+I) * yn + R~>
9. c o n s t r u c t .~ on A's side a n d ~ on B's side where
= I~i=l A i a n d = I-Ii= 1 i
10. c o m p u t e c.count = X . ~ -- ~ i =n x x~ * yi (Cl,2n/r * Yl + C2,2n/r * Y2 + " "" + C~,2~/~ * y~ + R~)
11. endif
12. L k = Lk U clc.count _> m i n s u p
13. end (Cl,((r--l)n/r+l) * Yl + C2,((~-l)n/l~+i ) * Y2+
14. end " • + c~,((~-l)n/r+l) * yn ÷ R,.)
15. Answer = U~cL~
(Cl,n * Yl + C2,n * y2 + ' ' " "}- Cn,n * yn + R~>
In step 3, t h e function apriori-gen takes t h e set of large
T h e n B sends S a n d t h e n above values to A, who writes:
itemsets Lk-~ found in t h e (k - 1)th pass as an a r g u m e n t
a n d generates t h e set of c a n d i d a t e itemsets C~. This is done
by generating a superset of possible c a n d i d a t e itemsets a n d S = (xl + c1,1 * R1 + cl,2 * R2 + - . . + c1,~ * R,~) * yl
p r u n i n g this set. [3] discusses t h e function in detail. +(x2 + c2,1 * R1 + c2,2 * R2 + . . . + c2,n * Rn) * y2
Given t h e counts a n d frequent itemsets, we can c o m p u t e
all association rules with s u p p o r t _> m i n s u p .
Only t h e steps 1, 3, 10 a n d 12 require sharing information.
Since t h e final result ~ L ~ is known to b o t h parties, steps 1, ÷ ( x . + cn,1 * R1 + cn,2 * R2 + .." + Cn,n * Rn) * yn
3 a n d 12 reveal no e x t r a information to either party. We now Simplifying further a n d grouping t h e xl • yl t e r m s gives:
show how to c o m p u t e step 10 w i t h o u t revealing information.

S -~ (xl * yl -4- x2 * y2 + " " --t-xn * yn)


4. THE COMPONENT ALGORITHM + ( y l * cz,1 * R1 + yl * cl,2 * R2 + . ' . + yl * cl,n * R,~)
Secure c o m p u t a t i o n of scalar p r o d u c t is t h e key to our
+ ( y 2 * C2,1 * R1 + Y2 * c2,2 * R2 + " ' ' + Y2 * c2,n * R n )
protocol. Scalar p r o d u c t protocols have been proposed in
t h e Secure M u l t i p a r t y C o m p u t a t i o n literature[10], however
these cryptographic solutions do not scale well to this d a t a
m i n i n g problem. We give a n algebraic solution t h a t hides +(Yn * cn,1 * R1 + yn * cn,2 * R2 + ' " + y,~ * c,~,, * Rn)
t r u e values by placing t h e m in equations masked with ran- T h e first line of t h e I:t.H.S. can b e succinctly w r i t t e n as
d o m values. T h e knowledge disclosed by these equations ~ = 1 xi • yl, t h e desired final result. In t h e remaining por-
only allows c o m p u t a t i o n of private values if one side learns tion, we group all multiplicative c o m p o n e n t s vertically, a n d
a s u b s t a n t i a l n u m b e r of t h e private values from a n outside rearrange t h e e q u a t i o n to factor out all t h e Ri values, giving:
source. (A different algebraic t e c h n i q u e has recently been
proposed [13], however it requires a t least twice t h e bitwise
c o m m u n i c a t i o n cost of t h e m e t h o d presented here.)
We assume w i t h o u t loss of generality t h a t n is even.
Step 1: A generates r a n d o m s R1 . . . / ~ . From these, ~ ,
+ R 1 * (c1,1 * y, + c2,1 * y~ + - - " + c,,1 * y,~)
a n d a m a t r i x C forming coefficients for a set of linear inde-
p e n d e n t equations, A sends t h e following vector ~ ' to B: + R 2 * (cl,2 * yl + c~,2 * y2 + " " + c,,~ * yn)
(Xl -~ Cl,1 * R1 -~ Cl,2 * R2 + . . . + Cl,n * R n )
(z2 + c2,1 * R1 + c2,2 * R2 + . " + c2,, * R , )
+ R n * (c1,, * yl + c2,,~ * y2 + " • + Cn,n * y , )
(Xn + Cn,1 * R1 + On,2 * R2 + ' " T Cn,n * R n ) A d d i n g a n d s u b t r a c t i n g t h e same q u a n t i t y from one side
of t h e e q u a t i o n does not change t h e e q u a t i o n in any way.

641
Hence, the above equation can be rewritten as follows: q'-R(r-1)nlr+l * (Cl,(r--1)nlr.-t-1 * Y(r--1)nlr+l +
c2,(.-1),~/~+1 * y2 + " • + c.,(~-1),~/~+1 * y~ + R~.)

S xi * yi +R,~ * (c1,,~ * yl + c2,~ * y2 + " ' + c~,. * y~ + PC)


i=l
•-[-{R1 * (Cl,1 * Yl + c2,1 * Y2 -[- • • • -[.- Cn,1 * y n )
-.Rn/~+l * P~ . . . . . R 2 . / r * I~2
+R1 * R~ - R1 * / ~ }

-R(~_i)./~+~ • PC . . . . . R . • PC
+{R,~/,. * (Cl,./,. * Ynl,- + c2,,~/,. * Y2 + " " + cn,,~l,- * Y,~)
A already knows the n Ri values. B also sent n other
+R./~ • R~ - R,,/r • a t } values, these are the coefficients of the n / ~ values above.
-b{Rn/r+l * ( C l , n / r + l * Y n / r + l "{- C 2 , n / r + l * Y2 -[- A multiplies the n values received from B with the corre-
• • • + c.,./~+1 * y~) sponding R~ and subtracts the sum from S to get:
+P~/~+~ * P4 - R,./,-+i • P~} n
Temp = E xi * Yl
i=1
+{P~z,~/~ * (cl,2./~ * y2,~l~ + c2,2n/~ * y2 +
• " + c.,2,~/~ * y . )
- R . / ~ + i * R~ . . . . . R2,,/~ * P4
+ R2,,I,- * t~z - R2,q,- * R~ }

-R(r-1)./~+l * PC . . . . . R . • PC
Factoring out the R~ gives:
-~-{R(r-1)nlr+l * (Cx,(r--1)r~/r...bl * Y ( r - - 1 ) n / r + l q- Temp =
C2,(r--1)n/r+l * y2 -1- " " " + Cn,(r--1)n/r+l * Y n )
~ xl * yl
+ R ( ~ - i ) n / r + i * PC -- R(r-l)n/r+l * PC}
i=1
- ( R , 4. R2 + . . . + ~ / . ) • R~
--~-{R.n . (Cl,n . Yl --~ c2,n * Y2 " J r ' " --~ Cn,n * Yn) -(*i,'+, + ~ / , . + 2 + - " + R2./r) * P4
+ P ~ • PC - R~ • PC}

-(R(cr-1)~/~)+l + R((~-1),~/~)+2 + ' " + Rn) * R~


Now A factors out the R/ from the first two components To get the desired final result (viz. ~ i =n 1 xl * yi), A needs
and groups the rest vertically, giving: to add the sum of the r multiplicative terms to Temp.
In step 3, A sends the r values to B, and B (knowing R')
computes the final result• Finally B replies with the result•

4.1 S e l e c t i o n o f c,,j
S = E xl * y/ The above protocol requires a matrix C of values that
form coefficients of linear independent equations. The ne-
+R1 * (c1,1 * Yl + c2a * Y2 + " ' + c.,1 * y . + R~) cessity of this is obvious from the fact t h a t the equations are
used to hide the d a t a values. If any equation can be elimi-
nated using less than half of the other equations, a linkage
between less than n / 2 of the unknowns is created.
+1:l,~/,. * (cl,,~Ir * Y,~lr + c2,,~1,. * Y2 +
W i t h high probability, a coefficient matrix generated by
. . . + c . , . / ~ * y . + RD a pseudo-random function will form linearly independent
"~R'nlr-bl * ( C l , . / r - F 1 * Y n / r + l .4.. C 2 , n / r + l * Y2 q- equations• This enables construction of the c i j matrix by
• ..+c.,,,/~+~ * y- +/gz) sharing only a seed and a generating function.

5. WHAT THIS METHOD DISCLOSES


" ~ R 2 n / r * (Cl,2rr/r * Y2nlr "b C 2 , 2 n l r * Y2 q-
The goal of this work is to create a practical, efficient
method to compute association rules without disclosing en-
• " + c.,2./,- * y . + / g z ) tity values. This does not require a complete zero-knowledge
solution. We will now discuss what is disclosed, and prob-
lems posed by boolean attributes• Other issues, such as
preventing disclosure of what items exist at each site, or

642
preventing a "cheater" from probing to find a specific value
T a b l e 1: S e c u r i t y A n a l y s i s o f P r o t o c o l
(e.g., by setting all values but one to 0 - a problem for true
Protected Number of Total number Number of
zero-knowledge scalar product protocols, but detectable and values randoms of unknourns equations
thus preventable with our method) are omitted due to space generated revealed
constraints. A Xl • • • Xn n 2n n -'F r
B yl • • • Yn r n .q- r n

5.1 What Must be Disclosed


The nature of the problem - each party knows its own
data, and learns the resulting global association rules - re- T a b l e 2: C o m m u n i c a t i o n C o s t
Rounds B i t w i s e cost
sults in some disclosure. For example, if we have a support
4 2. n* MaxVaISz O(n)
threshold of 5%, a rule A1 ~ B1 holds, and exactly 5% of
*MaxValSz = Maximum bits to represent any input value
the items on A have item A1, A knows that at least those
same items on B have property B1. It is acceptable for the
protocol to disclose knowledge that could be obtained from
the global rules and one's own database. 6. SECURITY/COMMUNICATION ANAL-
This method discloses more than just the presence or ab-
sence of a rule with support above a threshold. Side A learns YSIS
the exact support for an itemset. This increases the proba-
bility that A will learn that a set of items on B have a given 6.1 Security Analysis
property. This occurs when the global support equals A's The security of the scalar product protocol is based on the
support, not just when A's support is at the threshold. In inability of either side to solve k equations in more than k
any case, A learns the probability that an item in the set unknowns. Some of the unknowns are randomly chosen, and
supported by A has a property in B, computed as the ratio can safely be assumed as private. However, if enough data
of the actual support to A's support. values are known to the other party, the equations can be
It is unlikely that specific individual data values will be solved to reveal all values. Therefore, the disclosure risk in
disclosed with certainty by this method. This possibility can this method is based on the number of data values that the
be reduced further by allowing approximate answers. other party might know from some external source. Table 1
presents the number of unknowns and equations generated.
This shows the number of d a t a values the other party must
5.2 The Trouble with {0,1} have knowledge of to obtain full disclosure.
When we mine boolean association rules, the input values The scalar product protocol is used once for every candi-
(xl and yi values) are restricted to 0 or 1. This creates date item set. This could introduce extra equations. When
a disclosure risk, both with our protocol and with other the candidate itemset contains multiple attributes from each
scalar product protocols [10, 13]. Recall that A provides side, there is no question of linear equations so it does not
n + r equations in 2n unknowns. From these B can get r perceptibly weaken the privacy of the data.
equations in only in the xl values. If the equations have a It is also possible to have multiple w-itemsets in the can-
unique solution, B could try all possible values of 0 and 1 didate set split as 1, w - 1 on each side. Consider two possi-
for all the y~ to obtain the correct solution. This 2 '~ brute ble candidate sets A1, B1, E2, Bs and A1, B2, B3. If A uses
force approach enables B to k n o w the correct values for the new/different equations for each candidate set, it imperils
yl. An analogous situation exists for B since A can get n - r the security of A1. However, B can reuse the values sent
equations in only the yi values. the first time. The equations sent by B can be reused for
One way to eliminate this problem is to ensure there are the same combinations of Bi, only a new sum must be sent.
multiple solutions to the equations, by cleverly selecting the This reveals an additional equation, limiting the number of
ci,j values such that it is not evident exactly which of the times B can run the protocol. This limit is adjusted by r,
x i / y i are l's. One way to accomplish that is as follows. and is unlikely to be a reached if the number of entities is
Consider the form of any equation sent by B: c1,1 * yl + high relative to the number of attributes.
c2,1 *y2 + " ' + c ~ , l * Y n + R ~ . B can group the yi into
pairs of 0 and 1 and selectively set the coefficient of some 6.2 Communication Analysis
pairs to be the same. Thus, even if A finds a solution to the The total communication cost depends on the number of
equation, it is not unique and does not disclose which of the candidate itemsets, and can best be expressed as a (con-
yl is 0 and which is 1. The duplicated cl,j values can also be stant) multiple of the i / o cost of the apriori algorithm. Com-
used to force multiple solutions to the equations sent by A. puting support for each candidate itemset requires one run
Thus B is unable to verify that a given set of xi by checking of the component scalar product protocol. The cost of each
if they allow a unique solution to the Ri values. run (based on the number of items n is as follows: A sends
This requires that B communicate whic h Cld should be one message with n values. B replies with a message con-
the same. Alternatively, B can specify a periodic generator sisting of n + 1 values. A then sends a message consist-
and communicate the order of the x / y values so pairs match ing of r values. Finally B sends the result, for a total of
the period. This increases communication cost by O ( n ) . four communication rounds. The bitwise communication
Also, this creates a problem in security for A. As a result cost is O(n) with constant approximately 2 (assuming r is
of duplicating half of the cld, the security of A's x values constant). This is summarized in Table 2.
decrease; if B knows an x value, B can determine the value There is also the quadratic cost of communicating the ci,i
for the corresponding x of the pair. values. However, this cost can made constant by agreeing

643
on a function and a seed value to generate the values. Conference on Data Mining. IEEE, Nov. 29 - Dec. 2
2001.
7. CONCLUSION AND FUTURE WORK [8] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu.
Efficient mining of association rules in distributed
The major contributions of this paper are a privacy pre-
databases. Transactions on Knowledge and Data
serving association rule mining algorithm given a privacy
Engineering, 8(6):911-922, Dec. 1996.
preserving scalar product protocol, and an efficient proto-
col for computing scalar product while preserving privacy of [9] W. Du and M. J. Atallah. Secure multi-party
the individual values. We show that it is possible to achieve computation problems and their applications: A
good individual security with communication cost compara- review and open problems. In Proceedings off the 2001
ble to that required to build a centralized data warehouse. New Security Paradigms Workshop, Cloudcroft, New
There are several directions for future research. Handling Mexico, Sel?t. 11-13 2001.
multiple parties is a non-trivial extension, especially if we [10] W. Du and M. J. Atallah. Secure multi-party
consider collusion between parties as well. This work is computational geometry. In Proceedings of the Seventh
limited to boolean association rule mining. Non-categorical International Workshop on Algorithms and Data
attributes and quantitative association rule mining are sig- Structures, Providence, Rhode Island, Aug. 8-10 2001.
nificantly more complex problems. [11] Ford Motor Corporation. Corporate citizenship report.
The same privacy issues face other types of data mining, http://www, ford.com/en/ourCompany/
such as Clustering, Classification, and Sequence Detection. communityAndCult ure/buildingRelationships/
Our grand goal is to develop methods enabling any data strategicIssues/firestoneTireRecall.htm, May 2001.
mining that can be done at a single site to be done across [12] O. Goldreich, S. Micali, and A. Wigderson. How to
various sources, while respecting their privacy policies. play any mental game - a completeness theorem for
protocols ~ith honest majority. In 19th ACM
Symposium on the Theory of Computing, pages
8. ACKNOWLEDGMENTS 218-229, 1987.
The authors thank Murat Kantarcioglu for suggestions on [13] I. Ioannidis, A. Grama, and M. Atallah. A secure
addressing boolean attributes. Portions of this work were protocol for computing dot products in clustered and
supported by Grant EIA-9903545 from the National Science distributed environments. In The International
Foundation, and by sponsors of the Center for Education Conference on Parallel Processing, Vancouver,
and Research in Information Assurance and Security. Canada, Aug. 18-21 2002.
[14] M. Kantarcioglu and C. Clifton. Privacy-preserving
9. REFERENCES distributed mining of association rules on horizontally
[1] D. Agrawal and C. C. Aggarwal. On the design and partitioned data. In The ACM SIGMOD Workshop on
quantification of privacy preserving data mining Research Issues on Data Mining and Knowledge
algorithms. In Proceedings of the Twentieth ACM Discovery (DMKD'02), June 2 2002.
SIGA CT-SIGMOD-SIGART Symposium on Principles [15] Y. Lindell and B. Pinkas. Privacy preserving data
of Database Systems, Santa Barbara, California, USA, mining. In Advances in Cryptology - CRYPTO 2000,
May 21-23 2001. ACM. pages 36-54. Springer-Verlag, Aug. 20-24 2000.
[2] R. Agrawal, T. Imielinski, and A. N. Swami. Mining [16] National Highway Traffic Safety Administration.
association rules between sets of items in large Firestone tire recall.
databases. In P. Buneman and S. Jajodia, editors, http://www.nhtsa.dot .gov/hot/Firestone/Index.html,
Proceedings of the 1993 ACM SIGMOD International May 2001.
Conference on Management of Data, pages 207-216, [17] A. Prodromidis, P. Chan, and S. Stolfo. Meta-learning
Washington, D.C., May 26-28 1993. in distributed data mining systems: Issues and
[3] R. Agrawal and R. Srikant. Fast algorithms for mining approaches, chapter 3. A A A I / M I T Press, 2000.
association rules. In Proceedings of the 2Oth [18] S. J. Rizvi and J. R. Haritsa. Privacy-preserving
International Conference on Very Large Data Bases, association rule mining. In Proceedings of 28th
Santiago, Chile, Sept. 12-15 1994. VLDB. International Conference on Very Large Data Bases.
[4] R. Agrawal and R. Srikant. Privacy-preserving data VLDB, Aug. 20-23 2002.
mining. In Proceedings of the 2000 ACM SIGMOD [19] R. Wirth, M. Borth, and J. Hipp. When distribution
Conference on Management of Data, Dallas, TX, May is part of the semantics: A new problem class for
14-19 2000. ACM. distributed knowledge discovery. In Ubiquitous Data
[5] P. Chan. An Extensible Meta-Learning Approach for Mining for Mobile and Distributed Environments
Scalable and Accurate Inductive Learning. PhD thesis, workshop associated with the Joint 12th European
Department of Computer Science, Columbia Conference on Machine Learning (ECML'0I) and 5th
University, New York, NY, 1996. (Technical Report European Conference on Principles and Practice of
CUCS-044-96). Knowledge Discovery in Databases (PKDD'01),
[6] P. Chan. On the accuracy of meta-learning for Freiburg, Germany, Sept.3-7 2001.
scalable data mining. Journal off Intelligent [20] A. C. Yao. How to generate and exchange secrets. In
Information Systems, 8:5-28, 1997. Proceedings of the 27th IEEE Symposium on
[7] R. Chen, K. Sivakumar, and H. Kargupta. Distributed Foundations of Computer Science, pages 162-167.
web mining using bayesian networks from multiple IEEE, 1986.
data streams. In The 2001 IEEE International

544

You might also like