You are on page 1of 15

European Journal of Scientific Research

ISSN 1450-216X Vol.64 No.4 (2011), pp. 610-624


© EuroJournals Publishing, Inc. 2011
http://www.europeanjournalofscientificresearch.com

Privacy Preserving Distributed Data Mining using


Randomized Site Selection

M. Rajalakshmi
Coimbatore Institute of Technology, Coimbatore, India

T. Purusothaman
Government College of Technology, Coimbatore, India

Abstract

Distributed data mining explores hidden useful information from data sources
distributed among several sites. Privacy of participating sites becomes great concern and
sensitive information pertaining to the individual sites needs high protection when data
mining occurs among several sites. Different approaches for mining data securely in a
distributed environment have been proposed but in the existing approaches, collusion
among the participating sites may reveal sensitive information about other participating
sites and they suffer from the intended purposes of maintaining privacy of the individual
participating sites, reducing computational complexity and minimizing communication
overhead. The proposed method finds global frequent itemsets in a distributed environment
with minimal communication among sites and ensures higher degree of privacy with
Elliptic Curve Cryptography (ECC) and randomized site selection. The experimental
analysis shows that proposed method generates global frequent itemsets among colluded
sites without affecting mining performance and confirms optimal communication among
sites.

Keywords: Distributed data mining, privacy, secure multiparty computation, frequent


itemsets, sanitization, Elliptic Curve Cryptography (ECC)

1. Introduction
Major technological developments and innovations in the field of information technology have made it
easy for organizations to store a huge amount of data within its affordable limit. Data mining
techniques come in handy to extract useful information for strategic decision making from voluminous
data which is either centralized or distributed (Agrawal & Srikant, 1994; Han & Kamber, 2001).
The term data mining refers to extracting or mining knowledge from a massive amount of data.
Data mining functionalities like association rule mining, cluster analysis, classification, prediction etc.
specify the different kinds of patterns mined. Association Rule Mining (ARM) finds interesting
association or correlation among a large set of data items. Finding association rules among huge
amount of business transactions can help in making many business decisions such as catalog design
cross marketing etc. A best example of ARM is market basket analysis. This is the process of
analyzing the customer buying habits from the association between the different items which is
available in the shopping baskets. This analysis can help retailers to develop marketing strategies.
ARM involves two stages
(i) Finding frequent itemsets
Privacy Preserving Distributed Data Mining using Randomized Site Selection 611

(ii) Generating strong association rules

1.1. Association Rule Mining: Basic Concepts


Let I = {i1,i2…im} be a set of m distinct items. Let D denote a database of transactions where each
transaction T is a set of items such that T ⊆ I. Each transaction has a unique identifier, called TID. A
set of item is referred to as an itemset. An itemset that contains k items is a k-itemset. Support of an
itemset is defined as the ratio of the number of occurrences of the itemset in the data source to the total
number of transactions in the data source. Support shows the frequency of occurrence of an itemset.
The itemset X is said to have a support s if s% of transactions contain X. The support of an association
rule XY is given by
Support = (Number of transactions containing X U Y) / (Total number of Transactions)
where X is the antecedent and Y is the consequent
An itemset is said to be frequent when the number of occurrences of that particular itemset in
the database is larger than a user-specified minimum support. Confidence shows the strength of the
relation among the items. The confidence of an association rule is given by,
Confidence = (Number of transactions containing X U Y) / (Total number of Transactions
containing X)
An association rule is said to be strong when its confidence is larger than a user-specified
minimum confidence. Association rules with support and confidence above the minimum support and
minimum confidence alone are mined. Many algorithms have been proposed for frequent itemsets
generation. They are Apriori, Pincer search, Frequent pattern tree, etc. (Agrawal & Srikant, 1994; Lin
& Kedem, 2002; Han, Pei, Yin & Mao, 2004).

1.2. Distributed Data Mining


In the present situation, information is the key factor which drives and decides the success of any
organization and it is essential to share information pertaining to an individual data source for mutual
benefit. Thus, Distributed Data Mining (DDM) is considered as the right solution for many
applications, as it reduces some practical problems like voluminous data transfers, massive storage unit
requirement, security issues etc. Distributed Association Rule Mining (DARM) is a sub-area of DDM.
DARM is used to find global frequent itemsets from different data sources distributed among several
sites. The DARM scenario is illustrated in Figure 1.

Figure 1: Distributed Association Rule Mining

ARM Local Frequent


Site 1 Data Source 1
algorithm Itemset 1

ARM Local Frequent Distributed


algorithm Global
Site 2 Data Source 2 Itemset 2 Association
Frequent
Rule
Itemset
. . Mining
. Algorithm
. . .
. . .
ARM Local Frequent
Site n Data Source n algorithm Itemset n
612 M. Rajalakshmi and T. Purusothaman

In DARM, the local frequent itemsets for the given minimum support are generated at the
individual sites by using data mining algorithms like Apriori, FP Growth tree, etc. (Agrawal et al.
1994; Han et al. 2001). Then, global frequent itemsets are generated by combining local frequent
itemsets of all the participating sites with the help of distributed data mining algorithm (Cheung, D.,
Ng, Fu & Fu, 1996; Ashrafi, Taniar & Smith, 2004). The strong rules generated by distributed
association rule mining algorithms satisfy both minimum global support and confidence threshold.
Let D1, D2… Dn be the data sources which are geographically distributed. Let m be the number
of items and I = {i1,i2…im} be a set of items. Global Support of an itemset is defined as the ratio of the
number of occurrences of the itemset in all the data sources to the total number of transactions in all
the data sources. An itemset is said to be globally frequent when the number of occurrences of that
particular itemset in all the data sources is larger than a user-specified minimum support.
During global frequent itemset generation, local frequent itemsets of individual sites need to be
shared. So, the participating sites learn the exact support count of all other participating sites. However,
in many situations the participating sites are not interested to disclose the support counts of some of
their itemsets which are considered as sensitive information. Thus, it is essential to share information
pertaining to an individual data source without revealing sensitive information (Vaidya & Clifton,
2004). This type of information sharing or allowing access to sensitive information of a data source is
likely to cause some serious privacy issues in real time situations.
A typical example of a privacy-preserving data mining problem occurs in the field of market-
basket analysis. Consider the case of two or more companies have a huge data source where the
customers’ buying habits are stored as records. They wish to jointly mine their customers’ data to make
strategic decisions using the patterns mined. However, some of these companies may not want to share
some sensitive strategic patterns hidden within their own data with other parties. In such cases,
classical data mining solutions cannot be used. Hence, Secure Mutiparty Computational (SMC)
solutions can be applied to maintain privacy in distributed association rule mining (Lindell & Pinkas,
2009). The goal of SMC in distributed association rule mining is to find global frequent itemset
without revealing the local support count of participating sites to each other.
In distributed privacy preserving data mining, the participating sites may be treated as honest,
semi-honest or dishonest (Clifton, 2001). The semi-honest parties are honest but try to learn more from
received information. The dishonest parties are malicious and they do not follow the defined protocols.
When all the parties are honest the question of privacy will not arise. The real need for concealing the
data of individual site arises when the parties are semi-honest or dishonest. Here, a new collusion free
solution is proposed to find the global frequent itemsets among dishonest parties with minimum
communication cost and time complexity.
The subsequent sections of the paper are organized as follows. Firstly, related existing works
are reviewed. Secondly, the proposed approach and its performance evaluation are discussed. Thirdly
discussion and limitations are presented. Lastly, a suitable conclusion and future work for maintaining
privacy is attempted.

2. Related Work
Privacy preserving data mining is an active research area provides methods for finding patterns without
revealing sensitive data. Numerous research works are underway to preserve privacy both in individual
data source and multiple data sources. (Verykios, Bertino, Provenza, Saygin & Theodoridis, 2004;
Nedunchezhian & Anbumani, 2006).
There are two broad approaches for privacy-preserving data mining (Wang, Lee, Billis, &
Jafari, 2004). The first approach alters the data before it is delivered to the data miner so that real
values are hidden. It is called data sanitization. The second approach assumes that the data is
distributed between two or more sites, and that these sites cooperate to learn the global data mining
results without revealing the data at their individual sites. The second approach was named by
Goldreich as Secure Multiparty Computation (SMC) (Goldreich. 1998, Lindell & Pinkas, 2009). Data
Privacy Preserving Distributed Data Mining using Randomized Site Selection 613

mining on sanitized data results in loss of accuracy, while SMC protocols give accurate results with
high computation or communication costs (Inan, Saygyn, Savas, Hintoglu & Levi, 2006).
Representative works from each of the approaches is discussed in the following sections.
Clifton & Marks, (1996) have provided a well designed scenario which clearly reveals the
importance of data sanitization. In this scenario, by providing the original unaltered database to an
external party, some strategic association rules that are crucial to the data owner are disclosed with
serious adverse effects. The sensitive association rule hiding problem is very common in a
collaborative association rule mining project, in which one company may decide to disclose only part
of knowledge contained in its data and hide strategic knowledge represented by sensitive rules. These
sensitive rules must be protected before its data is shared. Also they suggest different measures to
protect sensitive data such as limiting access, altering the data, eliminating unnecessary groupings, and
augmenting the data in a single data source.
Atallah, Bertino, Elmagarmid, Ibrahim, & Verykios, (1999) proposed the concept of data
sanitization to resolve the association rule hiding problem. Its main idea is to select some transactions
from original database and to modify them through some heuristics. They also proved that the optimal
sanitization is an NP-hard problem. Moreover, data sanitization can produce a lot of I/O operations,
which greatly increases the time cost, especially when the original database includes a large number of
transactions.
The sanitizing algorithm explained in the paper (Stanley, Oliveira & Zaıane, 2003) requires two
scans of data source. The first scan is required to build the index for speeding up the sanitization
process. The second scan is used to sanitize the original data source. This represents a more significant
improvement compared to the other algorithms which require various scans depending on the number
of association rules that are to be hidden. Lee, Chang & Chen (2004) define a sanitization matrix in
their recent work. By multiplying the original transaction database and the sanitization matrix, a new
database, which is sanitized for privacy concern is created. However, the construction of sanitization
matrix for large data sources is a tedious process.
In order to hide sensitive rules, two fundamental approaches are presented in (Verykios,
Elmagarmid, Bertino, Saygin & Dasseni, 2004). The first approach prevents rules from being
generated by hiding the frequent sets from which they are derived. The second approach reduces the
importance of the rules by setting their confidence below a user-specified threshold. The approaches
used in the paper are moderately successful in hiding sensitive data, but they are computationally
intensive and have side effects like generation of artifactual new rules and hiding the existing
legitimate rules.
Wu, Chiang & Chen, (2007) have suggested a method for hiding sensitive rules with limited
side effects. Templates are generated for sensitive rules to be hidden in order to minimize the side
effects. But the cost involved in the template generation increases if the number of sensitive rules and
the number of items in the individual sensitive rules are large.The data sanitization approach is suitable
for privacy preserving in case of a single data source, whereas multiple data sources requires the
cooperation of sites to learn the global data mining results, yet this approach may fail and yield
incorrect global mining results. Thus, secure multiparty computational techniques were proposed.
The concept of Secure Multiparty Computation (SMC) was introduced in (Yao, 1986). In many
applications the data is distributed between two or more sites, and for mutual benefit these sites
cooperate to learn the global data mining results without revealing the data at their individual sites. The
basic idea of SMC is that this computation is secure if at the end of the computation no party is
unaware about the other participating sites except its input and the results.
Vaidya & Clifton (2002) proposed a method for privacy preserving association rule mining in
vertically partitioned data. Each site holds some of the attributes of each transaction. An efficient scalar
product protocol was proposed to preserve the privacy among two parties. This protocol did not
consider the collusion among the parties and was also limited to boolean association rule mining.
Kantarcioglu & Clifton (2004) proposed a work to preserve privacy among semi-honest parties
in a distributed environment. It works in two phases assuming no collusion among the parties. But in
614 M. Rajalakshmi and T. Purusothaman

some real life situations collusion is inevitable. The first phase identifies the global candidate itemsets
using commutative encryption which is computationally intensive and the second phase determines the
global frequent itemsets. This work only determines the global frequent itemsets but not their exact
support counts. Whereas Ashrafi, Taniar & Smith (2005) proposed privacy preserving algorithm for
finding the exact support of global frequent itemsets among semi-honest parties. Here, each site
generates local frequent one-length items. Randomization is applied to find the global one-length items
in a secure manner. Then, higher length frequent itemsets are identified with the help of global one-
length items. Through collusion and by closely watching the input and output of a particular site,
details of local frequent itemsets of the site can be identified and the same can be used to estimate the
supports of the higher length itemsets of the same site.
An algorithm using Clustering future (CF) tree and secure sum is proposed to preserve privacy
of quantitative association rules over horizontally partitioned data (Luo, 2006) and fixing of proper
threshold value for constructing the CF tree is not easy. The efficiency of the algorithm is
unpredictable since it depends on the threshold value chosen to construct the CF tree. The present
analysis proposes a novel, computationally simple and secure multiparty computation algorithm for
dishonest parties.
Since the Elliptic Curve Cryptography is used for encryption and decryption, the working
principle of ECC is precisely given in the following section.

3. Elliptic Curve Cryptography


Data transmitted across a network is insecure and vulnerable to many types of attack nowadays.
Asymmetric cryptography based communication is comparatively much secured than symmetric
cryptography because asymmetric cryptographic algorithms have the property of using a pair of keys.
One of the keys i.e. the public key is used for encryption and its corresponding private key must be
used for decryption. The critical feature of asymmetric cryptography, which makes it useful, is the fact
that one of the keys cannot be obtained from the other. This feature of asymmetric cryptosystems
greatly simplifies key exchanges and key management.
Diffie-Hellman, RSA and Elliptic Curve Cryptography (ECC) are the examples of asymmetric
cryptography algorithms (Yao.A.C. 1986). Among those algorithms ECC is efficient due to shorter key
size, faster key generation and faster decryption and its inverse operation gets harder, faster, against
increasing key length than do the inverse operations in Diffie Hellman and RSA. The security strength
of ECC algorithm is mainly determined by a mathematical principle called as the discrete logarithmic
problem.
The ECC approach is a mathematically richer procedure than standard systems like RSA. The
basic units for this cryptosystem are points (x,y) on an Elliptic curve, E(Fp), of the form, y2 =x3+ax+b,
with x, y, a, b ∈ Fp ={1, 2, 3, ... , p-2, p-1}, where Fp is a finite field of prime numbers. One basic
condition for any cryptosystem is that the system is closed, i.e. any operation on an element of the
system results in another element of the system. In order to satisfy this condition for elliptic curves it is
necessary to construct nonstandard addition and multiplication operations.
To encrypt P, a user picks an integer, k, at random and sends the pair of points (k*BP,
P+k*PUBKEY). The decryption is done by multiplying the first component with the secret key, s, and
subtracting from the second component, (P+ k*PUBKEY) - s*(k*BP) = P+k*(s*(BP)) - s*(k*BP) = P.
Then reverse the embedding process to produce the message, m, from the point P. This system requires
a high level of mathematical abstraction to implement (Burnett, Winters & Dowling, 2002).
Privacy Preserving Distributed Data Mining using Randomized Site Selection 615

4. Problem Definition
Let {S1, S2… Sn } be the set of participating sites where n>2. Let D1,D2…Dn be the data sources of sites
S1, S2… Sn respectively which are geographically distributed and let I = {i1,i2…im} be a set of items.
Each transaction T in Di such that T ⊆ I, where i=1 to n. LLi be the local frequent itemset generated
from a participating site Si and G be the global frequent itemset. To generate the global frequent
itemset G, each site sends its respective support counts of its local frequent itemsets to other
participating sites. The intended goal of this proposed approach is to discover the global frequent
itemsets without revealing the sensitive information of all the participating sites, where the sites are
assumed to be colluding.

5. Proposed Approach
This paper proposes a new method to compute globally frequent itemsets from distributed data sources
while preserving the privacy of the participating sites. Each participating site is assumed as dishonest
and the mining process can be initiated by any one of them. The proposed method works in two phases
to protect sensitive information of the individual participating sites in order to withstand any kind of
collusion among the parties. In the first phase, global candidate itemsets are collected at the site who
initiates the mining process using Elliptic Curve Cryptography (ECC) and in the second phase exact
support count of the candidate itemsets are calculated by using randomized site selection, thus global
frequent itemsets are collected in the initiator site (Rajalakshmi, Purusothaman & Pratheeba, 2010).
In phase I each site encrypts its local frequent itemsets in the first round and decrypts them in
the second round using ECC algorithm. Initially, the site who wants to initiate the mining process
encrypts all its local frequent itemsets and sends them to the next site. The next site encrypts all its
local frequent itemsets along with its itemsets received from the initiator and pass it to its next site and
so on. The procedure is continued for all the sites. Once the turn reached to the initiator, initiator start
decrypts the encrypted message and passes it to the next site. The process continues till the initiator site
has been reached. At this point, all global candidate itemsets are collected in the initiator site. Figure 2
illustrates the encryption process.

Figure 2: Encryption of Local frequent itemsets using ECC

E1(LL1)
Site 1 Site 2

E2(E1(LL1), LL2)
En(….E2((E1(LL1), LL2), LL3 ….. LLn)

Site n ...... Site 3

E3(E2(E1(LL1), LL2), LL3)

After collecting global candidate itemsets in phase I, initiator starts the second phase to count
the support of candidate itemsets. In this process, the initiator first adds spurious support count to all
the candidate sets in order to prevent the disclosure of exact support count of initiator itemsets to the
next randomly selected site and splits the global candidate set into two sets randomly called the
retained set and the released set. The retained set is the set which is retained in the current site itself
and the released set is released to the randomly selected site. Thus, each participating site will send the
sets in two rounds. Random site selection makes it complex for the colluders to find the exact itemsets
of any particular site.
616 M. Rajalakshmi and T. Purusothaman

When a site sends its data for the first time, the retained set is kept in the site itself and the
support count of the itemset in the released set is added with the itemset sent by the initiator and in turn
sent to the next randomly selected site. This procedure is repeated till the selected site is initiator site.
The destination site is randomly chosen excluding its immediate source site to send the released set.
Every time the destination site is selected at random avoiding the immediate source. This helps in
having different senders and receivers for a particular site at different stages of operations which in
turn reduces the collusion problem. The immediate source may be considered for the sake of
termination in certain unusual scenarios. The probability of unusual scenarios is very rare and even
when it happens occasionally it is harmless. When a site is selected for the second time, the received
itemsets from another site are added with the retained itemsets. At the end of round 2 all the itemsets
are collected in the initiator and the initiator removes the spurious support counts which are included in
the first round so that frequent itemsets are collected in the initiator site. .
As the termination criterion, a list is maintained to keep track of the sites that have already sent
their retained set. Once the list contains all the n-1 participating sites, excluding the current site, the
spurious itemsets of the current site are eliminated and the remaining itemsets in it are considered as
the globally frequent itemsets. The different steps of the proposed method are given as in algorithm1
and algorithm 2 below.

Algorithm for Phase I


Description: Given a distributed environment with a list of participating sites S, and each site having
the list of Local Frequent Itemsets LLi(k). LLei(k) represents the set of the encrypted k itemsets at site i
and X is the individual itemset in LLi(k). It is required to determine the Global Candidate Itemsets, GC
i(k), in a secured manner using ECC technique.
Input: Local Frequent Itemsets of participating sites, LLi(k).
Number of participating sites, n
Output: Global Candidate Sets, GC i(k)
begin
// Encryption Phase
For each site i do // Encryption of all the frequent itemsets by all sites
Compute LLi(k) using Apriori algorithm
LLei(k) = φ
For each site i do
For each X ∈ LLei(k) do
LLei(k) = LLei(k) ∪ { Ei(X)} //Encryption key
End for
End for
For j= 1 to n do // Encryption by all sites
If j = 1 then // site = initiator_site
Each Site I sends LLei(k) to site ( i+1) mod n // n = number of sites
else
Each Site I encrypts all items in LLe(i-j mod n)(k) with Ei to site ( i+1) mod n
End for // At the end of encryption phase site I has encrypted itemsets of all the
sites.
// Decryption Phase
For i= 1 to n do
Site i decrypts items in LLei(k) using Di and sends to site ( i+1) mod n
End for
Site I prune the duplicate items
GC i(k) = Decrypted items of all sites
End.
Privacy Preserving Distributed Data Mining using Randomized Site Selection 617

Algorithm for Phase II


Description: Given a distributed environment with a list of participating sites S, and each site having
the list of Local Frequent Itemsets LLi(k), it is required to determine the Global Frequent Itemsets G
from Global Candidate Sets GC i(k). The variables, Site and Next_site represent the active site and the
destination site respectively. Sub algorithms Random_Site_Select is used to randomly select the
destination site.
// Site level variables are represented using the structure SITE.
SITE [ Actual: Original Local Frequent Itemsets with support count;
Received: List of Itemsets and support count received from the source site;
Retained: Portion of itemset retained in the site;
Released: Portion of modified itemset to be released to another site;
Modified: List used for intermediate processing;
Input: Global candidate itemsets GC i(k) and randomly generated spurious support counts. Number of
participating sites n >2. The site initiate the mining process can be the initiator.
Output: Global frequent itemsets.
Site.Spurious = Generate_Spurious(); // add spurious support count to Global
candidate itemsets in the initiator Site.Modified =
Site.Actual + Site.Spurious;
Split Site.Modified into Site.Retained and Site.Released;
Site_Processing (Source, Received, Done_set, Site)
begin
Site.Recieved = Received;
Site.Source = Source;
Site.Done_set =Done_set;
If (Site.Retained is Empty and Site.Released is Empty ) then
begin
If (Site.Recieved is not Empty) then
Site.Modified = Site.Modified + Site.Received;
Split Site.Modified into Site.Retained and Site.Released;
End if
Else if (Site.Retained is not Empty) then
begin
Site.Released = Site.Received+ Site.Retained;
Make an Entry in Done_set;
End if
Next_Site = Random_Site_Select (Site.Source, Site.Done_set);
If (Next_Site is Not Empty ) then
Site_Processing (Site, Site.Released , Site.Done_set, New_site);
Else if (Next_Site is Empty) then // Termination Condition
Site.Released = Site.Received– Site.Spurious;
Global Frequent Set =Site.Released;
End.

Sub Algorithm
Random_Site_Select (Source, Done_set)
begin
Site_available_selection =Participating site list excluding Source and
Done_set entries;
Random_next_site = Select a Random site from the list site_available_selection
Return Random_next_site;
618 M. Rajalakshmi and T. Purusothaman

End.

Example
Assume that four sites namely site 1, site 2, site 3 and site 4 are taking part in the mining process.
Table 1 shows the actual local frequent itemsets of the individual sites.

Table 1: Site wise local frequent itemsets

Local frequent itemsets


Site
Itemset Local Support
a 8
Site 1 b 8
ab 6
a 5
b 5
Site 2 c 4
ab 5
ac 4
b 10
Site 3 c 6
d 2
a 5
b 5
Site 4
c 5
ac 4

Total number of transactions in each site is 10 and minimum Support is 2. Total number of
transactions in all four sites is 40 and minimum support is 8.

Figure 4: Communication sequence among sites

Input -> a: 18, b :18, c : 5, ab : 11, ac : 5


Output -> a : 18, b: 28 , c: 15, ab: 11, ac: 8

c : 9, ac : 9
Site 1 Site 2
c : 14, ac : 13
a: 23, b :23, ab : 11
a: 28, b:38, ab : 16

a: 28, b : 28, ab : 16
c:20, ac : 13

a: 18, b :18, ab : 11

Site 3 Site 4

c : 20, ac : 13

Round I Communications (Site 1  Site 2  Site 4  Site 3  Site 1 )


Round II Communications (Site 1  Site 4  Site 2  Site 3  Site1 )
Privacy Preserving Distributed Data Mining using Randomized Site Selection 619

In this example, site 1 is the initiator which initiates the mining process. In phase I, encryption
and decryption process is carried out using ECC algorithm. At the end of Phase I, site 1 removes the
duplicate itemsets and results in the generation of global candidate itemset. Global candidate set
collected in site 1 is {a, b, c, ab, ac}.
As per phase II, site 1 adds spurious support count with all itemsets along with its own support
count then it randomly divides them into two sets namely the retained set and the released set. After
adding spurious support counts { a: 10, b :10, c : 5, ab : 5, ac : 5 } the input set contains the itemsets {
a: 18, b :18, c : 5, ab : 11, ac : 5}, retained set contains { a: 18, b :18, ab : 11 } and the released set
contains {c : 5, ac : 5}. Here, while the retained set is kept in the site itself, the released set with its
spurious counts are sent to the site 2, which is randomly chosen. After the receipt of itemsets from site
1, site 2 updates its contents and it randomly chooses site 4 for sending its updated set. Site 4 also
repeats the same steps and selects site 3 as its destination. Once the list contains all the n-1
participating sites excluding the current site, the current site sends the itemset set to the initiator.
Then Site 1 enters into the second round in the second phase, Site 1 now sends the retained set
to randomly chosen site i.e site 4. Now the retained set of site 1 is NULL. The same process is repeated
until the retained set of all participating sites are NULL. Besides, a list is maintained to keep track of
the sites that have already sent their retained set. Once the list contains all the n-1 participating sites
excluding the current site, the modified itemsets in the current site are the global frequent itemsets and
sent to the initiator i.e site 1. The spurious support counts which are added in the initiator site are
eliminated and the remaining itemsets in it are considered as the globally frequent itemsets. Thus the
globally frequent itemsets in the above example are {a : 18, b: 28 , c: 15 , ab: 11 , ac: 8 }. The
communication sequence is depicted in Figure 4.

6. Performance Evaluation
Distributed environment with four participating sites was simulated to evaluate the performance of the
proposed algorithm. At each site, a P-IV, 2.8 GHz, 2 GB RAM machine runs on Windows operating
system. The proposed method is implemented using Java and MS-Access. Three different data sources
of size 10K, 50K, and 100K are synthetically generated to study the performance of the proposed work
and Table 2 shows the characteristics of the data source. The local frequent itemsets of the participating
sites which are the inputs for the algorithm is generated by using the Apriori algorithm for supports
varying from 10% to 30%.

Table 2: Data source characteristics

Data Set Average Transaction Length (ATL) No of Distinct Items |I| No. of Records |D|
10K 7 10 10000
50K 7 10 50000
100K 9 12 100000

Ashrafi, Taniar & Smith (2005) proposed a Privacy Preserving Distributed Association-Rule-
Mining Algorithm (PPDAM) which follows secure multiparty computation. This is a well accepted
method and we compare the performance of the proposed method against this benchmark. The
performance of the proposed approach has been analyzed to verify its effectiveness.
Transfer of voluminous data over network might take extremely larger time and also requires
an unbearable communication cost. Even a small quantity of data might create problems in wireless
network environments with limited bandwidth. In the process of preserving privacy in distributed
environment, existing methods face the problem of high communication cost. Figure 5 gives the
communication costs of the proposed method compared with PPDAM. Communication cost here
represents the number of communication among the sites.
Total communication cost (Tc) for finding global frequent itemset using PPDAM is
620 M. Rajalakshmi and T. Purusothaman

Tc = 2n + (n-1)*Mc – 1 + (n-1) * Ml (1)


Where
n = Number of participating sites
Mc = A candidate set with maximum length
Ml = A frequent itemset with maximum length
2n = Communication cost for finding 1-length itemset
(n-1)* Mc– 1 = Communication cost for broadcasting candidate itemsets to the initiator
(n-1) * Ml = Communication cost for broadcasting frequent itemsets to all other participating
sites
Total communication according to the proposed method is
Tc = 2n + 2(n/2) (2)
2n = Communication cost for finding global candidate sets
n/2 = Communication cost for calculating the support count of one half of the global candidate
sets.
It is evident from the graph in Figure 5 that when the support decreases, the number of
communication increases for PPDAM because the number of local frequent itemsets generated is more.
But in the proposed method, the number of communication remains the same because all the local
frequent itemsets in an individual site will be sent within three communication and hence for n sites it
takes 3n communication irrespective of whatever value n takes. As per our example described in the
previous section, to generate the 2-length itemset ‘ab’ our method takes only 12 communication and
PPDAM takes 16 communication. Thus, the proposed method notably reduces the communication
among the sites.

Figure 5: Comparison of Communication Cost

|D|=100k,|I|=12,ATL=9; No. of sites = 4

30

25
No. of Com m unication

20
PPDAM
15
Proposed Method
10

0
10 20 30
Support(%)

(a) Communication cost for 100K data source for different support (%)
Privacy Preserving Distributed Data Mining using Randomized Site Selection 621
Figure 5: Comparison of Communication Cost - continued

|D|=100k,|i|=12,ATL=9;upport = 10%

120

100

No. of Communication
80

PPDAM
60
Proposed Method

40

20

0
4 8 12

No. of Sites

(b) Communication cost for 100 K data source for varying number of sites

In the PPDAM approach, the communication among sites happens in a sequential order. This
gives the information about the sender and the receiver of a particular site. In such case, when the
sender colludes with the receiver, the sensitive information of the individual site can be revealed. From
the example illustrated in the previous section, if the existing approach is followed, all one length
itemsets are communicated say in the sequence of Site1, Site2, Site3 and Site4. To know the one length
itemset of Site 2, Site 1 and Site 3 can collude and the sum of actual itemset support and the random
number added can be determined during obfuscation. During deobfuscation, the actual random number
added to the itemset in Site 2 can be determined. Using both the information obtained during
obfuscation and deobfuscation, the actual support count of the one length itemset of Site 2 can be
determined by mere subtraction.
In the proposed method, the global candidate itemsets are generated with the help of ECC
followed by global frequent itemsets are computed using randomized site selection. This random
selection makes it difficult just for two sites to collude and requires all the n-1 sites to collude to
identify the sensitive information. Thus, it makes it difficult for colluders to find sensitive information
of a particular site, thereby the proposed method is more secure than the existing methods.
Time complexity is another major criterion considered for performance evaluation. Figure 6
compares the time complexity of the proposed method with the existing method. From the simulation
results, the proposed method takes minimum time compared to the PPDAM. All these experiments are
conducted for the supports varying from 30% to 10%. The time requirement to generate the global
frequent itemsets on data source 100Ks by the existing method was from 0.45 seconds to 12.34
seconds and by the proposed method was from 0.36 seconds to 5.6 seconds. From this, it is evident that
the proposed method performs better on dense data sets.
622 M. Rajalakshmi and T. Purusothaman
Figure 6: Comparison of Time complexity

|D|=10k,|I|=10,ATL=7; No. of sites = 4

1.6

1.4

Time in Seconds
1.2

1
PPDAM
0.8
Proposed Method
0.6

0.4

0.2

0
10 20 30
Support %

(a) Time complexity of 10K data source

|D|=50k,|I|=10,ATL=7; No. of sites = 4

1.6

1.4

1.2
Time in seconds

1
PPDAM
0.8
Proposed Method
0.6

0.4

0.2

0
10 20 30
Minimum support (%)

(b) Time complexity of 50K data source

D|=100k,|I|=12,ATL=9; No. of sites = 4

14

12
Time in seconds

10

8 PPDAM

6 Proposed Method

0
10 20 30
Minimum support (%)

(c) Time complexity of 100K data source


Privacy Preserving Distributed Data Mining using Randomized Site Selection 623

From the experimental results, it can be concluded that the proposed method has better
accuracy, communication costs, privacy and time complexity than the existing approaches in
determining the global frequent itemsets.

7. Conclusion and Future Work


Privacy preserving data mining techniques for distributed environment have been studied and a novel
algorithm for finding global frequent itemsets has been proposed and implemented. It is found that the
proposed method securely determine global frequent itemsets with minimal communication overhead
and time complexity. The proposed method can be used in all the existing set ups wherein the secure
multi party computation is used for global frequent itemset generation. Some of them include medical
research, fraud detection, market basket analysis and supply chain management. This work can be
extended to preserve privacy among heterogeneous data sources.

References
[1] Agrawal, R. & Srikant, R. (1994). Fast Algorithms for Mining Association Rules. In
Proceedings of the 20th VLDB Conference Santiago, Chile, 487-499.
[2] Ashrafi, M., Taniar, D. & Smith, K. (2004). ODAM : An Optimized Distributed Association
Rule Mining Algorithm, IEEE Distributed Systems Online, 5(3).
[3] Ashrafi, M., Taniar, D. & Smith, K. (2005). Privacy-Preserving Distributed Association Rule
Mining Algorithm. International Journal of Intelligent Information Technologies, 1(1), 46-69.
[4] Atallah, M., Bertino, E., Elmagarmid, A., Ibrahim, M. & Verykios, V.S. (1999). Disclosure
limitation of sensitive rules. In Proceedings of the IEEE Knowledge and Data Exchange
Workshop (KDEX'99). IEEE Computer Society, 45-52.
[5] Burnett.A, Winters.K, and Dowling.T, (2002). A Java implementation of an elliptic curve
Cryptosystem-Java programming and practices. In Proceedings of the inaugural conference on
the Principles and Practice of programming.
[6] Cheung, D., Ng, V., Fu, A. & Fu, Y.(1996). Efficient Mining of Association Rules in
Distributed Databases. IEEE Transactions on Knowledge and Data Engineering. 8(6), 911-922.
[7] Clifton, C. (2001) Secure Multiparty Computation Problems and Their Applications: A Review
and Open Problems. In Proceedings of the Workshop on New Security Paradigms, Cloudcroft,
New Mexico,.
[8] Clifton, C., Kantarcioglu, M. & Vaidya, J.(2004). Defining privacy for data mining. Book
Chapter Data Mining, Next generation challenges and future directions.
[9] Goldreich, O., Micali, S. & Wigderson. (1987). How to Play any Mental Game - A
Completeness Theorem For Protocols With Honest Majority. In 19th ACM symposium on the
theory of computing. 218-229.
[10] Goldreich, O. (1998). Secure Multiparty Computation (Working Draft).
[11] Han, J. & Kamber, M. (2001). Data Mining: Concepts and Technique. Morgan Kaufmann
Publishers.
[12] Han, J. Pei, J. , Yin,Y. & Mao,R.(2004). Mining Frequent Patterns without Candidate
Generation : A Frequent Pattern Approach. IEEE Transactions on Data Mining and Knowledge
Discovery, 8(1), ( pp.53-87).
[13] Inan.A., Saygyn.Y., Savas.E., Hintoglu.A.A, & Levi.A, (2006). Privacy preserving clustering
on horizontally portioned data. In Proceedings of the 22nd International Conference on Data
Engineering Workshops (ICDEW'06).
[14] Kantarcioglu, M. & Clifton, C. (2004). Privacy-Preserving Distributed Mining of Association
Rules on Horizontally Partitioned Data. IEEE Transactions on Knowledge and Data
Engineering, 16(9).
624 M. Rajalakshmi and T. Purusothaman

[15] Lee,G. ,Chang ,C. & Chen, A.L.P .(2004). Hiding sensitive patterns in association rules mining.
In Proceedings of the 28th Annual International Computer Software and Applications
Conference (COMPSAC'04).
[16] Lin, D. & Kedem, Z.M. (1998). Pincer-Search: A New Algorithm for Discovering the
Maximum Frequent Set. In Proceedings of the 6th International Conference on Extending
Database Technology (EDBT), Valencia .pp. 105-119.
[17] Lin, D. & Kedem, Z.M. (2002). Pincer-Search : An Efficient Algorithm for Discovering the
Maximum Frequent Set. IEEE Transactions on Knowledge and Data Engineering.14(3), 553 –
566.
[18] Lindell, Y. & Pinkas, B.(2009). Secure Multiparty Computation for Privacy-Preserving Data
Mining. In the Journal of Privacy and Confidentiality, 59-98.
[19] Luo, W. (2006). An Algorithm for Privacy-preserving Quantitative Association Rules Mining.
In Proceedings of the 2nd IEEE International Symposium.
[20] Nedunchezhian, R. & Anbumani, K. (2006). Rapid Privacy Preserving Algorithm for Large
Databases. International Journal of Intelligent Information Technologies. 2(1), 68-81.
[21] Rajalakshmi, M., Purusothaman ,T. and Pratheeba, S. "Collusion-Free Privacy Preserving Data
mining”, International Journal of Intelligent Information Technologies (IJIIT), Vol.6, No. 4,
pp.30-45, October 2010.
[22] Saygin, Y., Verykios, S. & Elmagarmid, K. (2002). Privacy Preserving Association Rule
Mining. In Proc. 12th International Workshop on Research Issues in Data Engineering:
Engineering E-Commerce/E-Business Systems (RIDE'02), San Jose, USA, 151-158.
[23] Stanley, R., Oliveira, M. & Za”ıane, R. (2003). Algorithms for Balancing Privacy and
Knowledge Discovery in Association Rule Mining. In Proceedings of the Seventh International
Database Engineering and Applications Symposium (IDEAS’03), Hong Kong, China, 54-65.
[24] Vaidya, J. & Clifton, C. (2002). Privacy Preserving Association Rule Mining in Vertically
Partioned data. In Proceedings of the SIGKDD ’02 Copyright ACM.
[25] Vaidya, J. & Clifton, C. ( 2004). Privacy-Preserving Data Mining: Why, How, and When. IEEE
Security and Privacy.
[26] Verykios, S. Bertino, E. Provenza, I. Saygin, Y & Theodoridis, Y. (2004). State- of- the -Art in
Privacy Preserving Data Mining. ACM SIGMOD Record, 33(1), 50-57.
[27] Verykios, S., Elmagarmid, K., Bertino, E., Saygin, Y. & Dasseni, E. (2004).Association Rule
Hiding. IEEE Transactions on Knowledge and Data Engineering, 16(4), 434 – 447.
[28] Wang, S., Lee, Y., Billis, S. & Jafari, A. (2004). Hiding Sensitive Items in Privacy Preserving
Association Rule Mining. IEEE International Conference on Systems, Man and Cybernetics.
[29] Wu,Y., Chiang, C., & Chen, A. (2007). Hiding Sensitive Association Rules with Limited Side
Effects. IEEE Transactions on Knowledge and DataEngineering, 19(1).

You might also like