You are on page 1of 4

2009 Second International Symposium on Computational Intelligence and Design

Privacy 3arallel Algorithm for Mining Association Rules and Its Application in HRM

XuePing Zhang1,2 YanXia Zhu 1 Nan Hua 3

1.College of Information Science 1.College of Information Science 3.Technology Department

and Engineering and Engineering Zhengzhou SuperHW Network
Henan University of Technology Henan University of Technology Technology Co. Ltd
Zhengzhou 450001, China Zhengzhou 450001, China Zhengzhou, 450000,China
2. Key Laboratory of Spatial Data
Mining & Information
Sharing of Ministry of Education
Fuzhou 350002, China

Abstract-Parallel association rules mining has been improved the

efficiency of data mining, and meanwhile concerned with the II. PARMA-P ALGORITHM
privacy preserving problem. A simple and effective method of
parallel association rules mining which based on privacy A. Princeple of PARMA-P
protection----Parallel Association Rules Mining Algorithm with PARMA-P algorithms which could protect frequent item
Privacy preserving(PARMA-P) has been introduced in this paper; sets by using heuristic technology and input Hash allocation
It could achieve effective concealment of frequent item-set and strategy, that is, host as the trust side is responsible for
then the association rules by the means of using imported Hash primary coding and the distributions of sub-FP trees by using
assignment strategy in frequent item sets of FP sub tree could be Hash function. When the applications are applied by sub
protected. It has been used in HRM of an enterprise and
computers, then, it will take the method of coding reversed
experiments show that the algorithm can be simple and effective in
with the composite sets of the results of sub parallel data
protection of data privacy.
mining to realize the privacy protect of association
Keywords-parallel data mining; rules mining;FP-tree; rules. D ' FP
privacy preserving; Hash assignment strategy
B. Description of PARMA-P
z Hash distribution and coding
The benefits brought by data mining has drawn the
attention of more and more businessmen and scientists. Suppose {D1 , D2 ,......, Dn} is a frequent 1-itemsets, then
Meanwhile privacy and information security of that is the subset of FP tree which suffix is Di (i 1, 2,......n) being
indispensable and it will become the key of enterprise
competition. Which made privacy data mining especially called Suffix FP-sub-tree group, donated as
parallel privacy preserve become a new research topic . Di  FP(i 1, 2,......n) .The front pointers of those are put in
Many scientists and scholars has proposed lots of linear list and being set up distribution flags: '0' indicates
algorithms in parallel data mining field the past few years, undistributed and '1' indicates distributed. When there is
such as CD[1] ǃPDM[2] ǃDD[3] ǃFDM[4] et al. The basic application from sub-computer ,the host allocate the sub-
method of those is based on the distributed processor tree according to Hash function randomly. The Hash
principle[5][6][7], that is each processor occupies its own function is as follows:
memory and disk space and communicates to each other by Hash( group) ( Rand (D )*100)%(count ( frequent  itemsets)) (1.1)
internal interconnected mechanism of network.
The results of distribution should be stored in the database
The existing privacy protection techniques mainly
which will be used in the process of reduction of frequent
include heuristic technology, security polytrophic technology
and reconstruction technology[8][9][10]. This article has sets. The conflicts produced by the process of distribution
introduced a privacy parallel algorithm which combines can be resolved by linear sampling method and then
with the characteristics of heuristic and security polytrophic complete the mapping of frequent 1-itemsets to a two-
technique, it can realize the concealment of frequent item dimensional table by the following hash table:
sets then protect the privacy protect of association rules. It Hash(value) char (O  convert (int( Rand (D )* E ), string )) (1.2)
has been applied to HRM of a company and can protect the (E count ( frequent  itemsets)2 , O Hash( group)  100)
privacy data effectively.
where O used to guarantee avoiding the emergent of
coincident code and to ensure the effectiveness of branch of
suffix FP-sub tree group .

978-0-7695-3865-5/09 $26.00 © 2009 IEEE 296

DOI 10.1109/ISCID.2009.220
e.g. The suffix FP-sub tree groups of Tableċ are denoted
as D FP(D a, b, c, d , e) .Its initial allocation table such as
Tableĉ(Of which, IP is the identity code and the initial
setting of host is

When there is application has been sent out ,the host ID 1 2 3 4 5

Item a_ FP b_ FP c_ FP d_ FP e_ FP
verified the identity of sub computer, allocate the suffix FP- Flag 0 0 0 0 0
sub trees through Hash function, meanwhile update IP and IP
flag of Table-I. When collision occurs it will take the hash
soundex to resolve that. Fox example, the first produced null
b:7 a:2
hash pointer is b-FP , however, its flag is '1' ,in other words,
b a
b-FP is allocated to others, it should detect next suffix FP- ,'
SRLQWHU a:4 c:2
d:1 c:2
sub tree's state until to find one which flag is '0',then assign E 
a d c c
to the applicant. The privacy protection of data is F  e:1
G c:2 d:1
implemented by the use of heuristic technique ,here, mainly H

 e c d
by the use of mapping with frequent item sets. TablHĊis e:1
the result of hash mapping about frequent 1-itemsets e
randomly .
TABLE II. THE MAPPING OF FREQUENT 1-ITEMSETS Figure 1. FP tree of transaction database (Tableċ

F_ID Item a-FP b-FP  c-FP d-FP

As Table Č shown that A(sub computer) is
F a F3 F4 F9 F14 distributed a-FP tree(as shown in Figure 2) and B is
G b G18 G11 G6 G11 distributed c-FP tree(as shown in Figure 3), other
H c H8 H22 H25 H22
I d I17 I16 I13 I17 undistributed suffix FP-sub tree is processed by the host.
J e J14 J9 J2 J15 Then update the node elements of each sub tree according
Because of taking random methods to generate hash to TableĊ and obtained the mapping of them which called
code ,furthermore, each suffix FP-sub trees are taking a'-FP tree (as shown in Figure 4) and c'-FP tree (as shown in
different allocation strategy, the original data obtained the Figure 5) and he others needn't map. When there is new
reliable privacy protection. application occurred, the host will execute similar process.
z Process of PARMA-P b:7 a:2

Given a transaction database as shown in Tableċ ,'

(min-sup=23%,that is the support counting equals 2).The a:4
result which generated by FP-Growth algorism illustrated as a
Figure 1 and the distribution list is as shown in Table Č.
Figure 2. a-FP tree
TID Record
T10 {a, b, e} b:7 a:2
T20 {b, d}
T30 {b, c} ,'
T40 {a, b, d} a:4 c:2 c:2
T50 {a, c} c
a c
T60 {b, c}
T70 {a, c} c:2
T80 {a, b, c, e} c
T90 {a, b, c}


ID 1 2 3 4 5
The branch of each D ' FP (such as D '=F3 or H25˅are
Item a-FP b- FP c- FP d- FP e- FP both ending in F3 or H25,and the support counting just equal
Flag 0 0 0 0 0
to the weight of the leafy node which on the basis of the
A: 192.168. B: 192.168.
IP 0.1
0.1 theory monotonic property of support measurement. so the
frequent item sets generated by a'-FP tree are as follows:

F3:6,G18F3:4, and the frequent item sets generated by c'-FP IF(EXISTS(SELECT * FROM distribute_table
are as the following: H25:6, F9H25:4,G6H25:4, G6F25:2. WHERE flag!=0))
G18:7 F3:2 SET @id=mod(rand()*100,@n)
UPDATE distribute_table
,' G18 F3
FRXQWLQJ SRLQWHU SET ip=@sub_IP, Flag=1 WHERE id=@id
)  F3:4

/*Frequent-item-set mapping*/
Figure 4. a'-FP tree
CREATE PROCEDURE Frequent_itemsets_convert
@sub_IP varchar(50),@item varchar(50)
null AS
G6:7 F9:2
+  H25:2 FROM (SELECT *
H25 FROM FP_Tree
H25:2 WHERE (path_id IN
H25 (SELECT path_id
WHERE item = @item))) DERIVEDTBL
Figure 5. c'-FP tree
The results produced by each sub computer should to (SELECT item_id
return to the host, afterwards, reduced according to TableĊ. FROM FP_Tree
At last the frequent item sets of a'-FP tree are reduced as WHERE (item = @item)
follows:a:4,ba:4; the frequent item sets of c'-FP tree are GROUP BY itemid))
reduced as the following :c:6,ac:4,bc:4,bac:2. In the process ORDER BY id
it shows that due to the differences of each elements of
frequent item sets and inability of reversing the frequent SELECT item, alias INTO hash_table_temp
item sets ,the privacy protection of data is completed . FROM hash_table
C. Related Property and Theorem [11] UPDATE path_temp
SET ip=@sub_IP, alias=(SELECT TOP 1 alias
z Property of frequent item sets: if an item set is FROM hash_table_temp
frequent, then all of its subsets must also be frequent. WHERE item = path_temp.item)
z Monotonicity Property of support counting: Let I
be a set of items, and J=2I be the power set of I. A III. PERFORMANCE EVALUATION
measure f is monotone(or upward closed) if The performance of this article three set of PC which
X , Y  J : ( X Ž Y ) o f ( X ) d f (Y ) have the equivalent configuration as follows: Pentium
which means that if X is a subset of Y, then f(x) must IV2.0GHz CPUˈ512MB Memory ,Window XP Operating
not exceed f(Y). On the other hand ,f is anti-monotone(or System, SQL Sever 2000 database platform ,C# Language.
downward closed) if Tableč displays the hash code of frequent 1-
X , Y  J : ( X Ž Y ) o f (Y ) d f ( X ) itemsets .TableĎ, Table ď showed the frequent item sets of
which means that if X is a subset of Y, then f(Y) must not each sub-computer respectively.
exceed f(X).
D. Key algorithms item G_01 G_02 G_03 G_04 G_11 G_12

/* The distribution of suffixed FP-tree*/ Nationality: hui| A3 A30 A111 A103 A113 A109
CREATE PROCEDEURE Sub_FPtree_assignment Register State:| B32 B126 B148 B25 B127 B43
(@sub_IP varchar(50))
Sex: female | C112 C7 C147 C139 C126 C111
BEGIN Position: salesman| D81 D168 D9 D145 D29 D109
DECLARE @id int,@n int Body_weight::45-55| E125 E43 E46 E167 E31 E45
SELECT @n=count(item) FROM distribute_table --
Nationality: han| F76 F34 F28 F84 F122 F77
Sub_tree Dispatch Table

Education: student| G70 G29 G149 G28 G105 G136 strategy of taking the sub-machine IP as a hash factor ,so as
Body_height:160-170| H46 H2 H96 H28 H67 H63
to achieve the dual mapping of Hash frequent item sets,
ultimately to protect the privacy of association rules. The
Body_height:150-160| I7 I136 I131 I140 I85 I137 experiments show that the algorithm for data privacy
Age:20-23| J135 J139 J101 J126 J61 J13 protection has played a good role and is practical.
Nationality: bachelor| K84 K152 K164 K156 K82 K146
Register-State: chore| L47 L136 L12 L165 L138 L58
This work is supported by Program for New Century
Age:22-24| M76 M160 M30 M48 M63 M7 Excellent Talents in University of Ministry of Education(NCET-
08-0660),Open Fund Item of Key Laboratory of Spatial Data
Mining & Information Sharing of Ministry of Education(200807),
Sponsored by Program for Science & Technology Innovation
TABLE VI. THE FREQUENT ITEMSETS OF SUB COMPUTER A Talents in Universities of Henan Province(2008HASTIT012),
National Science Foundation of Henan Province(0511011000),
)UHTXHQWB,WHPBVHW OHYHO VXSSRUW &KHFN6XP Science and Technology key projects of Henan
Province(0624220081) and Science and Technology key projects
of Zhen Zhou(064SGDG25127-9). Moreover, this project is a part
%_(_)_,_*_    of PH.D. Programs Foundation Research Projects of Henan
University of Technology.



%_)_,_-_*_    [1] Agrawal R, Shafer J. Parallel mining of association

(_)_,_*_    rule[J].IEEE Trans. On Knowledge and Data Engineering,
[2] Zou Q, Chu W,Lu B. Smart Miner: A depth first algorithm
(_,_-_*_    guided by tail information for mining maxi mal frequent item
sets[Z] .In Proc. of the IEEE International Conference on
Data Mining. Janpan:2002.
)UHTXHQWB,WHPBVHW OHYHO VXSSRUW &KHFN6XP [3] J Zaki. Parallel and Distributed Association Mining: A
Survey[J] . IEEE Concurrency, Special Issue on Parallel
&_'_+_$_/_    Mechanisms for Data Mining, 1999, 7( 4 ) : 14- 25.
&_'_$_/_    [4] Cheung D, Xiao Y. Effect of Data Skew ness in Parallel
Mining of Association Rules [C] . Melbourne, Australia: The
&_+_$_/_    12th Pacific-Asia Conference on Knowledge Discovery and
Data Mining, 1998 . 48 -60.
[5] YongHeng Wang, ShuQiang Yang, Yan Jia. An Efficient
&_$_$_/_    Method for the Parallel Mining of Frequent Item sets in Very
Large Text Database[J].Computer Engineering and Science,
&_$_/_    [6] Lei Wu, Peng Chen. Updated algorithm for mining
association rules based on paralleled computation[J].
Computer Applications, 2005,25(9):1990-1991
In the example above shows the sub-computer A and [7] Tao Chen, Wei Zhang. An Improved Paralleled Algorithm
B, when applied for the results of the excavation once time. for Mining Association Rules[J]. Computer Technology and
When the sub-machine A, B, after the completion of Development, 2007,17(1):139-141
assigned tasks, may re-apply for a new task, the host will [8] XueMing Li, ZhiJun Liu, DongXia Qin. Privacy Preserving
redistribute with Hash values according to the present state. data mining[J]. Application Research of Computers,
In the entire process, raw data has been done a complete and 2008,25(12): 3550-3555
effective protection to the sub computer, so as to improve [9] TingHuai Ma, MeiLi Tang. Data Mining Based on Privacy
security of data mining.
Preserving[J].Computer Engineering, 2008,34(9):78-80
[10] Peng Zhang YunHai Tong, ShiWei Tang, et al. An Effective
Method for Privacy Preserving Association Rule Mining[J].
In this paper, the proposed algorithm in PARMA-P Journal of Software, 2006, 7(8): 764-1774
applied Privacy protection method on the parallel mining [11] PangNing Tan, Michael Steinbach, Vipin Kumar ,et al.
algorithm. The premise of it is based on the premise of Introduction to Data Mining[M]. Posts & Telecom Press
taking the host side as a trust, using input-based allocation 2006.247