Professional Documents
Culture Documents
of Association Rules
1 Introduction
The process of generating large itemsets [1] is an important component of an
Association Rule Mining (ARM) algorithm. This process involves intensive disk
access activity necessitating the need for designing fast and efficient algorithms
for generating large itemsets. Incremental mining algorithms [5] are proposed
to handle additions/deletions of transactions to/from an existing database. In
addition to the dynamic nature of data, knowledge and values of parameters like
support also change dynamically. We call mining under such a dynamic environ-
ment, dynamic mining. In this paper, we propose a novel tree structure called
Pattern Count tree (PC-tree) that is a complete and compact representation of
the database; and is ideally suited for dynamic mining.
Existing parallel and distributed ARM algorithms [4] require (i) at least
two database scans and (ii) they generate and test candidate itemsets to obtain
large itemsets. However, the sequential ARM algorithm based on the frequent
pattern tree (FP-tree) is shown to be superior [3] to the existing ones as it only
tests candidate itemsets, for possible largeness, thus reducing the computational
requirements significantly. However, the FP-tree based algorithm requires two
database scans to generate frequent itemsets.
This motivated us to propose a scheme that employs an abstraction similar to
FP-tree and more general and complete- that is Pattern Count tree (PC-tree). A
PC-tree can be constructed based on a single database scan and can be updated
dynamically. We show that we can construct a unique ordered FP-tree, called
M. Valero, V.K. Prasanna, and S. Vajapeyam (Eds.): HiPC 2000, LNCS 1970, pp. 559–566, 2000.
c Springer-Verlag Berlin Heidelberg 2000
560 V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty
527:1 510:1
FIGURE : B PC-TREE.
In Figure B, verticle link represents the c-pointer and the horizontal link
represent the s-pointer. Each node has two parts : item-name and count value.
Fig. 1. Example transaction database and the PC-tree Structure for storing patterns
7 7
x 10 x 10
14 12
STORAGE SPACE IN BYTES
DB SIZE DB SIZE
12 PC−tree SIZE 10 PC−tree SIZE
LOFP−tree SIZE LOFP−tree SIZE
10
8
8
6
6 |T| = 15
|T| = 10 4
4
2 2
0 0
1 2 3 4 5 6 7 8 9 10 11
DATA SET USED DATA SET USED
3 Dynamic Mining
The mining procedure for generating large itemsets which characterize change
of data, change of knowledge and change of values of parameters like support
is called dynamic mining. In the literature on ARM, mining under change of
data scenario is called incremental mining. We explain how PC-tree handles
the incremental mining, change of knowledge and change of support value below.
4 Distributed Mining
Architecture
We propose an architecture based on Distributed Memory Machines
(DMMs). Let there be m processors P1 , P2 , . . . , Pm , which are connected
by a high speed gigabit switch. Each processor has a local memory and a
local disk. The processors can communicate only by passing messages. Let
DB 1 , DB 2 , · · · , DB n be n transaction databases at n sites and be partitioned
in to m non-overlapping blocks D1 , D2 , · · · , Dm , where m is the number of
processors available (m ≥ n). Each block Di has the same schema. Let tij be
the j th transaction in partition Di .
Algorithm : Distributed PC to LOFP tree Generator (DPC-LOFPG)
Steps :
1. Each processor P i makes a pass over its data partition, Di and generates
a local PC-tree, P C1i at level 1. At the end of this process, each processor
generates a local 1-itemset vector, LV . It is an array of size s (where s
is the number of items) whose each entry LV i [k] has a count value,
corresponding to item k in the data partition Di .
2. Process P i exchanges local 1-itemset vector, LV i with all other proces-
sors to develop a global 1-itemset vector, GV i . Note that all GV i s(=
GV ) are identical and are large 1-itemsets. The processors are forced to
synchronize at this step.
3. Each processor P i constructs a local LOFP-tree using local P C1i and
GV i . The processors are forced to synchronize at this step.
4. Every two alternative LOFP-trees (viz. LOF Pi1 and LOF Pi2 , LOF Pi3
and LOF Pi4 etc.) are given to a processor for merging and to generate
a single merged LOFP-tree. This gives the LOFP-tree at the next level
(viz. i + 1th level). After every iteration the processors are forced to syn-
chronize. During each iteration, number of LOFP-trees generated and
number of processors needed are reduced by a factor of 2 w.r.t the previ-
ous iteration. Iteration stops when one (global) LOFP-tree is generated.
5. Generation of conditional LOFP-tree to generate large itemsets. This
step is the same as the one explained in [3].
Experiments
We compared DPC-LOFPG algorithm with its sequential counterpart. Fig-
ure 4 shows the block schematic diagram that depicts the stages with the
corresponding timing requirements. From the block schematic shown in Fig-
ure 4, the time required to construct LOFP-tree from sequential mining, T
Scalable, Distributed and Dynamic Mining of Association Rules 565
t’ pc t pc
1 2 m
PC PC1 PC1 PC1 Stage 1
tr lv
t’ lofp
Stage 2
LOFP
t lofp1
1
LOFP 1
2
LOFP 1 ... m
LOFP 1 Stage 3
tr lofp1
Stage 4
NOTE :
i
1. tpc is the time required to construct PC 1 at Stage 1
all-to-all broadcast.
3. tlofp1 is the maximum time required to construct LOFP i
1 at Stage 3 .
th
4. tr lofpj is the maximum time required to transfer any LOFP ij in the j iteration for any i, at Stage 4. .
4. represents the communication between the processors followed by the
.
synchronization. 1
LOFP j Stage 5
200 300
SET−3 SET−9
RESPONSE TIME (SEC.)
100
50
50
0 0
0 2 4 6 8 0 2 4 6 8
NO. OF PROCESSORS NO. OF PROCESSORS
S is the response time of the system when m (> 1) number processors are
used; and T is the response time when m = 1. We show in Figure 6 the effi-
ciency of parallelism for different number of processors. It may be observed
from Figure 6 that the best value of efficiency is 98% for Set 11 and 95% for
Set 4, both are exhibited by the 2-processor system.
1 1
EFFECIENCY OF PARALLELISM
EFFECIENCY OF PARALLELISM
SET−3 SET−9
SET−4 SET−11
0.95 0.95
0.85 0.85
0.8 0.8
0.75 0.75
2 4 6 8 2 4 6 8
NO. OF PROCESSORS NO. OF PROCESSORS
5 Conclusions
In this paper, we proposed a novel data structure called PC-tree which can be
used to represent the database in a complete and compact form. PC-tree can be
constructed using a single database scan. We use it here for mining association
rules. We have shown that the ARM algorithms based on PC-tree are scalable.
We introduced the notion of dynamic mining that is a significant extension
of incremental mining and we have shown that PC-tree is ideally suited for
dynamic mining. The proposed distributed algorithm, DPC-LOFPG is found
to be efficient because it scans the database only once and it does not generate
any candidate itemsets.
References
1. Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large
databases, Proc. of 20th Int’l conf. on VLDB, (1994), 487 - 499.
2. Savasere, A., Omiecinsky, E., Navathe, S. An efficient algorithm for mining asso-
ciation rules in large databases, Proc. of Int’l conf. on VLDB, (1995), 432 - 444.
3. Han, J., Pei, J., Yin, Y. Mining Frequent Patterns without Candidate Generation,
Proc. of ACM-SIGMOD, (2000).
4. Mohammed, J. Zaki. Parallel and distributed association mining : A survey, IEEE
Concurrency, special issue on Parallel Mechanisms for Data Mining, Vol.7, No.4,
(1999), 14 - 25.
5. Thomas, S., Sreenath, B., Khaled, A., Sanjay, R. An efficient algorithm for the
incremental updation of association rules in large databases, AAAI, (1997).
6. Ananthanarayana, V.S., Subramanian, D.K., Narasimha Murty, M. Scalable, dis-
tributed and dynamic mining of association rules using PC-trees, IISc-CSA, Tech-
nical Report, (2000).