You are on page 1of 8

Scalable, Distributed and Dynamic Mining

of Association Rules

V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty

Dept. of Computer Science and Automation,


Indian Institute of Science, Bangalore 560 012, INDIA
{anvs, dks, mnm}csa.iisc.ernet.in

Abstract. We propose a novel pattern tree called Pattern Count tree


(PC-tree) which is a complete and compact representation of the database.
We show that construction of this tree and then generation of all large
itemsets requires a single database scan where as the current algorithms
need at least two database scans. The completeness property of the PC-
tree with respect to the database makes it amenable for mining asso-
ciation rules in the context of changing data and knowledge, which we
call dynamic mining. Algorithms based on PC-tree are scalable because
PC-tree is compact. We propose a partitioned distributed architecture
and an efficient distributed association rule mining algorithm based on
the PC-tree structure.

1 Introduction
The process of generating large itemsets [1] is an important component of an
Association Rule Mining (ARM) algorithm. This process involves intensive disk
access activity necessitating the need for designing fast and efficient algorithms
for generating large itemsets. Incremental mining algorithms [5] are proposed
to handle additions/deletions of transactions to/from an existing database. In
addition to the dynamic nature of data, knowledge and values of parameters like
support also change dynamically. We call mining under such a dynamic environ-
ment, dynamic mining. In this paper, we propose a novel tree structure called
Pattern Count tree (PC-tree) that is a complete and compact representation of
the database; and is ideally suited for dynamic mining.
Existing parallel and distributed ARM algorithms [4] require (i) at least
two database scans and (ii) they generate and test candidate itemsets to obtain
large itemsets. However, the sequential ARM algorithm based on the frequent
pattern tree (FP-tree) is shown to be superior [3] to the existing ones as it only
tests candidate itemsets, for possible largeness, thus reducing the computational
requirements significantly. However, the FP-tree based algorithm requires two
database scans to generate frequent itemsets.
This motivated us to propose a scheme that employs an abstraction similar to
FP-tree and more general and complete- that is Pattern Count tree (PC-tree). A
PC-tree can be constructed based on a single database scan and can be updated
dynamically. We show that we can construct a unique ordered FP-tree, called

M. Valero, V.K. Prasanna, and S. Vajapeyam (Eds.): HiPC 2000, LNCS 1970, pp. 559–566, 2000.
c Springer-Verlag Berlin Heidelberg 2000

560 V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty

Lexicographically Ordered FP-tree (LOFP-tree) from a PC-tree without scan-


ning the database. More specifically, we propose a distributed ARM algorithm
based on this abstraction of the database, i.e., PC-tree, where the databases are
distributed and partitioned across a loosely-coupled system. Here, we convert
each partition of the database into a local PC-tree using a single database scan.
Due to high data compression, it is possible to store local PC-trees in the main
memory itself. This PC-tree, in turn, is converted into a local LOFP-tree using
the knowledge of globally large 1-itemsets in the data. These local LOFP-trees
are merged to form a single LOFP-tree, which can be used to generate all large
itemsets.

2 Pattern Count Tree (PC-Tree)


PC-tree is a data structure which is used to store all the patterns occurring
in the tuples of a transaction database, where a count field is associated with
each item in every pattern which is responsible for a compact realization of the
database.
A. Structure of a PC-tree : Each node of the tree consists of the following
structure: item-name, count and two pointers called child (c) and sibling (s).
In the node, the item-name field specifies which item the node represents, the
count field specifies the number of transactions represented by a portion of
the path reaching this node, the c-pointer field represents the pointer to the
following pattern; and the s-pointer field points to the node which indicates
the subsequent other patterns from the node under consideration.
B. Construction of the PC-tree : PC-tree construction requires a single
scan of the database. We take each transaction from the database and put
it as a new branch of the PC-tree if any sub pattern, which is a prefix of the
transaction, does not exist in the PC-tree; else put it into an existing branch,
eb , by incrementing the corresponding count field value of the nodes in the
PC-tree. We put the remaining sub patterns, if any, of the transaction by
appending additional nodes with count field value equal to 1 to the path in
eb . We preprocess each transaction to put items in a lexicographical order.
Refer to [6] for an algorithm for detailed construction.
We give below an example to explain the construction of a PC-tree. Consider
the un-normalized, single attribute based relation (transaction database)
shown in Figure 1A. The corresponding PC-tree is shown in Figure 1B.
C. Properties of the PC-tree :
Definition 2.1 Complete representation
A representation R is complete with respect to a transaction database DB,
if every pattern in each transaction in DB, is present in R.
Lemma 2.1 PC-tree is a complete representation of the database.
Definition 2.2 Compact representation
Let D be a database of size Ds . Then, a complete representation R of DB
is compact if Rs < Ds , where Rs is the size of R.
Lemma 2.2 PC-tree is a compact representation of the database.
Proof Refer to [6] for proofs of Lemmas 2.1 and 2.2.
Scalable, Distributed and Dynamic Mining of Association Rules 561

Transaction Id Item numbers of items purchased (Transactions)


T
t1 19, 40, 510, 527
t2 19, 40, 179
t3 19, 40, 125, 510
527:2 19:4
t4 527, 740
t5 527, 740, 795
t6 19
740:2 40:3

FIGURE : A : Transaction Database, D

795:1 510:1 179:1 125:1

527:1 510:1

FIGURE : B PC-TREE.
In Figure B, verticle link represents the c-pointer and the horizontal link
represent the s-pointer. Each node has two parts : item-name and count value.

Fig. 1. Example transaction database and the PC-tree Structure for storing patterns

2.1 Construction of the LOFP-Tree from a Given PC-Tree


The PC-tree corresponding to a given database is unique, where as the FP-
tree corresponding to a given database need not be unique. If all items in each
transaction of the database are ordered lexicographically, and large 1-itemsets
are also ordered lexicographically, when the items are having same frequency, we
call the corresponding FP-tree the Lexicographically Ordered FP-tree (LOFP-
tree) which is unique and the number of nodes in LOFP-tree is lesser then or
equal to that of FP-tree. Since PC-tree is the complete representation of the
database, LOFP-tree can be constructed directly from the PC-tree. Refer [6]
for detailed construction process. Once the LOFP-tree is constructed from PC-
tree, generation of large itemsets from such a structure is done by constructing
a conditional LOFP-tree. Since the procedure for construction of conditional
LOFP-tree is similar to that of conditional FP-tree [3], we do not discuss it
here.

2.2 Experiments on Construction of PC-Tree and LOFP-Tree


We synthesize datasets using the synthetic data generator given in [1]. Figure 2A
shows the parameter description and Figure 2B shows the data set used in our
work with a fixed support value of 0.75%.
Experiment 1 - Memory requirements : Figure 3 shows the memory re-
quirements for database, PC-tree and LOFP-tree for the datasets Set 1 to
Set 11. It is clear from the graphs that even though the size of PC-tree is
greater than that of the LOFP-tree, the memory size requirement of PC-
tree increases nominally with a significant increase in the database size. For
example, the ratio of the storage space for the database to that of PC-tree
keeps increasing from 7 to 10 as the database size increases from 8.5MB to
137MB, where |T | = 10. So, PC-tree requires a lesser memory space than
the database.
562 V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty

|D| Number of transactions.


NAME |D| |T| |I| |L| NOI
|T| Average size of a transaction.
|I| Average size of maximal potentially large itemsets Set_1 25K 10 4 1000 1000
|L| Number of maximal potential large itemsets Set_2 50K 10 4 1000 1000
NOI Number of items. Set_3 100K 10 4 1000 1000
Set_4 200K 10 4 1000 1000
FIGURE A : Parameters
Set_5 300K 10 4 1000 1000
Set_6 400K 10 4 1000 1000
Set_7 25K 15 4 1000 1000
Set_8 50K 15 4 1000 1000
Set_9 100K 15 4 1000 1000
Set_10 150K 15 4 1000 1000
Set_11 200K 15 4 1000 1000

FIGURE B : Data Set used

Fig. 2. Parameters and Data set used

7 7
x 10 x 10
14 12
STORAGE SPACE IN BYTES

STORAGE SPACE IN BYTES

DB SIZE DB SIZE
12 PC−tree SIZE 10 PC−tree SIZE
LOFP−tree SIZE LOFP−tree SIZE
10
8
8
6
6 |T| = 15
|T| = 10 4
4

2 2

0 0
1 2 3 4 5 6 7 8 9 10 11
DATA SET USED DATA SET USED

Fig. 3. Comparison based on the space requirements

Experiment 2 - Timing requirements : We conducted the experiment us-


ing the data sets Set 1, Set 2, Set 3 and Set 4 to compare the time required
to construct the LOFP-tree directly from the database which requires two
database scans and also from the PC-tree, which requires one database scan
and one scan of the PC-tree. Table 1 shows the values obtained. It can be
seen that time required to construct the LOFP-tree based on PC-tree is
lesser than that of its construction directly from the database. These results
indicate that the time required to construct LOFP-tree from the PC-tree is
smaller than that of direct construction of LOFP-tree by 2 to 22 sec.

Table 1. Comparison of both algorithms based on time requirement

Data Set Set 1 Set 2 Set 3 Set 4


Time to construct LOFP-tree from PC-tree 22 sec. 45 sec. 88 sec. 173 sec.
Time to construct LOFP-tree directly from database 24 sec. 48 sec. 97 sec. 195 sec.
Scalable, Distributed and Dynamic Mining of Association Rules 563

3 Dynamic Mining

The mining procedure for generating large itemsets which characterize change
of data, change of knowledge and change of values of parameters like support
is called dynamic mining. In the literature on ARM, mining under change of
data scenario is called incremental mining. We explain how PC-tree handles
the incremental mining, change of knowledge and change of support value below.

3.1 Incremental Mining of Large Itemsets : Incremental mining is a


method for generating large itemsets over a changing transaction database
in an incremental fashion without considering the part of the database which
is already mined. Effectively there are two operations which lead to changes
in a database; they are ADD and DELETE operations as explained below.
1. ADD :- A new transaction is added to the transaction database : this
is handled in a PC-tree either by incrementing the count value of the
nodes corresponding to the items in the transaction or by adding new
nodes or by performing both the above activities.
2. DELETE :- An existing transaction is deleted from the transaction
database : this is handled in a PC-tree by decrementing the count value
of the nodes corresponding to the items in the transaction, if their count
value is > 1; or by deleting the existing nodes corresponding to the items
in the transaction.
Definition 3.1 Large-itemset equivalent : Let DB be a transaction
database with L as the set of all large itemsets in DB. Let R1 and R2 be
any two representations of DB. Let L1 and L2 be the sets of large itemsets
generated from R1 and R2 respectively. R1 and R2 are said to be large-
itemset equivalent if L1 = L2 = L.
Definition 3.2 Appropriate representation for incremental mining
(ARFIM): Let R be a large-item equivalent representation of DB. Let DB
be updated to DB́ such that DB́ = DB + db1 - db2 . R is an ARFIM, if
there exists an algorithm that generates Ŕ using only R, db1 and db2 such
that Ŕ is large-itemset equivalent to DB́. Note that the PC-tree can be
incrementally constructed by scanning DB, db1 and db2 only once.
3.2 Change of Knowledge : Knowledge representation in the mining pro-
cess is mainly by an ‘is a’ hierarchy. Let K1 be the knowledge represented
using an ‘is a’ hierarchy. Let P C be the PC-tree for the database D. We can
construct generalized PC-tree (GPC-tree) by applying K1 to the transac-
tions obtained from P C without accessing D. If knowledge K1 is changed to
K2 , then by applying K2 on P C, we get another generalized PC-tree without
accessing D. This shows that PC-tree handles dynamic change of knowledge
without accessing the database during change of knowledge. Change of sup-
port value can be handled in a similar manner.

Definition 3.3 A representation R is appropriate for dynamic mining of large


itemsets if it satisfies the following properties :
1. R is a complete and compact representation of the database, D.
564 V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty

2. R can be used to handle changes in database, knowledge and support


value without scanning D to generate frequent itemsets.

PC-tree is a representation that is appropriate for dynamic mining of large item-


sets. For more details on dynamic mining, refer to [6].

4 Distributed Mining

Architecture
We propose an architecture based on Distributed Memory Machines
(DMMs). Let there be m processors P1 , P2 , . . . , Pm , which are connected
by a high speed gigabit switch. Each processor has a local memory and a
local disk. The processors can communicate only by passing messages. Let
DB 1 , DB 2 , · · · , DB n be n transaction databases at n sites and be partitioned
in to m non-overlapping blocks D1 , D2 , · · · , Dm , where m is the number of
processors available (m ≥ n). Each block Di has the same schema. Let tij be
the j th transaction in partition Di .
Algorithm : Distributed PC to LOFP tree Generator (DPC-LOFPG)

Steps :
1. Each processor P i makes a pass over its data partition, Di and generates
a local PC-tree, P C1i at level 1. At the end of this process, each processor
generates a local 1-itemset vector, LV . It is an array of size s (where s
is the number of items) whose each entry LV i [k] has a count value,
corresponding to item k in the data partition Di .
2. Process P i exchanges local 1-itemset vector, LV i with all other proces-
sors to develop a global 1-itemset vector, GV i . Note that all GV i s(=
GV ) are identical and are large 1-itemsets. The processors are forced to
synchronize at this step.
3. Each processor P i constructs a local LOFP-tree using local P C1i and
GV i . The processors are forced to synchronize at this step.
4. Every two alternative LOFP-trees (viz. LOF Pi1 and LOF Pi2 , LOF Pi3
and LOF Pi4 etc.) are given to a processor for merging and to generate
a single merged LOFP-tree. This gives the LOFP-tree at the next level
(viz. i + 1th level). After every iteration the processors are forced to syn-
chronize. During each iteration, number of LOFP-trees generated and
number of processors needed are reduced by a factor of 2 w.r.t the previ-
ous iteration. Iteration stops when one (global) LOFP-tree is generated.
5. Generation of conditional LOFP-tree to generate large itemsets. This
step is the same as the one explained in [3].
Experiments
We compared DPC-LOFPG algorithm with its sequential counterpart. Fig-
ure 4 shows the block schematic diagram that depicts the stages with the
corresponding timing requirements. From the block schematic shown in Fig-
ure 4, the time required to construct LOFP-tree from sequential mining, T
Scalable, Distributed and Dynamic Mining of Association Rules 565

Sequential Mining Algorithm DPC-LOFP Generator

Database (D) D1 D2 ... Dm

t’ pc t pc
1 2 m
PC PC1 PC1 PC1 Stage 1

tr lv
t’ lofp
Stage 2
LOFP

t lofp1
1
LOFP 1
2
LOFP 1 ... m
LOFP 1 Stage 3

tr lofp1

Stage 4
NOTE :
i
1. tpc is the time required to construct PC 1 at Stage 1

2. tr lv is the total time required to transfer LVs at Stage 2 using LOFP


2
1 ... LOFP
2
x

all-to-all broadcast.
3. tlofp1 is the maximum time required to construct LOFP i
1 at Stage 3 .
th
4. tr lofpj is the maximum time required to transfer any LOFP ij in the j iteration for any i, at Stage 4. .
4. represents the communication between the processors followed by the
.
synchronization. 1
LOFP j Stage 5

Fig. 4. Timing requirements for DPC-LOFPG and its sequential counterpart

and time for constructing global LOFP-tree from DPC-LOFPG algorithm, S


are given by the following formulas : S = tpc +trlv +{tlof p1 +trlof p1 +tlof p2 +
trlof p2 + · · ·+tlof pj +trlof pj } , if no. of processors = 2j at Stage 1, for j > 0.
Corresponding to S, we have T = t́pc + t́lof p . To study the efficiency and
response time of the parallel algorithm DPC-LOFPG, corresponding to
its sequential counterpart, we conducted a simulation study by varying the
number of processors from 1 to 8. Data sets used in our experiment are Set 3,
Set 4 and Set 9, Set 11. Figure 5 shows the response time for different num-
ber of processors. Efficiency of parallelism is defined as T /(S × m), where

200 300
SET−3 SET−9
RESPONSE TIME (SEC.)

RESPONSE TIME (SEC.)

SET−4 250 SET−11


150
200

100 |T| = 10 150 |T| = 15

100
50
50

0 0
0 2 4 6 8 0 2 4 6 8
NO. OF PROCESSORS NO. OF PROCESSORS

Fig. 5. Response time


566 V.S. Ananthanarayana, D.K. Subramanian, and M.N Murty

S is the response time of the system when m (> 1) number processors are
used; and T is the response time when m = 1. We show in Figure 6 the effi-
ciency of parallelism for different number of processors. It may be observed
from Figure 6 that the best value of efficiency is 98% for Set 11 and 95% for
Set 4, both are exhibited by the 2-processor system.

1 1
EFFECIENCY OF PARALLELISM

EFFECIENCY OF PARALLELISM
SET−3 SET−9
SET−4 SET−11
0.95 0.95

0.9 |T| = 10 0.9 |T| = 15

0.85 0.85

0.8 0.8

0.75 0.75
2 4 6 8 2 4 6 8
NO. OF PROCESSORS NO. OF PROCESSORS

Fig. 6. Efficiency of Parallelism

5 Conclusions
In this paper, we proposed a novel data structure called PC-tree which can be
used to represent the database in a complete and compact form. PC-tree can be
constructed using a single database scan. We use it here for mining association
rules. We have shown that the ARM algorithms based on PC-tree are scalable.
We introduced the notion of dynamic mining that is a significant extension
of incremental mining and we have shown that PC-tree is ideally suited for
dynamic mining. The proposed distributed algorithm, DPC-LOFPG is found
to be efficient because it scans the database only once and it does not generate
any candidate itemsets.

References
1. Agrawal, R., Srikant, R. Fast algorithms for mining association rules in large
databases, Proc. of 20th Int’l conf. on VLDB, (1994), 487 - 499.
2. Savasere, A., Omiecinsky, E., Navathe, S. An efficient algorithm for mining asso-
ciation rules in large databases, Proc. of Int’l conf. on VLDB, (1995), 432 - 444.
3. Han, J., Pei, J., Yin, Y. Mining Frequent Patterns without Candidate Generation,
Proc. of ACM-SIGMOD, (2000).
4. Mohammed, J. Zaki. Parallel and distributed association mining : A survey, IEEE
Concurrency, special issue on Parallel Mechanisms for Data Mining, Vol.7, No.4,
(1999), 14 - 25.
5. Thomas, S., Sreenath, B., Khaled, A., Sanjay, R. An efficient algorithm for the
incremental updation of association rules in large databases, AAAI, (1997).
6. Ananthanarayana, V.S., Subramanian, D.K., Narasimha Murty, M. Scalable, dis-
tributed and dynamic mining of association rules using PC-trees, IISc-CSA, Tech-
nical Report, (2000).

You might also like