You are on page 1of 12

Knowledge-Based Systems 33 (2012) 41–52

Contents lists available at SciVerse ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Efficient colossal pattern mining in high dimensional datasets


Mohammad Karim Sohrabi ⇑, Ahmad Abdollahzadeh Barforoush
ISLAB, Computer Engineering & IT Department, Amirkabir University of Technology, 424 Hafez Ave., Tehran 15914, Iran

a r t i c l e i n f o a b s t r a c t

Article history: ‘Frequent pattern mining’ is considered as an important data mining problem which has been extensively
Received 18 June 2011 studied over the last decade. There are a large number of algorithms which have been developed for fre-
Received in revised form 4 March 2012 quent pattern mining on a traditional commercial dataset which usually contains a huge number of
Accepted 4 March 2012
transactions besides a small number of items in each transaction. The advent of bioinformatics contrib-
Available online 15 March 2012
uted to the development of new form of datasets – called high dimensional – which are characterized by
small number of transactions and large number of items in each transaction. The running time of tradi-
Keywords:
tional algorithms increases exponentially with increasing average transaction length, thus these algo-
Frequent patterns mining
Colossal patterns
rithms cannot be suitable for the high dimensional datasets. On the other hand, the mining algorithms
Bottom-up mining on high dimensional datasets create a very large output set as result which includes small and mid-size
Bit matrix frequent patterns which do not bear any useful information for scientists. Colossal pattern mining is
High dimensional dataset described as a solution to reduce the amount of output set of mining patterns. Due to ignoring the mining
of the small and mid-sized patterns, mining process speed is increased in colossal patterns mining algo-
rithms. Therefore, only very large (colossal) patterns are extracted and mined in this approach. In this
paper we represent an efficient vertical bottom up method to conduct mining of frequent colossal pat-
terns in high dimensional datasets. In our algorithm, we use a bit matrix to compress the dataset and
make it easy to use in mining process. Our experimental result shows that our algorithm attains very
good mining efficiencies on various input datasets. Furthermore, our performance study shows that this
algorithm outperforms substantially the best former algorithms.
Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction which lead the researches to the FP-Growth [4]. FP-Growth and
its successors [5–7,24] are the second category of mining algo-
A fundamental problem for mining association rules is to mine rithms which use a data structure called FP-tree to compress the
frequent itemsets. Frequent pattern mining has numerous applica- dataset and search it without candidate generation and without
tions, including analysis of customer purchase patterns, analysis of multiple scans on the dataset. The third category of mining algo-
web access patterns, the investigation of scientific or medical pro- rithms, called vertical algorithms, performed with data presented
cesses, and the analysis of DNA sequences. There exist several in vertical data format [8]. In usual datasets, transactions are in hor-
types of frequent pattern mining. The main types of frequent pat- izontal data format. Each transaction (row) has an id and a set of
tern mining include: Frequent itemset mining, sequential pattern items (columns) which are used in it. Rather, in vertical data format,
mining, and graph mining. In this paper we concentrate on the each row of dataset has an item and a set of transaction ids which
itemsets and for simplicity, use the terms ‘‘pattern’’ and ‘‘itemset’’ contain this item. These algorithms are used when our dataset
interchangeably. has small number of transactions each of them containing a large
Many efficient frequent pattern mining algorithms have been number of items.
proposed in the literature [1–8,19,25,27,28]. These algorithms can Based on the requested output, pattern mining algorithms has
be classified in three main categories: the first category includes two different approaches for mining frequent patterns: some of
the apriori-like algorithms which use the method of candidate gen- them try to find a complete set of all frequent patterns [1,4,8] and
eration and test to find the frequent patterns [1–3]. Huge number of the others discover a compressed set of patterns [9–12,14]. Major
candidate sets and repeatedly scanning the dataset to check the fre- challenge in mining frequent patterns is the fact that all the sub-
quency of candidates are two main problems of these algorithms patterns of a frequent pattern are frequent and there are an expo-
nential number of sub-patterns for each pattern which generate a
huge number of frequent patterns as result. To overcome this prob-
⇑ Corresponding author. Fax: +98 21 66413969/64542744.
lem, closed frequent pattern mining [9–13,17,25,27] and maximal
E-mail addresses: Amir_sohraby@yahoo.com (M.K. Sohrabi), Ahmad@aut.ac.ir
(A.A. Barforoush).
frequent pattern (max-pattern) mining [14,15] were proposed.

0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2012.03.003
42 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

Closed pattern set is a compact set of frequent patterns which 24]. MAFIA [15] is a maximal pattern mining method which uses a
retains all the information of complete pattern set. A pattern is vertical bitmap representation for support counting and pruning
closed frequent in a dataset if it is frequent in the dataset and there mechanisms for searching the itemset lattice.
exists no super-pattern of it which has the same support as it in the Index-BitTableFI [16] uses a BitTable, an index array, and the
dataset. A pattern is maximal frequent in a dataset if it is frequent corresponding computing method to mining frequent itemsets.
in the dataset and there exists no frequent super-pattern of it in Frequent itemsets are identified quickly by using breadth-first
the dataset. Although the set of max-patterns is more compact search in this method.
than the set of closed frequent ones, almost all important research Granule is defined as a collection of the entities that has the
tend to compress the set of frequent patterns focused on closed fre- same property in [22]. The relational model uses these granules
quent pattern mining, since the set of maximal frequent patterns as attribute values which is called machine oriented data model.
does not usually contain the complete support information regard- The model transforms data mining, particularly finds association
ing to its corresponding frequent pattern. rules, into Boolean operations. Bit-AssoRule is a fast association
Although closed pattern mining incredibly reduced the amount rule algorithm based on granular computing which is represented
of computation and output volume of the mining process, the pat- in [23]. Bit-AssoRule is represented to avoid generation-and-test
tern mining problem is very time and space consuming for high strategy of apriori algorithm. Using bitmap techniques, the candi-
dimensional datasets. Our experiment shows that when we have date is a large itemset if the bit count on the intersection of all
a dataset with 100 rows and each row include 1000 items, none the bitmaps is equal or greater than the minimal count. The bit
of the existing horizontal algorithms can finish the closed pattern count is the number of 1’s in the bitmap indexes from the result
mining when threshold is set low. On the other hand, the output of the intersection of the bitmaps.
result of mining algorithms includes huge number of small and CFP-Tree and CFP-Array [24] are two novel data structures
mid-size patterns which has often no suitable information for which use lightweight compression techniques to reduce memory
many applications. Since in many applications, only large-sized consumption of FP-tree based algorithms by an order of magni-
patterns have applicable and suitable information, it is a good idea tude. CFP-Tree exploits a combination of structural changes to
to find mining methods which extract only large patterns without the FP-Tree and bitmap techniques.
mining the small and mid-size patterns. Mining colossal pattern In this study, the bit wise method is used to make the items and
without mining smaller patterns is a rather new approach to ad- patterns easy to use in vertical bottom up search to find colossal
dress this issue. patterns.
All former pattern mining methods (complete, closed, or maxi- The remaining of the paper is organized as follow:
mal frequent pattern mining methods) follow a bottom up ap- In Section 2, we present some preliminaries and mining task
proach to find patterns. Mining process in these methods, starts and then with an example, we describe our bit matrix representa-
from small patterns and goes on to the larger one. However, in tion. In Section 3, we explain different search strategies for mining
some applications, small or medium-sized patterns often do not datasets. The new colossal patterns mining methods based on bit
have useful information and we can gain useful information only matrix representation is explained in the Section 4. We present
from very large-sized patterns, called colossal patterns. the new algorithm in Section 5 and conduct experimental study
The first serious algorithm to mining colossal patterns was core in Section 6. Finally we conclude the study in Section 7.
pattern fusion (core-fusion) algorithm and was represented in
2007 [20]. In this paper a core pattern-fusion method is repre-
sented which could give a good approximation. First, the paper 2. Problem definition
shows why pattern-fusion’s mining result would favor colossal
patterns over smaller-sized done and then it explores how pat- In the first subsection of this section we present some prelimi-
tern-fusion gives a good approximation by catching the outliers naries and in the second one we describe the bit matrix represen-
in the complete answer. The main idea of the method is that, pat- tation mining task by using an example.
tern-fusion merges all the small sub-patterns of a large pattern in
one step instead of expanding patterns with additional single
items. This gives pattern-fusion the advantage to circumvent 2.1. Preliminaries
mid-sized patterns and progress on a path leading to a potential
colossal pattern. Let I = {i1, i2, . . . , in} be a set of items, also called features (or col-
This work has been followed by some other more efficiently umns). The dataset D consists of a set of rows (also called transac-
colossal pattern mining methods such as Colossal Pattern Miner tions) R = {r1, r2, . . . , rm}, where each row ri is a set of items with an
(CPM) in 2010 [21]. CPM suggests to pick-up the seed pattern in id called Rid or Tid. Given a set of items X # I we define the support
an intelligent way instead of randomly picking a seed pattern. This set, denoted D(X) # R as the maximal set of rows that contain X.
method represented a way of separating sub pattern of overlapping The number of rows in the dataset that contains X is called the
colossal patterns based on their frequencies which facilitates leap- support of X. By definition the support of X is given as Sup
ing through the enormous number of mid-sized patterns. (X) = jD (X)j. An itemset X is called frequent itemset or frequent pat-
This work is motivated towards developing a more efficient tern if Sup (X) P minsup, where minsup is a user specified lower
strategy for mining colossal pattern using a bit wise vertical bot- support threshold. In other words, a pattern is frequent if the num-
tom up mining approach. A vertical bottom up search method is ber of transactions which contains that pattern is not less than the
designed to mine the rowsets of a dataset and a new algorithm is user specified threshold.
represented to combine this bottom up search method with the A set of items X # I is called a closed pattern if there exists no Y
bit wise approach to mine the frequent colossal patterns from high such that Y # X and Sup (X) = Sup (Y), i.e., there is no superset of X
dimensional datasets efficiently. with same support. Put another way, the row set that contains the
The mining approaches which are using bitmap techniques to superset Y, must not be exactly the same as the row set that
find the frequent patterns, are called bit wise approaches in this contains the set X. An Itemset X is called a frequent closed pattern,
paper. if it is closed and frequent.
Recently, many attempts have been given to applying bitmap Fig. 1 shows an example of a dataset in which the items are
techniques in the frequent patterns mining algorithms [15,16,22– represented using alphabets.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 43

Given a dataset D and a user support threshold minsup, fre-


quent pattern mining problem is to discover all frequent patterns
with respect to minsup and similarly frequent closed pattern min-
ing problem is to find frequent closed patterns with respect to
minsup.

2.2. Bit wise representation of datasets

In this subsection we explain how we can represent a dataset


with a bit matrix. We will use this bit wise representation in our
algorithm in this paper. Fig. 2. Bit matrix of the dataset.
Let R be a set of rows with m members and I be a set of items
with n members which are sorted in some order. This order is
called item search order. The sub search space of an item contains
all the items after it in item search order but no item before it. Two
item search order were proposed in literature: lexicographic order
and ascending frequency order. Lexicographic order is to order the
items lexicographically and ascending frequency order orders fre-
quent items in ascending order of their frequencies. In this paper
we use lexicographic order for bit pattern representation approach.
We construct a bit matrix with m rows and n columns such that
each row is corresponding to a member of set R and each column is
corresponding to an item (in lexicographic order). Each row of
dataset will be an n-bit bit string such that if a row of dataset is
containing item i then ith bit of the correspond bit string set to 1
and else set to 0. Therefore our data set compress in a bit matrix.
For example the bit matrix corresponding to dataset of Fig. 1 is rep-
resented in Fig. 2.
Two arrays were constructed with bit matrix: colSum is an
n-member array that saved the frequency (support) of each item
and rowSum is an m-member array which saved the number of
items in each row. We can perform a column wise pruning on
the bit matrix based on minsup and eliminate all the columns
which their corresponding sum in colSum is less than minsup.
Fig. 3 shows two pruned bit matrices for minsup 3 and 4,
respectively.
In Fig. 3a our minsup is 3 and so columns a, c, d, e, f, g and h
Fig. 3. Pruned bit matrix.
which their support is not less than 3 have been remained and
other columns (with support less than 3) have been eliminated.
The rowSum in new bit matrix has been changed because some infrequent, all the super pattern of it (such as cdh) will be
items of each row have been excluded in pruning operation. infrequent. We can use this principle for more pruning in bit-wise
Fig. 3b shows the pruned bit matrix with minsup = 4. Since the mining algorithms.
support of c and d is 3 in data set, so c and d are not present in With constructing colAndVector, we can also see in Fig. 3a, ac is
Fig. 3b. Now the number of items of the rows which is containing frequent with minsup 3 and ace is frequent too because the sum of
c or d is decries and their corresponding indexes in rowSum is the values of its colAndVector are 3. In this case ac is not a closed
affected. frequent pattern because there is a super pattern with the same
To determine when an itemset X is frequent, we do as follow: support of it in dataset. Similarly bf is not closed because bfg has
first we should operate operator ‘‘AND’’ on the columns corre- the same support but, e.g. is closed because its support is 4 and
sponding to all the items of X and construct colAndVectotr. For all its frequent super patterns (such beg) have support 3.
example, colAndVector of some itemsets of Fig. 3a is represented To determine closeness of a frequent itemset such as X from the
in Fig. 4. If the sum of values in a colAndVector is not less than min- bit matrix, we should do as follow: constructing the colAndVector
sup then the itemset of that colAndVector is frequent. We can see of X, operate AND on colAndvector with column of each item that is
that ac is a frequent itemset with minsup 3 and cd and cdh are not extant in X separately. If the result of at least one of these AND
infrequent. Based on apriori principle we can say since cd is

Fig. 1. A sample of a transaction database (dataset). Fig. 4. ColAndVectors and frequency test.
44 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

operations is equal with colAndVector of X then X is not closed. In


Fig. 5 we can see the steps of closeness checking for itemset X = ad.
For each frequent item k, which does not exist in pattern ad, the
result of AND operation of ColAndVector (ad) and BitMat (k) (the
vector corresponding to column k of pruned bit matrix) is repre-
sented in Fig. 5. It shows that ad is not closed because adf has
the same support in dataset.
Whenever a super pattern is detected with the same support of
pattern X, the closeness checking of pattern X will be stopped and X
is declared as an unclosed pattern.

3. Vertical bottom-up mining

In this section we first explain horizontal and vertical search


strategies and describe which one suits what kind of datasets. Then
we consider the bottom-up and top-down searches for vertical
datasets and compare their advantages and drawbacks.

3.1. Horizontal vs. vertical

There are two different ways to construct the search tree of a


dataset. In the first way, we consider the set of all items (columns)
which used in the dataset and so find the number of transactions
(rows) of dataset which contain the different combinations of the
items. We refer this approach as horizontal method and call its re-
lated algorithms as horizontal algorithms. Fig. 6 shows the search
space of horizontal algorithms for six items named a, b, e, f, g and h
(the frequent items in Fig. 3b).
Horizontal algorithms are suitable for mining low dimensional
(such as commercial) datasets which have a small number of items
and a large number of transactions. However in high dimensional Fig. 6. Horizontal search tree.
(such as bioinformatics) datasets with large number of items, hor-
izontal algorithms cannot do the mining process efficiently because We can construct a bottom up or a top down search tree based
of the exponential order of mining process on the items number. on different combinations of rows.
Vertical algorithms are good solutions for this problem. Since
the number of rows in high dimensional datasets is often small,
3.1.1. Bottom-up vs. top-down
we can construct a search tree based on the combination of row
High dimensional datasets have small number of rows and large
numbers.
number of items. This characteristic helps us to reduce the number
of tests in pattern mining process with constructing a vertical
search strategy.
There are two vertical search strategies: the first search strategy
is bottom up search strategy which is beginning from the smallest
rowset and in each level the rowsets grow up by one member. The
first level of a bottom up search three (the root) is null. The second
level has m nodes (m is the number of rows in the dataset) each of
which represents a row-id. In the third level, children of node x are
constructed with combination of x and one of the row-ids that is
greater than x. So, for each y > x, node xy is a child of node x. Other
levels’ nodes will be constructed similarly. For example for each
z > y, node xyz is a child of node xy and so on.
Fig. 7 shows a bottom up search tree for a dataset with six rows
which each row id is an integer number between 1 and 6 (dataset
of Fig. 3).
The second search strategy is top down search strategy. It is
beginning from the largest rowset and goes onto the smaller row-
sets. Fig. 8 shows the top down search tree of the dataset.
Both top down and bottom up search strategies can be suitable for
vertical mining methods and based on their advantages and draw-
backs, applications can use them for different problems’ situation.

4. Colossal itemset mining

Based on bottom up search strategies, we represent a mining


Fig. 5. Closeness checking of ad. method to extract colossal pattern from high dimensional datasets.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 45

Fig. 7. Bottom up search.


Fig. 8. Top down search.

4.1. Bit-wise vertical mining


such b, the column b of all rows of rowset r is set to 1. In other
The vertical search tree is constructed with rowsets. To deter- words, if we operate AND on the rows of the rowset r, the result
mine the frequency of an itemset, we should check if the rowset bit vector is 1 for columns corresponding to all items of itemset r.
which includes the all rows which contain the itemset and does For high dimensional datasets we represent two new vertical
not include any other row, has more than (or equal to) the minsup mining methods which discover the bottom up and top down
rows in itself or not. Therefore we can say that only the rowsets search tree respectively.
which do not contain less than minsup rows construct some fre-
quent itemset. The union of the all frequent itemsets which is cor- 4.2. Bottom-up vertical colossal pattern mining
responding to a node constructs one and only one closed frequent
itemset which is the node corresponding frequent closed itemset. Bottom up search tree provides the facility of using a parent bit
Let r be a rowset with k rows, named r1, r2, . . . , and rk, in it. Each vector to form the child bit vector and determine the correspond-
itemset X which there is in all ris (1 les i 6 k) is a corresponding ing pattern is frequent or not.
itemset to r and this corresponding itemset is frequent if and only In vertical search tree, each node is a rowset and its correspond-
if k P minsup. Since the support of all corresponding itemsets, ing itemsets are those which there are in the all rows of this row-
such X, are k in this example, the largest corresponding itemset set. In bit wise approach, we use a bit vector, which is the result of
(which is the union of all the corresponding itemsets so) is the performing AND operation on the bit vectors of all rows of the row-
closed corresponding itemset. set, to determine the corresponding pattern. In bottom up
We can divide all the frequent itemsets of a dataset based on approach we can use the bit vector of a rowset to form its chil-
their support as follow: first category contains the itemset with dren’s bit vector.
support m (which m is the number of rows of the dataset), Second Fig. 9 shows an example of using bit matrix to bottom up min-
category contains the itemsets with support m  1, . . ., and the last ing. In this figure we use the bit matrix of Fig. 3a to show the pro-
category contains the itemsets with support minsup. Since each cess of bottom up mining. We use this facility to construct our
frequent itemset has a support no less than minsup and no more bottom up vertical search tree and use it for our bottom up vertical
than m (the number of rows), each itemset belongs to one and only colossal pattern mining method.
one category. With constructing top down or bottom up search The main idea of our method is based on a simple lemma: for
tree of rowsets, we can have all the combinations of rows and each rowset x and y if x # y then jPattern (x)j P jPattern (y)j, where
use the combination with no less than minsup rows to mine the Pattern (r) is the set of all items which belongs to all rows of rowset
frequent (closed) itemsets. r. the proof is straightforward. Assume x = x1x2 . . . xn and y = x1x2
Based on bit matrix representation, we say a rowset r with k . . . xny1y2 . . . ym. Item ‘a’ belongs to pattern if for each i (1 6 i 6 n),
rows contains itemset X if and only if for each item of itemset X, rowset xi include item ‘a’.
46 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

Fig. 10. Pruned vertical bottom up tree.

Fig. 9. Bottom up mining.

Now it is possible that there exist a yj (1 6 j 6 m), such that yj


does not contain ‘a’. Therefore, there are some (or zero) Items such
‘a’, which belong Pattern (x) and do not belong Pattern (y). On the
other hand, for each Item ‘a’, if ‘a’ belongs to Pattern (y), then all
rows xi (1 6 i 6 n) and all rows yj (1 6 j 6 m) will include item ‘a’
and so it will belong to Pattern (x). Therefore Pattern (x) is a subset
of Pattern (y) and Pattern (y) may is not a subset of Pattern (x) and
so jPattern (x)j P jPattern (y)j.
The direct result of this lemma for vertical bottom-up tree is
this fact that the size of corresponding pattern of a node in vertical
bottom-up search tree is never less than the size of any of its
children’s corresponding pattern. so in each branch of this tree, Fig. 11. Level 1 of minsup-level pruned tree for the dataset of Fig. 1.
the size of patterns which are produced in level I of the branch
are greater than the size of patterns which are produced in level
j (j > i) of the branch. Since the first level of tree which produces dren. So we need to construct the vertical bottom up search tree
a frequent pattern is level minsup, we can conclude that the great- to the minsup level.
est pattern of each branch in vertical bottom up tree is produced in Fig. 10 shows this minsup-level pruned tree for the dataset of
level minsup. So if our method explores the tree only to minsup Fig. 1.
level, it can find all the colossal pattern of dataset. Therefore our Based on this tree we can construct a pruned bit-wise vertical
Bit-wise Vertical Bottom Up Colossal pattern mining (BVBUC) bottom up tree which is limited to minsup level.
Algorithm searches to minsup level of tree and produces the corre- Fig. 11 shows the first level of this bit-wise tree for the bit
sponding pattern of the nodes of this level and prune their chil- matrix of Fig. 2.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 47

Fig. 13. Node 12 and its children.


Fig. 12. Node 1 and its children.

We can expand each node of this tree and construct its children
and go on the expanding to level minsup.
Fig. 12 shows the node corresponding to rowset {1} and its chil-
dren, which is constructed by operating AND on bit vector of a par-
ent node and bit vector of the row that is inserted to the parent’s
rowset in this node. For example, the bit vector of node {1, 4} is
constructed by operating AND on 2 bits string (bit vector)
1 1 1 0 1 0 1 1 and 1 1 1 1 1 1 1 0 which are the corresponding bit vec-
tor of rows 1 and 4, respectively.
We can see the corresponding bit vector in each node Children
of each child (children of rowsets {1, 2}, {1, 3}, {1, 4}, and {1, 5}) are
shown in Figs. 13–16 respectively.
Rowset {1, 2} has four children as shown in Fig. 13.
Each child of rowset {1, 2} contains three row ids and since the
minsup is 3, so this is last level of expansion in branch {1, 2}. The
corresponding bit vector of rowset {1, 2} and the corresponding
bit vector of rows {3}, {4}, {5}, and {6} are operated with AND oper-
ation and construct the children’s corresponding bit vector, respec-
tively, which produce the first set of output patterns. Now, the
output patterns can be easily checked to specify if they are colossal,
and colossal ones will be inserted to the output file.
Similarly, rowset {1, 3} construct their three children and pro-
duce the corresponding patterns as shown in Fig. 14. egh, bg, and Fig. 14. Node 13 and its children.
beg are three new patterns which are produced in branch of rowset
{1,3}.
Fig. 15 shows the two children of rowset {1, 4}. These two chil- branch’s construction in level 1. Based on vertical bottom up tree
dren produce two patterns abg and eg. construction method, to expand a node, we insert one row id of
Finally we can see in Fig. 16, the rowset {1, 5} and its only one the rows which are after that the maximum row id in the rowset
child which is produce the output pattern g. of this node to the rowset of this node and construct the corre-
Node of rowset {1, 6} has no child and so does not expand. If we sponding rowset of children. So if a node in the tree of a dataset
could know this node has no child (and so branch of this node does with m rows, contains row id ‘m’, it cannot have any child. There-
not reach to minsup level and does not produce any frequent pat- fore, if a node in level 1 of tree contains ‘m’ (node corresponding to
tern) when we expand node {1}, we would prune it and won not {m}) and minsup is greater than 1 then the node should not be con-
construct it. Pruning this node and similar ones, reduce the size structed. If the minsup is greater than 2, then the nodes in level 2
of tree and time complexity of constructing it dramatically, espe- which contain ‘m’ and so the nodes in level 1 which contain ‘m  1’
cially when the minsup is large. or ‘m’ should not be constructed. For minsup > 3, only nodes of le-
BVBUC method prunes all branches which won not reach to le- vel 1 which do not contain row m  2 and its after rows can be con-
vel minsup. For this aim, the method calculates the maximum level structed. Therefore for a given minsup, only first m  minsup + 1
number which each branch can be reached, when it starts the nodes of level 1 are expanded.
48 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

Fig. 18 shows that nodes corresponding to rowsets {6}, {5} (and


its child {5, 6}) in level 1 and nodes corresponding to rowsets {3, 6}
and {4, 6} in level 2 of tree were pruned.
We can operate another pruning on the tree when we want to
mine colossal pattern which has more than a (user-specified) num-
bers of items.
When we have a minimum acceptable number of items in a
colossal pattern, we can stop the process of expanding of a node
if the number of the items of its corresponding pattern is less than
that minimum threshold. This user-specified threshold, helps us to
prune a branch whenever we find out its corresponding pattern
cannot have enough items to be a colossal pattern.
For example, if we know the threshold is 100 and minsup is 25,
if a node with 85 items in its corresponding patterns created in the
third level of three we can prune its branch because we know it
does not have any child with at least 100 items in its corresponding
pattern.
Fig. 15. Node 14 and its children.

5. Algorithm

In this section we represent the algorithm of our bit-wise verti-


cal bottom up methods and describe its operation. In this

Fig. 16. Node 15 and its child.

Similar to above discussion, we can prove many of branches in


the next levels can be pruned because they are never reach to min-
sup level.
Assume ‘q’ is a node of tree in level ‘p’. Since each node in level
‘p’ of our vertical bottom up tree has exactly p row ids, we assume
node q is corresponding to rowset a1a2 . . . ap, where ai (1 6 i 6 p) is
a row id and for each i < j, ai < aj. So we can construct all children of
node q, by insert row id ap + 1 to m to rowset of q (the children cor-
responding rowsets will be a1a2 . . . apap+1, a1a2 . . . apap+2, . . . , and
a1a2 . . . apm). Therefore, the sub tree which is started form q (sub
tree with root q) has m  ap levels. So the largest branch of tree
which contains node q, has p node before and m  ap node after
q and so has p + m  ap levels. So we can say ‘‘for each node q of
the vertical bottom up tree, the largest branch of tree which con-
tains node q has p + m  ap levels, where p is the level of q in the
tree, m is the number of dataset’s rows, and apis the maximum
row id which there exists in q’’. For example, in a dataset with
six rows, the largest branch which contains node 13 (the node cor-
responding to rowset {1, 3}) has five levels because m is 6, p is 2,
and ap is three and so the branch has p + m  ap = 2 + 6  3 = 5 lev-
els. You can see this in Fig. 7.
We can see in Fig. 17, which is the sub tree of node {2} of tree to
minsup level, the node {2, 6} is not constructed because when we
want to expand node {2}, for each child, we first calculate the num-
ber of levels of largest branch which contain this child and if this
number is less than minsup the child wo not be constructed.
For node {2, 6} in this tree, m = 6, p = 2, and ap = 6 and so the
largest branch which contains this node has two levels. If minsup
is greater than 2 (like this example in which minsup is 3), the node
{1, 2} and its children (if any) will be pruned. Fig. 17. Sub-branch of node 2.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 49

Fig. 18. Sub-branches of nodes 3 and 4.


Fig. 19. BVBUC algorithm.

algorithm the user-specified threshold is a global variable which is


the minimum acceptable number of items which can exist in a In the second part (in the else block), first m is inserted to row-
colossal pattern. set S and corresponding rowset of this node is constructed. Based
Furthermore the minsup and the main bit matrix are global too on user-specified threshold, if the pattern is not colossal, algorithm
and are accessible in algorithm. will stop the expanding this branch of three. The mining process
will be continued only when the Pattern (S) is colossal. In this case,
for each row-id from m + 1 to max, the algorithm first tests if the
5.1. Bit-wise vertical bottom up colossal (BVBUC) branch of corresponding child will arrive to minsup level or not,
and if it will arrive the BVBUC will be called recursively to con-
Fig. 19 shows the bit-wise vertical bottom up colossal mining struct patterns corresponding children node’s rowsets by expand-
algorithm. ing this node and constructing its children. The condition of if
Input parameters of this algorithm include m, the first row id of statement in the loop of algorithm is equivalence to condition
the parent node’s rowset, max, the number of rows of bitmatrix, l, p + m  ap P min sup, which have been explained in Section 4.2.
current processing level of tree, S, the input rowset for algorithm, As we explained in Section 1, a good way to reduce the output
and minsup, the (user-defined) minimum support of patterns. size of pattern mining algorithms, is closed pattern mining. We de-
This algorithm is a recursive algorithm and constructed from scribed in Fig. 5 that how we can specify the closeness of a pattern
two main parts. using its bit vector. Therefore, using the closeness specification
In the first part (in the main if block of algorithm), the algorithm method, we can represent a new version of BVBUC which will mine
checks if it arrives to level minsup or not. If the current level of three the closed colossal frequent patterns of datasets. to represent the
is the minsup level then algorithm stops the expanding of current closed version of BVBUC, the first part (the main if-block) of the
branch. In this case, only the corresponding pattern of this node is algorithm will changes as follow:
calculated and if it is a new colossal pattern (based on user-speci-
fied threshold), which does not exist in the output file, the corre- If (l = minsup) then
sponding pattern and its support will be written in the output Begin
file. Pattern (S) is a simple function which calculates the corre- If (Pattern (S) is colossal) then
sponding pattern of rowset S by operate ‘‘AND’’ on the bit vectors If (pattern (S) is not in file) then
of the row-ids there are in the S and select the items which have va- If (pattern (S) is closed) then
lue 1 in the result ‘‘AND’’ vector. The number row ids of rowset S is Output (File, (Pattern (S), support (S)));
the support of pattern (S) which is returned by function support (S). End
50 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

There is only one change in closed version of BVBUC and the pri-
mary version and that is the condition which checks closeness of
patterns before adding them to output file. This change makes
algorithm to mine closed colossal frequent patterns.

6. Experimental results

In this section we will study the performance of our three algo-


rithms. All our experiments were performed on a computer with a
2.2 GHz core2 duo CPU, 2 GB RAM and 120 GB hard disk. All the re-
ported runtime include both computation time and IO time. TD-
Close [18] has already shown its better performance than other
column enumeration and row enumeration based algorithms such
as carpenter [29], CLOSET+ [12], CHARM [11], and FPClose [13], and
so pattern fusion only compares its method with TDClose and Fig. 21. Dataset mushroom.
FPclose. FPClose is a column enumeration based algorithm and it
was selected because TDClose is a row enumeration based.
CPM does not compare its result with core fusion and any other
colossal pattern mining method. It only compares its algorithm’s
efficiency with CLOSET+.
We implement both core fusion and CPM Algorithms and com-
pare our method with them.
Because of the variant size of the datasets used for performance
study, we do not determine the minsup threshold with an absolute
number. Instead, minsup is determined by a percentage of the
number of transactions. We use five real standard datasets from
FIMI [26] to compare the algorithms. Fig. 20 shows some statistical
information about the dataset used for performance study.
Figs. 21–25 show the result of running three algorithms BVBUC
(our algorithm), CPM, and Core-Fusion (core pattern fusion), on
real standard datasets. We can see that with increasing minimum
support all the algorithm performance in all dataset will be
Fig. 22. Dataset pumsb.
decreased.
Often in pattern mining algorithms, when we have a small min-
sup, we have worse result than the times we have greater minsup
because with smaller minsup the mining algorithm will find more
frequent items and with growth of frequent items, the size of
search tree and so the mining process time will increase dramati-
cally. However, in BVBUC, when the minsup decreases, the number
of created level of tree will be decreased too. Therefore when the
minsup is small, we have a very much efficiency improvement in
BVBUC. As Figs. 21–25 shows, the difference of efficiency of BVBUC
with core fusion and CPM is very much when the minsup is small.
Fig. 21 shows CPM and core fusion have similar time complexity
in mining rather small dataset mushroom. Even in this small data-
set, we can see that BVBUC is very better two former algorithm
specially when the minsup is low. With a great minsup, there is
a very small set of frequent items and different algorithms will
have similar efficiency in mining patterns.
Fig. 23. Dataset accident.
Largest datasets, such as pumsb, accident and retail, behaves
very similarly to the mushroom dataset. BVBUC also outperforms
substantially CPM and core fusion as shown in Figs. 22–24, We also run the algorithms on two biological datasets to com-
respectively. pare them on micro array gene expression datasets.
The Fig. 25 shows that, in large datasets with huge number of Figs. 26 and 27 show the running time of three methods on a
rows and items, core fusion is very better than CPM and its effi- lung cancer (LC)1 and prostate cancer (PC)2 dataset, respectively.
ciency is near our algorithm. There are 181 tissue samples (rows) in LC and each sample is de-
scribed by the activity level of 12533 genes (items). PC has 102 rows
and 12,600 items.
The result of running the implemented algorithm shows that
BVBUC outperforms other algorithms in micro array datasets like
standard datasets.

1
http://www.chestsurg.org.
2
Fig. 20. Characteristics of test datasets. http://www-genome.wi.mit.edu/mpr/prostate.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 51

7. Conclusion

According to apriori property, any subset of a frequent itemset


is frequent. This downward closure property leads to an explosive
number of patterns. The introduction of closed frequent itemsets
and maximal frequent itemsets can partially alleviate this redun-
dancy problem. However in all studies on frequent itemset mining
(closed, maximal or complete set of frequent itemsets) small and
mid-sized itemsets are mined. While in many applications only
large itemsets (colossal itemsets) are useful. Colossal pattern min-
ing is a relatively new approach which reduces the process time of
frequent pattern mining.
In this paper we represent a new algorithm (BVBUC), to conduct
mining of colossal frequent patterns in very large datasets. The
datasets can be large horizontally or vertically depending on their
Fig. 24. Dataset retail. data types. In our algorithms, we use a bit matrix to compress the
dataset and make it easy to use in mining algorithm. We use ver-
tical bottom up approach to enable the algorithm to mine frequent
itemsets from the largest one. The first extracted itemset is the
largest one in this approach. We also use many pruning rules to
improve the efficiency of the algorithm. Our experimental result
shows that our algorithm attains very good mining efficiencies
on various input datasets. Furthermore, our performance study
shows that this algorithm outperforms substantially the best pre-
viously developed algorithms.

References

[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of


items in large databases, in: Proceedings of the 1993ACM-SIGMOD
international conference on management of data (SIGMOD’93), Washington,
DC, 1993, pp 207–216.
[2] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proc.
Fig. 25. Dataset kosarak.
1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, September
1994, pp. 487–499.
[3] H. Mannila, H. Toivonen, A.I. Verkamo, Efficient Algorithms for Discovering
Association Rules, (KDD’94) Seattle, WA, 1994, pp. 181–192.
[4] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in:
Proceeding of the 2000 ACM-SIGMOD International Conference on
Management of Data (SIGMOD’00), Dallas, TX, 2000, pp. 1–12.
[5] R. Agarwal, C.C. Aggarwal, V.V.V. Prasad, A tree projection algorithm for
generation of frequent itemsets, J. Parallel Distribut. Comput. 61 (2001) 350–371.
[6] J. Liu, Y. Pan, K. Wang, J. Han, Mining frequent item sets by opportunistic
projection, in: Proceeding of the 2002 ACM SIGKDD International Conference
on Knowledge Discovery in Databases (KDD’02), Edmonton, Canada, 2002, pp.
239–248.
[7] G. Grahne, J. Zhu, Efficiently using prefix-trees in mining frequent itemsets, in:
Proceeding of the ICDM’03 International Workshop on Frequent Itemset
Mining Implementations (FIMI’03), Melbourne, FL, 2003, pp. 123–132.
[8] M.J. Zaki, Scalable algorithms for association mining, IEEETransKnowl Data
Eng. 12 (2000) 339–372.
[9] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed
Itemsets for association rules, in: Proceeding of the 7th International
Conference on Database Theory (ICDT’99), Israel, 1999, pp. 398–416.
[10] J. Pei, J. Han, R. Mao, CLOSET an effective algorithm for mining frequent closed
Fig. 26. Dataset LC. itemsets, in: Proceeding of the 2000 ACM-SIGMOD SIGMOD International
Workshop Data Mining and Knowledge Discovery (DMKD’00), Dallas, TX,
2000, pp. 11–20.
[11] M.J. Zaki, C.J. Hsiao, CHARM: an efficient algorithm for closed itemset mining,
in: Proceeding of the 2002SIAMinternational Conference on Data Mining
(SDM’02), Arlington, VA, 2002, pp. 457–473.
[12] J. Wang, J. Han, J. Pei, CLOSET+: searching for the best strategies for mining
frequent closed itemsets, in: Proceeding of the 2003 ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining
(KDD’03), Washington, DC, 2003, pp. 236–245.
[13] G. Grahne, J. Zhu , Efficiently using prefix-trees in mining frequent itemsets, in:
Proceeding of the ICDM’03 International Workshop on Frequent Itemset
Mining Implementations (FIMI’03), Melbourne, FL, 2003, pp. 123–132.
[14] R.J. Bayardo, Efficiently mining long patterns from databases, in: Proceeding of
the 1998 ACM-SIGMOD International Conference on Management of Data
(SIGMOD’98), Seattle, WA, 1998, pp. 85–93.
[15] D. Burdick, M. Calimlim, J. Gehrke, MAFIA: a maximal frequent itemset
algorithm for transactional databases, in: Proceeding of the 2001 International
Conference on data Engineering (ICDE’01), Heidelberg, Germany, 2001, pp.
443–452.
[16] W. Song, B. Yang, Z. Xu, Index-BitTableFI: an improved algorithm for mining
Fig. 27. Dataset PC. frequent itemsets, Knowl.-Based Syst. 21 (2008) 507–513.
52 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52

[17] Nair B, Tripathy AK. Accelerating Closed Frequent Itemset Mining by [24] Schlegel B, Gemulla R, Lehner w. Memory-Efficient Frequent Itemset Mining,
Elimination of Null Transactions. In: International Journal of Emerging in: Proc. 14th International Conference on Extending Database
Trends in Computing and Information Sciences 2011; 317-324. Technology(EDBT), 2011.
[18] H. Liu, J. Han, D. Xin, Z. Shao, Mining frequent patterns on very high [25] U. Yun, Mining lossless closed frequent patterns with weight constraints,
dimensional data: a topdown row enumeration approach, in: Proceeding of Knowl.-Based Syst. 20 (1) (2007) 86–97.
the 2006 SIAM International Conference on data Mining (SDM’06), Bethesda, [26] Frequent Itemset Mining Implementations Repository: <http://
MD, 2006, pp. 280–291. fimi.cs.helsinki.fi/>.
[19] J. Dong, M. Han, BitTableFI: an efficient mining frequent itemsets algorithm, [27] C. Borgelt, X. Yang, R. Nogales-Cadenas, P. Carmona-Saez, A.D. Pascual-
Knowl.-Based Syst. 20 (4) (2007) 329–335. Montano, Finding closed frequent item sets by intersecting transactions, in:
[20] F. Zhu, X. Yan, J. Han, P. Yu, H. Cheng, Mining colossal frequent patterns by core Proc. 14th International Conference on Extending Database Technology (EDBT)
pattern fusion. in: Proceeding of the 2007 Pacific-Asia Conference on 2011, Uppsala, Sweden, 2011, pp. 367–376.
Knowledge Discovery and Data Mining, 2007. [28] Y.J. Tsay, J.Y. Chiang, CBAR: an efficient method for mining association rules,
[21] Madhavi Dabbiru, Moghalla Shashi, An efficient approach to colossal pattern Knowl.-Based Syst. 18 (2005) 99–105.
mining, Int. J. Comput. Sci. Network Secur. (IJCSNS) 6 (2010) 304–312. [29] F. Pan, G. Cong, A.K.H. Tung, J. Yang, M. Zaki, CARPENTER: finding closed
[22] T.Y. Lin, E. Louie, Finding association rules using fast bit computation: patterns in long biological datasets, in: Proceeding of the 2003 ACM SIGKDD
machine-oriented modeling, Lecture Notes in Computer Science, vol. 1932, International Conference on Knowledge Discovery and Data Mining (KDD’03),
Springer, Berlin, 2000, pp. 247–258. Washington, DC, 2003, pp. 637–642.
[23] T.Y. Lin, Hu Xiaohua, E. Louie, A fast association rule algorithm based on
bitmap and granular computing, in: Proc. 12th IEEE Internat. Conf. on Fuzzy
Systems, 2003, pp. 678–683.

You might also like