Professional Documents
Culture Documents
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
a r t i c l e i n f o a b s t r a c t
Article history: ‘Frequent pattern mining’ is considered as an important data mining problem which has been extensively
Received 18 June 2011 studied over the last decade. There are a large number of algorithms which have been developed for fre-
Received in revised form 4 March 2012 quent pattern mining on a traditional commercial dataset which usually contains a huge number of
Accepted 4 March 2012
transactions besides a small number of items in each transaction. The advent of bioinformatics contrib-
Available online 15 March 2012
uted to the development of new form of datasets – called high dimensional – which are characterized by
small number of transactions and large number of items in each transaction. The running time of tradi-
Keywords:
tional algorithms increases exponentially with increasing average transaction length, thus these algo-
Frequent patterns mining
Colossal patterns
rithms cannot be suitable for the high dimensional datasets. On the other hand, the mining algorithms
Bottom-up mining on high dimensional datasets create a very large output set as result which includes small and mid-size
Bit matrix frequent patterns which do not bear any useful information for scientists. Colossal pattern mining is
High dimensional dataset described as a solution to reduce the amount of output set of mining patterns. Due to ignoring the mining
of the small and mid-sized patterns, mining process speed is increased in colossal patterns mining algo-
rithms. Therefore, only very large (colossal) patterns are extracted and mined in this approach. In this
paper we represent an efficient vertical bottom up method to conduct mining of frequent colossal pat-
terns in high dimensional datasets. In our algorithm, we use a bit matrix to compress the dataset and
make it easy to use in mining process. Our experimental result shows that our algorithm attains very
good mining efficiencies on various input datasets. Furthermore, our performance study shows that this
algorithm outperforms substantially the best former algorithms.
Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction which lead the researches to the FP-Growth [4]. FP-Growth and
its successors [5–7,24] are the second category of mining algo-
A fundamental problem for mining association rules is to mine rithms which use a data structure called FP-tree to compress the
frequent itemsets. Frequent pattern mining has numerous applica- dataset and search it without candidate generation and without
tions, including analysis of customer purchase patterns, analysis of multiple scans on the dataset. The third category of mining algo-
web access patterns, the investigation of scientific or medical pro- rithms, called vertical algorithms, performed with data presented
cesses, and the analysis of DNA sequences. There exist several in vertical data format [8]. In usual datasets, transactions are in hor-
types of frequent pattern mining. The main types of frequent pat- izontal data format. Each transaction (row) has an id and a set of
tern mining include: Frequent itemset mining, sequential pattern items (columns) which are used in it. Rather, in vertical data format,
mining, and graph mining. In this paper we concentrate on the each row of dataset has an item and a set of transaction ids which
itemsets and for simplicity, use the terms ‘‘pattern’’ and ‘‘itemset’’ contain this item. These algorithms are used when our dataset
interchangeably. has small number of transactions each of them containing a large
Many efficient frequent pattern mining algorithms have been number of items.
proposed in the literature [1–8,19,25,27,28]. These algorithms can Based on the requested output, pattern mining algorithms has
be classified in three main categories: the first category includes two different approaches for mining frequent patterns: some of
the apriori-like algorithms which use the method of candidate gen- them try to find a complete set of all frequent patterns [1,4,8] and
eration and test to find the frequent patterns [1–3]. Huge number of the others discover a compressed set of patterns [9–12,14]. Major
candidate sets and repeatedly scanning the dataset to check the fre- challenge in mining frequent patterns is the fact that all the sub-
quency of candidates are two main problems of these algorithms patterns of a frequent pattern are frequent and there are an expo-
nential number of sub-patterns for each pattern which generate a
huge number of frequent patterns as result. To overcome this prob-
⇑ Corresponding author. Fax: +98 21 66413969/64542744.
lem, closed frequent pattern mining [9–13,17,25,27] and maximal
E-mail addresses: Amir_sohraby@yahoo.com (M.K. Sohrabi), Ahmad@aut.ac.ir
(A.A. Barforoush).
frequent pattern (max-pattern) mining [14,15] were proposed.
0950-7051/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.knosys.2012.03.003
42 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52
Closed pattern set is a compact set of frequent patterns which 24]. MAFIA [15] is a maximal pattern mining method which uses a
retains all the information of complete pattern set. A pattern is vertical bitmap representation for support counting and pruning
closed frequent in a dataset if it is frequent in the dataset and there mechanisms for searching the itemset lattice.
exists no super-pattern of it which has the same support as it in the Index-BitTableFI [16] uses a BitTable, an index array, and the
dataset. A pattern is maximal frequent in a dataset if it is frequent corresponding computing method to mining frequent itemsets.
in the dataset and there exists no frequent super-pattern of it in Frequent itemsets are identified quickly by using breadth-first
the dataset. Although the set of max-patterns is more compact search in this method.
than the set of closed frequent ones, almost all important research Granule is defined as a collection of the entities that has the
tend to compress the set of frequent patterns focused on closed fre- same property in [22]. The relational model uses these granules
quent pattern mining, since the set of maximal frequent patterns as attribute values which is called machine oriented data model.
does not usually contain the complete support information regard- The model transforms data mining, particularly finds association
ing to its corresponding frequent pattern. rules, into Boolean operations. Bit-AssoRule is a fast association
Although closed pattern mining incredibly reduced the amount rule algorithm based on granular computing which is represented
of computation and output volume of the mining process, the pat- in [23]. Bit-AssoRule is represented to avoid generation-and-test
tern mining problem is very time and space consuming for high strategy of apriori algorithm. Using bitmap techniques, the candi-
dimensional datasets. Our experiment shows that when we have date is a large itemset if the bit count on the intersection of all
a dataset with 100 rows and each row include 1000 items, none the bitmaps is equal or greater than the minimal count. The bit
of the existing horizontal algorithms can finish the closed pattern count is the number of 1’s in the bitmap indexes from the result
mining when threshold is set low. On the other hand, the output of the intersection of the bitmaps.
result of mining algorithms includes huge number of small and CFP-Tree and CFP-Array [24] are two novel data structures
mid-size patterns which has often no suitable information for which use lightweight compression techniques to reduce memory
many applications. Since in many applications, only large-sized consumption of FP-tree based algorithms by an order of magni-
patterns have applicable and suitable information, it is a good idea tude. CFP-Tree exploits a combination of structural changes to
to find mining methods which extract only large patterns without the FP-Tree and bitmap techniques.
mining the small and mid-size patterns. Mining colossal pattern In this study, the bit wise method is used to make the items and
without mining smaller patterns is a rather new approach to ad- patterns easy to use in vertical bottom up search to find colossal
dress this issue. patterns.
All former pattern mining methods (complete, closed, or maxi- The remaining of the paper is organized as follow:
mal frequent pattern mining methods) follow a bottom up ap- In Section 2, we present some preliminaries and mining task
proach to find patterns. Mining process in these methods, starts and then with an example, we describe our bit matrix representa-
from small patterns and goes on to the larger one. However, in tion. In Section 3, we explain different search strategies for mining
some applications, small or medium-sized patterns often do not datasets. The new colossal patterns mining methods based on bit
have useful information and we can gain useful information only matrix representation is explained in the Section 4. We present
from very large-sized patterns, called colossal patterns. the new algorithm in Section 5 and conduct experimental study
The first serious algorithm to mining colossal patterns was core in Section 6. Finally we conclude the study in Section 7.
pattern fusion (core-fusion) algorithm and was represented in
2007 [20]. In this paper a core pattern-fusion method is repre-
sented which could give a good approximation. First, the paper 2. Problem definition
shows why pattern-fusion’s mining result would favor colossal
patterns over smaller-sized done and then it explores how pat- In the first subsection of this section we present some prelimi-
tern-fusion gives a good approximation by catching the outliers naries and in the second one we describe the bit matrix represen-
in the complete answer. The main idea of the method is that, pat- tation mining task by using an example.
tern-fusion merges all the small sub-patterns of a large pattern in
one step instead of expanding patterns with additional single
items. This gives pattern-fusion the advantage to circumvent 2.1. Preliminaries
mid-sized patterns and progress on a path leading to a potential
colossal pattern. Let I = {i1, i2, . . . , in} be a set of items, also called features (or col-
This work has been followed by some other more efficiently umns). The dataset D consists of a set of rows (also called transac-
colossal pattern mining methods such as Colossal Pattern Miner tions) R = {r1, r2, . . . , rm}, where each row ri is a set of items with an
(CPM) in 2010 [21]. CPM suggests to pick-up the seed pattern in id called Rid or Tid. Given a set of items X # I we define the support
an intelligent way instead of randomly picking a seed pattern. This set, denoted D(X) # R as the maximal set of rows that contain X.
method represented a way of separating sub pattern of overlapping The number of rows in the dataset that contains X is called the
colossal patterns based on their frequencies which facilitates leap- support of X. By definition the support of X is given as Sup
ing through the enormous number of mid-sized patterns. (X) = jD (X)j. An itemset X is called frequent itemset or frequent pat-
This work is motivated towards developing a more efficient tern if Sup (X) P minsup, where minsup is a user specified lower
strategy for mining colossal pattern using a bit wise vertical bot- support threshold. In other words, a pattern is frequent if the num-
tom up mining approach. A vertical bottom up search method is ber of transactions which contains that pattern is not less than the
designed to mine the rowsets of a dataset and a new algorithm is user specified threshold.
represented to combine this bottom up search method with the A set of items X # I is called a closed pattern if there exists no Y
bit wise approach to mine the frequent colossal patterns from high such that Y # X and Sup (X) = Sup (Y), i.e., there is no superset of X
dimensional datasets efficiently. with same support. Put another way, the row set that contains the
The mining approaches which are using bitmap techniques to superset Y, must not be exactly the same as the row set that
find the frequent patterns, are called bit wise approaches in this contains the set X. An Itemset X is called a frequent closed pattern,
paper. if it is closed and frequent.
Recently, many attempts have been given to applying bitmap Fig. 1 shows an example of a dataset in which the items are
techniques in the frequent patterns mining algorithms [15,16,22– represented using alphabets.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 43
Fig. 1. A sample of a transaction database (dataset). Fig. 4. ColAndVectors and frequency test.
44 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52
We can expand each node of this tree and construct its children
and go on the expanding to level minsup.
Fig. 12 shows the node corresponding to rowset {1} and its chil-
dren, which is constructed by operating AND on bit vector of a par-
ent node and bit vector of the row that is inserted to the parent’s
rowset in this node. For example, the bit vector of node {1, 4} is
constructed by operating AND on 2 bits string (bit vector)
1 1 1 0 1 0 1 1 and 1 1 1 1 1 1 1 0 which are the corresponding bit vec-
tor of rows 1 and 4, respectively.
We can see the corresponding bit vector in each node Children
of each child (children of rowsets {1, 2}, {1, 3}, {1, 4}, and {1, 5}) are
shown in Figs. 13–16 respectively.
Rowset {1, 2} has four children as shown in Fig. 13.
Each child of rowset {1, 2} contains three row ids and since the
minsup is 3, so this is last level of expansion in branch {1, 2}. The
corresponding bit vector of rowset {1, 2} and the corresponding
bit vector of rows {3}, {4}, {5}, and {6} are operated with AND oper-
ation and construct the children’s corresponding bit vector, respec-
tively, which produce the first set of output patterns. Now, the
output patterns can be easily checked to specify if they are colossal,
and colossal ones will be inserted to the output file.
Similarly, rowset {1, 3} construct their three children and pro-
duce the corresponding patterns as shown in Fig. 14. egh, bg, and Fig. 14. Node 13 and its children.
beg are three new patterns which are produced in branch of rowset
{1,3}.
Fig. 15 shows the two children of rowset {1, 4}. These two chil- branch’s construction in level 1. Based on vertical bottom up tree
dren produce two patterns abg and eg. construction method, to expand a node, we insert one row id of
Finally we can see in Fig. 16, the rowset {1, 5} and its only one the rows which are after that the maximum row id in the rowset
child which is produce the output pattern g. of this node to the rowset of this node and construct the corre-
Node of rowset {1, 6} has no child and so does not expand. If we sponding rowset of children. So if a node in the tree of a dataset
could know this node has no child (and so branch of this node does with m rows, contains row id ‘m’, it cannot have any child. There-
not reach to minsup level and does not produce any frequent pat- fore, if a node in level 1 of tree contains ‘m’ (node corresponding to
tern) when we expand node {1}, we would prune it and won not {m}) and minsup is greater than 1 then the node should not be con-
construct it. Pruning this node and similar ones, reduce the size structed. If the minsup is greater than 2, then the nodes in level 2
of tree and time complexity of constructing it dramatically, espe- which contain ‘m’ and so the nodes in level 1 which contain ‘m 1’
cially when the minsup is large. or ‘m’ should not be constructed. For minsup > 3, only nodes of le-
BVBUC method prunes all branches which won not reach to le- vel 1 which do not contain row m 2 and its after rows can be con-
vel minsup. For this aim, the method calculates the maximum level structed. Therefore for a given minsup, only first m minsup + 1
number which each branch can be reached, when it starts the nodes of level 1 are expanded.
48 M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52
5. Algorithm
There is only one change in closed version of BVBUC and the pri-
mary version and that is the condition which checks closeness of
patterns before adding them to output file. This change makes
algorithm to mine closed colossal frequent patterns.
6. Experimental results
1
http://www.chestsurg.org.
2
Fig. 20. Characteristics of test datasets. http://www-genome.wi.mit.edu/mpr/prostate.
M.K. Sohrabi, A.A. Barforoush / Knowledge-Based Systems 33 (2012) 41–52 51
7. Conclusion
References
[17] Nair B, Tripathy AK. Accelerating Closed Frequent Itemset Mining by [24] Schlegel B, Gemulla R, Lehner w. Memory-Efficient Frequent Itemset Mining,
Elimination of Null Transactions. In: International Journal of Emerging in: Proc. 14th International Conference on Extending Database
Trends in Computing and Information Sciences 2011; 317-324. Technology(EDBT), 2011.
[18] H. Liu, J. Han, D. Xin, Z. Shao, Mining frequent patterns on very high [25] U. Yun, Mining lossless closed frequent patterns with weight constraints,
dimensional data: a topdown row enumeration approach, in: Proceeding of Knowl.-Based Syst. 20 (1) (2007) 86–97.
the 2006 SIAM International Conference on data Mining (SDM’06), Bethesda, [26] Frequent Itemset Mining Implementations Repository: <http://
MD, 2006, pp. 280–291. fimi.cs.helsinki.fi/>.
[19] J. Dong, M. Han, BitTableFI: an efficient mining frequent itemsets algorithm, [27] C. Borgelt, X. Yang, R. Nogales-Cadenas, P. Carmona-Saez, A.D. Pascual-
Knowl.-Based Syst. 20 (4) (2007) 329–335. Montano, Finding closed frequent item sets by intersecting transactions, in:
[20] F. Zhu, X. Yan, J. Han, P. Yu, H. Cheng, Mining colossal frequent patterns by core Proc. 14th International Conference on Extending Database Technology (EDBT)
pattern fusion. in: Proceeding of the 2007 Pacific-Asia Conference on 2011, Uppsala, Sweden, 2011, pp. 367–376.
Knowledge Discovery and Data Mining, 2007. [28] Y.J. Tsay, J.Y. Chiang, CBAR: an efficient method for mining association rules,
[21] Madhavi Dabbiru, Moghalla Shashi, An efficient approach to colossal pattern Knowl.-Based Syst. 18 (2005) 99–105.
mining, Int. J. Comput. Sci. Network Secur. (IJCSNS) 6 (2010) 304–312. [29] F. Pan, G. Cong, A.K.H. Tung, J. Yang, M. Zaki, CARPENTER: finding closed
[22] T.Y. Lin, E. Louie, Finding association rules using fast bit computation: patterns in long biological datasets, in: Proceeding of the 2003 ACM SIGKDD
machine-oriented modeling, Lecture Notes in Computer Science, vol. 1932, International Conference on Knowledge Discovery and Data Mining (KDD’03),
Springer, Berlin, 2000, pp. 247–258. Washington, DC, 2003, pp. 637–642.
[23] T.Y. Lin, Hu Xiaohua, E. Louie, A fast association rule algorithm based on
bitmap and granular computing, in: Proc. 12th IEEE Internat. Conf. on Fuzzy
Systems, 2003, pp. 678–683.