0 Up votes0 Down votes

17 views11 pagesMar 17, 2011

© Attribution Non-Commercial (BY-NC)

PDF, TXT or read online from Scribd

Attribution Non-Commercial (BY-NC)

17 views

Attribution Non-Commercial (BY-NC)

- (Lecture Notes in Computer ScienAlgorithms – ESA 2012_ 20th Annual European Symposium
- Introduction to Computation and Programming Using Python%2C Revised - Guttag%2C John v..131
- Solving 8 Queen Problem
- 4_bw_comad2011_submission_36_minimally Infrequent Itemset Mining Using Pattern-growth Paradigm and Residual Trees
- Decision Rules
- Combined use of ARM and graph clustering methods to find association in urban routes
- Lecture11b-Zoo Case Study Cluster Association Rule
- 44 Paper 28011035 IJCSIS Camera Ready Pp273-279
- algorithm complexity L3
- Chap1 Intro
- Data Warehousing and Mining - Exam Solutions
- icde99_rj
- IJCTT-V3I4P108
- Backtracking
- Mine High Utility Itemset using UP-Tree and FP-Growth
- Discovering High Utility Itemsets Using Hybrid Approach
- [IJCT-V2I3P20] Authors :Litty Tressa George, Asha Rose Thomas
- FPTree
- Ga Parameters
- MS - 1

You are on page 1of 11

Frequent/Closed/Maximal Itemsets

1

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan

e-mail: uno@nii.jp, masashi@grad.nii.ac.jp

2

Information Science and Technology, Hokkaido University

Kita 14-jo Nishi 9-chome, 060-0814 Sapporo, JAPAN, e-mail: arim@ist.hokudai.ac.jp

Abstract: For a transaction database, a frequent a ppc extension needs no previously obtained closed

itemset is an itemset included in at least a specified itemset. Hence, the memory use of LCM does not

number of transactions. A frequent itemset P depend on the number of frequent closed itemsets,

is maximal if P is included in no other frequent even if there are many frequent closed itemsets.

itemset, and closed if P is included in no other The time complexity of LCM is theoretically

itemset included in the exactly same transactions bounded by a linear function in the number of fre-

as P . The problems of finding these frequent quent closed itemsets, while the existing algorithms

itemsets are fundamental in data mining, and are not. We further developed algorithms for the

from the applications, fast implementations for frequency counting, occurrence deliver and hybrid of

solving the problems are needed. In this paper, diffsets. They reduce the practical computation time

we propose efficient algorithms LCM (Linear time efficiently. Moreover, the framework of LCM is sim-

Closed itemset Miner), LCMfreq and LCMmax ple. Generating ppc extensions needs no sophisti-

for these problems. We show the efficiency of our cated data structure such as binary trees. LCM is

algorithms by computational experiments compared implemented with only arrays. Therefore, LCM is

with existing algorithms. fast, and outperforms than other algorithms for some

sparse datasets.

However, LCM does not have any routine for re-

ducing the database, while many existing algorithms

1 Introduction have. Thus, the performance of LCM is not good for

dense datasets with large minimum supports, which

Frequent item set mining is one of the fundamental involve many unnecessary items and transactions.

problems in data mining and has many applications At FIMI03, we also proposed modifications of LCM,

such as association rule mining, inductive databases, LCMfreq and LCMmax, for enumerating all frequent

and query expansion. From these applications, fast itemsets and maximal frequent itemsets. Although

implementations of frequent itemset mining problems they are fast for some instances, if LCM is not fast

are needed. In this paper, we propose the second for an instance, they are also not fast for the instance.

versions of LCM, LCMfreq and LCMmax, for enu- Existing maximal frequent itemset mining algorithms

merating closed, all and maximal frequent itemsets. have efficient pruning methods to reduce the number

LCM is an abbreviation of Linear time Closed item of iterations, while LCMmax does not have. It is also

set Miner . a reason of the slowness of LCMmax.

In FIMI03[7], we proposed the first version of This paper proposes the second version of LCM

LCM, which is for enumerating frequent closed item- algorithms. We added database reduction to LCM,

sets. LCM uses prefix preserving closure extension so that problems of dense datasets can be solved in

(ppc extension in short), which is an extension from short time. The second version of LCMmax includes

a closed itemset to another closed itemset. The ex- a pruning method, thus the computation time is re-

tension induces a search tree on the set of frequent duced when the number of maximal frequent itemsets

closed itemsets, thereby we can completely enumer- is small. We further developed new algorithms for

ate closed itemsets without duplications. Generating checking the maximality of a frequent itemset and

1

for taking the closure of an itemset. We compare f rq(Q) < f rq(P ). We call the item with the maxi-

the performance of LCM algorithms and other algo- mum index in P the tail of P , and denote by tail(P ).

rithms submitted to FIMI03 by computational exper-

iments. In many instances, LCM algorithms perform

above other algorithms. 3 Algorithms for Efficient Enu-

The organization of the paper is as follows. Sec- meration

tion 2 introduces preliminaries. The main algorithms

and practical techniques of LCM algorithms are de- In this section, we explain the techniques used in the

scribed in Section 3. Section 4 shows the results of second versions of LCM algorithms. We explain them

computational experiments, and Section 5 concludes one-by-one with comparing to the techniques used

the paper. by the other algorithms, in the following subsections.

The new techniques used in the second version are:

counting cost)

Let I = {1, ..., n} be the set of items. A transaction 3.6. database reduction for fast checking closedness

database on I is a set T = {t1 , . . . , tm } such that each 3.8. database reduction for fast checking maximality

ti is included in I. Each ti is called a transaction. We 3.7. new pruning algorithm for backtracking-based

denote by ||T || the sum of sizes of all transactions in maximal frequent itemset mining.

T , that is, the size of database T . A set P ⊆ I is

called an itemset. The techniques also used in the first versions are:

For itemset P , a transaction including P is called

an occurrence of P . The denotation of P , denoted 3.4. occurrence deliver (compute frequency in

by T (P ) is the set of the occurrences of P . |T (P )| is linear time)

called the frequency of P, and denoted by f rq(P ). For 3.5. ppc extension (generates closed itemsets with

given constant θ, called a minimum support, itemset neither memory nor duplication)

P is frequent if f rq(P ) ≥ θ. If a frequent itemset P 3.3. hypercube decomposition (fast enumeration by

is included in no other frequent itemset, P is called grouping frequent itemsets by equivalence class).

maximal. For any itemsets P and Q, T (P ∪ Q) =

T (P )∩T (Q) holds, and if P ⊆ Q then T (Q) ⊆ T (P ). The techniques used in the existing algorithms and

An itemset P is called closed if no other itemset Q the first version have citations to the previous papers.

satisfies T (P ) = T (Q), P ⊆ Q.

Given set S ⊆ T of transactions, let I(S) be the set 3.1 Enumerating Frequent Itemsets

T items common to all transactions in S, i.e., I(S) =

of

Any itemset included in a frequent itemset is it-

T ∈S T . Then, we define the closure of T itemset P

in T , denoted by clo(P ), by I(T (P ))(= t∈T (P ) t). self frequent. Thereby, the property “frequent” is

For every pair of itemsets P and Q, the following monotone. From this, we can construct any frequent

properties hold[13, 14]. itemset from the empty set by adding items one-by-

one without passing through any infrequent itemset.

(1) If P ⊆ Q, then clo(P ) ⊆ clo(Q). Roughly speaking, the existing algorithms are classi-

(2) If T (P ) = T (Q), then clo(P ) = clo(Q). fied into two groups, and algorithms in both groups

(3) clo(clo(P )) = clo(P ). use this property.

(4) clo(P ) is the unique smallest closed itemset The first group is so called apriori or level-by-level

including P . algorithms [1, 2]. Let Dk be the set of frequent item-

(5) A itemset P is a closed itemset if and only sets of size k. Apriori algorithms start with D0 , that

if clo(P ) = P . is {∅}, and compute Dk from Dk−1 in the increasing

order of k from k = 1. Any itemset in Dk is obtained

For itemset P and item i ∈ P , let P (i) = P ∩ from an itemset of Dk−1 by adding an item. Apriori

{1, . . . , i} be the subset of P consisting only of el- algorithms add every item to each itemset of Dk−1 ,

ements no greater than i, called the i-prefix of P . and choose frequent itemsets among them. If Dk = ∅

An itemset Q is a closure extension of an itemset holds for some k, then Dk0 = ∅ holds for any k 0 > k.

P if Q = clo(P ∪ {i}) holds for some i 6∈ P . If Thus, apriori algorithms stop at such k. This is the

Q is a closure extension of P , then Q ⊃ P, and scheme of apriori algorithms.

2

The other group is so called backtracking Database reduction performs well when the mini-

algorithms[3, 18, 19]. Backtracking algorithm is mum support is large, and many existing algorithms

based on recursive calls. An iteration of a backtrack- use it. LCM algorithms also use database reduction.

ing algorithm inputs a frequent itemset P , and gener- In the existing studies, the input databases are

ates itemsets by adding every items to P . Then, for often stored and maintained by using FP-tree (fre-

each itemset being frequent among them, the itera- quent pattern tree), which is a version of prefix tree

tion generates recursive calls with respect to it. To (trie) [9]. By using FP-tree, we can search specified

avoid duplications, an iteration of backtracking algo- transactions from the datasets efficiently. FP-tree

rithms adds items with indices larger than the tail compresses the common prefix, so we can decrease

of P . We describe the framework of backtracking the memory use. In addition, FP-tree can detect the

algorithms as follows. identical transactions, thus we can merge them into

one. This merge accelerates the frequency counting.

ALGORITHM BackTracking (P :current solution) From these reasons, FP-trees are used in many algo-

1. Output P rithms and implementations.

2. For each e ∈ I, e > tail(P ) do Although FP-tree has many good advantages, we

3. If P ∪ {e} is frequent then do not use it in the implementation of LCM, but use

call BackTracking (P ∪ {e}) simple arrays. The main reason is that LCM does

not have to search transactions in the database. The

An execution of backtracking algorithms gives a main operation of LCM is tracing the transactions in

tree structure such that the vertices of the tree are the the denotation of the current solution. Thus, we

iterations, and edges connect two iterations if one of do not need to use sophisticated data structures for

the iteration calls the other. If an iteration I recur- searching.

sively calls another iteration I 0 , then we say that I The other reason is the computation time for the

is the parent of I 0 , and I 0 is a child of I. For an iter- initialization. If we use a standard binary tree for

ation, the itemset received from the parent is called implementing FP-tree, the initialization of the in-

the current solution. put database takes O(||T || + |T | log |T |) time. for

Apriori algorithms use much memory for storing constructing FP-tree in memory. Compared to this,

Dk in memory, while backtracking algorithms use LCM detects the identical transactions and stores

less memory since they keep only the current solu- the database in memory within linear time of the

tion. Backtracking algorithms need no computation database size. This is because that LCM uses radix

for maintaining previously obtained itemsets, so the sort for this task, which sorts the transactions in a

computation time of backtracking algorithms is gen- lexicographic order in linear time. In general, the

erally short. However, apriori algorithms have ad- datasets of data mining problems have many trans-

vantages for the frequency counting. actions, and each transaction has few items. Thus,

LCM algorithms are based on backtracking al- ||T || is usually smaller than |T | log |T |, and LCM has

gorithms, and use an efficient techniques for the an advantage. The constant factors of the compu-

frequency counting, which are occurrence deliver tation time of binary tree operations are relatively

and anytime database reduction described below. larger than that of array operations. LCM also has

Hence, LCM algorithms compute the frequency effi- an advantage at this point. Again, we recall that

ciently without keeping previously obtained itemsets LCM never search the transactions, so each opera-

in memory. tion required by LCM can be done in constant time.

FP-tree has an advantage in reducing the memory

use. This memory reduction can also reduce the

3.2 Maintaining Databases computation time of the frequency counting. To

check the efficiency of the reduction, we checked

In the existing studies, database reduction is said to the reduction ratio by FP-tree for some datasets

be important to reduce the computation time. It is examined in FIMI03. The result is shown in Table

to reduce the input database as the following rules: 1. Each cell shows the ratio of the number of items

needed to be stored by arrays and FP-tree. Usually,

1. remove each item included in less than θ transactions the input database is reduced in each iteration,

2. remove each item included in all transactions hence we sum up the numbers over all iterations to

3. merge the identical transactions into one. compute the ratio. In the results of our experiments,

3

the ratio was not greater 3 in many instances. If generated itemset. To reduce the computation time,

|I| is small, |T | is large, and the dataset has a existing algorithms uses down project. For an itemset

randomness, such as accidents, the ratio was up to 6. P , down project computes its denotation T (P ) by us-

Generally, a binary tree uses memory three times as ing two subsets P1 and P2 of P . If P = P1 ∪ P2 , then

much as an array. Thus, the performance of FP-tree T (P ) = T (P1 )∩T (P2 ). Under the condition that the

seems to be not quite good rather than an array in items of P1 and P2 are sorted by their indices, the in-

both memory use and computation time, for many tersection can be computed in O(|T (P1 )| + |T (P2 )|)

datasets. time. Down project uses this property, and computes

the denotations quickly. Moreover, if |T (P1 )| < θ or

|T (P2 )| < θ holds, we can see that P never be fre-

3.3 Hypercube Decomposition quent. It also helps to reduce the computation time.

Apriori-type algorithms accelerates the frequency

LCM finds a number of frequent itemsets at once for counting by finding a good pair P1 and P2 of subsets

reducing the computation time[18]. Since the item- of P , such that |T (P1 )| + |T (P2 )| is small, or either

sets obtained at once compose a hypercube in the P1 or P2 is infrequent. Backtracking algorithm adds

itemset lattice, we call the technique hypercube de- an item e to the current solution P in each itera-

composition. For a frequent itemset P , let H(P ) tion, and compute its denotation. By using T ({e}),

be the set of items e satisfying e > tail(P ) and the computationP time for the frequency counting is

T (P ) = T (P ∪ {e}). Then, for any Q ⊆ H(P ), reduced to O( e>tail(P ) (|T (P )| + |T ({e})|)).

T (P ∪ Q) = T (P ) holds, and P ∪ Q is frequent. The bitmap method[5] is a technique for speed-

LCMfreq uses this property. For two itemsets P and ing up the computation of taking the intersection in

P ∪ Q, we say that P 0 is between P and P ∪ Q if down project. It uses a bitmap image (the charac-

P ⊆ P 0 ⊆ P ∪ Q. In the iteration with respect to teristic vector) of the denotations. To take the in-

P , we output all P 0 between P and P ∪ H(P ). This tersection, we have to take O(|T |) time with bitmap.

saves about 2|H(P )| times of the frequency counting. However, a 32bit CPU can take the intersection of

To avoid duplications, we do not generate recur- 32bits at once, thus roughly speaking the computa-

sive calls with respect to items included in H(P ). tion time is reduced to 1/32. This method has a

Instead of generating these recursive calls, we output disadvantage for sparse datasets, and is not orthog-

frequent itemsets including items of H(P ) in recur- onal to anytime database reduction described in the

sive calls with respect to items not included in H(P ). below. From the results of the experiments in FIMI

When the algorithm generates a recursive call with 03, bitmap method seems to be not good for sparse

respect to e 6∈ H(P ), we pass H(P ) to it. In the re- large datasets.

cursive call, we output all itemsets between P ∪ {e} LCM algorithms use another method for the fre-

and P ∪ {e} ∪ H(P ) ∪ H(P ∪ {e}). Since any itemset quency counting, called occurrence deliver[18, 19].

Q satisfies T (P ∪ Q ∪ H(P )) = T (P ∪ Q), the item- Occurrence deliver computes the denotations of P ∪

sets output in the recursive calls are frequent. We {e} for e = tail(P ) + 1, ..., |I| at once by tracing

describe hypercube decomposition as follows. transactions in T (P ). It use a bucket for each e to

ALGORITHM HypercubeDecomposition be added, and set them to empty set at the begin-

(P :current solution, S:itemset) ning. Then, for each transaction t ∈ T (P ), occur-

S 0 := S ∪ H(P ) rence deliver inserts t to the bucket of e for each

Output all itemsets including P e ∈ t, e > tail(P ). After these insertions, the bucket

and included in P ∪ S 0 of e is equal to T (P ∪ {e}). For each transaction t,

For each item e ∈ I \ (P ∪ S 0 ), e > tail(P ) do occurrence deliver takes O(|t ∩ {tail(P )P + 1, ..., |I|}|)

If P ∪ {e} is frequent then time. Thus, the computation time is O( T ∈T (P ) |T ∩

P

call HypercubeDecomposition (P ∪ {e}, S 0 ) {tail(P )+1, ..., |I|}|) =O(|T (P )|+ e>tail(P ) |T (P ∪

End for {e})|). This time complexity is smaller than down

project. We describe the pseudo code of occurrence

3.4 Frequency Counting deliver in the following.

Generally, the most heavy part of the frequent item- ALGORITHM OccurrenceDeliver

set mining is the frequency counting, which is to (T:database, P :itemset)

count the number of transactions including a newly 1. Set Bucket[e] := ∅ for each item e > tail(P )

4

dataset and chess accidents BMS-WebView2 T40I10D100K

minimum support 40% 30% 0.05% 0.1%

reduction factor

by FP-tree 2.27 6.01 1.9 1.57

reduction factor by

Hypercube decomposition 6.25 1 1.21 1

reduction factor

by apriori (best) 1.11 1.34 1.35 2.85

Table 1: Efficiency test of FP-tree, hypercube decomposition, and apriori: the reduction factor of FP-tree is

(sum of # of elements in reduced database by LCM) / ( sum of # of elements in reduced database by FP-tree),

over all iterations, the reduction ratio of hypercube decomposition is the

P average number of output frequent

itemsets in an iteration, and the reduction ratio of apriori is (sum of

P e>tail(P ) |T (P ∪ {e})|) / (sum of

e>F (P ) |T (P ∪ {e})|), over all iterations.

3. For each item e ∈ t, e > tail(P ) do According to this, LCM algorithms recursively re-

4. Insert t to Bucket[e] duce the database while the execution of recursive

5. End for calls. Before the recursive call, LCM algorithms gen-

6. End for erate a reduced database according to the above dis-

7. Output Bucket[e] for all e > tail(P ) cussion, and pass it to the recursive call. We call this

technique anytime database reduction.

Let F (P ) be the set of items e such that e > Anytime database reduction reduces the compu-

tail(P ) and P ∪ {e} is frequent. Apriori algorithms tation time of the iterations located at the lower

have possibility to find out in short time that P ∪ {e} levels of the recursion tree. In the recursion tree,

is infrequent, thus, in the bestPcase, their computa- many iterations are on the lower levels and few

tion time can be reduced to O( e∈F (P ) |T (P ∪{e})|). iterations are on the upper levels. Thus, anytime

P

If e>tail(P ),e6∈F (P ) |T (P ∪ {e})| is large, occurrence database reduction is expected to be efficient. In

deliver will be slow. our experiments, anytime database reduction works

P

To decrease e>tail(P ),e6∈F (P ) |T (P ∪ {e})|, LCM quite well. The following table shows the efficiency

algorithms sort indices of items e in the increasing of anytime database reduction. We sum up over all

order of |T ({e})|. As we can see in Table 1, this iterations the sizes of the database received from

P

sort reduces e>tail(P ),e6∈F (P ) |T (P ∪ {e})| to 1/4 of the parent, in both cases with anytime database

P

reduction and without anytime database reduction.

e>tail(P ) |T (P ∪ {e})| in many cases. Since apriori

algorithms take much time to maintain previously Each cell shows the sum. The reduction ratio is large

obtained itemsets, the possibility of speeding up by especially if the dataset is dense and the minimum

apriori algorithms is not so large. support is large.

LCM algorithms further speeds up the frequency

counting by iteratively reducing the database. Sup-

pose that an iteration I of a backtracking algorithm 3.5 Prefix Preserving Closure Exten-

receives a frequent itemset P from its parent. Then, sion

in any descendant iteration of I, no item of indices

smaller than tail(P ) is added. Hence, any such item Many existing algorithms for mining closed itemsets

can be removed from the database while the exe- are based on frequent itemset mining. That is, the

cution of the descendant iterations. Similarly, the algorithms enumerate frequent itemsets, and output

transactions not including P never include the cur- those being closed. This approach is efficient when

rent solution of any descendant iteration, thus such the number of frequent itemsets and the number of

transactions can be removed while the execution of frequent closed itemsets differ not so much. How-

the descendant iterations. Indeed, infrequent items ever, if the difference between them is large, the algo-

can be removed, and the identical transactions can rithms generate many non-closed frequent itemsets,

5

dataset and connect pumsb BMS-WebView2 T40I10D100K

minimum support 50% 60% 0.1% 0.03%

Database reduction 188319235 2125460007 2280260 1704927639

Anytime database reduction 538931 7777187 521576 77371534

Reduction factor 349.4 273.2 4.3 22.0

Table 2: Accumulated number of transactions in database in all iterations

thus they will be not efficient. Many pruning meth- ate the closure of P . By adding to P all items e such

ods have been developed for speeding up, however that f rq(P ) = f rq(P ∪ {e}), we can construct the

they are not complete. Thus, the computation time closure of P . We call the second closure operation.

is not bounded by a linear function in the number of LCM uses closure operations for generating ppc ex-

frequent closed itemsets. There is a possibility of over tensions. Similar to the frequency counting, we use

linear increase of computation time in the number of database reduction for closure operation. Suppose

output. that the current solution is P , the reduced database

LCM uses prefix preserving closure extension (ppc- is composed of transactions S1 , ..., Sh , and each Sl is

extension in short) for generating closed itemsets[18, obtained from transactions T1l , ..., Tkl of the original

19]. For a closed itemset P , we define the closure database. For eachTSl , we define the interior inter-

tail clo tail(P ) by the item i of the minimum index section In(Sl ) by T ∈{T l ,...,T l } T . Here the closure

T 1 k

satisfying clo(P (i)) = P . clo tail(P ) is always in- of P is equal to S∈{S1 ,...,Sh } In(S). Thus, by using

cluded in P . We say that P 0 is a ppc extension of P interior intersections, we can efficiently construct the

if P 0 = clo(P ∪ {e}) and P 0 (e − 1) = P (e − 1) hold for closure of P .

an item e > clo tail(P ). Let P0 be the itemset satis- When we merge transactions to reduce the

fying T (P 0 ) = T . Any closed itemset P 0 6= P0 is a database, interior intersections can be updated effi-

ppc extension of another closed itemset P , and such ciently, by taking the intersection of their interior in-

P is unique for P 0 . Moreover, the frequency of P is tersections. In the same way as the frequency count-

strictly larger than P 0 , hence ppc extension induces a ing, we can remove infrequent items from the interior

rooted tree on frequent closed itemsets. LCM starts intersections for more reduction. The computation

from P0 , and finds all frequent closed itemsets in a time for the closure operation in LCM depends on

depth first manner by recursively generating ppc ex- the size of database, but not on the number of previ-

tensions. The proof of ppc extension algorithms are ously obtained itemsets. Thus, storage method has

described in [18, 19]. advantages if the number of frequent closed itemsets

By ppc extension, the time complexity is bounded is small. However, for the instances with a lot of

by a linear function in the number of frequent closed frequent closed itemsets, which take long time to be

itemsets. Hence, the computation time of LCM never solved, LCM has an advantage.

be super linear in the number of frequent closed item-

sets.

3.7 Enumerating Maximal Frequent

Itemsets

3.6 Closure Operation

Many existing algorithms for maximal frequent item-

To enumerate closed itemsets, we have to check set enumeration are based on the enumeration of fre-

whether the current solution P is a closed itemset quent itemsets. In breadth-first manner or depth-

or not. In the existing studies, there are two meth- first manner, they enumerate frequent itemsets and

ods for this task. The first method is to store in output maximal itemsets among them. To reduce the

memory previously obtained itemsets which are cur- computation time, the algorithms prune the unnec-

rently maximal among itemsets having the identical essary itemsets and recursive calls.

denotation. In this method, we find frequent itemsets Similar to these algorithms, LCMmax enumerates

one-by-one, and store them in memory with remov- closed itemsets by backtracking, and outputs maxi-

ing itemsets included in another itemset having the mal itemsets among them. It uses a pruning to cut

identical denotation. After finding all frequent item- off unnecessary branches of the recursion. The prun-

sets, only closed itemsets remain in memory. We call ing is based on a re-ordering of the indices of items,

this storage method. The second method is to gener- in each iteration. We explain the re-ordering in the

6

following. is a heavy task, thus many existing algorithms avoid

Let us consider a backtracking algorithm for enu- it. They store in memory maximal itemsets among

merating frequent itemsets. Let P be the current so- previously obtained frequent itemsets, and update

lution of an iteration of the algorithm. Suppose that them when they find a new itemset. When the al-

P 0 is a maximal frequent itemset including P . LCM- gorithms terminate and obtain all frequent itemsets,

max puts new indices to items with indices larger only maximal frequent itemsets remain in memory.

than tail(P ) so that any item in P 0 has an index We call this storage method. If the number of max-

larger than any item not in P 0 . Note that this re- imal frequent itemsets is small, storage method is

ordering of indices has no effect to the correctness of efficient. However, if the number is large, storage

the algorithm. method needs much memory. When a frequent item-

Let e > tail(P ) be an item in P 0 , and consider the set is newly found, storage method checks whether

recursive call with respect to P ∪ {e}. Any frequent the itemset is included in some itemsets in the mem-

itemset P̂ found in the recursive call is included in ory or not. If the number of frequent itemsets is large,

P 0 , since every item having an index larger than e the operation takes long time.

is included in P 0 , and the recursive call adds to P

items only of indices larger than e. From this, we To avoid the disadvantage of storage method,

can see that by the re-ordering of indices, recursive LCMmax operates maximality check. LCMmax

calls with respect to items in P 0 ∩ H generates no checks the maximality by finding an item e such

maximal frequent itemset other than P 0 . that P ∪ {e} is frequent. If and only if such e ex-

According to this, an iteration of LCMmax chooses ists, P is not maximal. To operate this efficiently,

an item e∗ ∈ H, and generates a recursive call with we reduce the database. Let us consider an itera-

respect to P ∪ {e∗ } to obtain a maximal frequent tion of LCMmax with respect to a frequent itemset

itemset P 0 . Then, re-orders the indices of items other P . LCM algorithms reduce the database by anytime

than e∗ as the above, and generates recursive calls database reduction for the frequency counting. Sup-

with respect to each e > tail(P ) not included in P 0 ∪ pose that the reduced database is composed of trans-

{e∗ }. In this way, we save the computation time for actions S1 , ..., Sh , and each Sl is obtained by merg-

finding P 0 , and by finding a large itemset, increase ing transactions T1l , ..., Tkl of the original database.

the efficiency of this approach. In the following, we Let H be the set of items to be added in the iter-

describe LCMmax. ation. Suppose that we remove all items e from H

such that P ∪ {e} is infrequent. Then, for any l,

ALGORITHM LCMmax (P :itemset, H:items to T1l ∩ H = T2l ∩ H =, ..., = Tkl ∩ H holds. For an item e

be added) and a transaction Sl , we define the weight w(e, Sl ) by

1. H 0 := the set of items e in H s.t. P ∪ {e} is frequent the number of transactions in P T1l , ..., Tkl including e.

2. If H 0 = ∅ then Here the frequency of P ∪{e} is S∈{S1 ,...,Sh } w(e, S).

3. If P ∪ {e} is infrequent for any e then Thus, by using the weights, we can efficiently check

output P ; return the maximality, in linear time of the size of the re-

4. End if duced database.

5. End if When we merge transactions to reduce the

6. Choose an item e∗ ∈ H 0 ; H 0 := H 0 \ {e∗ } database, the weights can be updated easily. For each

7. LCMmax (P ∪ {e}, H 0 ) item e, we take the sum of w(e, S) over all transac-

8. P 0 := frequent itemset of the maximum size tions S to be merged. In the same way as frequency

found in the recursive call in 7 counting, we can remove infrequent items from the

9. For each item e ∈ H \ P 0 do database for maximality checking, for more reduc-

10. H 0 := H 0 \ {e} tion.

11. LCMmax (P ∪ {e}, H 0 )

12. End for The computation time for maximality check in

LCMmax depends on the size of database, but not

3.8 Checking Maximality on the number of previously obtained itemsets. Thus,

storage method has advantages if the number of max-

When LCMmax finds a frequent itemset P , it checks imal frequent itemsets is small, but for the instances

the current solution is maximal or not. We call with a lot of maximal frequent itemsets, which take

this operation maximality check. Maximality check long time to be solved, LCMmax has an advantage.

7

BMS-WebView-2-all

BMS-WebView-2-all

BMS-WebView-2-all BMS-WebView-2-closed

BMS-WebView-2-closed

BMS-WebView-2-closed

10

1000

10 10

1000

10

afopt_all

afopt_all

afopt_all afopt_all afopt_closed

afopt_closed

afopt_closed

fpg_all

fpg_all

fpg_all fpg_all fpg_closed

fpg_closed

fpg_closed

lcm_all

lcm_all lcm_all lcm_closed

lcm_closed

100 mafia_all mafia_all 100 mafia_closed

dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

1

10 1

10

1 1

0.1

10.1 0.1

10.1

0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01 0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01

minsup

minsup

minsup(%)

(%)

(%) minsup

minsup

minsup(%)

(%)

(%)

BMS-WebView-2-maximal

BMS-WebView-2-maximal

BMS-WebView-2-maximal

10

1000

10

afopt_maximal

afopt_maximal

afopt_maximal

fpg_maximal

fpg_maximal

fpg_maximal

lcm_maximal

lcm_maximal

100 mafia_maximal

cputime (sec)

1

10

0.1

10.1

0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01

minsup

minsup

minsup(%)

(%)

(%)

Figure 1: Results 1

displayed in Figure 1 and 2. In each graph, the hori-

In this section, we show the results of our compu- zontal axis is the size of minimum supports, and the

tational experiments. We implemented our three al- virtical axis is the CPU time written in a log scale.

gorithms LCM, LCMfreq, and LCMmax. They are

coded by ANSI C, and complied by gcc. The exper- From the performances of implementations, the in-

iments were executed on a notebook PC, with AMD stances were classified into three groups, in which the

athron XP 1600+ of 224MB memory. The perfor- results are similar. Due to the space limitation, we

mance of LCM algorithms are compared with the show one instance as a representative for each group.

algorithms which marked good score on FIMI 03: The first group is composed of BMS-WebView1,

fpgrowth[8], afopt[11], MAFIA[5, 6], kDCI[12], and BMS-WebView2, BMS-POS, T10I4D100K, kosarak,

PATRICIAMINE[16]. We note that kDCI and PA- and retail. These datasets have many items and

TRICIAMINE are only for all frequent itemset min- transactions but are sparse. We call these datasets

ing. To reduce the time for experiments, we stop sparse datasets. We chosen BMS-WebView2 as the

the execution when an algorithm takes more than representative.

10 minute. The following figures show the results.

We do not plot if the computation time is over 10 The second group is composed of datasets taken

8

chess-all

chess-all chess-closed

100

100 1000

afopt_all

afopt_all afopt_all afopt_closed

fpg_all

fpg_all fpg_all fpg_closed

lcm_all lcm_all lcm_closed

10 mafia_all mafia_all 100 mafia_closed

10 dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

1 10

1

0.1 1

0.1

0.01 0.1

90

90 80

80 70

70 60

60 50

50 40

40 30 20 90 80

80 7070 60 60 50 5040 40

30 30

20

minsup

minsup(%)

(%) minsup (%)

chess-maximal accidents-all

accidents-all

100 100

1000

afopt_maximal afopt_all

afopt_all afopt_all

fpg_maximal fpg_all

fpg_all fpg_all

lcm_maximal lcm_all lcm_all

mafia_maximal mafia_all mafia_all

10 100 dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

10

1 10

0.1 11

90 80 70 60 50 40 30 20 9090 80

80 70

70 60

60 50 50 40

40 30

30 20 10

minsup (%) minsup

minsup(%)

(%)

accidents-closed

accidents-closed accidents-maximal

100

1000 1000

afopt_closed

afopt_closed afopt_maximal

fpg_closed

fpg_closed fpg_maximal

lcm_closed lcm_maximal

mafia_closed mafia_maximal

100 100

cputime (sec)

cputime (sec)

10

10 10

11 1

9090 80

80 70

70 6060 50 50 40 4030 30

20 20

10 90 80 70 60 50 40 30 20 10

minsup

minsup(%)

(%) minsup (%)

Figure 2: Results 2

9

from UCI-Machine Learning Repository1 , connect, 5 Conclusion

chess, mushrooms, pumsb, and pumsb-star. These

datasets have many transactions but few items. We In this paper, we proposed a fast implementation of

call these datasets middle density datasets. As a rep- LCM for enumerating frequent closed itemsets, which

resentative, we show the result of chess. is based on prefix preserving closure extension. We

The third group is accidents. It is different from further gave implementations LCMfreq and LCM-

any other dataset. It has huge number of transac- max for enumerating all frequent itemsets and max-

tions, but few items. Transactions includes many imal frequent itemsets by modifying LCM. We show

items, so the dataset is very dense. We call this by computational experiments that our implements

dataset very dense dataset. of LCM, LCMfreq and LCMmax perform above the

other algorithms for many datasets, especially for

In almost instances and minimum supports, LCM

sparse datasets. There is a possibility of speeding up

algorithms perform well. When the minimum sup-

LCM algorithms by developing more efficient maxi-

port is large, LCM algorithms are the fastest for all

mality checking algorithms, or developing a hybrid of

instances, because of the fast initialization. For all

array and FP-tree like data structures.

instances with any minimum support, LCM outper-

forms other closed itemset mining algorithms. This

shows the efficiency of ppc extension. Acknowledgment

For sparse datasets, LCM algorithms are the

fastest, for any minimum support. The efficiency of This research is supported by joint-research funds of

FP-tree is not large, and occurrence deliver works ef- National Institute of Informatics.

ficiently. The performances of afopt and fp-growth

are quite similar for these problems. They are the References

second bests, and 2 to 10 times slower than LCM

algorithms. For enumerating frequent closed item- [1] R. Agrawal and R. Srikant, “Fast Algorithms for

sets, they take much time when the number of closed Mining Association Rules in Large Databases,”

itemsets is large. Although PATRICIAMINE is fast In Proceedings of VLDB ’94, pp. 487–499, 1994.

as much as fp-groth and afopt, it abnormally ter-

minated for some instances. kDCI is slow when the [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen

number of frequent itemsets is large. MAFIA was the and A. I. Verkamo, “Fast Discovery of Associa-

slowest for these instances, for any minimum support. tion Rules,” In Advances in Knowledge Discov-

For middle density datasets, LCM is the fastest ery and Data Mining, MIT Press, pp. 307–328,

for all instances on closed itemset mining. On all 1996.

and maximal frequent itemset mining, LCMfreq and [3] R. J. Bayardo Jr., “Efficiently Mining Long Pat-

LCMmax are the fastest for large minimum supports, terns from Databases”, In Proc. SIGMOD’98,

for any dataset. For small minimum supports, for pp. 85–93, 1998.

half instances LCMfreq and LCMmax are the fastest. [4] E. Boros, V. Gurvich, L. Khachiyan, and

For the other instances, the results are case by case: K. Makino, “On the Complexity of Generat-

each algorithm won in some cases. ing Maximal Frequent and Minimal Infrequent

For accidents, LCM algorithms are the fastest Sets,” STACS 2002, pp. 133-141, 2002.

when the minimum support is large. For small sup- [5] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA:

ports, LCM(closed) is the fastest, however LCMfreq A Maximal Frequent Itemset Algorithm for

and LCMmax are slower than fp-growth For this Transactional Databases,” In Proc. ICDE 2001,

dataset, the efficiency of FP-tree is large, and the pp. 443-452, 2001.

compression ratio is up to 6. Bitmap is also efficient

[6] D. Burdick, M. Calimlim, J. Flannick, J.

from the density. Hence, the computation time for

Gehrke, and T. Yiu, “MAFIA: A Performance

the frequency counting is short in the execution of

Study of Mining Maximal Frequent Itemsets,”

existing implementations. However, by ppc exten-

In Proc. IEEE ICDM’03 Workshop FIMI’03,

sion, LCM has an advantage for closed itemset min-

2003. (Available as CEUR Workshop Proc. se-

ing. hence LCM(closed) is the fastest.

ries, Vol. 90, http://ceur-ws.org/vol-90)

[7] B. Goethals, the FIMI’03 Homepage,

1 http://www.ics.uci.edu/ mlearn/MLRepository.html

http://fimi.cs.helsinki.fi/, 2003.

10

[8] G. Grahne and J. Zhu, “Efficiently Using as CEUR Workshop Proc. series, Vol. 90,

Prefix-trees in Mining Frequent Itemsets,” In http://ceur-ws.org/vol-90)

Proc. IEEE ICDM’03 Workshop FIMI’03, 2003. [19] T. Uno, T. Asai, Y. Uchida, H. Arimura, “An

(Available as CEUR Workshop Proc. series, Efficient Algorithm for Enumerating Closed Pat-

Vol. 90, http://ceur-ws.org/vol-90) terns in Transaction Databases,” to appear in

[9] J. Han, J. Pei, Y. Yin, “Mining Frequent Pat- Proc. of Discovery Science 2004, 2004.

terns without Candidate Generation,” SIGMOD [20] M. J. Zaki, C. Hsiao, “CHARM: An Efficient

Conference 2000, pp. 1-12, 2000 Algorithm for Closed Itemset Mining,” 2nd

[10] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma- SIAM International Conference on Data Mining

son and Z. Zheng, “KDD-Cup 2000 Organizers’ (SDM’02), pp. 457-473, 2002.

Report: Peeling the Onion,” SIGKDD Explo- [21] Z. Zheng, R. Kohavi and L. Mason, “Real World

rations, 2(2), pp. 86-98, 2000. Performance of Association Rule Algorithms,”

[11] Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wang KDD 2001, pp. 401-406, 2000.

Wei, and Xiangye Xiao, “AFOPT: An Efficient

Implementation of Pattern Growth Approach,”

In Proc. IEEE ICDM’03 Workshop FIMI’03,

2003. (Available as CEUR Workshop Proc. se-

ries, Vol. 90, http://ceur-ws.org/vol-90)

[12] S. Orlando, C. Lucchese, P. Palmerini, R. Perego

and F. Silvestri, “kDCI: a Multi-Strategy Algo-

rithm for Mining Frequent Sets,” In Proc. IEEE

ICDM’03 Workshop FIMI’03, 2003. (Available

as CEUR Workshop Proc. series, Vol. 90,

http://ceur-ws.org/vol-90)

[13] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,

Efficient Mining of Association Rules Using

Closed Itemset Lattices, Inform. Syst., 24(1),

25–46, 1999.

[14] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,

Discovering Frequent Closed Itemsets for Asso-

ciation Rules, In Proc. ICDT’99, 398-416, 1999.

Algorithm for Mining Frequent Closed Item-

sets,” ACM SIGMOD Workshop on Research Is-

sues in Data Mining and Knowledge Discovery

2000, pp. 21-30, 2000.

[16] A. Pietracaprina and D. Zandolin, “Mining

Frequent Itemsets using Patricia Tries,” In

Proc. IEEE ICDM’03 Workshop FIMI’03, 2003.

(Available as CEUR Workshop Proc. series,

Vol. 90, http://ceur-ws.org/vol-90)

[17] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shi-

rakawa, “A New Algorithm for Generating All

the Maximum Independent Sets,” SIAM Jour-

nal on Computing, Vol. 6, pp. 505–517, 1977.

“LCM: An Efficient Algorithm for Enumerat-

ing Frequent Closed Item Sets,” In Proc. IEEE

ICDM’03 Workshop FIMI’03, 2003. (Available

11

- (Lecture Notes in Computer ScienAlgorithms – ESA 2012_ 20th Annual European SymposiumUploaded byLam Ngoc Tam
- Introduction to Computation and Programming Using Python%2C Revised - Guttag%2C John v..131Uploaded byZhichaoWang
- Solving 8 Queen ProblemUploaded bygowdhami
- 4_bw_comad2011_submission_36_minimally Infrequent Itemset Mining Using Pattern-growth Paradigm and Residual TreesUploaded byvlatko
- Decision RulesUploaded bySwarnava Mitra
- Combined use of ARM and graph clustering methods to find association in urban routesUploaded byJournal of Computing
- Lecture11b-Zoo Case Study Cluster Association RuleUploaded bymahi2716
- 44 Paper 28011035 IJCSIS Camera Ready Pp273-279Uploaded byijcsis
- algorithm complexity L3Uploaded byHaroon Arshad
- Chap1 IntroUploaded byFauzan Ishlakhuddin
- Data Warehousing and Mining - Exam SolutionsUploaded byuserscrybd
- icde99_rjUploaded byjilf
- IJCTT-V3I4P108Uploaded bysurendiran123
- BacktrackingUploaded byKanahaiya Gupta
- Mine High Utility Itemset using UP-Tree and FP-GrowthUploaded byseventhsensegroup
- Discovering High Utility Itemsets Using Hybrid ApproachUploaded byEditor IJRITCC
- [IJCT-V2I3P20] Authors :Litty Tressa George, Asha Rose ThomasUploaded byIjctJournals
- FPTreeUploaded byVimal Victor
- Ga ParametersUploaded byAnonymous TxPyX8c
- MS - 1Uploaded byAsma Ehsan
- dynprogUploaded byHykinel Bon Guarte
- Solutions ManualUploaded byDavid N Jiang
- 10.1.1.114Uploaded byOuc Ac Cy Ouc
- A Deep Learning based Feature Selection Method with Multi Level Feature Identification and Extraction using Convolutional Neural NetworkUploaded byiaetsdiaetsd
- 79-314-1-PBUploaded byjohnoftheroad
- BacktrackingUploaded byNinni Singh
- BacktrackingUploaded byShilpa Bhatia
- Chp3Uploaded byhaider
- Art013.pdfUploaded bycruzreyeslaura
- Data StructuresUploaded bySanjay Gowd

- Lse Applicablemathematics InfoUploaded bymilutin milankovic
- Regular Grammar, Regular Languages and Properties of Regular LanguagesUploaded byJhayr Pascua
- FFT Cooley TukeyUploaded byHarônio Ferreira
- wind flow analysis in staggered buildingsUploaded byvignesh791
- Wavelets a New Tool for the Resonant Grounded Power Distribution Systems RelayingUploaded byBhaskar Mitra
- Sieve Algorithms for Perfect Power TestingUploaded byYllovanea
- Stock Market Prediction on High-Frequency Data Using Generative Adversarial NetsUploaded bybnns
- AB-exam-2011Uploaded byrliu27
- 102 Session 2Uploaded byapi-3719521
- jcs_si_84-88Uploaded bymrvozturk
- nlogn.pdfUploaded bymueramon
- Lab Manual 1 Flow Chart Repaired)Uploaded byirfan_chand_mian
- Introduction to Stochastic Processes with Applications in the BiosciencesUploaded bymushr00mcloud
- Cryptography With Python TutorialUploaded bySunendran Shanmuganathan
- Lab workbookUploaded byOye Aj
- Social Inspections Based on Risk ProfilesUploaded bydacis2005
- 1406.2661Uploaded byBogdan Musat
- AnnuityUploaded byGoldey Fetalvero Malabanan
- Treap_ppt2Uploaded byitzmanohar_bhat
- 25-SohJUploaded byZisa Krage
- Webinar OptiSLang4 Ansys WBUploaded bykiran_wakchaure
- assignment IIIUploaded byapi-19978246
- Dynamic Programming - Wikip..Uploaded bylocohost
- Farid - Hybrid Decision Tree and Naïve Bayes Classifiers for Multi-class - 2014Uploaded byAhmar Taufik Safaat
- An Introduction to Fuzzy Logic and Fuzzy SetsUploaded byfarmoos
- 2011-2012 IPPR Micro Lesson PlanUploaded bypatelIbrahim
- Study on Simplex Method and Design & Development of Application for Solving Simplex Problem using PythonUploaded byAnonymous izrFWiQ
- Review SolutionsUploaded byhasant_4
- chap12Uploaded byYonathan X Seyoum
- CHE555 Lesson PlanUploaded byMXR-3

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.