8 views

Uploaded by technetvn

DBV1

DBV1

© All Rights Reserved

- A Practical Scalable Distributed B-Tree
- Successful Data Mining
- Data Mining Resources
- Apriori Itemset Generation
- Exam 6 Study Guide
- Understanding Business Intelligence Systems Study Guide
- getDoc
- Business Intelligence for New-Generation Managers
- A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING MDL REDUCTION ALGORITHM
- IJAIEM-2014-01-23-051
- Mine High Utility Itemset using UP-Tree and FP-Growth
- paper.weka
- Weka
- Improving search time for contentment based image retrieval via, LSH, MTRee, and EMD bounds
- IRJET-An Enhanced Apriori And Improved Algorithm For Association Rules
- AP 26261267
- Data Mining
- IRJET-V5I3289
- Data Mining
- CS614

You are on page 1of 11

journal homepage: www.elsevier.com/locate/eswa

itemsets

Bay Vo a,, Tzung-Pei Hong b,c, Bac Le d

a

Department of Computer Science, Ho Chi Minh City University of Technology, Ho Chi Minh, Viet Nam

Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan, ROC

c

Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan, ROC

d

Department of Computer Science, University of Science, Ho Chi Minh, Viet Nam

b

a r t i c l e

i n f o

Keywords:

BitTable

Data mining

Dynamic Bit-Vector

Frequent closed itemsets

Vertical data format

a b s t r a c t

Frequent closed itemsets (FCI) play an important role in pruning redundant rules fast. Therefore, a lot of

algorithms for mining FCI have been developed. Algorithms based on vertical data formats have some

advantages in that they require scan databases once and compute the support of itemsets fast. Recent

years, BitTable (Dong & Han, 2007) and IndexBitTable (Song, Yang, & Xu, 2008) approaches have been

applied for mining frequent itemsets and results are signicant. However, they always use a xed size

of Bit-Vector for each item (equal to number of transactions in a database). It leads to consume more

memory for storage Bit-Vectors and the time for computing the intersection among Bit-Vectors. Besides,

they only apply for mining frequent itemsets, algorithm for mining FCI based on BitTable is not proposed.

This paper introduces a new method for mining FCI from transaction databases. Firstly, Dynamic Bit-Vector (DBV) approach will be presented and algorithms for fast computing the intersection between two

DBVs are also proposed. Lookup table is used for fast computing the support (number of bits 1 in a

DBV) of itemsets. Next, subsumption concept for memory and computing time saving will be discussed.

Finally, an algorithm based on DBV and subsumption concept for mining frequent closed itemsets fast is

proposed. We compare our method with CHARM, and recognize that the proposed algorithm is more efcient than CHARM in both the mining time and the memory usage.

2012 Elsevier Ltd. All rights reserved.

1. Introduction

Mining frequent itemsets (FI) is a one of the most important

task in mining association rules. However, mining association rules

from FI will generate a lot of redundant rules (Bastide, Pasquier,

Taouil, Stumme, & Lakhal, 2000; Pasquier, Bastide, Taouil, & Lakhal,

1999a, 1999b; Zaki, 2000, 2004). Whereas, mining association

rules from FCI will overcome the disadvantage of this problem.

Therefore, a lot of algorithms for mining FCI have been proposed

such as Close (Pasquier et al., 1999b), A-Close (Pasquier et al.,

1999a), Closet (Pei, Han, & Mao, 2000), Closet+ (Wang, Han, &

Pei, 2003), CHARM (Zaki & Hsiao, 2005), etc. They can be divided

into two categories according to the database format: horizontal

and vertical. Vertical-based approaches often scan the database

once and base on the divide-and-conquer strategy for fast mining

FI/FCI. We can list some of them.

E-mail addresses: vdbay@hcmhutech.edu.vn (B. Vo), tphong@nuk.edu.tw

(T.-P. Hong), lhbac@t.hcmus.edu.vn (B. Le).

0957-4174/$ - see front matter 2012 Elsevier Ltd. All rights reserved.

doi:10.1016/j.eswa.2012.01.062

(i) Tidlist (Zaki & Hsiao, 2005; Zaki, Parthasarathy, Ogihara, & Li,

1997): the database is transformed into item Tidset form.

A new itemset from two itemsets X Tidset(X) and Y Tidset(Y) is (X [ Y) (Tidset(X) \ Tidset(Y)). The support of

itemset X is the cardinality of Tidset(X).

(ii) BitTable (Dong & Han, 2007; Song, Yang, & Xu, 2008): the

database is transformed into item Bit-Vector form. A

new itemset from two itemsets X Bit-Vector(X) and

Y Bit-Vector(Y) is (X [ Y) (Bit-Vector(X) \ Bit-Vector(Y)),

where Bit-Vector(X) \ Bit-Vector(Y) is computed by using

AND operation of each bit in Bit-Vectors. The support of an

itemset X is the number of bits 1 in Bit-Vector(X).

All above approaches must store the whole database in main

memory. When the number of transactions is large, it is difcult

to store them in main memory and therefore, we must mine in secondary memory and lead to consume more time.

In this paper, we propose a new approach for mining FCI based

on Dynamic Bit-Vector. Our approach has some advantages: (i) Fast

computing the intersection of two DBVs and the support of itemsets; (ii) It overcomes the disadvantage of CHARM in that it checks

whenever this itemset is created, and it need not use hash table

to check non-closed itemsets; (iii) The subsumption concept is

used for pruning search space. With subsumption concept, itemsets that have the same transaction identiers will be subsumed

into an itemset. Hence, we can save the memory for storage and

the time for computing the intersection of DBVs.

The rest of this paper is follows: Section 2 presents the related

work. Section 3 introduces Dynamic Bit-Vector, some denitions

are also presented in this section, and an algorithm for getting

the intersection between two DBVs is proposed. Some theorems

are developed in Section 4 and based on them, the method for mining FCI will be proposed. Section 5 presents experimental results.

We conclude our work in Section 6.

2. Related work

2.1. Mining frequent closed itemsets

Frequent itemsets play an important role in the mining process.

A frequent itemset can be formally dened as follows. Let D be a

transaction database and I be the set of items in D. The support

of an itemset X (X # I), denoted r(X), is the number of transactions

in D containing X. X is called a frequent itemset if r(X) P minSup,

where minSup is a predened minimum support threshold.

Frequent closed itemsets are a variant of frequent itemsets for

reducing rule numbers. Formally, an itemset X is called a frequent

closed itemset if it is frequent, and it does not exist any frequent

itemset Y such that X Y and r(X) = r(Y). There are many methods

proposed for mining FCI from databases. They could be divided

into the following four categories (Lee, Wang, Weng, Chen, & Wu,

2008; Yahia, Hamrouni, & Nguifo, 2006):

(i) Generate-and-test approaches: they are mainly based on the

Apriori algorithm and use the level-wise approach to discover FCI. Some examples are Close (Pasquier et al., 1999b)

and A-Close (Pasquier et al., 1999a).

(ii) Divide-and-conquer approaches: they adopt the divide-andconquer strategy and use compact data structures extended

from the frequent-pattern (FP) tree to mine FCI. Examples

include Closet (Pei et al., 2000), Closet+ (Wang et al., 2003)

and FPClose (Grahne & Zhu, 2005).

(iii) Hybrid approaches: they integrate both two strategies above

to mine FCI. They rstly transform the database into the vertical data format. These approaches develop properties and

use the hash table to prune non-closed itemsets. CHARM

(Zaki & Hsiao, 2005) and CloseMiner (Singh, Singh, & Mahanta, 2005) also belong to them.

(iv) Hybrid approaches without duplication: these approaches

differ from the hybrid ones in not using the subsumptionchecking technique, such that FCI need not be stored in the

main memory. They dont use the hash-table technique as

CHARM (Zaki & Hsiao, 2005). DCI-Close (Lucchese, Orlando,

& Perego, 2006), LCM (Uno, Asai, Uchida, & Arimura, 2004)

and PGMiner (Moonestinghe, Fodeh, & Tan, 2006) also

belong to this category.

2.2. Vertical data format

In mining FI and FCI, if we are interesting in data format, there

are two main kinds: horizontal and vertical data formats. Algorithms for mining frequent itemsets based on the vertical data format are usually more efcient than that on the horizontal data

format (Dong & Han, 2007; Song et al., 2008; Zaki, 2000, 2004; Zaki

& Hsiao, 2005; Zaki et al., 1997), because they often scan the data-

7197

base once and compute fast the support of itemsets. The disadvantage is that it consumes more memory for storing Tidsets (Zaki,

2000, 2004; Zaki & Hsiao, 2005; Zaki et al., 1997), Bit-Vector (Dong

& Han, 2007; Song et al., 2008).

Some typical algorithms based on the vertical data format are

summarized as follows:

(i) Eclat (Zaki et al., 1997): proposed by Zaki et al. in 1997.

Authors used Galois connection to propose an efcient algorithm for mining frequent itemsets. We can see that the support of an itemset is cardinality of Tidset. Thus, we can

compute the support of X fast by using Tidset(X), r(X) = |Tidset(X)|. Author also proposed the way of computing Tidset(XY) by using the intersection between Tidset(X) and

Tidset(Y), i.e., Tidset(XY) = Tidset(X) \ Tidset(Y). Some

improvements of Eclat also proposed in Zaki and Hsiao

(2005). In this paper, the divide-and-conquer strategy was

used to mine fast frequent itemsets.

(ii) BitTable-FI (Dong & Han, 2007): Another way of data compressing is that it represents the database in form of BitTable, each item occupies in |T| bits, where |T| is number of

transactions in D. When we create a new itemset XY from

two itemsets X and Y, Bit-Vector of XY will be computed

by using Bit-Vector of items in XY (by getting the result of

AND operation between 2 bytes in Bit-Vectors). Because

the cardinalities of two Bit-Vectors are the same, the result

will be a Bit-Vector with the length of |T|/8 + 1 bytes. The

algorithm for mining frequent itemsets in Dong and Han

(2007) is based on Apriori (Agrawal & Srikant, 1994). The different point is in computing the support, the Apriori algorithm computes the support by re-scanning the database,

while BitTable-FI only computes the intersection in Bit-Vectors. The support of outcome itemset can be computed faster

by computing the number of bits 1 in Bit-Vector. Another

improvement of BitTable-FI was proposed in Song et al.

(2008). Authors base on the subsume concept to subsume

items and propose Index-BitTableFI for mining frequent

itemsets. The way of subsuming is: initially, items are sorted

by increasing order according to their supports. For considering each item i with all successive items (in sorted order),

if the Bit-Vector of item ith is subset of the Bit-Vector of item

jth, then the item jth belongs to the subsume set of ith.

Mining frequent itemsets based on the subsume concept

has improved signicantly the time of mining frequent itemsets in comparison with BitTable-FI.

3. Dynamic Bit-Vector

As mentioned above, Bit-Vector of each itemset always occupies

the xed size corresponding to the number of transactions in the

given database, so it consumes more memory and the time of computing the intersection between Bit-Vectors. In practice, the BitVector of an itemset that contains many bits 0 can be removed

to reduce the space and time. Thus, we develop the Dynamic BitVector approach to solve above problem.

3.1. DBV data structure

Each DBV has two elements:

pos: the position of the rst non-zero byte in Bit-Vector.

Bit-Vector: a list of bytes in Bit-Vector after removing 0 bytes

from head and tail.

Example 1. Given a Bit-Vector (in decimal format) as shown in

Fig. 1.

7198

0 0 0 0 0 0 0 0 0 0 5 3 8 0 0 7 6 2 7 6 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

For the example above, BitTable needs 40 bytes to store (Fig. 1),

while DBV only consumes 14 bytes (12 bytes for the bit vector and

2 bytes for the position as in Fig. 2). The DBV scheme thus needs

less memory than the original BitTable approach.

pos=10

538007632765

&

430104600513

pos=13

The idea of the algorithm: Initially, from the greater position value in two position values of two DBVs, performing AND operation

between 2 bytes from this position, if the result value is 0, then the

pos value is increased by 1 until reaching the rst non-zero value.

Next, from this position, keep performing AND operation for each

byte of Bit-Vectors until values of the rest bytes is 0, we have the

outcome DBV.

Getting the intersection between two DBVs will be performed

by the following example.

Example 2. Assume that we want to compute the intersection

between

2

DBVs

{10, {5, 3, 8, 0, 0, 7, 6, 3, 2, 7, 6, 5}}

and

{13, {4, 3, 0, 1, 0, 4, 6, 0, 0, 5, 1, 3}} (Fig. 3). Because 13 > 10, it implies

pos = 13. At position 13, we have 0&4 = 0, so pos = 14. Similarly,

0&3 = 0, 7&0 = 0, 6&1 = 0, 3&0 = 0, 2&4 = 0, so pos = 19. Next,

because of 7&6 = 6 0, outcome Bit-Vector will be 6. Because the

rest bytes are 0, the outcome DBV will be {19, {6}}.

From the example above, the pseudo code for getting the intersection of two DBVs is presented as follows.

Input: Two DBVs {pos1, Bit-Vector1} and {pos2, Bit-Vector2}.

Output: DBV = {pos, Bit-Vector}.

Method: See Fig. 4.

6

pos=19

Fig. 4 presents a pseudo code for computing the intersection between two DBVs. First of all, it nds a maximal position between

two positions (line 1). Then, it nds the positions (i and j) that have

the intersection of two Bit-vectors (lines 2 and 3). From these positions, using AND operator to nd the rst byte that is non-zero

(lines 57). Besides, the pos variable will increase by 1 if the result

of AND operator is zero (nding the rst position that is non-zero

for the DBV result) and the count variable decrease by 1 (the number of bytes of the intersection decreases by 1). i1 and j1 are the

last positions that can be had the intersection between two Bitvectors (line 8). From these positions, using the AND operator to

nd the last position that is non-zero (lines 911). The rest of bytes

belong to the intersection between two Bit-vectors (lines 1214).

3.3. Some denitions and corollaries

Denition 3.1. Given two DBVs, DBV1 = {pos1, Bit-Vector1} and

DBV2 = {pos2, Bit-Vector2}, DBV1 is called a subset of DBV2, denote

DBV1 # DBV2, if and only if DBV1 \ DBV2 = DBV1.

Fig. 4. The pseudo code for getting the intersection between two DBVs.

7199

DBV2 = {pos2, Bit-Vector2}, DBV1 is called equal to DBV2, denoted

DBV1 = DBV2, if pos1 = pos2, |Bit-Vector1| = |Bit-Vector2|, and

"i e [0.. |Bit-Vector1| 1]: Bit-Vector1[i] = Bit-Vector2[i].

Denition 3.3. Given DBV = {pos, Bit-Vector}, DBV is called if

Bit-Vector is .

Corollary 3.1. Given two DBVs, DBV1 = {pos1, Bit-Vector1} and

DBV2 = {pos2, Bit-Vector2}. If one of two following cases satisfy,

then DBV1 \ DBV2 = .

(i) pos1 + |Bit-Vector1| < pos2 or,

(ii) pos2 + |Bit-Vector2| < pos1.

Proof

(i) pos1 + |Bit-Vector1| < pos2 implies that the count variable (in

the Fig. 3) will be 0. Therefore, Bit-Vector is or DBV1 \

DBV2 = .

(ii) Similar to (i).

Corollary 3.2. Consider the algorithm in Fig. 4, if count 8 < minSup,

then the itemset that contains DBV is not a frequent itemset.

Proof. The number of bits 1 in Bit-Vector is the support of an itemset. Besides, the value of count variable (in line 4) is always larger

than or equal to that of |Bit-Vector|, i.e., |Bit-Vector| 6 count. Therefore, number of bits 1 6 |Bit-Vector| 8 6 count 8 < minSup or the

itemset corresponding to the Bit-Vector will be not frequent. h

In fact, if the count variable in line 12 satises count 8 < minSup, then the itemset that contains DBV is not a frequent itemset.

as in Fig. 5.

The algorithm of Fig. 5 differs from the algorithm of Fig. 4 in line

4 (lines 4, and 5 of Fig. 5), and line 14 of Fig. 5. They check whether

the count variable satises the corollary 3.2 or not. If it satises, the

algorithm will return NULL value for the DBV result (it means that

Bit-Vector is NULL).

3.4. A method for fast computing the itemset support

A limit of BitTable-based approach is that it consumes more

time for computing the intersection among Bit-Vectors and counting the number of bits 1 of a Bit-Vector. Algorithms in Dong and

Han (2007) and Song et al. (2008) count the number of bits 1 of

an itemset after computing the intersection among Bit-Vectors of

items in this itemset. For example: Assume that we need compute

the support of itemset X = {x1, x2, . . ., xk}, we must compute Bit-Vector(X) = Bit-Vector(x1) \ Bit-Vector(x2) \ \ Bit-Vector(xk) rst.

After that, we scan Bit-Vector(X) to count the number of bits 1.

The complexity of support counting is O(nk), where n is number

of transactions and k is the length of X.

This paper proposes a new method for fast computing the support of a new itemset. When computing the intersection between

2 bytes to create the result byte, we use a lookup table to get the

number of bits 1 in this byte. With this method, when the BitVector is computed, we will have number of bits 1 of Bit-Vector

immediately. Therefore, the complexity for the support counting

of an itemset is O(m), where m is the number of bits in DBV. Because m 6 n, O(m) O(nk). If we only consider the time for

counting the number of bits 1 in the result Bit-Vector, then the

complexity of BitTable approach is O(n) and of our approach is

only O(1).

Because each element in Bit-Vector is one byte, we can use a

lookup table (as in Table 1) with 256 elements, the ith element

contains the number of bits 1 of the value i.

Table 1

Lookup table for look up the number of bits 1.

Value

...

...

255

Binary value

#bit 1

00000000

0

00000001

1

00000010

1

00000011

2

00000100

1

00000101

2

...

...

11111111

8

Fig. 5. The modied pseudo code for getting the intersection between two DBVs.

7200

DBV(AD) = {0, {21}}.

A joins E to create a new node AE, DBV(E) = {0, {31}}, so

DBV(AE) = {0, {29}}.

4.1. DBV-tree

Each node in DBV-tree includes two elements X and DBV(X),

where X is an itemset and DBV(X) is Dynamic Bit-Vector containing

X. Arc connects between two nodes X and Y must satisfy condition

that X, Y have the same length and the same |X| 1 prex items.

After that, each child node of A will join with all its successive

nodes to create the grandchildren of A. This process will be repeated recursively to create DBV-tree.

4.2. Subsumption concept

Bit-Vector and DBV of each items in Table 2 will be as shown in

Table 3.

Fig. 6 illustrates the method for creating DBV-tree of the database in Table 3 with 5 rst items { A, B, C, D, E}: The rst level of

DBV-tree contains single items and their DBV. To create higher levels, we join each child node with all its following nodes.

For example: consider node A in the level 1 of DBV-tree in Fig. 6.

r(X) = r(XY).

The subsumption concept in our denition is more general than

from the denitions in Song et al. (2008) and Zaki and Hsiao

(2005). In Song et al. (2008), authors used subsumption concept

for subsuming single items, and in Zaki and Hsiao (2005), authors

used subsumption concept for checking non-closed itemsets. The

subsumption concept in Denition 4.1 can be used for subsuming

itemsets and eliminating non-closed itemsets from DBV-tree.

DBV(B) = {0, {63}}, so DBV(AB) = DBV(A) \ DBV(B) = {0, {29}} \

{0, {63}} = {0, {29&63}} = {0, {29}}.

A joins C to create a new node AC, DBV(C) = {0, {58}}, so

DBV(AC) = {0, {24}}.

DBV(Y) then X is subsumed by Y.

Table 2

An example database.

Transactions

Bought items

1

2

3

4

5

6

A, B, D, E, G

B, C, E, F

A, B, D, E, G

A, B, C, E, F, G

A, B, C, D, E, F, G

B, C, D, F, G, H

Bit-Vector(X). It implies that number of bits 1 of Bit-Vector(X) is

equal to that of Bit-Vector(XY) or r(X) = r(XY). h

(ii) If X is subsumed by Y, and Y is subsumed by X, then X and Y are

not closed itemsets.

Table 3

Bit-Vector and DBV of each items in Table 2.

Items

Transactions

Bit-Vector

DBV

A

B

C

D

E

F

G

H

1,

1,

2,

1,

1,

2,

1,

6

011101

111111

111010

110101

011111

111010

111101

100000

{0, 29}

{0, 63}

{0, 58}

{0, 53}

{0, 31}

{0, 58}

{0, 61}

{0, 32}

3,

2,

4,

3,

2,

4,

3,

4,

3,

5,

5,

3,

5,

4,

5

4, 5, 6

6

6

4, 5

6

5, 6

Proof

(i) X is subsumed by Y, i.e., r(X) = r(XY). According to the denition of frequent closed itemset in section 2.1, X is not a

closed itemset.

(ii) If X is subsumed by Y, and Y is subsumed by X, then

r(X) = r(XY) = r(Y). It implies that both X and Y is not

closed. h

{}

A

0,29

AB

AC

0,29

0,24

0,63

0,58

0,53

AD AE

BC

BD BE

CD

CE

DE

0,21 0,29

0,58

0,53 0,31

0,48

0,26

0,20

0,24 0,21 0,29

0,16 0,24

0,16

0,24

0,21

0,21

0,48

ACDE

BCDE

0,16

0,16

0,26

0,21

E

0,31

CDE

0,16

ABCDE

0,16

Fig. 6. DBV-tree for mining itemsets from the database in Table 3 with 5 rst items {A, B, C, D, E}.

7201

(i) If X is subsumed by Y, then X is not closed. Therefore, we can

subsume X with Y to be XY.

(ii) If X is subsumed by Y, and Y is subsumed by X, then X will be

subsumed by Y into XY and Y will be subsumed by X into YX.

It means that there exist two nodes with the same itemsets

XY and YX. It is necessary to remove one of them.

4.3. DBV-Miner an algorithm for mining FCI using DBV-tree

Fig. 7 presents an algorithm for mining frequent closed itemsets

from transaction databases. Firstly, it creates a set of nodes in the

rst level of DBV-tree, each node contains a single item such that

its support satises minSup (line 1). After that, it sorts nodes in

the rst level increasing according to their supports, the subsumption concept is used in this step to subsume items (line 2). This

work differs from Index-BitTableFI in that it will replace itemset

X by itemset XY if DBV(X) = DBV(XY) according to the Theorems

4.1 and 4.2.i. Besides, by Theorem 4.2.ii, if DBV(X) = DBV(XY), then

Y will be deleted from the tree. Finally, the algorithm will call the

procedure DBV-EXTEND (line 3). This procedure will consider each

node XDBV(X) in L with all nodes following it to create all child nodes

of X (lines 4-9). With each node YDBV(Y) following it such that Y X,

the algorithm will check the conditional in line 6, if true, then replace Z by ZXY (by the Theorem 4.2) and insert XY as a child node

of Li. Otherwise, if XY is not subsumed by any successive node of

XDBV(X) in L, then by Lemma 4.1 (below), XY is inserted as a child

node of Li (line 9). Finally, if number of nodes in Li is larger than

1, then call recursively the procedure DBV-EXTEND to create all

child nodes of Li (line 10).

Lemma 4.1. When node XDBV(X) joins with node YDBV(Y) to create a

new node XYDBV(XY), if in Lr exists node ZDBV(Z) such that X is a

successive node of Z, and XY is subsumed by Z then XY will not be

closed.

Proof. When the algorithm joins node Z with all successive nodes

in Lr, then Z has been joined with X and Y. There are two cases:

(ii) XY Z, it means that X Z or Y Z. According to the algorithm, there is at least one child node Z of Z such that XY

# Z. It implies that XY is not closed. h

4.4. An example

Consider the database in Table 2 with minSup = 30%, we have:

Firstly, assume that single items of the database are sorted

increasingly

by

their

support,

L = {A0,29, C0,58, D0,53, F0,58, E0,31, G0,61, B0,63} (H is pruned since its support < minSup).

Consider node A: A is subsumed by {G, E, B}, so A is replaced into

AEGB.

Similarly, C is replaced into CFB, and F is pruned (F has the same

tidset with C), D is replaced into DGB, E is replaced into EB, and G

is replaced into GB.

Finally, the children nodes of Lr are {AEGB0,29, CFB0,58, DGB0,53, EB0,31, GB0,61, B0,63} as in Fig. 8.

After subsuming items in the rst level, the algorithm will call

procedure DBV-EXTEND.

Consider node AEGB0,29:

Li = .

AEGB0,29 joins CFB0,58 into a new node AEGBCF0,24,

r(AEGBCF) = 2 P minSup. AEGBCF0,24 is inserted into the list

of child nodes of Li ({AEGBCF0,24}).

AEGB0,29 joins DGB0,53 into a new node AEGBD0,21,

r(AEGBD) = 3 P minSup. AEGBD0,21 is inserted into the list

of child nodes of Li ({AEGBCF0,24, AEGBD0,21}).

Consider node EB0,31, because EB AEGB, it is skipped.

Similarly to node GB0,61 and node B0,63.

After considering node all AEGB0,29 with all nodes following

it, we have Li = ({AEGBCF0,24, AEGBD0,21}. Because |Li| P 2,

DBV-EXTEND is called recursively with Lr = Li.

o Node AEGBCF0,24 joins with node AEGBD0,21, we have

DBV(AEGBCFD) = {0,{16}}, the support of AEGBCFD is

1 < minSup ) it is skipped.

7202

Table 4

Features of the databases adopted.

{}

AEGB

CFB

DGB

EB

GB

0,29

0,58

0,53

0,31

0,61

0,63

Li = .

CFB0,58 joins with node DGB0,53 into a new node CFBDG0,48,

r(CFBDG) = 2 P minSup. CFBDG0,48 is inserted into the list

of child nodes of Li ({CFBDG0,48}).

CFB0,58 joins with node EB0,31 into a new node CFBE0,26,

r(CFBE) = 3 P minSup. CFBE0,26 is inserted into the list of

child nodes of Li ({CFBDG0,48, CFBE0,26}).

CFB0,58joins with node GB0,61 into a new node CFBG0,56,

r(CFBG) = 3 P minSup. CFBG0,56 is inserted into the list of

child nodes of Li ({CFBDG0,48, CFBE0,26, CFBG0,56}).

Because |Li| P 2, DBV-EXTEND is called recursively with

Lr = Li = ({CFBDG0,48, CFBE0,26, CFBG0,56}).

o CFBDG0,48

joins

with

CFBE0,26,

we

have

DBV(AEGBCFD) = {0,{16}}, the support of AEGBCFD is 1

< minSup ) it is skipped.

o CFBDG0,48 joins with CFBG0,56, because CFBG CFBDG, it

is skipped.

o CFBE0,26 joins with CFBG0,56, into a new node CFBEG0,24,

because CFBEG0,24 is subsumed by node AEGB0,29, it is

skipped.

Database

#Trans

#Items

Chess

Mushroom

Pumsb

Pumsb

Connect

Accidents

3196

8124

49,046

49,046

67,557

340,183

76

120

7117

7117

130

468

Table 5

Number of FCI of six databases under different minSup values.

Database

minSup (%)

#FCI

Chess

75

70

65

60

40

30

20

10

96

94

92

90

55

50

45

40

98

94

90

86

80

70

60

50

11,525

23,892

49,034

98,392

140

427

1197

4095

54

216

610

1465

116

248

713

2610

135

1237

3486

7087

149

529

2074

8057

Mushroom

Pumsb

Pumsb

Connect

Table 2 with minSup = 30% (mining frequent closed itemsets

that contain in at least two transactions). From Fig. 9, we have all

FCI that satisfy minSup are {AEGB:4, CFB:4, DGB:4, EB:5, GB:5, B:6,

AEGBCF:2, AEGBD:3, CFBDG:2, CFBE:3, CFBG:3}.

According to the Lemma 4.1, if XY is not closed, then we can

prune it away DBV-tree. For example, consider the Fig. 9, DGBE0,21

is subsumed by node AEGB0,29 (DBV(AEGBD) = {0, {21}}, it implies

that r(DGBE) = r(AEGBD)), DGBE is not added into DBV-tree. Similarly, EBG and CFBEG are subsumed by AEGB, they are not added

into DBV-tree.

By Lemma 4.1, we need not use hash table to check non-closed

itemset as CHARM. Besides, an itemset will be pruned immediately

when it is created if it satises the Lemma 4.1.

Accidents

5. Experimental results

Experiments were conducted to show the performance of the

proposed algorithms. They were implemented on a Centrino Core

2 Duo (2 2.53 GHz), with 4 GBs RAM of memory and running

Windows 7. The algorithms were coded in C# 2008. Six databases

from http://mi.cs.helsinki./data/ (download on April 2005) were

used for the experiments, with their features displayed in Table 4.

{}

AEGBCF

0,24

AEGB

CFB

DGB

EB

GB

0,29

0,58

0,53

0,31

0,61

0,63

DGBE

EBG

AEGBD

0,21

0,48

0,26

0,56

0,21

0,29

(subsumed by (subsumed by

AEGB)

AEGB)

0,24

7203

databases above under different minSup values.

5.1. Comparing of mining time

Experiments were then made to compare the mining time of

CHARM (Zaki & Hsiao, 2005), and DBV-Miner for different minSup

values. The results for the six databases were shown in Figs. 1015.

Fig. 10 shows the mining time of CHARM (Zaki & Hsiao, 2005),

and DBV-Miner in Chess database. The results show that our approach is efcient than CHARM. For example, with minSup = 60%,

the mining time of CHARM is 19.83(s), and of DBV-Miner is

4:83

4.83(s). The scale is 19:83

100% 24:4%.

In this database, the distance of two approaches is very large.

For example, with minSup = 86%, the mining time of CHARM is

40.03 (s), while of DBV-Miner is 13.29 (s).

Chess

25

14

DBV- Miner

12

CHARM

DBV- Miner

10

15

Time(s)

Time(s)

20

Pumsb*

16

CHARM

10

8

6

4

2

0

0

75

70

65

55

60

50

Fig. 10. Execution time of the two algorithms for Chess under different minSup

values.

Connect

45

CHARM

40

Fig. 13. Execution time of the two algorithms for Pumsb under different minSup

values.

Mushroom

2.5

45

minSup(%)

minSup(%)

40

DBV- Miner

35

CHARM

DBV- Miner

Time(s)

Time(s)

30

1.5

25

20

15

10

0.5

5

0

0

40

30

20

98

10

94

Fig. 11. Execution time of the two algorithms for Mushroom under different

minSup values.

140

DBV- Miner

CHARM

DBV- Miner

120

100

15

Time(s)

Time(s)

Accidents

160

CHARM

20

86

Fig. 14. Execution time of the two algorithms for Connect under different minSup

values.

Pumsb

25

90

minSup(%)

minSup(%)

10

80

60

40

20

96

94

92

90

minSup(%)

Fig. 12. Execution time of the two algorithms for Pumsb under different minSup

values.

80

70

60

50

minSup(%)

Fig. 15. Execution time of the two algorithms for Accidents under different minSup

values.

7204

DBV in DBV-tree by BitTable (Dong & Han, 2007; Song et al., 2008).

The results for the six databases were shown in Figs. 1621.

Fig. 16 shows the mining time of BitTable-based, and DBVbased in Chess database. The results show that DBV-based is efcient than BitTable-based. For example, with minSup = 60%, the

mining time of CHARM is 5.65 (s), and of DBV-Miner is 4.83 (s).

Next, experiments were conducted to compare the total

memory used (MBs) for Tidset/Bit-Vector of the following two

algorithms: CHARM (Zaki & Hsiao, 2005), DBV-Miner. The results

for six databases under different minSup values are shown in

Figs. 2227.

BitTable- based

BitTable- based

DBV- based

2.5

DBV- based

Time(s)

4

Time(s)

Pumsb*

3.5

Chess

2

1.5

1

0.5

1

0

55

0

75

70

65

50

45

40

minSup(%)

60

minSup(%)

Fig. 16. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Chess under different minSup values.

Fig. 19. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Pumsb under different minSup values.

Connect

30

Mushroom

1.8

1.4

BitTable- based

DBV- based

1.2

Time(s)

DBV- based

20

Time(s)

1.6

BitTable- based

25

1

0.8

15

10

0.6

0.4

0.2

0

98

0

40

30

20

94

10

90

86

minSup(%)

minSup(%)

Fig. 17. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Mushroom under different minSup values.

Fig. 20. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Connect under different minSup values.

Pumsb

2

1.8

BitTable- based

1.6

DBV- based

BitTable- based

100

1.4

DBV- based

80

1.2

Time(s)

Time(s)

Accidents

120

1

0.8

60

40

0.6

0.4

20

0.2

0

0

96

94

92

90

minSup(%)

Fig. 18. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Pumsb under different minSup values.

80

70

60

50

minSup(%)

Fig. 21. Execution time of BitTable-based and DBV-based in the proposed algorithm

for Accidents under different minSup values.

7205

Chess

25

Pumsb*

16

CHARM

CHARM

Memory Usage(MBs)

Memory Usage(MBs)

14

DBV- Miner

20

15

10

DBV- Miner

12

10

8

6

4

2

0

75

70

65

60

55

50

minSup(%)

Fig. 22. Memory used (MBs) of the two algorithms for Chess under different minSup

values.

Connect

45

CHARM

40

DBV- Miner

35

Memory Usage(MBs)

Memory Usage(MBs)

40

Fig. 25. Memory used (MBs) of the two algorithms for Pumsb under different

minSup values.

Mushroom

2.5

45

minSup(%)

1.5

0.5

CHARM

DBV- Miner

30

25

20

15

10

5

0

0

40

30

20

98

10

94

minSup(%)

Fig. 23. Memory used (MBs) of the two algorithms for Mushroom under different

minSup values.

Accidents

7000

CHARM

CHARM

Memory Usage(MBs)

Memory Usage(MBs)

6000

DBV- Miner

20

86

Fig. 26. Memory used (MBs) of the two algorithms for Connect under different

minSup values.

Pumsb

25

90

minSup(%)

15

10

DBV- Miner

5000

4000

3000

2000

1000

0

0

96

94

92

90

minSup(%)

Fig. 24. Memory used (MBs) of the two algorithms for Pumsb under different

minSup values.

Figs. 2227 show the total memory used of six databases under

different minSup, we can see that the memory used of CHARM always consumes more than that of DBV-Miner. For example,

consider the Chess database with minSup = 60%, the total memory

used for Tidset is 802.6 MBs, and for DBV (DBV-Miner) is

31.12 MBs.

6. Conclusion and future work

In this paper, we have proposed a new method for mining FCI

from transaction databases. This method has some advantages:

80

70

60

50

minSup(%)

Fig. 27. Memory used (MBs) of the two algorithms for Accidents under different

minSup values.

one scan in the whole mining process. Next, an algorithm for fast

mining FCI has been proposed. This algorithm uses DBV for storing

the database. Advantages of this method are based on the intersection between two DBVs for fast computing the support, and checking non-closed itemsets by using Lemma 4.1. Experimental results

show the efcient of this method in both the mining time and

memory usage.

Mining frequent itemsets in incremental databases has been

developed in recent years (Hong, Lin, & Wu, 2009; Hong & Wang,

2010; Hong, Wu, & Wang, 2009; Lin, Hong, & Lu, 2009; Zhang,

Zhang, & Jin, 2009). We can see that DBV can be applied for fast

7206

from this kind of databases. Based on DBV, we can mine all FI/FCI

when transactions are inserted, updated or deleted. Besides, mining association rules from lattice is more efcient than from FI/

FCI (Vo & Le, 2009, 2011a, 2011b). Therefore, we will develop an

algorithm for building frequent closed itemsets lattice based on

DBV.

References

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In

VLDB94 (pp. 487499).

Bastide, Y., Pasquier, N., Taouil, R., Stumme, & G., Lakhal, L. (2000). Mining minimal

non-redundant association rules using frequent closed itemsets. In First

international conference on computational logic (pp. 972986).

Dong, J., & Han, M. (2007). BitTable-FI: An efcient mining frequent itemsets

algorithm. Knowledge Based Systems, 20(4), 329335.

Grahne, G., & Zhu, J. (2005). Fast algorithms for frequent itemset mining using fptrees. IEEE Transactions on Knowledge and Data Engineering, 17(10), 13471362.

Hong, T.-P., Lin, C.-W., & Wu, Y.-L. (2009). Maintenance of fast updated frequent

pattern trees for record deletion. Computational Statistics and Data Analysis,

53(7), 24852499.

Hong, T.-P., & Wang, C.-J. (2010). An efcient and effective association-rule

maintenance algorithm for record modication. Expert Systems with

Applications, 37(1), 618626.

Hong, T.-P., Wu, Y.-Y., & Wang, S.-L. (2009). An effective mining approach for up-todate patterns. Expert Systems with Applications, 36(6), 97479752.

Lee, A. J. T., Wang, C. S., Weng, W. Y., Chen, J. A., & Wu, H. W. (2008). An efcient

algorithm for mining closed inter-transaction itemsets. Data and Knowledge

Engineering, 66(1), 6891.

Lin, C.-W., Hong, T.-P., & Lu, W.-H. (2009). The Pre-FUFP algorithm for incremental

mining. Expert Systems with Applications, 36(5), 94989505.

Lucchese, B., Orlando, S., & Perego, R. (2006). Fast and memory efcient mining of

frequent closed itemsets. IEEE Transaction on Knowledge and Data Engineering,

18(1), 2136.

Moonestinghe, H. D. K., Fodeh, S., & Tan, P. N. (2006). Frequent closed itemsets

mining using prex graphs with an efcient ow-based pruning strategy. In

Proceedings of 6th ICDM, Hong Kong (pp. 426435).

Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999a). Discovering frequent closed

itemsets for association rules. In Proceedings of the 5th international conference

on database theory. LNCS (pp. 398416). Jerusalem, Israel: Springer.

Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999b). Efcient mining of

association rules using closed itemset lattices. Information Systems, 24(1),

2546.

Pei, J., Han, J., & Mao, R. (2000). CLOSET: An efcient algorithm for mining frequent

closed itemsets. In Proceedings of the 5th ACM-SIGMOD workshop on research

issues in data mining and knowledge discovery, Dallas, Texas, USA (pp. 1120).

Singh, N. G., Singh, S. R., & Mahanta, A.K. (2005). CloseMiner: Discovering frequent

closed itemsets using frequent closed tidsets. In Proceedings of the 5th ICDM,

Washington DC, USA (pp. 633636).

Song, W., Yang, B., & Xu, Z. (2008). Index-BitTableFI: An improved algorithm for

mining frequent itemsets. Knowledge Based Systems, 21(6), 507513.

Uno, T., Asai, T., Uchida, Y., & Arimura, H. (2004). An efcient algorithm for

enumerating closed patterns in transaction databases. In Proceedings of the 7th

international conference on discovery science. LNCS (pp. 1631). Padova, Italy:

Springer.

Vo, B., & Le, B. (2009). Mining traditional association rules using frequent itemsets

lattice. In The 39th international conference on computers and industrial

engineering, Troyes, France, July 68 (pp. 1401-1406). IEEE.

Vo, B., & Le, B. (2011a). Interestingness measures for mining association rules:

Combination between lattice and hash tables. Expert Systems with Applications,

38(9), 1163011640.

Vo, B., & Le, B. (2011b). Mining minimal non-redundant association rules using

frequent itemsets lattice. Journal of Intelligent Systems Technology and

Applications, 10(1), 92206.

Wang, J., Han, J., & Pei, J. (2003). CLOSET+: Searching for the best strategies for

mining frequent closed itemsets. In ACM SIGKDD international conference on

knowledge discovery and data mining (pp. 236245).

Yahia, S. B., Hamrouni, T., & Nguifo, E. M. (2006). Frequent closed itemset based

algorithms: a thorough structural and analytical survey. ACM SIGKDD

Explorations Newsletter, 8(1), 93104.

Zaki, M. J. (2000). Generating non-redundant association rules. In Proceedings of the

6th ACM SIGKDD international conference on knowledge discovery and data

mining, Massachusetts, Boston, USA (pp. 3443).

Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining and

Knowledge Discovery, 9(3), 223248.

Zaki, M. J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast

discovery of association rules. In: Third international conference on knowledge

discovery and data mining (KDD) (pp. 283286).

Zaki, M. J., & Hsiao, C. J. (2005). Efcient algorithms for mining closed itemsets and

their lattice structure. IEEE Transactions on Knowledge and Data Engineering,

17(4), 462478.

Zhang, S., Zhang, J., & Jin, Z. (2009). A decremental algorithm of frequent itemset

maintenance for mining updated databases. Expert Systems with Applications,

36(8), 1089010895.

- A Practical Scalable Distributed B-TreeUploaded by.xml
- Successful Data MiningUploaded byNektarios Leonardos
- Data Mining ResourcesUploaded byCheryl Morris Walder
- Apriori Itemset GenerationUploaded byjollydmello
- Exam 6 Study GuideUploaded byjake
- Understanding Business Intelligence Systems Study GuideUploaded byNu Say Hah
- getDocUploaded byosijek031
- Business Intelligence for New-Generation ManagersUploaded bymipuldd
- A NOVEL APPROACH TO MINE FREQUENT PATTERNS FROM LARGE VOLUME OF DATASET USING MDL REDUCTION ALGORITHMUploaded byIAEME Publication
- IJAIEM-2014-01-23-051Uploaded byAnonymous vQrJlEN
- Mine High Utility Itemset using UP-Tree and FP-GrowthUploaded byseventhsensegroup
- paper.wekaUploaded byAnirudh Garg
- WekaUploaded byrashmi23sept
- Improving search time for contentment based image retrieval via, LSH, MTRee, and EMD boundsUploaded byInternational Organization of Scientific Research (IOSR)
- IRJET-An Enhanced Apriori And Improved Algorithm For Association RulesUploaded byIRJET Journal
- AP 26261267Uploaded byAnonymous 7VPPkWS8O
- Data MiningUploaded bybhumeshwa
- IRJET-V5I3289Uploaded byIRJET Journal
- Data MiningUploaded bySourya Koumar
- CS614Uploaded bychi
- a2dbminerUploaded bysathyam66
- 3l4 Gane.pdfUploaded byPETER
- MCA03 PGDCA–02Uploaded byLokesh Verma
- Applying Adaptive Software Development AUploaded byPedro Guaytima
- MANAGEMENT INFORMATION SYSTEM 201 PRESENTATION.pptxUploaded byPapri Das
- 20090313Uploaded byandik_math
- ProjectSolutions - projectlistUploaded byAman Verma
- Limitation of Apriori AlgoUploaded byGarima Saraf
- AcknowledgementUploaded byVaibhavi Ats
- ch06_2.pptUploaded byFatimaIjaz

- QuantitativeUploaded bytechnetvn
- IEEE InternetUploaded bytechnetvn
- Association Rule Mining in Peer-to-Peer Systems.pdfUploaded bytechnetvn
- Essential Microsoft Office 2013Uploaded byLaurel Joshua
- Firefly AlgorithmUploaded bytechnetvn
- Data Encryption Standard (DES)Uploaded byYoga
- IC3 TeacherUploaded bytechnetvn
- dexa05Uploaded bytechnetvn
- Practical Microsoft Office 2010Uploaded bytechnetvn
- Mo Android Mobiledata App PDFUploaded bytechnetvn
- seqpat-05Uploaded bytechnetvn
- Vietnam Vmug Meeting Agenda -3rd on Mar26 - V2Uploaded bytechnetvn
- Paper12 RevisedUploaded bytechnetvn
- 06 Allaoua Abderrahmani Gasbaoui NasriUploaded bytechnetvn
- A Using MultiUploaded bytechnetvn
- Create and Deploy Simple Web Service and Web Service Client in EclipseUploaded bytechnetvn
- Microsoft Word ExercisesUploaded byRandy Bernan Mergas
- Apply Different Views to a DocumentUploaded bytechnetvn
- Computing Engineering 2014 PPT HCMCUploaded bytechnetvn
- fpmUploaded bytechnetvn
- Server GuideUploaded byPetru Bordeianu
- ReferenceUploaded bytechnetvn
- Topo Gui Anh BaoUploaded bytechnetvn
- Chap 12Uploaded bytechnetvn

- Monitor Jdbc ConnUploaded byMarwan Saad
- IBM Tivoli Storage Manager a Technical IntroductionUploaded bya_nguyenviet
- Odbc Driver SetupUploaded byshineks1
- MongoDB Reference ManualUploaded bytaojonesin
- BusinessObjects Query builder - Basics _ SCN.pdfUploaded bysundhar_v
- Hints for 1z0-147Uploaded bypallavisaxena
- NormalizationUploaded byPaksmiler
- SPAdapter-PDS3DUploaded bybalajivangaru
- Step by Step Guide to Event Based AlertUploaded byMalik Asif Joyia
- GRC Data Mart DesignUploaded byUtsav Mukherjee
- Usort HelpUploaded byscottjwilliams
- zfs_adminUploaded byJan Vydrář
- Creating a Multi-Instance Queue Manager for WebSphere MQ on LinuxUploaded byRaj
- NetWorker Module for Microsoft Applications 2.3 Application GuideUploaded byvinoopnv
- Tutorial 10 Data Warehouse 2018fUploaded byAnonymous 6ZYxdG
- Mb Clean ResultsUploaded byVictor Ioncu
- Abinitio InterviewUploaded byChandni Kumari
- Database performance and resiliency in the Dell PowerEdge R630 running Microsoft SQL Server 2014Uploaded byPrincipled Technologies
- 13 Smart ViewUploaded bySuman Darsi
- How to Perform a SQL Server Health Check - Brad McGeheeUploaded byTaoqir Khosa
- 293100360-SQL-interview-Questions-pdf.pdfUploaded byprince jain
- Bot Creator AssessmentUploaded byLuisana Peña
- Migration ChecklistUploaded byMohd Sufiyan Ansari
- VFP Cross Tab Query VsUploaded byrafohorna
- Table Lookup in SASUploaded bynaniwithu
- 07_In_Database_Archiving.pdfUploaded byHari Prasath
- Pd i 2000 LecturesUploaded byBhaskar Reddy
- Understanding-the-Data-Warehouse-LifecycleUploaded bySharan Sakthi
- Prism Cubed User GuideUploaded byezanne
- An Introduction to SQL Serv_SSISUploaded byHend Selmy