You are on page 1of 13

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO.

9, SEPTEMBER 2007 1175

Compressed Hierarchical Mining of Frequent


Closed Patterns from Dense Data Sets
Liping Ji, Kian-Lee Tan, Member, IEEE Computer Society, and Anthony K.H. Tung

Abstract—This paper addresses the problem of finding frequent closed patterns (FCPs) from very dense data sets. We introduce two
compressed hierarchical FCP mining algorithms: C-Miner and B-Miner. The two algorithms compress the original mining space,
hierarchically partition the whole mining task into independent subtasks, and mine each subtask progressively. The two algorithms
adopt different task partitioning strategies: C-Miner partitions the mining task based on Compact Matrix Division, whereas B-Miner
partitions the task based on Base Rows Projection. The compressed hierarchical mining algorithms enhance the mining efficiency and
facilitate a progressive refinement of results. Moreover, because the subtasks can be mined independently, C-Miner and B-Miner can
be readily paralleled without incurring significant communication overhead. We have implemented C-Miner and B-Miner, and our
performance study on synthetic data sets and real dense microarray data sets shows their effectiveness over existing schemes. We
also report experimental results on parallel versions of these two methods.

Index Terms—Frequent closed patterns, progressive, dense data sets, data mining, parallel mining.

1 INTRODUCTION

F REQUENT pattern (FP) mining [1], [5], [13], [9] is a


fundamental step to several essential data mining tasks,
including association analysis, correlation analysis, caus-
dense data sets efficiently and progressively. The framework
employs a compact strategy that compresses the original
mining space to reduce the workload and then partitions the
ality analysis, association-based classification, and cluster- whole mining task hierarchically into a number of smaller
ing. However, the number of FPs can be too large for them subtasks such that 1) each subtask can be mined indepen-
to be of practical use, especially for dense data sets and/or dently and 2) the union of the FCPs from all subtasks is equal
when low support thresholds are used. To reduce the to the FCPs obtained from the whole mining task. Based on
number of FPs, frequent closed pattern (FCP) mining has this framework, we propose two algorithms, C-Miner and B-
been introduced and successfully adopted for data analysis Miner, for efficient and progressive mining of FCPs. The two
in many domains. In particular, FCPs mined from gene schemes differ in two ways. First, partition methods are
expression data have been used to build association rules to different: C-Miner partitions the mining task based on
uncover gene regulation networks [3], [16] and to build Compact Matrix Division, whereas B-Miner partitions the
classifiers for diagnosis [17]. task based on Base Rows Projection. Second, because the
Some notable FCP mining schemes include CLOSET+ partitioning methods are different, different pruning strate-
[10], CHARM [12], CARPENTER [6], REPT [3], and D-miner gies are used.
[2]. Although these algorithms have been shown to perform Compared with previous FCP mining algorithms, we
well in their respective context, it turns out that they are not have made three key contributions. First, the compact
suited for applications that involve data sets with very high framework reduces the searching space and, hence, greatly
density, where nearly 50 percent or more of the cells contain enhances the mining efficiency. Second, the schemes are
ones (as we shall see, all the real data sets that we used in amiable to parallelism with little or no synchronization
the performance study are dense): they are either very (and, hence, negligible communication overhead): the
inefficient (that is, take hours or even days to produce subtasks can be mined independently and concurrently
patterns, even with high minimum support threshold) or across a number of parallel sites. This is critical, as to our
may even fail (that is, run out of memory). In addition, these knowledge, there is no reported work in the literature on
methods are nonprogressive; that is, the users are swarmed parallel FCP mining. Finally, the hierarchical framework
with all the answer patterns (after a very long wait) at a facilitates a progressive refinement of results from approx-
single time when the algorithm completes. imation to precision. This is very helpful in some applica-
In this paper, we present a top-down compressed tion domains. For example, in biological applications,
hierarchical framework that allows us to mine FCPs from instead of waiting long hours for all precise results,
biologists now would be provided with a general picture
. The authors are with the Department of Computer Science, National of the approximate results from the upper levels of the
University of Singapore, 3 Science Drive 2, Singapore 117543. hierarchical tree in a very short response time. This way,
E-mail: {jiliping, tankl, atung}@comp.nus.edu.sg. biologists could refine only those patterns that are of
Manuscript received 19 May 2006; revised 11 Dec. 2006; accepted 9 Apr. interest to them.
2007; published online 18 Apr. 2007. We have implemented C-Miner and B-Miner and experi-
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0268-0506. mented with synthetic data sets and three real microarray
Digital Object Identifier no. 10.1109/TKDE.2007.1047. data sets. Our results show that our C-Miner and B-Miner are
1041-4347/07/$25.00 ß 2007 IEEE Published by the IEEE Computer Society
1176 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

superior to CLOSET+, REPT, and D-Miner on dense data TABLE 1


sets. We also report results on parallel versions of our A Sample Data Set (Matrix O)
proposed schemes.
The rest of this paper is organized as follows: In the
next section, we summarize some previous works. In
Section 3, we present some preliminaries. Section 4
presents the proposed C-Miner and B-Miner algorithms.
In Section 5, we report experimental results obtained from
comparing C-Miner and B-Miner against some existing
schemes. Finally, we conclude in Section 6.
tively dense microarray data, we conduct experiments to
compare our proposed schemes against them.
2 RELATED WORK As data mining is computationally expensive, there have
There are a number of previous approaches for mining FPs also been a number of attempts to design parallel and
[1], [5], [13], [9]. These schemes generate a large number of distributed mining algorithms. As noted in the survey
patterns, and many of them are redundant. To reduce the paper on parallel association mining [14], most of the
number of FPs, FCP mining algorithms have been proposed. previous parallel pattern mining algorithms are extensions
Some notable schemes include CLOSET [8], CLOSET+ [10], of their sequential counterparts. For example, Count
and CHARM [12]. However, these three methods all adopt a Distribution is based on Apriori, ParEclat on Eclat, and
feature enumeration strategy and are unable to handle data APM on DIC. However, most of these incur significant
sets with a large number of features (columns) and a small communication overhead. Several recently proposed paral-
number of rows (for example, biological data sets). lel FP mining algorithms [4], [11] avoid such communica-
A recently proposed FCP mining algorithm CARPENTER tion cost with either new data structures or new partition
[6] is designed to deal with the special “large columns, small methods. In [4], an algorithm called Inverted Matrix is
rows” characteristic of biological data sets. CARPENTER proposed, which exploits replication across parallel nodes,
combines the depth-first, row enumeration strategy with and a relatively small independent tree is built to summar-
some efficient search pruning techniques, which results in a ize co-occurrences, which ensures minimum interprocessor
scheme that outperforms traditional closed pattern (CP) communication. In [11], a parallel projection approach for
mining algorithms on biological data. Another algorithm, partitioning the transaction data is proposed to mine FPs
COBBLER [7], has also been proposed to mine biological without communication information. However, to our
data sets. COBBLER is designed to dynamically switch knowledge, no parallel algorithms for closed FP mining
between feature enumeration and row enumeration, de- have been reported in the literature.
pending on the data characteristic in the process of mining.
However, the decision to switch the enumeration strategies
at runtime is not very precise and is costly. Another 3 PRELIMINARIES
algorithm is REPT [3]. REPT traverses the row enumeration We shall first define some notations that we will be
tree using a projected transposed table. The projected using throughout this paper and then give the problem
transposed table is represented by a prefix tree, which is description.
similar to the FP-tree [8]. However, unlike the FP-tree whose Let R ¼ fr1 ; r2 ; . . . ; rn g denote a set of rows (transactions)
nodes represent items, nodes in a prefix tree are rows. and C ¼ fc1 ; c2 ; . . . ; cm g denote a set of columns (items/
Experimental results showed that REPT is more efficient features). In the data set, each row ri contains a set of
than CLOSET+ and CARPENTER [3]. In [15], another FCP columns, and each column cj is contained in a set of rows.
mining algorithm Linear time Closed itemset Miner (LCM) In this paper, we represent a data set by a binary matrix
was proposed. LCM develops a new data structure that
O ¼ n  m, where cell Oi;j corresponds to the relationship
integrates bitmap, prefix tree, and array list. Unfortunately,
between row i and column j. A true value (that is, “1”)
all these algorithms do not work well when the data set is
denotes the “containing/contained” relationship and a false
dense. In [2], a novel algorithm D-miner was proposed to
value otherwise. Table 1 shows an example. In the table, r3
identify closed sets of attributes (or items) for dense and
contains c2 and c6 , denoted as Cðr3 Þ ¼ fc2 ; c6 g, and c7 is
highly correlated Boolean contexts. However, the efficiency
contained in r5 and r6 , denoted as Rðc7 Þ ¼ fr5 ; r6 g.
of D-miner highly depends on the minimum number of the
data set’s rows/columns containing “0.” As a result, when Definition 3.1 (Column (Feature) Support Set RðC 0 Þ).
the data set has a relatively large number of rows and Given a set of columns C 0  C, the maximal set of rows that
columns, D-miner loses its advantages. contains C 0 is defined as the Column (Feature) Support Set
Although the above algorithms may have good applica- RðC 0 Þ  R, and this set is unique. For example, in Table 1,
tions in their specific domains, they are not particularly let C 0 ¼ fc1 ; c4 g. Then, RðC 0 Þ ¼ fr4 ; r6 g, since r4 and r6
effective for dense biological data sets with a large number contain c1 and c4 , and no other rows contain both two
of features: they either take a long time to produce all columns.
answers or, even worse, they may fail to generate any Definition 3.2 (Row Support Set CðR0 Þ). Given a set of rows
patterns as a result of running out of memory. Since R0  R, the maximal set of columns that contains R0 is
CLOSET+ [10], REPT [3], and D-Miner [2] represent the defined as the Row Support Set CðR0 Þ  C, and this set is
state-of-the-art efficient FCP mining algorithms for rela- unique. For example, in Table 1, let R0 ¼ fr1 ; r2 g. Then,
JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1177

CðR0 Þ ¼ fc1 ; c6 g, since c1 and c6 are contained in r1 and In the first phase, the data is compressed to reduce the
r2 , and no other columns are contained in both two mining space from O to O0 . Then, in the subtask generation
rows.
phase, O0 is recursively1 partitioned and decompressed into
Definition 3.3 (Support jRðC 0 Þj). Given a set of columns C 0 , subtasks S1 ; S2 ; . . . ; St , t  1, such that
the number of rows in the data set that contains C 0 is defined
as the Support of C 0 , denoted as jRðC 0 Þj. MineF CP ðOÞ ¼ [ti¼1 MineðSi Þ; ð1Þ
0
Definition 3.4 (Pattern Length jCðR Þj). Given a set of rows and where i 6¼ j,
R0 , the number of features (columns) in the data set that
contains R0 is defined as the Pattern Length jCðR0 Þj, denoted MineðSi Þ \ MineðSj Þ ¼ ;: ð2Þ
as Len.
In other words, the whole mining task is split into
Definition 3.5 (Closed Patterns (CP)). Given a set of features independent subtasks such that the union of the FCPs
(columns) C 0  C and a set of rows R0  R, a pattern f ¼ mined from all the subtasks are equal to the actual answer.
ðR0  C 0 Þ  O is defined as a CP if 1) C 0 ¼ CðR0 Þ and
This property allows us to mine the various subtasks
2) R0 ¼ RðC 0 Þ. For clarity, f ¼ ðR0  C 0 Þ is written as
independently and concurrently. This way, answers ob-
f ¼ ðR0 ; C 0 Þ. Moreover, conditions 1 and 2 are referred to
tained from a subtask can be returned immediately to the
as closed in the column set and the row set, respectively.
users without having to wait for all other subtasks to finish,
Definition 3.6 (Frequent Closed Patterns (FCP)). A pattern hence reducing the response time. Moreover, since a
f ¼ ðR0  C 0 Þ  O is called an FCP if 1) the support subtask is smaller than the original whole task, it can be
jRðC 0 Þj and the pattern length jCðR0 Þj are higher than the mined more efficiently. As already mentioned, the ability to
minimum support and minimum pattern length threshold, mine the subtasks independently facilitates parallelism.
respectively, and 2) f is a CP. For example, in Table 1, In the third phase, the subtask mining phase, each
given that minsup ¼ 1 and minlen ¼ 2, the pattern f ¼ subtask is mined for FCPs independently and progres-
fr2 ; r4 g  fc1 ; c2 ; c6 g will be an FCP. However, f 0 ¼ sively. As each subtask is mined hierarchically, approx-
fr2 ; r4 g  fc1 ; c2 g is not an FCP in that its column set imate results are refined from the upper levels into the
fc1 ; c2 g is not equal to its row support set fc1 ; c2 ; c6 g. lower levels progressively, and the patterns contained in the
That is, f 0 is not closed in column set. leaves are the final exact FCPs.
Definition 3.7 (Data Density). The Data Density (denoted as In the next two sections, we shall present two algorithms,
Density) is defined as the percentage of cells containing value C-Miner and B-Miner, that are based on this framework.
“1” in the (Boolean) data set. The two schemes differ in how the whole task is partitioned
and, hence, the pruning strategies.
Problem definition (FCP mining). Given a data set O, We would like to mention that for both algorithms, we
our problem is to discover all FCPs with respect to a user have a data preprocessing phase to remove rows and
support threshold minsup and a user pattern length thresh- columns that do not satisfy the user-prespecified minsup
old minlen. and minlen thresholds for efficiency.
Before leaving this section, we would like to point out
that unlike previous FCP definitions, which refer to FCP 4.2 Algorithm C-Miner
only as the column set C 0 , we will often need to refer to both In this section, we shall present C-Miner, which is based on
the column set C 0 and the column support set RðC 0 Þ as an Compact Rows Enumeration.
FCP. As such, for convenience, we will also refer to the
submatrix RðC 0 Þ  C 0 as FCP. This is important in biological 4.2.1 Compressing the Mining Space
data analysis when both genes and experimental conditions In the first phase of C-Miner, we compress the original
or tissue samples are very useful information in the mining space O into a smaller compact space O0 without
resulting patterns. In these cases, a submatrix rather than
any information loss. The compressing phase comprises
a set of items is preferred.
two steps.
In the first step, similar rows in the original data set O,
4 COMPRESSED HIERARCHICAL FCP MINING which is an n  m binary matrix, are grouped together by
In this section, we first present the basic framework for clustering. Any clustering algorithm can be employed here.
compressed hierarchical FCP mining. We then present the In our experimental study, we have used CLUTO.2 The
two schemes, C-Miner and B-Miner, that are based on the number of clusters k is a user-specified parameter. In
framework. Finally, we show how the framework can be CLUTO, the desired k-way clustering solution is computed
easily adapted for parallel FCP mining. by performing a sequence of k  1 repeated bisections. In
this approach, the matrix is first clustered into two groups.
4.1 A Framework for Compressed Hierarchical
Then, one of these groups is selected and bisected further.
FCP Mining
This process continues until the desired number of clusters
The basic idea of the compressed hierarchical FCP mining
is found. During each step, the cluster is bisected so that the
framework comprises three phases: data compressing phase,
resulting two-way clustering solution optimizes the cluster-
subtask generation phase, and subtask mining phase. Let O
ing criterion function that
be the original data set (matrix) to be mined. Let
MineF CP ðOÞ denote the whole mining task and MineðSi Þ 1. In our current implementation, we adopt only one level of splitting.
denote the ith subtask divided from the whole task. 2. http://www-users.cs.umn.edu/~karypis/cluto/.
1178 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

TABLE 2 TABLE 3
C-Miner Compact Matrix O0 Cutters

k sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X X possible in each splitting. Hence, we group “0”s together on
maximizes simðv; uÞ;
the larger dimension for efficiency. For the example in
i¼1 v;u2Ri
Table 2, zeros in O0 shall be grouped on the column
where Ri is the set of rows assigned to the ith cluster, v and dimension in that the number of the columns is larger than
u represent two rows, and simðv; uÞ is the cosine similarity that of the cluster rows.
between two rows. The cluster to be selected for further We use Z to denote a set of cell groups, which are
partitioning is controlled by the rule that its bisection will partitions of the false values (that is, “0”) of the Boolean
optimize the value of the overall clustering criterion matrix. Given X  L and Y  C, an element ðX; Y Þ 2 Z is
function the most. called a “cutter” if 8li 2 X and 8cj 2 Y , O0i;j ¼ 0. We call X
In the second step, rows within the same cluster are and Y the left atom and the right atom of cutter ðX; Y Þ,
combined to form a new compact row, called cluster respectively. We group the “0” cells cluster row by cluster
row. Let G ¼ fr1 ; r2 ; . . . ; rq g be the set of rows of a row; hence, Z contains as many cutters as cluster rows of the
particular cluster D. Then, the cluster can be represented compact matrix O0 . Each cutter is composed of the cells
as a D ¼ q  m matrix. The cluster row of D, denoted valued by 0 in the cluster row. Table 3 shows the three cutters
L¼P fl1 ; l2 ;W. . . ; lm g, is formed according to the rule that generated from the running example of matrix O0 in Table 2.
lj ¼ qi¼1 di;j , where j ¼ 1; 2; . . . ; m. That is, the cell We start with the compact matrix O0 ðL; CÞ and split it
value of the cluster row is 0 only when all of its make- recursively by using the cutters of Z until all cutters in Z are
up values are 0; otherwise, the cell value is 1. By the used and, consequently, all cells in each resulting node have
above processing, O is transformed into a compact the value “1.” A cutter ðX; Y Þ in Z is used to cut a node
matrix O0 ¼ k  m, where k is the number of clusters, ðL0 ; C 0 Þ if X \ L0 6¼ ;, and Y \ C 0 6¼ ;. In this case, we say
and k  n. Given matrix O in Table 1, for example, let us that the cutter is “applicable” to the node. By convention,
suppose that its rows “r1 ; r2 ; r3 ” and “r5 ; r6 ” are grouped we define the left son of ðL0 ; C 0 Þ by ðL0 n X; C 0 Þ and the right
into clusters L1 and L3 , respectively. Then, its compact son by ðL0 ; C 0 n Y Þ.
matrix O0 is shown in Table 2. During the splitting process, only nodes not satisfying
the minsup and minlen are pruned off. Thus, no valuable
4.2.2 Partitioning the Mining Task information for the FCP mining is removed. The support of
In the second phase of C-Miner, the compact search space a node is calculated by the number of its original rows
O0 is split into overlapping subspaces such that the mining rather than the number of its cluster rows. Each resulting
of each subspace can be treated as an independent subtask. leaf is a compact submatrix with all cells valued “1.”
Given the 2D Boolean matrix as the search space, the CPs Let minsup ¼ 2 and minlen ¼ 2. Fig. 2 shows the
are actually the maximal rectangles with cells all valued “1.” splitting tree of our running example, and the compact
Hence, if we could remove the useless “0”s from the submatrices generated are shown in Table 4 (made up of
2D matrix, then we would narrow the search space greatly. columns 2, 4, 5, and 6).
Fig. 1 illustrates the zero removing principle. Let rectangle We note that the ordering in which cutters are applied
ðABCDÞ represent the whole data matrix and rectangle affect performance. As a heuristic, cutters with more 0s are
ðAGEKÞ represent the useless “0 zone” to be removed. From applied first, as it will result in a shorter tree (and, hence,
the edge of ðAGEKÞ, two lines EF and GH are derived, more efficient processing).
which split the rectangle ðABCDÞ into two new pieces:
ðCDEF Þ and ðHDGBÞ. Each piece is still a rectangle, and the
equation ðCDEF Þ [ ðHDGBÞ ¼ ðABCDÞ n ðAGEKÞ is sa-
tisfied. In any of the new piece, there may still exist
“0 zones.” The same principle can be applied until all “0
zones” are removed. We try to remove as many “0”s as

Fig. 1. Zero removing principle. Fig. 2. Splitting tree using cutters.


JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1179

TABLE 4
Resulting Compact Submatrices and Decompressed Subspaces ðminsup ¼ 2; minlen ¼ 2Þ

After applying the zero removing principle on O0 , all subspace can be treated as an independent subtask of the
useful information for mining FCPs are kept into the leaf subsequent mining.
nodes of O0 , and all these leaf nodes are different from each Considering the compact submatrices in Table 4, we
other. have, after decompression, the resulting decompressed
subspaces (columns 3-6).
Theorem 1. Let F CP s be the set of FCPs of a 2D data set O. Let
LV ¼ flv1 ; lv2 ; ; lvz g be the set of all leaf nodes derived by the 4.2.3 Mining Subspaces to Generate FCPs
zero removing principle from the splitting tree with root O. To produce the actual FCPs, in the last phase of C-Miner,
Then, both 1) lvi 6¼ lvj , where i 6¼ j, and 2) F CP s  LV hold. each subspace is mined independently as a subtask. The
Due to space limitation, readers are referred to [18] for mining process follows the same zero removing principle as
described above. That is, we recursively remove “0”s from
the proof of this theorem and all subsequent theorems in the
each subspace in the splitting tree, refining the resulting
paper. nodes progressively, from approximation to exact FCPs in
Finally, for each compact submatrix in the leaf nodes, its the leave nodes.
cluster rows are “decompressed” back into their original For the four subtasks of the running example in Table 4,
rows. The decompression introduces new cells that may the splitting trees are shown in Fig. 3. Each node is split into
contain 0s in the corresponding data set. Now, each of these two new nodes in the next level if the cutter is applicable.
data sets forms a subspace from which we can mine the We only keep and show nodes satisfying the thresholds
actual FCPs (of the original data set). The mining of each (minsup ¼ 2 and minlen ¼ 2) due to space limitation.

Fig. 3. C-Miner subtask mining ðminsup ¼ 2; minlen ¼ 2Þ.


1180 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

During the mining process, not all nodes generated are TABLE 5
useful for further splitting. False drops or redundant nodes FCP ðminsup ¼ 2; minlen ¼ 2Þ
may arise among subtasks. This is due to the subspace
overlaps in the subtask partition. As a result, a pattern
mined from a certain subtask may have either a superset
pattern or duplicates in other subtasks. Hence, it is critical
to make independent pruning strategies of each subtask to
ensure the independent mining of subtasks.
In the splitting trees in Fig. 3, we find two types of
useless nodes:

1. Left son nodes that are not closed in column set. For
example, the left son Lðr2 r4 ; c1 c2 Þ (node C) in original data set O. If 9rx 2 ðR n Rf Þ such that 8cy 2 Cf ,
subtask 4 is not closed in the column set in that it Ox;y ¼ 1, then f can be pruned off.
has a superset Lðr2 r4 ; c1 c2 c6 Þ (node D) in subtask 3. Proof. Let RðCf Þ represent the column support set of Cf . If
2. Right son nodes and subtask roots that are not 9rx 2 ðR n Rf Þ such that 8cy 2 Cf , Ox;y ¼ 1 holds, then
closed in row set. For example, the right son Rf  RðCf Þ. Thus, f is not closed in the row set and can
Rðr3 r4 ; c2 c6 Þ (node A) in subtask 3 is not closed in be pruned off. u
t
the row set in that it has a superset Rðr2 r3 r4 ; c2 c6 Þ
(node B) in subtask 3. Taken the right son Rðr3 r4 ; c2 c6 Þ (node A) in subtask 3,
To remove the two types of useless nodes, we develop for example, there exists row r2 such that for columns c2
the Column Set Checking Strategy in Lemma 1 and the Row and c6 , O2;2 ¼ 1, and O2;6 ¼ 1. Thus, Rðr3 r4 ; c2 c6 Þ is pruned
Set Checking Strategy in Lemma 2. off due to the unclosed row set. In this example,
Rðr2 r3 r4 ; c2 c6 Þ (node B) in subtask 3 is its superset.
Lemma 1 (Column Set Checking Strategy). Given a node
In C-Miner, only the right son nodes and subtask roots
f ¼ ðRf  Cf Þ from subtask Si , where Rf and Cf denote the
need the Row Set Checking.
row and column sets of f, respectively, let li1 ; li2 ; . . . ; liu
Now, we are ready to summarize the mining process of
represent the cluster row contributing to Si and let
Phase 3. We take each subtask as the root of the splitting
Li1 ; Li2 ; . . . ; Liu represent the original row sets of O that
tree and employ the Row Set Checking Strategy. If the
make up each cluster row, respectively. If there exists any Lix
subtask root is kept as a useful root, then cutters are applied
such that Rf \ Lix ¼ ;, f can be pruned off.
recursively until all zeros are removed from the search
Proof. For subtask Si with row set Ri and column set Ci , if space. While applying a cutter on a node, the cutter is first
Rf \ Lix ¼ ; holds, then Rf  ðRi n Lix Þ. Since there checked whether it is applicable to the node or not. If it is
exists another subtask Sj ¼ ðRi n Lix Þ  Cj , where Ci  applicable, then it cuts the node into the left son node and
Cj and, hence, f  Sj . Thus, 9f 0 ¼ Rf  Cf0  Sj . If the right son node; otherwise, it is ignored. For the left son
Cf ¼ Cf0 , f ¼ f 0 , then f can be pruned off as redundancy. node, the monotonic constraint minsup is checked, followed
If Cf  Cf0 , then f can be pruned off due to the unclosed by the Column Set Checking Strategy. For the right son
column set. u
t node, the monotonic constraint minlen is checked, followed
by the Row Set Checking Strategy. These checking strategies
For example, subtask 4 in Fig. 3 is made up of cluster ensure that the useless searching space is removed early in
rows l1 , l2 , and l3 , which come from the original row sets the mining process.
fr1 ; r2 ; r3 g, fr4 g, and fr5 ; r6 g, respectively. For the left son In our running example, the final resulting FCPs are
Lðr2 r4 ; c1 c2 Þ (node C), since fr2 r4 g \ fr5 ; r6 g ¼ ;, f is shown in Table 5.
pruned off due to the unclosed column set. In other words,
the cluster row l3 does not contribute to the node 4.2.4 C-Miner Correctness
Lðr2 r4 ; c1 c2 Þ, and only cluster rows l1 and l2 contribute to Theorem 2 shows that C-Miner can correctly generate only
the node. Since there must exist another subtask (subtask 3, all FCPs.
in this example) that is made up of cluster rows l1 and l2 but
Theorem 2. Let F CP s be the set of FCPs of a 2D data set O. Let
with a larger column set fc1 ; c2 ; c5 ; c6 g, node Lðr2 r4 ; c1 c2 Þ
LVo be the set of leaf nodes derived from C-Miner on the data
will either have a superset or duplicate in the subtask. In
set O. Then, F CP s ¼ LVo .
this example, Lðr2 r4 ; c1 c2 c6 Þ (node D) in subtask 3 is the
superset of Lðr2 r4 ; c1 c2 Þ. Also, if the cutters in subtask 3 cut 4.3 Algorithm B-Miner
off both c5 and c6 from node D, then Lðr2 r4 ; c1 c2 Þ would We shall now examine algorithm B-Miner. B-Miner is based
become a duplicate. In either case, Lðr2 r4 ; c1 c2 Þ can be safely on Base Rows Projection.
pruned from subtask 4.
In C-Miner, the right son nodes and subtask roots never 4.3.1 Compressing the Mining Space
satisfy the condition; hence, only the left son nodes need the In the first phase of B-Miner, the original mining space O ¼
Column Set Checking. R  C is compressed into a smaller mining space O0 in two
Lemma 2 (Row Set Checking Strategy). Given a node steps. In the first step, the row set R is divided into
f ¼ ðRf  Cf Þ, where Rf and Cf represent the row and equidepth bins. The number of rows in each bin is the same,
column sets of f, respectively, let R denote the row set of the which is a user-specified parameter, defined as Group
JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1181

TABLE 6 TABLE 7
B-Miner Compact Matrix O0 ðGL ¼ 2Þ Subtasks of B-Miner ðGL ¼ 2Þ

Length (GL). Given jRj ¼ n and GL ¼ k, the number of bins


will be bnkc þ 1.
In the second step, rows within each bin are combined to
form a new compact row, called Base Row. Let G ¼
fr1 ; r2 ; . . . ; rk g be the set of rows of a particular bin A. Then,
the bin can be represented as A ¼ k  m matrix. The base
row of A, denoted BP ¼ fb1W ; b2 ; . . . ; bm g, is formed according
to the rule that bj ¼ ki¼1 ai;j , where j ¼ 1; 2; . . . ; m. That
is, the cell value of the base row is 0 only when all of its
make-up values are 0; otherwise, the cell value is 1. By the
above processing, O is transformed into a compact matrix B1 are contained in S1 , and those with supporting rows
O0 ¼ n0  m, where n0 is the number of bins, and n0  n. coming from B2 but not B1 are contained in S2 . Also, FCPs
Given matrix O in Table 1, for example, let GL ¼ 2. Then, its with supporting rows coming only from B3 are kept in S3 .
compact matrix O0 is shown in Table 6. FCPs will not be generated in the subspace that has
fewer rows than minsup. Hence, the number of subspaces
4.3.2 Partitioning the Mining Task n0 ¼ bðnminsupÞc þ 1 rather than bnkc þ 1. It is safe to ignore
k
In the second phase of B-Miner, the original mining space O those latter subspaces without enough rows. Column sets
is partitioned into several overlapping subspaces, and the with enough row support have already been covered by the
mining of each subspace is an independent subtask. former subspaces. For the above example, if we set
The original rows within each base row bin may become minsup ¼ 3, then only the first two subspaces (S1 and S2 )
the supporting rows of the final FCPs. Hence, given a base will be mined. The last subspace S3 with only two rows is
row Bi , the final FCPs can be divided into two categories: safe to be dropped off.
1) FCPs with supporting rows coming from base row
Bi ’s bin and 2) FCPs without supporting rows coming from 4.3.3 Mining Subspaces to Generate FCPs
base row Bi ’s bin. Given another base row Bj ðj 6¼ iÞ, the Like C-Miner, in the third phase of B-Miner, each subspace
FCPs of category 2 can be further divided into two is mined independently as a subtask. The same zero
subcategories: 3) FCPs with supporting rows coming from removing principle is employed to build up the splitting
base row Bj ’s bin and 4) FCPs without supporting rows tree. Furthermore, the results are refined progressively.
coming from base row Bj ’s bin. Recursively, the FCPs of For the three subtasks of the running example in Table 7,
category 4 can be further partitioned, given more base rows. the splitting trees are shown in Fig. 4. Only applicable
That is, given n0 base rows, the final FCPs can be partitioned cutters and nodes satisfying the thresholds (minsup ¼ 2 and
into n0 groups (FCPs with empty supporting row set are minlen ¼ 2) are kept in the figure due to space limitation.
ignored), regardless of support thresholds. Based on this Like C-Miner, during the mining process, false drops or
observation, we could partition the original mining space redundant nodes may arise among subtasks as well. Hence,
into subspaces such that each subspace contains a certain we develop pruning strategies to ensure the independent
category of final FCPs, as described above. That explains mining of each subtask. In the splitting trees of B-Miner,
why each subspace can be mined independently, and no there are two types of useless nodes:
interesting answers are missed.
Let Si ¼ Ri  Ci represent the ith subtask by projection 1. Left son nodes that are not closed in column set. For
on the base row Bi . Then, the column set Ci ¼ the example in Fig. 4, the left son Lðr3 r4 r5 r6 ; c1 c2 c5 c6 Þ
fc1 ; c2 ; . . . ; cq g  C is projected on Bi such that 8cj 2 Ci , (node C) in subtask 1 is not closed in the column set
O0i;j ¼ 1 is satisfied. That is, a column is selected only when in that it has a superset S2 ðr3 r4 r5 r6 ; c1 c2 c3 c4 c5 c6 Þ
its cell in the corresponding base row valued “1;” (root D) of subtask 2.
otherwise, it is not included in the column set of the 2. Right son nodes and subtask roots that are not
subspace. Given B1 in Table 6, for example, subtask S1 ’s closed in row set. For the example in Fig. 4, the right
column set is C1 ¼ fc1 ; c2 ; c5 ; c6 g. Also, subtask Si ’s row set son Rðr3 r4 r5 r6 ; c2 c6 Þ (node A) in subtask 2 is not
Ri ¼ frði1Þkþ1 ; rði1Þkþ2 ; . . . ; rn g  R, where k is the num- closed in the row set in that it has a superset
ber of original rows within each bin. As for subtask S2 in Rðr2 r3 r4 r5 r6 ; c2 c6 Þ (node B) in subtask 1.
Table 6, its corresponding row set R2 ¼ fr3 ; r4 ; r5 ; r6 g. The type 2 unclosed nodes can be pruned off by using
Given matrix O in Table 1, for example, according to its the same Row Set Checking Strategy (in Lemma 2) as
base rows in Table 6, there are three subtasks generated in C-Miner, and only the right son nodes and subtasks roots
Table 7. The final FCPs with supporting rows coming from need the Row Set Checking in B-Miner. Here, we would
1182 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

Fig. 4. B-Miner subtask mining ðminsup ¼ 2; minlen ¼ 2Þ.

only develop a new strategy for the left son nodes, called S2 ðr3 r4 r5 r6 ; c1 c2 c3 c4 c5 c6 Þ (node D) is just the superset of
Left Checking Strategy, for the type 1 nodes in Lemma 3. node C. Hence, node C can be removed.
Lemma 3 (Left Checking Strategy). Given a node f ¼ Here, we summarize the mining process of Phase 3. Like
ðRf  Cf Þ from subtask Si , where Rf and Cf denote the C-Miner, each subtask root is first checked with the Row Set
row and column sets of f, respectively, let Ri represent the row Checking Strategy. Then, if a cutter is applicable to a node,
set from which Si ’s base row Bi is generated. If Rf \ Ri ¼ ;, it cuts the node into left and right sons. For the left son
then f can be pruned off.
node, the monotonic constraint minsup is checked, followed
Proof. Let Rsi represent the row set of subtask Si . If Rf \
by the Left Checking Strategy. For the right son node, the
Ri ¼ ; holds, then there must exist another subtask Sj ¼
monotonic constraint minlen is checked, followed by the
Rsj  Csj such that Rf  Rsj  Rsi n Ri , and Cf  Csj .
Hence, f  Sj . Thus, 9f 0 ¼ R0f  Cf0  Sj . If Cf ¼ Cf0 , Row Set Checking Strategy. Useless nodes are pruned off
then f ¼ f 0 , hence f can be pruned off as redundancy. If early by the checking.
Cf  Cf0 , then f can be pruned off due to the unclosed
column set. u
t 4.3.4 B-Miner Correctness
Theorem 3 shows that B-Miner can correctly generate only
Let us take the left son Lðr3 r4 r5 r6 ; c1 c2 c5 c6 Þ (node C) in all FCPs.
subtask 1 in Fig. 4 for example. For subtask S1 , its base row
Theorem 3. Let F CP s be the set of FCPs of a 2D data set O. Let
is made up of row set fr1 ; r2 g. Since node C’s support row
LVo be the set of leaf nodes derived from B-Miner on the data
set fr3 r4 r5 r6 g \ fr1 ; r2 g ¼ ;, node C can be pruned off. This
is because there exists another subtask S2 whose root set O. Then, F CP s ¼ LVo .
JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1183

4.4 Parallel FCP Mining branches) from its neighbors. This way, the workload is well
As noted in the previous sections, the hierarchical FCP balanced.
mining framework can be easily adapted for parallel Now, both C-Miner and B-Miner can be paralleled under
processing. In this section, we shall present the parallel the framework. There is, however, one issue to be
FCP mining framework. addressed: In order for a node to be able to mine a subtask
We use as our context a parallel environment that Si independently, the pruning of false drops or redundant
comprises a network of nodes (that is, PCs) that are loosely FCPs must be done without incurring any communication
integrated. This would be similar to work like Seti@Home3 overhead between nodes. To ensure this, we need to
and Folding@Home.4 Moreover, we assume that only a disseminate the original data set O to all participating
nodes. This cost, fortunately, is inexpensive (in terms of
source node has the original data set. In other words, we do
response time), as it can be done concurrently while the
not assume that the data set is partitioned across all
data space is being partitioned. Moreover, only one copy
participating nodes. When the data set needs to be mined,
per node is necessary, even if multiple subtasks are being
the source node will look for nodes to parallel the mining
allocated to a node. In addition, for our real data sets, they
process. Like traditional load-balanced query processing,
are not big.
our framework generates a large number of subtasks (larger
than the number of nodes) and then allocates these subtasks
to nodes to be mined concurrently and independently. It is 5 EXPERIMENTAL RESULTS
essentially a straightforward adaptation of the hierarchical We have implemented C-Miner and B-Miner, as well as
FCP mining framework and operates in three logical their parallel versions (denoted as PC-Miner and PB-Miner,
phases: respectively). We conducted a performance study to
evaluate their efficiency against CLOSET+,5 REPT, and
. Task execution phase. The task execution phase D-Miner. For our experiments, we use two real microarray
corresponds to the subtask generation phase of the
data sets: the breast cancer (BC)6 data set and the prostate
hierarchical framework. Thus, the original task is
cancer (PC)7 data set. In such data sets, the rows represent
partitioned into independent subtasks such that
mining all the subtasks will lead to the answers of clinical samples, whereas the items represent the activity
mining the original task. This phase can be done at levels of genes/proteins in the samples. In the BC data set,
the source node (in which case the source node there are 78 tissue samples, and each sample is described by
generates all the subtasks). Alternatively, we can the activity level of 24,481 genes. In the PC data set, there
parallel this phase by exploiting more nodes to are 102 tissue samples, each described by the activity level
perform the partitioning: 1) the source node will of 12,600 genes. The BC and PC data sets are discretized by
generate t1 subtasks, 2) these subtasks are then doing an equal-width partition for each column with 20 and
allocated to t1 nodes (including the source), and each 4 buckets, respectively, resulting in data sets with density of
of these t1 nodes will further partition the allocated 49.76 percent (that is, 49.76 percent of the cells contain one,
subtask into t2 smaller subtasks, which are then whereas the rest contain zero) and 49.18 percent accord-
further distributed to t2 nodes, and 3) the above ingly. To study the effect of the proposed schemes on other
process is repeated until a sufficient number of factors (for example, density and scalability), we also use
subtasks have been produced. For simplicity, in our synthetic data sets generated by the IBM data generator.8
experimental study, this procedure is accomplished
All the experiments are run on a Pentium 4 PC with 1-Gbyte
by the source node (that is, we do not parallel this
RAM. We have run a large number of experiments and shall
phase).
. Subtask allocation phase. In the second phase, the present only representative results here. The default
source node (which acts as a coordinator) will assign number of processors for the parallel algorithms is 8.
a subtask to each node to mine. In each subtask 5.1 Varying Data Set Density
mining process, since the mining of the left and right
In the first set of experiments, we study the effects of data
branches are also independent of each other, the
set density on the execution time. We experiment on seven
newly generated branches can be treated as a sub-
subtask and be further paralleled to obtain better synthetic data sets generated by the IBM data generator
load balancing. with 50 rows, 500 columns, and density varying from
. Subtask execution phase. Finally, in the third phase, 10 percent to 40 percent. We compare the performance of
which is similar to the subtask mining phase, each CLOSET+, REPT, D-Miner, C-Miner ðNcluster¼5 Þ, B-Miner
node independently mines the allocated subtasks. ðGL ¼ 1Þ, PC-Miner, and PB-Miner. We set minsup ¼ 15
We note that the second and third phases operate and minlen ¼ 1. Fig. 5 shows the execution time (seconds in
iteratively: Whenever a node completes processing its logarithm scale) of each algorithm. In Fig. 5a, we find that
subtask, it will request the source node for another subtask. CLOSET+ and REPT are quicker than D-Miner when the
When all subtasks from the source node have been allocated,
5. The code is downloaded from http://illimine.cs.uiuc.edu/.
the node may also request an unfinished sub-subtask (tree 6. http://www.rii.com/publications/default.htm.
7. http://www-genome.wi.mit.edu/mpr/prostate.
3. http://setiathome.berkeley.edu/. 8. The generator is available at http://www.cs.umbc.edu/cgiannel/
4. http://folding.stanford.edu/. assoc_gen.html.
1184 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

Fig. 5. Variation of density. (a) REPT versus CLOSET+ versus D-Miner. (b) D-Miner versus the proposed schemes.

Fig. 6. Vary number of clusters (and subtasks). (a) Breast cancer. (b) Prostate cancer.

density is below 25 percent, but they become much slower 5.2.1 Tuning the Proposed Schemes
than D-Miner when the density is above 30 percent. That is, We vary the number of clusters for C-Miner and PC-Miner.
although CLOSET+ and REPT are efficient on sparse data The results for the two cancer data sets are shown in Fig. 6,
sets, they lose their advantage on dense data sets compared where we set minsup ¼ 5 and minlen ¼ 300 for the BC data
with D-Miner. Thus, since our focus is on dense data sets, set and minsup ¼ 10 and minlen ¼ 1; 100 for the PC data
we will not discuss them any further. Instead, we shall set, respectively. The results show that there is a certain
compare our proposed schemes with D-Miner. Fig. 5b
“optimal” cluster number for C-Miner. From the results, we
shows that our proposed schemes C-Miner and B-Miner are
find that more clusters lead to better load balancing in
much quicker than D-Miner, and C-Miner is slightly
parallelism. Having more clusters (hence, more subtasks)
quicker than B-Miner on denser data sets. Moreover, the
may be beneficial, as it facilitates load balancing. Having a
parallel versions can further reduce the processing time
smaller subtask may result in some “heavy-weight” mining
greatly. We also note that PB-Miner is more efficient than
subtask that dominates performance. Thus, in general, the
PC-Miner.
more subtasks there are, the better the runtime for
5.2 Experiments on Real Microarray Data Sets parallelism becomes. However, when the number of
In the second set of experiments, we compare our clusters increases to some point, the overall processing
proposed schemes against D-Miner on real microarray time keeps increasing, and the advantage in parallelism is
data sets by variation of minsup and minlen, respectively. affected as well. When the number of clusters is very large,
For C-Miner, B-Miner, and their parallel versions, one of the number of subtasks becomes large. This means that
the key parameters is the number of subspaces, which generating the subtasks (in phase 2) becomes costly, and
affects the execution time. For C-Miner and PC-Miner, the processing a large number of subtasks (in phase 3) is also
number of subspaces is controlled by the number of costly.
clusters. For B-Miner and PB-Miner, the number of As for the BC and PC data sets, the execution time of
subspaces is controlled by the GL. Hence, we begin by C-Miner and PC-Miner keeps increasing when the numbers
tuning the various algorithms (C-Miner, B-Miner, and their of clusters are above 9 and 11, respectively. Hence, we
parallel versions) on these two parameters. suggest choosing the number of clusters below these values.
JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1185

Fig. 7. Vary GL (and subtasks). (a) Breast cancer. (b) Prostate cancer.

Fig. 8. Variation of minsup. (a) Breast cancer ðNcluster ¼ 2; GL ¼ 1Þ. (b) Prostate cancer ðNcluster ¼ 10; GL ¼ 1Þ.

Fig. 9. Variation of minlen. (a) Breast cancer ðNcluster ¼ 2; GL ¼ 1Þ. (b) Prostate cancer ðNcluster ¼ 10; GL ¼ 1Þ.

For the BC data set, the “optimal” value is 2 for C-Miner for the two data sets are shown in Fig. 7, where we set
and 8 for PC-Miner. Users may choose the value according minsup ¼ 5 and minlen ¼ 300 for the BC data set and
to whether they prefer a centralized or a parallel scheme. As minsup ¼ 10 and minlen ¼ 1; 100 for the PC data set,
for the PC data set, the “optimal” value is 10 for both respectively. As GL increases, the number of subtasks
C-Miner and PC-Miner. We shall use Ncluster ¼ 2 for the BC decreases: GL ¼ 1 indicates the largest number of subtasks
data set and Ncluster ¼ 10 for the PC data set as the default (that is, number of subtasks ¼ number of rows), and GL ¼
when experimenting with these data sets in all subsequent number of rows indicates no partitioning (that is, one single
studies. subtask). As shown in Fig. 7, the execution time of both
For B-Miner and PB-Miner, the number of subtasks is B-Miner and its parallel version (PB-Miner) increases with
determined by the GL. We vary GL from 1 to 5. The results the increase of GL. GL’s effect on B-Miner is relatively
1186 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 19, NO. 9, SEPTEMBER 2007

subtasks such that the load is well balanced and, hence, it


outperforms PB-Miner.

5.3 Varying the Number of Processors


We also study the effects of processor numbers on the
execution time of PC-Miner and PB-Miner. Since the results
for both BC and PC data sets show similar relative
performance, we only show the results of the PC data set.
We set minsup ¼ 10, minlen ¼ 1; 100, and Ncluster ¼ 10 for
PC-Miner and GL ¼ 1 for PB-Miner. Fig. 10 shows the
execution time of PC-Miner and PB-Miner, as we vary the
number of processors. The execution time of both algo-
Fig. 10. Vary number of processors. rithms decreases with the increasing of processor number.
From the experiments, we find that our schemes balance the
small. To optimize the PB-Miner, we shall use GL ¼ 1 as the workload very well.
default for the data sets.
5.4 Scalability
5.2.2 Varying minsup and minlen To study the scalability of our proposed schemes, we
In this set of experiments, we study the effects of minsup generate a synthetic data set of 1,000 rows, 100,000 columns,
and minlen on the execution time. We experiment on the and 10 percent in density by using the IBM data generator.
BC and PC data sets and compare the performance of We set Ncluster ¼ 2 and GL ¼ 1 to optimize C-Miner and
D-Miner, C-Miner, B-Miner, PC-Miner, and PB-Miner. B-Miner, and we vary the minsup and minlen for the
First, we set minlen ¼ 300 and vary the minsup from 5 to experiments. The results are presented in Fig. 11. From the
10 for the BC data set. We set minlen ¼ 1; 100 and vary the results, we see that our proposed schemes can scale well for
minsup from 6 to 16 for the PC data set. The comparative large volume data sets. B-Miner is slightly quicker than
results are presented in Fig. 8. C-Miner. Also, PB-Miner’s load is better balanced due to
Second, we set minsup ¼ 5 and vary the minlen from 300 more subtasks. As for D-Miner, it takes more than 30,000 sec-
to 350 for BC data set. We set minsup ¼ 10 and vary the onds for each execution and, hence, its results are not shown
minlen from 1,050 to 1,100 for the PC data set. The in the figure.
comparative results are shown in Fig. 9.
The comparative results presented in Figs. 8 and 9 show
clearly that the execution time decreases with increasing
6 CONCLUSION
minsup and minlen values. Moreover, as in the previous In this paper, we have proposed a novel framework for
experiments, C-Miner, B-Miner, and their parallel versions mining FCPs on dense data sets. The key idea is to partition
outperformed D-Miner. C-Miner outperformed B-Miner for the original data sets into smaller independent subtasks
both data sets. For the BC data set, the parallel version of such that mining the subtasks will produce the same
C-Miner (PC-Miner) is slightly slower than PB-Miner. This answers as mining from the original space. Based on this
is due to the reason that the subtasks for PC-Miner is much framework, we proposed two new algorithms: C-Miner and
fewer than those for PB-Miner, considering Ncluster ¼ 2. As B-Miner. The two schemes adopt different partitioning and
for the PC data set, since Ncluster ¼ 10, PC-Miner has more pruning strategies. We also show how the framework can

Fig. 11. Scalability. (a) Variation of minsup. (b) Variation of minlen.


JI ET AL.: COMPRESSED HIERARCHICAL MINING OF FREQUENT CLOSED PATTERNS FROM DENSE DATA SETS 1187

facilitate parallel FCP mining in a straightforward manner. Liping Ji received the bachelor’s degree in
information management and systems from
Our performance study showed that both schemes and their Nanjing University, China, in 1998 and the
parallel versions are both efficient and scalable. doctoral degree in computer science from the
National University of Singapore (NUS) in 2007.
Her current research interest is in data mining,
with particular emphasis on discovering useful
REFERENCES patterns from gene expression (or microarray)
[1] R. Agrawal and R. Srikant, “Fast Algorithms for Mining data.
Association Rules,” Proc. 20th Int’l Conf. Very Large Data Bases
(VLDB ’94), pp. 487-499, 1994.
[2] J. Besson, C. Robardet, and J.-F. Boulicaut, “Constraint-Based
Mining of Formal Concepts in Transactional Data,” Proc. Eighth Kian-Lee Tan received the BSc (Hons) and PhD
Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD degrees in computer science from the National
’04), pp. 615-624, 2004. University of Singapore (NUS) in 1989 and
[3] G. Cong, K.L. Tan, A.K.H. Tung, and F. Pan, “Mining Frequent 1994, respectively. He is currently an associate
Closed Patterns in Microarray Data,” Proc. Fourth IEEE Int’l Conf. professor in the Department of Computer
Data Mining (ICDM ’04), pp. 363-366, 2004. Science, National University of Singapore. His
[4] M. El-Hajj and O.R. Zaiane, “Parallel Association Rule Mining major research interests include query proces-
with Minimum Inter-Processor Communication,” Proc. 14th Int’l sing and optimization, database security, and
Workshop Database and Expert Systems Applications (DEXA ’03), data mining. He has published more than
pp. 519-523, 2003. 200 conference/journal papers in international
[5] H. Mannila, H. Toivonen, and A.I. Verkamo, “Efficient Algo- conferences and journals. He has also coauthored three books. He is a
rithms for Discovering Association Rules,” Proc. 12th Nat’l Conf. member of the ACM and the IEEE Computer Society.
Artificial Intelligence (AAAI ’94) Workshop Knowledge Discovery in
Databases (KDD ’94), pp. 181-192, 1994. Anthony K.H. Tung received the BSc (Second
[6] F. Pan, G. Cong, and A.K.H. Tung, “Carpenter: Finding Closed Class Honor) and MSc degrees in computer
Patterns in Long Biological Datasets,” Proc. Ninth ACM SIGKDD science from the National University of Singa-
Int’l Conf. Knowledge Discovery and Data Mining (KDD ’03), pp. 642- pore (NUS) in 1997 and 1998, respectively, and
673, 2003. the PhD degree in computer sciences from
[7] F. Pang, A.K.H. Tung, G. Cong, and X. Xu, “Cobbler: Combining Simon Fraser University (SFU) in 2001. He is
Column and Row Enumeration for Closed Pattern Discovery,” currently an assistant professor in the Depart-
Proc. 16th Int’l Conf. Scientific and Statistical Database Management ment of Computer Science, National University
(SSDBMS ’04), pp. 21-30, 2004. of Singapore. His research interests involve
[8] J. Pei, J. Han, and R. Mao, “Closet: An Efficient Algorithm for various aspects of databases and data mining
Mining Frequent Closed Itemsets,” Proc. SIGMOD Int’l Workshop (KDD) including buffer management, frequent pattern discovery, spatial
Data Mining and Knowledge Discovery, pp. 21-30, 2000. clustering, outlier detection, and classification analysis. His recent
[9] P. Shenoy, J. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, and D. interest also includes data mining for microarray data and 3D protein
Shah, “Turbo-Charging Vertical Mining of Large Databases,” Proc. structures, spatial indexing, sequences searches, and dominant
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’00), relationship analysis.
pp. 22-23, 2000.
[10] J. Wang, J. Han, and J. Pei, “CLOSET+: Searching for the Best
Strategies for Mining Frequent Closed Itemsets,” Proc. Ninth ACM
. For more information on this or any other computing topic,
SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD
please visit our Digital Library at www.computer.org/publications/dlib.
’03), pp. 236-245, 2003.
[11] R.P.G. Yudho Giri Sucahyo and A. Rudra, “Efficiently Mining
Frequent Patterns from Dense Datasets Using a Cluster of
Computers,” Proc. 16th Australian Joint Conf. Artificial Intelligence
(AI ’03), pp. 233-244, 2003.
[12] M. Zaki and C. Hsiao, “CHARM: An Efficient Algorithm for
Closed Association Rule Mining,” Proc. Second SIAM Int’l Conf.
Data Mining (SDM ’02), 2002.
[13] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New
Algorithms for Fast Discovery of Association Rules,” Proc. Third
Int’l Conf. Knowledge Discovery and Data Mining (KDD ’97), pp. 283-
286, 1997.
[14] M.J. Zaki, “Parallel and Distributed Association Mining: A
Survey,” IEEE Concurrency, special issue on data mining, pp. 14-
25, 1999.
[15] U. Takeaki, K. Masashi, and A. Hiroki, “LCM ver. 3: Collaboration
of Array, Bitmap and Prefix Tree for Frequent Itemset Mining,”
Proc. First Int’l Workshop Open Source Data Mining: Frequent Pattern
Mining Implementations (OSDM ’05), pp. 77-86, 2005.
[16] C. Creighton and S. Hanash, “Mining Gene Expression Database
for Association Rules,” Bioinformatics, vol. 19, pp. 79-86, 2003.
[17] G. Cong, A.K.H. Tung, X. Xu, F. Pan, and J. Yang, “FARMER:
Finding Interesting Rule Groups in Microarray Datasets,” Proc.
ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’04), 2004.
[18] L. Ji, “Mining Localized Co-Expressed Gene Patterns from
Microarray Data,” PhD dissertation, http://www.comp.nus.
edu.sg/~jiliping, 2006.