Professional Documents
Culture Documents
Frequent Itemsets
Mohammed J. Zaki †
Abstract
In this chapter we give an overview of the closed and maximal itemset mining prob-
lem. We survey existing methods and focus on Charm and GenMax, both state-
of-the-art algorithms that efficiently enumerate all closed and maximal patterns, re-
spectively. Charm and GenMax simultaneously explore both the itemset space and
transaction space, and use a number of optimizations to quickly prune away a large
portion of the subset search space. We conduct an extensive experimental charac-
terization of GenMax and Charm against other maximal and closed pattern mining
methods. We found that the methods have varying performance depending on the
database characteristics (mainly the distribution of the closed or maximal frequent
patterns by length). We present a systematic and realistic set of experiments showing
under which conditions a method is likely to perform well and under what condi-
tions it does not perform well. Overall, both Charm and GenMax deliver excellent
† This work was supported in part by NSF CAREER Award IIS-0092978, DOE Early Career Award DE-
performance and outperform extant approaches for closed and maximal set mining.
However, there are a few exceptions to this, which we highlight in our experiments.
1.1 INTRODUCTION
Mining frequent itemsets is a fundamental and essential problem in many data min-
ing applications such as the discovery of association rules, strong rules, correlations,
multi-dimensional patterns, and many other important discovery tasks. The prob-
lem is formulated as follows: Given a large data base of set of items transactions,
find all frequent itemsets, where a frequent itemset is one that occurs in at least a
user-specified percentage of the data base.
Many of the proposed itemset mining algorithms are a variant of Apriori [2],
which employs a bottom-up, breadth-first search that enumerates every single fre-
quent itemset. In many applications (especially in dense data) with long frequent
patterns enumerating all possible 2m − 2 subsets of a m length pattern (m can easily
be 30 or 40 or longer) is computationally infeasible. For example, many real-world
domains like gene expression studies, network intrusion, web content and usage min-
ing, and so on, contain patterns are typically long. There are two current solutions to
the long pattern mining problem. The first solution one is to mine only the maximal
frequent itemsets [5, 12, 3, 6, 8]. A frequent set is maximal if it has no frequent
superset; the set of maximal patterns is typically orders of magnitude smaller than
all frequent patterns. While mining maximal sets help understand the long patterns
in dense domains, they lead to a loss of information; since subset frequency is not
available maximal sets are not suitable for generating rules. The second solution is
to mine only the frequent closed sets [4, 15, 14, 18, 21]; a frequent set is closed if
it has no superset with the same frequency. Closed sets are lossless in the sense that
they uniquely determine the set of all frequent itemsets and their exact frequency.
At the same time closed sets can themselves be orders of magnitude smaller than
all frequent sets, especially on dense databases. Nevertheless, for some of the dense
datasets we consider in this chapter, even the set of all closed patterns would grow to
be too large. The only recourse is to mine the maximal patterns in such domains.
In this chapter we give an overview of the closed and maximal itemset mining
problem. We survey existing methods and focus on Charm and GenMax, both state-
of-the-art algorithms that efficiently enumerate all closed and maximal patterns, re-
spectively. Charm and GenMax simultaneously explore both the itemset space and
transaction space, and use a number of optimizations to quickly prune away a large
portion of the subset search space. GenMax uses a novel progressive focusing tech-
nique to eliminate non-maximal itemsets, while Charm uses a fast hash-based ap-
proach to eliminate non-closed itemsets during subsumption checking. Both utilize
diffset propagation for fast frequency checking. Diffsets [20] keep track of differ-
ences in the tids of a candidate pattern from its prefix pattern. Diffsets drastically
cut down (by orders of magnitude) the size of memory required to store intermediate
results. Thus the entire working set of patterns can fit entirely in main-memory, even
for large databases.
We conduct an extensive experimental characterization of GenMax against other
maximal pattern mining methods like MaxMiner [5] and Mafia [6]. We compare
Charm against previous methods for mining closed sets such as Close [14], Closet [15],
Mafia [6] and Pascal [4]. We found that the methods have varying performance
depending on the database characteristics (mainly the distribution of the closed or
maximal frequent patterns by length). We present a systematic and realistic set of
experiments showing under which conditions a method is likely to perform well and
under what conditions it does not perform well. Overall, both Charm and GenMax
deliver excellent performance and outperform extant approaches for closed and max-
imal set mining. However, there are a few exceptions to this, which we highlight in
our experiments.
1.2 PRELIMINARIES
The problem of mining closed and maximal frequent patterns can be formally stated
as follows: Let I = {i1 , i2 , . . . , im } be a set of m distinct items. Let D denote
a database of transactions, where each transaction has a unique identifier (tid) and
contains a set of items. The set of all tids is denoted T = {t1 , t2 , ..., tn }. A set
Example 1 Consider our example database in Figure 1.1. There are five different
items, I = {A, C, D, T, W } and six transactions T = {1, 2, 3, 4, 5, 6}. The table
on the right shows all 19 frequent itemsets for min sup= 3. The lattice for FI is
shown in Figure 1.2; under each itemset X , we show its tidset t(X). The figure also
shows the 7 closed sets obtained by collapsing all the itemsets that have the same
tidset (closed regions of the lattice), and the 2 maximal sets (circles). It is clear that
MFI ⊆ CFI ⊆ FI.
GenMax uses backtracking search to enumerate the MFI. We first describe the back-
tracking paradigm in the context of enumerating all frequent patterns. We will sub-
sequently modify this procedure to enumerate the MFI.
Backtracking algorithms are useful for many combinatorial problems where the
solution can be represented as a set I = {i0 , i1 , ...}, where each ij is chosen from
a finite possible set, Pj . Initially I is empty; it is extended one item at a time,
as the search space is traversed. The length of I is the same as the depth of the
corresponding node in the search tree. Given a partial solution of length l, Il =
{i0 , i1 , ..., il−1 }, the possible values for the next item il comes from a subset Cl ⊆ Pl
called the combine set. If y ∈ Pl − Cl , then nodes in the subtree with root node Il =
{i0 , i1 , ..., il−1 , y} will not be considered by the backtracking algorithm. Since such
subtrees have been pruned away from the original search space, the determination of
Cl is also called pruning.
[INSERT FIGURE 1.3 HERE ]
Consider the backtracking algorithm for mining all frequent patterns, shown in
Figure 1.3. The main loop tries extending Il with every item x in the current com-
bine set Cl . The first step is to compute Il+1 , which is simply Il extended with x.
The second step is to extract the new possible set of extensions, Pl+1 , which consists
only of items y in Cl that follow x. The third step is to create a new combine set for
the next pass, consisting of valid extensions. An extension is valid if the resulting
itemset is frequent. The combine set, Cl+1 , thus consists of those items in the pos-
sible set that produce a frequent itemset when used to extend Il+1 . Any item not in
the combine set refers to a pruned subtree. The final step is to recursively call the
backtrack routine for each extension. As presented, the backtrack method performs
a depth-first traversal of the search space.
[INSERT Figure 1.4 HERE ]
Example 2 Consider the full subset search space shown in Figure 1.4. The back-
track search space can be considerably smaller than the full space. For example, we
start with I0 = ∅ and C0 = F1 = {A, C, D, T, W }. At level 1, each item in C0 is
added to I0 in turn. For example, A is added to obtain I1 = {A}. The possible set
for A, P1 = {C, D, T, W } consists of all items that follow A in C0 . However, from
Figure 1.1, we find that only AC , AT , and AW are frequent (at min sup=3), giving
C1 = {C, T, W }. Thus the subtree corresponding to the node AD has been pruned.
A good coverage of mining long patterns appears in [1]. Methods for finding the
maximal elements include All-MFS [10], which works by iteratively attempting to
extend a working pattern until failure. A randomized version of the algorithm that
uses vertical bit-vectors was studied, but it does not guarantee every maximal pattern
will be returned.
The Pincer-Search [12] algorithm uses horizontal data format. It not only con-
structs the candidates in a bottom-up manner like Apriori, but also starts a top-down
search at the same time, maintaining a candidate set of maximal patterns. This can
help in reducing the number of database scans, by eliminating non-maximal sets
early. The maximal candidate set is a superset of the maximal patterns, and in gen-
eral, the overhead of maintaining it can be very high. In contrast GenMax maintains
only the current known maximal patterns for pruning.
MaxMiner [5] is another algorithm for finding the maximal elements. It uses
efficient pruning techniques to quickly narrow the search. MaxMiner employs a
breadth-first traversal of the search space; it reduces database scanning by employing
a lookahead pruning strategy, i.e., if a node with all its extensions can determined to
be frequent, there is no need to further process that node. It also employs item
(re)ordering heuristic to increase the effectiveness of superset-frequency pruning.
Since MaxMiner uses the original horizontal database format, it can perform the
same number of passes over a database as Apriori does.
DepthProject [3] finds long itemsets using a depth first search of a lexicographic
tree of itemsets, and uses a counting method based on transaction projections along
There have been several recent algorithms proposed for mining CFI. Close [14] is
an Apriori-like algorithm that directly mines frequent closed itemsets. There are two
main steps in Close. The first is to use bottom-up search to identify generators, the
smallest frequent itemset that determines a closed itemset. For example, consider the
frequent itemset lattice in Figure 1.2. The item A is a generator for the closed set
ACW , since it is the smallest itemset with the same tidset as ACW . All generators
are found using a simple modification of Apriori. After finding the frequent sets at
level k, Close compares the support of each set with its subsets at the previous level.
If the support of an itemset matches the support of any of its subsets, the itemset
cannot be a generator and is thus pruned. The second step in Close is to compute the
closed sets for all the generators found in the first step, which is done via intersection
of all transactions where it occurs as a subset. This can be done in one pass over the
database, provided all generators fit in memory. Nevertheless computing closures
this way is an expensive operation.
The authors of Close recently developed Pascal [4], an improved algorithm for
mining closed and frequent sets. They introduce the notion of key patterns and show
that other frequent patterns can be inferred from the key patterns without access
to the database. They showed that Pascal, even though it finds both frequent and
closed sets, is typically twice as fast as Close, and ten times as fast as Apriori. Since
Pascal enumerates all patterns, it is only practical when pattern length is short (as we
shall see in the experimental section). The Closure algorithm [7] is also based on
a bottom-up search; it performs only marginally better than Apriori. Charm uses a
more efficient depth-first search over itemset and tidsets spaces.
Closet [15] uses a novel frequent pattern tree (FP-tree) structure, which is a
compressed representation of all the transactions in the database. It uses a recur-
sive divide-and-conquer and database projection approach to mine long patterns.
Mafia [6] is primarily intended for maximal pattern mining, but has an option to
mine the closed sets as well. Mafia relies on efficient compressed and projected ver-
tical bitmap based frequency computation. In contrast to Closet and Mafia, Charm
uses diffsets for fast support computation.
Recently two new algorithms for finding frequent closed itemsets have been pro-
posed, Closet+ [16] and Carpenter [13]. Closet+ combines several previously pro-
posed as well as new effective strategies into one algorithm. Carpenter mines closed
patterns in datasets that have significantly more items than there are transactions,
such as datasets that arise in biology, for example, microarray datasets. In these
datasets, there can easily be 10,000 or more items, but only 100-1000 transactions.
All of the above algorithms for CFI cannot deal with such a large itemsets space
(e.g.,210000 ). Capitalizing on the fact that the tidset search space is much smaller
(e.g., 2100 ), Carpenter enumerates closed tidsets and determines the corresponding
closed itemsets from them.
There are two main ingredients to develop an efficient CFI and MFI algorithm. The
first is the set of techniques used to remove entire branches of the search space, and
the second is the representation used to perform fast frequency computations. We
will describe below how Charm and GenMax extend the basic backtracking routine
for FI, and the progressive focusing, hash-based subsumption checking and diff-
set propagation techniques they use for fast maximality, closedness and frequency
checking. More details on Charm and GenMax, and on diffsets can be found in
[8, 21, 20].
Typically, pattern mining algorithms use a horizontal database format, such as the
one shown in Figure 1.1, where each row is a tid followed by its itemset. Con-
sider a vertical database format, where for each item we list its tidset, the set of all
transaction tids where it occurs. The vertical representation has the following ma-
jor advantages over the horizontal layout: Firstly, computing the support of itemsets
is simpler and faster with the vertical layout since it involves only the intersections
of tidsets (or compressed bit-vectors if the vertical format is stored as bitmaps [6]).
Secondly, with the vertical layout, there is an automatic “reduction” of the database
before each scan in that only those itemsets that are relevant to the following scan
of the mining process are accessed from disk. Thirdly, the vertical format is more
versatile in supporting various search strategies, including breadth-first, depth-first
or some other hybrid search.
[INSERT Figure 1.5 HERE]
Let’s consider how the FI-combine (see Figure 1.3) routine works, where the
frequency of an extension is tested. Each item x in Cl actually represents the itemset
Il ∪ {x} and stores the associated tidset for the itemset Il ∪ {x}. For the initial
invocation, since Il is empty, the tidset for each item x in Cl is identical to the tidset,
t(x), of item x. Before line 3 is called in FI-combine, we intersect the tidset of the
element Il+1 (i.e., t(Il ∪ {x})) with the tidset of element y (i.e., t(Il ∪ {y})). If
the cardinality of the resulting intersection is above minimum support, the extension
with y is frequent, and y 0 the new intersection result, is added to the combine set
Cl+1 for the next level. Cl+1 is kept in increasing order of support of its elements.
Figure 1.5 shows the pseudo-code for FI-tidset-combine using this tidset intersection
based support counting.
Example 3 Suppose, that we have the itemset ACT ; we show how to get its support
using the tidset intersections. We start with item A, and extend it with item C .
We find the support of AC as follows: t(AC) = t(A) ∩ t(C) = {1, 3, 4, 5} ∩
{1, 2, 3, 4, 5} = {1, 3, 4, 5}, and the support of AC is then |t(AC)| = 4 At the
next level, we need to compute the tidset for ACT using the tidsets for AC and
AT , where Il = {A} and Cl = {C, T }. We have t(ACT ) = t(AC) ∩ t(AT ) =
{1, 3, 4, 5} − {1, 3, 5} = {1, 3, 5}, and its support is |t(ACT )| = 3.
1.4.1.1 Diffsets Propagation Despite the many advantages of the vertical format,
when the tidset cardinality gets very large (e.g., for very frequent items) the intersec-
tion time starts to become inordinately large. Furthermore, the size of intermediate
tidsets generated for frequent patterns can also become very large to fit into main
memory. Both Charm and GenMax uses a new format called diffsets [20] for fast
frequency testing.
The main idea of diffsets is to avoid storing the entire tidset of each element in
the combine set. Instead we keep track of only the differences between the tidset of
itemset Il and the tidset of an element x in the combine set (which actually denotes
Il ∪ {x}). These differences in tids are stored in what we call the diffset, which is a
difference of two tidsets at the root level or a difference of two diffsets at later levels.
Furthermore, these differences are propagated all the way from a node to its children
starting from the root. In an extensive study [20], we showed that diffsets are very
short compared to their tidsets counterparts, and are highly effective in improving
the running time of vertical methods.
[INSERT Figure 1.6 HERE]
We describe next how diffsets are used, with the help of an example. At level
0, we have available the tidsets for each item in F1 . To find the combine set at this
level, we compute the diffset of y 0 , denoted as d(y 0 ) instead of computing the tidset
of y as shown in line 4 in Figure 1.5. That is d(y 0 ) = t(x) − t(y). The support of
y 0 is now given as σ(y 0 ) = σ(x) − |d(y 0 )|. At subsequent levels, we have available
the diffsets for each element in the combine list. In this case d(y 0 ) = d(y) − d(x),
but the support is still given as σ(y 0 ) = σ(x) − |d(y 0 )| [9]. Figure 1.6 shows the
pseudo-code for computing the combine sets using diffsets.
Example 4 Suppose, that we have the itemset ACT ; we show how to get its support
using the diffset propagation technique. We start with item A, and extend it with
item C . Now in order to find the support of AC we first find the diffset for AC ,
denoted d(AC) = t(A) − t(C) = {1, 3, 4, 5} − {1, 2, 3, 4, 5} and then calculate its
support as σ(AC) = σ(A) − |d(AC)| = 4 − 0 = 4. At the next level, we need to
compute the diffset for ACT using the diffsets for AC and AT , where Il = {A} and
Cl = {C, T }. The diffset of itemset ACT is given as d(ACT ) = d(AT )−d(AC) =
{4} − {} = {4}, and its support is given as σ(AC) − |d(ACT )| = 4 − 1 = 3.
Two general principles for efficient searching using backtracking are that: 1) It is
more efficient to make the next choice of a subtree (branch) to explore to be the one
whose combine set has the fewest items. This usually results in good performance,
since it minimizes the number of frequency computations in FI-combine. 2) If we
are able to remove a node as early as possible from the backtracking search tree we
effectively prune many branches from consideration.
Reordering the elements in the current combine set to achieve the two goals is
a very effective means of cutting down the search space. The first heuristic is to
reorder the combine set in increasing order of support. This is likely to produce
small combine sets in the next level, since the items with lower frequency are less
likely to produce frequent itemsets at the next level. This heuristic was first used in
MaxMiner, and has been used in other methods since then [3, 6, 21]. At each level of
backtracking search, both Charm and GenMax reorder the combine set in increasing
order of support (this is indicated in Figures 1.3, 1.5, and 1.6).
1.4.3.1 Pruning Search Space To prune the search space we can utilize certain
properties of itemset-diffset pairs as described in [21].
Theorem 1 ([21]) Let Xi = Il+1 ∪ {yi } and Xj = Il+1 ∪ {yj } be two members of
the possible set at level l, and let Xi × d(Xi ) and Xj × d(Xj ) be the corresponding
itemset-diffset pairs. The following four properties hold:
Charm uses these four properties to prune the elements of the combine set, as
shown in Figure 1.8. CFI-diffset-combine is the same as FI-diffset-combine (in
Figure 1.6), except for lines 7-16 that are used to check the four properties outlined
in Theorem 1. According to Property 1, if the diffsets of y and Il+1 are identical
this means c(Il+1 ) = c(Il+1 ∪ {y}). In this case we prune the y branch of the
backtrack tree from further consideration (line 9) and replace Il+1 with Il+1 ∪ {y}
(line 8). By Property 2, if d(y) ⊃ d(Il+1 ) then wherever Il+1 occurs y always co-
occurs, i.e., c(Il+1 ) = c(Il+1 ∪ {y}). We replace Il+1 with Il+1 ∪ {y} (line 11).
By Property 3 if d(y) ⊂ d(Il+1 ) then wherever y occurs Il+1 always co-occurs, i.e.,
c(y) = c(Il+1 ∪ {y}). We therefore remove the entire tree under y in Cl (line 14),
but we add y to Cl+1 . Finally, if the diffsets of y and Il+1 are not equal, then no
pruning is possible; we add y to Cl+1 .
Example 5 We explain the working of Charm using the example database in Fig-
ure 1.1. At the first level we have C0 = {A, C, D, T, W }, and thus within the for
loop in CFI-backtrack (Figure 1.7) we have I1 = A and P1 = {C, D, T, W }. We
next determine Cl+1 using CFI-diffset-combine. We find that t(A) ⊂ t(C) (Prop-
erty 2), i.e., whenever A occurs C co-occurs, thus we replace A with I1 = AC .
Considering the next element y = D, we find y0 = AD is not frequent. The next
item y = T yields a frequent itemset y 0 = AT , but t(A) 6= t(T ), thus we set
C1 = {T }. Finally for y = W we again find t(A) ⊂ t(W ), thus we set I1 = ACW ,
and return C1 = {T }. In the next recursion ACW T will be found to be closed.
Next consider the second iteration of the for loop in CFI-backtrack. We have
I1 = C and P1 = {D, T, W }. In CFI-diffset-combine, we find that for all three
items Property 3 is true, i.e., whenever D, T, W occur, C always co-occurs, thus, we
prune them from the backtrack tree, and we get C0 = {A, C}, but C1 = {D, T, W },
which will be recursively processed.
closed set subsumed by C, and it should not be added to CFI. Since CFI dynamically
expands during enumeration of closed patterns, we need a very fast approach to
perform such subsumption checks.
Clearly we want to avoid comparing X with all existing elements in CFI, for this
would lead to a O(|CFI|2 ) complexity. To quickly retrieve relevant closed sets, the
obvious solution is to store CFI in a hash table. But what hash function to use?
Since we want to perform subset checking, we can’t hash on the itemset. We could
use the support of the itemsets for the hash function. But many unrelated itemsets
may have the same support. Since Charm uses diffsets/tidsets, it seems reasonable
to use the information from the tidsets to help identify if X is subsumed. Note that
if t(Xj ) = t(Xi ), then obviously σ(Xj ) = σ(Xi ). Thus to check if X is subsumed,
we can check if t(X) = t(C) for some C ∈ CFI. This check can be performed
in O(1) time using a hash table. But obviously we cannot afford to store the actual
tidset with each closed set in CFI; the space requirements would be prohibitive.
Charm adopts a compromise solution. It computes a hash function on the tidset
and stores in the hash table a closed set along with its support. Let h(Xi ) denote
a suitable chosen hash function on the tidset t(Xi ). Before adding X to CFI, we
retrieve from the hash table all entries with the hash key h(X). For each retrieved
closed set C, we then check if σ(X) = σ(C). If yes, we next check if X ⊂ C. If
yes, then X is subsumed and we do not add it to CFI.
What is a good hash function on a tidset? Charm uses the sum of the tids in
P
the tidset as the hash function, i.e., h(X) = T ∈t(X) T (note, this is not the same
as support, which is the cardinality of t(X)). We tried several other variations and
found there to be no performance difference. This hash function is likely to be as
good as any other due to several reasons. Firstly, by definition a closed set is one that
does not have a superset with the same support; it follows that it must have some tids
that do not appear in any other closed set. Thus the hash keys of different closed sets
will tend to be different. Secondly, even if there are several closed sets with the same
hash key, the support check we perform (i.e., if σ(X) = σ(C)) will eliminate many
closed sets whose keys are the same, but they in fact have different supports. Thirdly,
this hash function is easy to compute and it can easily be used with the diffset format.
1.4.4.1 Superset Checking Techniques Checking to see if the given itemset I l+1
combined with the possible set Pl+1 is subsumed by another maximal set was also
proposed in Mafia [6] under the name HUTMFI. Further pruning is possible if one
can determine based just on support of the combine sets if Il+1 ∪ Pl+1 will be guar-
anteed to be frequent. In this case also one can avoid processing any more branches.
This method was first introduced in MaxMiner [5], and was also used in Mafia under
the name FHUT.
In addition to sorting the initial combine set at level 0 in increasing order of sup-
port, GenMax uses another reordering heuristic based on a simple lemma
Example 6 For our database in Figure 1.1 with min sup = 2, IF (x) is the same of
all items x ∈ F1 , and the sorted order (on support) is A, D, T, W, C . Figure 1.10
shows the backtracking search trees for maximal itemsets containing prefix items A
and D. Under the search tree for A, Figure 1.10 (a), we try to extend the partial
solution AD by adding to it item T from its combine set. We try another item W
after itemset ADT turns out to be infrequent, and so on. Since GenMax uses itemsets
which are found earlier in the search to prune the combine sets of later branches, after
finding the maximal set ADW C , GenMax skips ADC . After finding AT W C all
the remaining nodes with prefix A are pruned, and so on. The pruned branches are
shown with dashed arrows, indicating that a large part of the search tree is pruned
away
√
amortized time O( s log s) per operation in a sequence of s operations [17] (for us
s = O(MFI)). In dense domain we have thousands to millions of maximal frequent
itemsets, and the number of subset checking operations performed would be at least
that much. Can we do better?
The answer is, yes! Firstly, we observe that the two subset checks (one on line 4
and the other on line 8) can be easily reduced to only one check. Since Il+1 ∪ Pl+1 is
a superset of Il+1 , in our implementation we do superset check only for Il+1 ∪ Pl+1 .
While testing this set, we store the maximum position, say p, at which an item in
Il+1 ∪ Pl+1 is not found in a maximal set M ∈ MFI. In other words, all items
before p are subsumed by some maximal set. For the superset test for Il+1 , we check
if |Il+1 | < p. If yes, Il+1 is non-maximal. If no, we add it to MFI.
The second observation is that performing superset checking during each recur-
sive call can be redundant. For example, suppose that the cardinality of the possible
set Pl+1 is m. Then potentially, MFI-backtrack makes m redundant subset checks,
if the current MFI has not changed during these m consecutive calls. To avoid such
redundancy, a simple check status flag is used. If the flag is false, no superset check
is performed. Before each recursive call the flag is false; it becomes true whenever
Cl+1 is empty, which indicates that we have reached a leaf, and have to backtrack.
[INSERT Figure 1.11 HERE]
√
The O( s log s) time bounds reported in [17] for dynamic subset testing do not
assume anything about the sequence of operations performed. In contrast, we have
full knowledge of how GenMax generates maximal sets; we use this observation to
substantially speed up the subset checking process. The main idea is to progressively
narrow down the maximal itemsets of interest as recursive calls are made. In other
words, we construct for each invocation of MFI-backtrack a list of local maximal
frequent itemsets, LM F Il . This list contains the maximal sets that can potentially
be supersets of candidates that are to be generated from the itemset Il . The only such
maximal sets are those that contain all items in Il . This way, instead of checking if
Il+1 ∪Pl+1 is contained in the full current MFI, we check only in LM F Il – the local
set of relevant maximal itemsets. This technique, that we call progressive focusing,
is extremely powerful in narrowing the search to only the most relevant maximal
itemsets, making superset checking practical on dense datasets.
Figure 1.11 shows the pseudo-code for GenMax that incorporates this optimiza-
tion (the code for the first two optimizations is not show to avoid clutter). Before
each invocation of LMFI-backtrack a new LM F Il+1 is created, consisting of those
maximal sets in the current LM F Il that contain the item x (see line 10). Any new
maximal itemsets from a recursive call are incorporated in the current LM F Il at line
12.
Past work has demonstrated that for MFI mining DepthProject [3] is faster than
MaxMiner [5], and the latest paper shows that Mafia [6] consistently beats DepthPro-
ject. In our experimental study below, we retain MaxMiner for baseline comparison.
At the same time, MaxMiner shows good performance on some datasets, which were
not used in previous studies. We use Mafia as the current state-of-the-art method and
show how GenMax compares against it. For comparison we used the original source
or object code for MaxMiner [5] and MAFIA [6], provided to us by their authors. For
CFI mining, we used the original source or object code for Close [14], Pascal [4],
Closet [15] and Mafia [6], all provided to us by their authors. We also include a
comparison with the base Apriori algorithm [2] for mining all itemsets.
All our experiments were performed on a 400MHz Pentium PC with 256MB of
memory, running RedHat Linux 6.0. Since Closet was provided as a Windows exe-
cutable by its authors, we compared it separately on a 900 MHz Pentium III processor
with 256MB memory, running Windows 98. Timings in the figures are based on total
wall-clock time, and include all preprocessing costs (such as horizontal-to-vertical
conversion in Charm, GenMax and Mafia). The times reported also include the pro-
gram output. We believe our setup reflects realistic testing conditions (as opposed
to some previous studies which report only the CPU time or may not include output
cost).
[INSERT Figure 1.12 HERE]
We chose several real and synthetic datasets for testing the performance of the the
algorithms, shown in Table 1.12. The real datasets have been used previously in the
evaluation of maximal patterns [5, 3, 6]. Typically, these real datasets are very dense,
i.e., they produce many long frequent itemsets even for high values of support. The
table shows the length of the longest maximal pattern (at the lowest minimum sup-
port used in our experiments) for the different datasets. For example on pumsb*, the
longest pattern was of length 43 (any method that mines all frequent patterns will be
impractical for such long patterns). We also chose two synthetic datasets, which have
been used as benchmarks for testing methods that mine all frequent patterns. Previ-
ous maximal set mining algorithms have not been tested on these datasets, which are
sparser compared to the real sets. All these datasets are publicly available from IBM
Almaden (www.almaden.ibm.com/cs/quest/demos.html).
[INSERT Figure 1.13 HERE]
Figure 1.13 shows the total number of frequent, closed and maximal itemsets
found for various support values. The maximal frequent itemsets are a subset of the
frequent closed itemsets. The frequent closed itemsets are, of course, a subset of all
frequent itemsets. Depending on the support value used, for the real datasets, the set
of maximal itemsets is about an order of magnitude smaller than the set of closed
itemsets, which in turn is an order of magnitude smaller than the set of all frequent
itemsets. Even for very low support values we find that the difference between max-
imal and closed remains around a factor of 10. However the gap between closed and
all frequent itemsets grows more rapidly. Similar results were obtained for other real
datasets as well. On the other hand in sparse datasets the number of closed sets is
only marginally smaller than the number of frequent sets; the number of maximal
sets is still smaller, though the differences can narrow down for low support values.
[INSERT FIGURE 1.14 HERE ]
1.5.2.1 Type I: Symmetric Datasets Let us first compare how the methods per-
form on datasets which exhibit a symmetric distribution of closed itemsets, namely
chess, pumsb, connect and pumsb*. We observe that Apriori, Close and Pascal work
only for very high values of support on these datasets. The best among the three is
Pascal which can be twice as fast as Close, and up to 4 times better than Apriori. On
the other hand, Charm is several orders of magnitude better than Pascal, and it can
be run on very low support values, where none of the former three methods can be
run. Comparing with Mafia, we find that both Charm and Mafia have similar per-
formance for higher support values. However, as we lower the minimum support,
the performance gap between Charm and Mafia widens. For example at the lowest
support value plotted, Charm is about 30 times faster than Mafia on Chess, about 3
times faster on connect and pumsb, and 4 times faster on pumsb*. Charm outper-
forms Closet by an order of magnitude or more, especially as support is lowered. On
chess and pumsb* it is about 10 times faster than Closet, and about 40 times faster on
pumsb. On connect Closet performs better at high supports, but Charm does better
at lower supports. The reason is that connect has transactions with lot of overlap
1.5.2.2 Type II: Bi-modal Datasets On the two datasets with a bi-modal distri-
bution of frequent closed patterns, namely mushroom and T 40, we find that Pascal
fares better than for symmetric distributions. For higher values of support the maxi-
mum closed pattern length is relatively short, and the distribution is dominated by the
first mode. Apriori, Close and Pascal can hand this case. However, as one lowers the
minimum support the second mode starts to dominate, with longer patterns. These
these methods thus quickly lose steam and become uncompetitive. Between Charm
and Mafia, up to 1% minimum support there is negligible difference, however, when
the support is lowered there is a huge difference in performance. Charm is about 20
times faster on mushroom and 10 times faster on T 40 for the lowest support shown.
The gap continues to widen sharply. We find that Charm outperforms Closet by a
factor of 2 for mushroom and 4 for T 40.
1.5.2.3 Type III: Right-Skewed Datasets On gazelle and T 10, which have a large
number of very short closed patterns, followed by a sharp drop, we find that Apriori,
Close and Pascal remain competitive even for relatively low supports. The reason
is that T 10 had a maximum pattern length of 11 at the lowest support shown. Also
gazelle at 0.06% support also had a maximum pattern length of 11. The level-wise
search of these three methods is able to easily handle such short patterns. However,
for gazelle, we found that at 0.05% support the maximum pattern length suddenly
jumped to 45, and none of these three methods could be run.
T 10, though a sparse dataset, is problematic for Mafia. The reason is that T 10
produces long sparse bitvectors for each item, and offers little scope for bit-vector
compression and projection that Mafia relies on for efficiency. This causes Mafia
to be uncompetitive for such datasets. Similarly Mafia fails to do well on gazelle.
However, it is able to run on the lowest support value. The diffset format of Charm is
resilient to sparsity (as shown in [20]) and it continues to outperform other methods.
For the lowest support, on T 10 it is twice as fast as Pascal and 15 times better than
Mafia, and it is about 70 times faster than Mafia on gazelle. Charm is about 2 times
slower than Closet on T 10. The reason is that the majority of closed sets are of length
2, and the tidset/diffsets operations in Charm are relatively expensive compared to
the compact FP-tree for short patterns (max length is only 11). However, for gazelle,
which has much longer closed patterns, Charm outperforms Closet by a factor of
160!
1.5.3.1 Type I Datasets: Chess and Pumsb Figure 1.16 shows the performance
of the three algorithms on chess and pumsb. These Type I datasets are character-
ized by a symmetric distribution of the maximal frequent patterns (leftmost graph).
Looking at the mean of the curve, we can observe that for these datasets most of the
maximal patterns are relatively short (average length 11 for chess and 10 for pumsb).
The MFI cardinalities in Figure 1.13 show that for the support values shown, the
MFI is 2 orders of magnitude smaller than all frequent itemsets.
Compare the total execution time for the different algorithms on these datasets
(center and rightmost graphs). We use two different variants of Mafia. The first
one, labeled Mafia, does not return the exact maximal frequent set, rather it returns
a superset of all maximal patterns. The second variant, labeled MafiaPP, uses an
option to eliminate non-maximal sets in a post-processing (PP) step. Both GenMax
and MaxMiner return the exact MFI.
On chess we find that Mafia (without PP) is the fastest if one is willing to live
with a superset of the MFI. Mafia is about 10 times faster than MaxMiner. However,
notice how the running time of MafiaPP grows if one tries to find the exact MFI in
a post-pruning step. GenMax, though slower than Mafia is significantly faster than
MafiaPP and is about 5 times better than MaxMiner. All methods, except MafiaPP,
show an exponential growth in running time (since the y-axis is in log-scale, this
appears linear) faithfully following the growth of MFI with lowering minimum sup-
port, as shown in the top center and right figures. MafiaPP shows super-exponential
growth and suffers from an approximately O(|MFI|2 ) overhead in pruning non-
maximal sets and thus becomes impractical when MFI becomes too large, i.e., at
low supports.
On pumsb, we find that GenMax is the fastest, having a slight edge over Mafia.
It is about 2 times faster than MafiaPP. We observed that the post-pruning routine in
MafiaPP works well till around O(104 ) maximal itemsets. Since at 60% min sup we
had around that many sets, the overhead of post-processing was not significant. With
lower support the post-pruning cost becomes significant, so much so that we could
not run MafiaPP beyond 50% minimum support. MaxMiner is significantly slower
on pumsb; a factor of 10 times slower then both GenMax and Mafia.
Type I results substantiate the claim that GenMax is an highly efficient method
to mine the exact MFI. It is as fast as Mafia on pumsb and within a factor of 2 on
chess. Mafia, on the other hand is very effective in mining a superset of the MFI.
Post-pruning, in general, is not a good idea, and GenMax beats MafiaPP with a wide
margin (over 100 times better in some cases, e.g., chess at 20%). On Type I data
MaxMiner is noncompetitive.
[INSERT FIGURE 1.17 HERE ]
forms Mafia. MafiaPP continues to be favorable for higher supports, but once again
beyond a point post-pruning costs start to dominate. MafiaPP could not be run be-
yond the plotted points. MaxMiner remains noncompetitive (about 10 times slower).
The initial start-up time for Mafia for creating the bit-vectors is responsible for the
high offset at 50% support on pumsb*. GenMax appears to exhibit a more graceful
increase in running time than Mafia.
[INSERT Figure 1.18 HERE ]
1.5.3.3 Type III Datasets: T10I4 and T40I10 As depicted in Figure 1.18, Type
III datasets – the two synthetic ones – are characterized by an exponentially decaying
distribution of the maximal frequent patterns. Except for a few maximal sets of size
one, the vast majority of maximal patterns are of length two! After that the number of
longer patterns drops exponentially. The mean pattern length is very short compared
to Type I or Type II datasets; it is around 4-6. MFI cardinality is not much smaller
than the cardinality of all frequent patterns (see Figure 1.13). The difference is only
a factor of 10 compared to a factor of 100 for Type I and a factor of 10,000 for Type
II.
Comparing the running times we observe that MaxMiner is the best method for
this type of data. The breadth-first or level-wise search strategy used in MaxMiner
is ideal for very bushy search trees, and when the average maximal pattern length
is small. Horizontal methods are better equipped to cope with the quadratic blowup
in the number of frequent 2-itemsets since one can use array based counting to get
their frequency. On the other hand vertical methods spend much time in performing
intersections on long item tidsets or bit-vectors. GenMax gets around this problem
by using the horizontal format for computing frequent 2-itemsets (denoted F 2 ), but
it still has to spend time performing O(|F2 |) pairwise tidset intersections.
Mafia on the other hand performs O(|F1 |2 ) intersections, where F1 is the set of
frequent items. The overhead cost is enough to render Mafia noncompetitive on Type
III data. On T10 Mafia can be 20 or more times slower than MaxMiner. GenMax
exhibits relatively good performance, and it is about 10 times better than Mafia and
2 to 3 times worse than MaxMiner. On T40, the gap between GenMax/Mafia and
MaxMiner is smaller since there are longer maximal patterns. MaxMiner is 2 times
better than GenMax and 5 times better than Mafia. Since the MFI cardinality is
not too large MafiaPP has almost the time as Mafia for high supports. Once again
MafiaPP could not be run for lower support values. It is clear that, in general, post-
pruning is not a good idea; the overhead is too much to cope with.
[INSERT Figure 1.19 HERE ]
1.6 CONCLUSIONS
This is one of the first works to comprehensively compare recent closed and maximal
pattern mining algorithms under realistic assumptions. Our timings are based on
wall-clock time, we included all pre-processing costs, and also cost of outputting all
the closed and maximal itemsets (written to a file). We were able to distinguish three
different types of CFI distributions and four different types of MFI distributions in
our benchmark testbed. We believe these distributions to be fairly representative of
what one might see in practice, since they span both real and synthetic datasets. For
CFI mining, Type I is a symmetric/normal CFI distribution, with both small and
long mean pattern lengths, Type II is a bi-modal distributions with both long and
short modes, and Type III is a right-skewed distribution with relatively short closed
pattern lengths. For MFI mining, Type I is a normal MFI distribution with not too
long maximal patterns, Type II is a left-skewed distributions, with longer maximal
patterns, Type III is an exponential decay distribution, with extremely short maximal
patterns, and finally Type IV is an extreme left-skewed distribution, with very large
average maximal pattern length.
We noted that different algorithms perform well under different distributions.
Among the CFI mining algorithms Charm performs the best for all distribution
types, with the exception of T10 dataset that is sparse and has very short pattern
lengths. For connect and mushroom Closet does better for high support values, but
Charm outperforms at lower supports. Mafia was not found to be competitive with
Charm, and neither were Pascal or Close. Among the current methods for MFI min-
ing, MaxMiner is the best for mining Type III distributions. On the remaining types,
Mafia is the best method if one is satisfied with a superset of the MFI. For very low
supports on Type II data, Mafia loses its edge. Post-pruning non-maximal patterns
typically has high overhead. It works only for high support values, and MafiaPP can-
not be run beyond a certain minimum support value. GenMax integrates pruning of
non-maximal itemsets in the process of mining using the novel progressive focusing
technique, along with other optimizations for superset checking; GenMax is the best
method for mining the exact MFI.
Acknowledgments
We would like to thank Roberto Bayardo for providing us the MaxMiner algorithm; Lotfi
Lakhal and Yves Bastide for providing us the source code for Close and Pascal; Jiawei Han,
Jian Pei, and Jianyong Wang for sending us the executable for Closet; and Johannes Gehrke
for the Mafia algorithm. We thanks Roberto Bayardo for providing us the IBM real datasets,
and Ronny Kohavi and Zijian Zheng of Blue Martini Software for giving us access to the
Gazelle dataset.
3. Ramesh Agrawal, Charu Aggarwal, and V.V.V. Prasad. Depth First Generation
of Long Patterns. In 7th Int’l Conference on Knowledge Discovery and Data
Mining, August 2000.
10. D. Gunopulos, H. Mannila, and S. Saluja. Discovering all the most specific
sentences by randomized algorithms. In Intl. Conf. on Database Theory, January
1997.
11. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
In ACM SIGMOD Conf. Management of Data, May 2000.
12. D-I. Lin and Z. M. Kedem. Pincer-search: A new algorithm for discovering
the maximum frequent set. In 6th Intl. Conf. Extending Database Technology,
March 1998.
13. F. Pan, G. Cong, A.K.H. Tung, J. Yang, and M.J. Zaki. CARPENTER: Finding
closed patterns in long biological datasets. In ACM SIGKDD Int’l Conf. on
Knowledge Discovery and Data Mining, August 2003.
15. J. Pei, J. Han, and R. Mao. Closet: An efficient algorithm for mining frequent
closed itemsets. In SIGMOD Int’l Workshop on Data Mining and Knowledge
Discovery, May 2000.
16. J. Wang, J. Han, and J. Pei. Closet+: Searching for the best strategies for mining
frequent closed itemsets. In ACM SIGKDD Int’l Conf. on Knowledge Discovery
and Data Mining, August 2003.
17. D.M. Yellin. An algorithm for dynamic subset and intersection testing. Theo-
retical Computer Science, 129:397–406, 1994.
20. M. J. Zaki and K. Gouda. Fast vertical mining using Diffsets. Technical report,
9th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, August
2003.
21. M. J. Zaki and C.-J. Hsiao. C H ARM: An efficient algorithm for closed itemset
mining. In 2nd SIAM International Conference on Data Mining, April 2002.
22. Q. Zou, W.W. Chu, and B. Lu. Smartminer: a depth first algorithm guided by
tail information for mining maximal frequent itemsets. In 2nd IEEE Int’l Conf.
on Data Mining, November 2002.
DATABASE
ALL FREQUENT ITEMSETS
Transcation Items
MINIMUM SUPPORT = 50%
1 ACTW
Support Itemsets
2 CDW
100% (6) C
3 ACTW
83% (5) W, CW
4 ACDW
A, D, T, AC, AW
5 ACDTW
67% (4) CD, CT, ACW
AT, DW, TW, ACT, ATW
6 CDT 50% (3) CDW, CTW, ACTW
MAXIMAL ITEMSETS
ACTW
(135) ACTW CDW
(135) (245)
ACW CT CD
(1345) (1356) (2456)
AT TW AW AC CD CT CW DW
(135) (135) (1345) (2456) (1356)
(1345) (245)
CW
(12345) (12345)
A C D T W
(12345) C
(1345) (123456) (2456) (1356) (123456)
// Invoke as FI-backtrack(∅, F1 , 0)
FI-backtrack(Il , Cl , l)
1. for each x ∈ Cl
2. Il+1 = I ∪ {x} //also add Il+1 to FI
3. Pl+1 = {y : y ∈ Cl and y > x}
4. Cl+1 = FI-combine (Il+1 , Pl+1 )
5. FI-backtrack(Il+1 , Cl+1 , l + 1)
Level
{}{A,C,D,T,W}
0
ACD{T,W} ACT{W} ACW ADT ADW ATW CDT{W} CDW CTW DTW 3
ACDTW 5
Fig. 1.4 Subset/Backtrack Search Tree (min sup= 3): Circles indicate maximal sets and the
infrequent sets have been crossed out. Due to the downward closure property of support (i.e.,
all subsets of a frequent itemset must be frequent) the frequent itemsets form a border (shown
with the bold line), such that all frequent itemsets lie above the border, while all infrequent
itemsets lie below it. Since MFI determine the border, it is straightforward to obtain FI in a
single database scan if MFI is known.
Fig. 1.5 FI-combine Using Tidset Intersections (* indicates a new line not in FI-combine)
Fig. 1.6 FI-combine using Diffsets (* indicates a new line not in FI-tidset-combine)
// Invocation: CFI-backtrack(∅, F1 , 0)
CFI-backtrack(Il , Cl , l)
1. for each x ∈ Cl
2. Il+1 = I ∪ {x} //also add Il+1 to FI
3. Pl+1 = {y : y ∈ Cl and y > x}
4. Cl+1 = FI-diffset-combine (Il+1 , Pl+1 , Cl )
5. CFI-backtrack(Il+1 , Cl+1 , l + 1)
6.* if Il+1 has no superset in CFI with same support
7.* CFI= CFI ∪ Il+1
Fig. 1.7 Backtrack Algorithm for Mining CFI (* indicates a new line not in FI-backtrack)
// Invocation: MFI-backtrack(∅, F1 , 0)
MFI-backtrack(Il , Cl , l)
1. for each x ∈ Cl
2. Il+1 = I ∪ {x}
3. Pl+1 = {y : y ∈ Cl and y > x}
4.* if Il+1 ∪ Pl+1 has a superset in MFI
5.* return //all subsequent branches pruned!
6.* Cl+1 = CFI-diffset-combine (Il+1 , Pl+1 , Cl )
7.* if Cl+1 is empty
8.* if Il+1 has no superset in MFI
9.* MFI= MFI ∪ Il+1
10. else MFI-backtrack(Il+1 , Cl+1 , l + 1)
Fig. 1.9 Backtrack Algorithm for Mining MFI(* indicates a new line not in FI-backtrack)
A{D,T,W,C} D{T,W,C}
(a) (b)
T{W,C}
(c)
ADWC ATWC
TW{C} TC
TWC
// Invocation: LMFI-backtrack(∅, F1 , ∅, 0)
// LM F Il is an output parameter
LMFI-backtrack(Il , Cl , LM F Il , l)
1. for each x ∈ Cl
2. Il+1 = I ∪ {x}
3. Pl+1 = {y : y ∈ Cl and y > x}
4. if Il+1 ∪ Pl+1 has a superset in LM F Il
5. return //subsequent branches pruned!
6. * LM F Il+1 = ∅
7. Cl+1 = CFI-diffset-combine (Il+1 , Pl+1 , Cl )
8. if Cl+1 is empty
9. if Il+1 has no superset in LM F Il
10. LM F Il = LM F Il ∪ Il+1
11.* else LM F Il+1 = {M ∈ LM F Il : x ∈ M }
12. LMFI-backtrack(Il+1 , Cl+1 , LM F Il+1 , l + 1)
13.* LM F Il = LM F Il ∪ LM F Il+1
Fig. 1.11 Mining MFI with Progressive Focusing (* indicates a new line not in MFI-
backtrack)
Database I AL R MPL
chess 76 37 3,196 23 (20%)
connect 130 43 67,557 31 (2.5%)
mushroom 120 23 8,124 22 (0.025%)
pumsb* 7117 50 49,046 43 (2.5%)
pumsb 7117 74 49,046 27 (40%)
gazelle 498 2.5 59,601 154 (0.01%)
T10I4D100K 1000 10 100,000 13 (0.01%)
T40I10D100K 1000 40 100,000 25 (0.1%)
Fig. 1.12 Database Characteristics: I denotes the number of items, AL the average length
of a record, R the number of records, and M P L the maximum pattern length at the given
min sup.
Set Cardinality
Set Cardinality
1e+06
1e+06 1e+06
100000
100000 100000
10000
10000 10000
1000
1000 100 1000
100 10 100
70 65 60 55 50 45 40 35 30 25 20 100 90 80 70 60 50 40 30 20 10 20 18 16 14 12 10 8 6 4 2 0
Minimum Support (%) Minimum Support (%) Minimum Support (%)
Set Cardinality
Set Cardinality
100000 100000
1e+06
10000
100000
1000 10000
100 10000
10 1000 1000
50 45 40 35 30 25 20 15 10 5 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Minimum Support (%) Minimum Support (%) Minimum Support (%)
0 0 0
0 5 10 15 20 25 30 35 0 5 10 15 20 25 1 2 4 8 16 32 64
Length Length Length
250 2000
Mafia 12 Mafia 120
200 Charm Charm
10 1500 100
150 8 80
1000
100 6 60
50 4 500 40
2 0 20
0
0 0
20 10 5 1 0.1 1 0.1 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4
Minimum Support (%) Minimum Support (%) Minimum Support (%) Minimum Support (%)
T10I4D100K T10I4D100K gazelle gazelle
700 22 800 400
Apriori Closet Apriori Closet
600 Close 20 Charm 700 Close 350 Charm
Pascal 18 Pascal
Total Time (sec)
Fig. 1.15 Charm versus Apriori, Close, Pascal, Mafia, and Closet
2000
1500 100 100
1000
500
10 10
0
-500
-1000 1 1
0 5 10 15 20 25 30 35 40 100 90 80 70 60 50 40 30 20 10 0 50 45 40 35 30 25 20 15 10 5 0
Length Minimum Support (%) Minimum Support (%)
20000
100
15000
10000
100
10
5000
0 1 10
0 2 4 6 8 10 12 14 16 18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2
Length Minimum Support (%) Minimum Support (%)
20000 10
Frequency
15000
MaxMiner
10000 1 MafiaPP
GenMax
5000 GenMax’
Mafia
0 0.1
0 5 10 15 20 25 10 1 0.1 0.01
Length Minimum Support (%)