Chapter 9: Summarizing Itemsets

Data Mining and Machine Learning:
Fundamental Concepts and Algorithms

dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 9: Summarizing Itemsets
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 1 / 24
An Example Database
Transaction database
Frequent itemsets (minsup = 3)
Tid Itemset
1 ABDE sup Itemsets
2 BCE 6 B
3 ABDE 5 E , BE
4 ABCE 4 A, C , D, AB, AE , BC , BD, ABE
5 ABCDE 3 AD, CE , DE , ABD, ADE , BCE , BDE , ABDE
6 BCD
Maximal Frequent Itemsets
Given a binary database D ⊆ T × I, over the tids T and items I, let F denote
the set of all frequent itemsets, that is,

F = X | X ⊆ I and sup(X ) ≥ minsup
A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let

M be the set of all maximal frequent itemsets, given as

M = X | X ∈ F and 6 ∃Y ⊃ X , such that Y ∈ F
The set M is a condensed representation of the set of all frequent itemset F,

because we can determine whether any itemset X is frequent or not using M. If
there exists a maximal itemset Z such that X ⊆ Z , then X must be frequent;
otherwise X cannot be frequent.
Closed Frequent Itemsets
Given T ⊆ T , and X ⊆ I, define
t(X ) = {t ∈ T | t contains X }
i (T ) = {x ∈ I | ∀t ∈ T , t contains x}
c(X ) = i ◦ t(X ) = i (t(X ))
The function c is a closure operator and an itemset X is called closed if c(X ) = X . It

follows that t(c(X )) = t(X ). The set of all closed frequent itemsets is thus defined as

C = X | X ∈ F and 6 ∃Y ⊃ X such that sup(X ) = sup(Y )
X is closed if all supersets of X have strictly less support, that is, sup(X ) > sup(Y ), for
all Y ⊃ X .
The set of all closed frequent itemsets C is a condensed representation, as we can
determine whether an itemset X is frequent, as well as the exact support of X using C
alone.
Minimal Generators
A frequent itemset X is a minimal generator if it has no subsets with the same

support:

G = X | X ∈ F and 6 ∃Y ⊂ X , such that sup(X ) = sup(Y )
In other words, all subsets of X have strictly higher support, that is,
sup(X ) < sup(Y ), for all Y ⊂ X .
Given an equivalence class of itemsets that have the same tidset, a closed itemset
is the unique maximum element of the class, whereas the minimal generators are
the minimal elements of the class.
Frequent Itemsets: Closed, Minimal Generators and Maximal
∅
A B D E C
1345 123456 1356 12345 2456
AD DE AB AE BD BE BC CE
135 135 1345 1345 1356 12345 2456 245
ABD ADE BDE ABE BCE

135 135 135 1345 245
ABDE
135
Itemsets boxed and shaded are closed, double boxed are maximal, and those boxed are
minimal generators
Mining Maximal Frequent Itemsets: GenMax Algorithm
Mining maximal itemsets requires additional steps beyond simply determining the
frequent itemsets. Assuming that the set of maximal frequent itemsets is initially
empty, that is, M = ∅, each time we generate a new frequent itemset X , we have
to perform the following maximality checks
Subset Check: 6 ∃Y ∈ M, such that X ⊂ Y . If such a Y exists, then clearly
X is not maximal. Otherwise, we add X to M, as a potentially maximal
itemset.
Superset Check: 6 ∃Y ∈ M, such that Y ⊂ X . If such a Y exists, then Y
cannot be maximal, and we have to remove it from M.
GenMax Algorithm: Maximal Itemsets
GenMax is based on dEclat, i.e., it uses diffset intersections for support computation.
The initial call takes as input the set of frequent items along with their tidsets, hi, t(i)i,
and the initially empty set of maximal itemsets, M. Given a set of itemset–tidset pairs,
called IT-pairs, of the form hX , t(X )i, the recursive GenMax method works as follows.
S
If the union of all the itemsets, Y = Xi , is already subsumed by (or contained in) some
maximal pattern Z ∈ M, then no maximal itemset can be generated from the current
branch, and it is pruned. Otherwise, we intersect each IT-pair hXi , t(Xi )i with all the
other IT-pairs hXj , t(Xj )i, with j > i, to generate new candidates Xij , which are added to
the IT-pair set Pi .
If Pi is not empty, a recursive call to GenMax is made to find other potentially frequent
extensions of Xi . On the other hand, if Pi is empty, it means that Xi cannot be
extended, and it is potentially maximal. In this case, we add Xi to the set M, provided
that Xi is not contained in any previously added maximal set Z ∈ M.
GenMax Algorithm
Call: M ← ∅,
// Initial
P ← hi, t(i)i | i ∈ I, sup(i) ≥ minsup
GenMaxS (P, minsup, M):
1 Y ← Xi
2 if ∃Z ∈ M, such that Y ⊆ Z then
3 return // prune entire branch
4 foreach hXi , t(Xi )i ∈ P do
5 Pi ← ∅
6 foreach hXj , t(Xj )i ∈ P, with j > i do
7 Xij ← Xi ∪ Xj
8 t(Xij ) = t(Xi ) ∩ t(Xj )
9 if sup(Xij ) ≥ minsup then Pi ← Pi ∪ {hXij , t(Xij )i}
10 if Pi 6= ∅ then GenMax (Pi , minsup, M)
11 else if 6 ∃Z ∈ M, Xi ⊆ Z then
12 M = M ∪ Xi // add Xi to maximal set
Mining Maximal Frequent Itemsets
A B C D E
1345 123456 2456 1356 12345
PA PB PC PD
AB AD AE BC BD BE CE DE
1345 135 1345 2456 1356 12345 245 135
PAB PAD PBC PBD
ABD ABE ADE BCE BDE

135 1345 135 245 135
PABD
ABDE
135
Mining Closed Frequent Itemsets: Charm Algorithm
Mining closed frequent itemsets requires that we perform closure checks, that is,
whether X = c(X ). Direct closure checking can be very expensive.
Given a collection of IT-pairs {hXi , t(Xi )i}, Charm uses the following three
properties:
If t(Xi ) = t(Xj ), then c(Xi ) = c(Xj ) = c(Xi ∪ Xj ), which implies that we can
Property (1)
replace every occurrence of Xi with Xi ∪ Xj and prune the branch under Xj

because its closure is identical to the closure of Xi ∪ Xj .
If t(Xi ) ⊂ t(Xj ), then c(Xi ) 6= c(Xj ) but c(Xi ) = c(Xi ∪ Xj ), which means
Property (2)
that we can replace every occurrence of Xi with Xi ∪ Xj , but we cannot prune

Xj because it generates a different closure. Note that if t(Xi ) ⊃ t(Xj ) then
we simply interchange the role of Xi and Xj .
If t(Xi ) 6= t(Xj ), then c(Xi ) 6= c(Xj ) 6= c(Xi ∪ Xj ). In this case we cannot
Property (3)
remove either Xi or Xj , as each of them generates a different closure.
Charm Algorithm: Closed Itemsets

// Initial Call: C ← ∅, P ← hi, t(i)i : i ∈ I, sup(i) ≥ minsup
Charm (P, minsup, C):
1 Sort P in increasing order of support (i.e., by increasing |t(Xi )|)
2 foreach hXi , t(Xi )i ∈ P do
3 Pi ← ∅
4 foreach hXj , t(Xj )i ∈ P, with j > i do
5 Xij = Xi ∪ Xj
6 t(Xij ) = t(Xi ) ∩ t(Xj )
7 if sup(Xij ) ≥ minsup then
8 if t(Xi ) = t(Xj ) then // Property 1
9 Replace Xi with Xij in P and Pi
10 Remove hXj , t(Xj )i from P
11 else
12 if t(Xi ) ⊂ t(Xj ) then // Property 2
13 Replace Xi with Xij in P and Pi
14 else // Property 3
15 Pi ← Pi ∪ hXij , t(Xij )i
16 if Pi 6= ∅ then Charm (Pi , minsup, C)

17 if 6 ∃Z ∈ C, such that Xi ⊆ Z and t(Xi ) = t(Z ) then
18 C = C ∪ Xi // Add Xi to closed set
Mining Frequent Closed Itemsets: Charm
Process A
A AE AEB C D E B
1345 2456 1356 12345 123456
PA
AD ADE ADEB
135
Mining Frequent Closed Itemsets: Charm
A AE AEB C CB D DB E EB B
1345 2456 1356 12345 123456
PA PC PD
AD ADE ADEB CE CEB DE DEB

135 245 135
Nonderivable Itemsets
An itemset is called nonderivable if its support cannot be deduced from the supports of
its subsets. The set of all frequent nonderivable itemsets is a summary or condensed
representation of the set of all frequent itemsets. Further, it is lossless with respect to
support, that is, the exact support of all other frequent itemsets can be deduced from it.
Generalized Itemsets: Let X be a k-itemset, that is, X = {x1 , x2 , . . . , xk }. The k tidsets
t(xi ) for each item xi ∈ X induce a partitioning of the set of all tids into 2k regions,
where each partition contains the tids for some subset of items Y ⊆ X , but for none of
the remaining items Z = X \ Y .
Each partition is therefore the tidset of a generalized itemset Y Z , where Y consists of
regular items and Z consists of negated items.
Define the support of a generalized itemset Y Z as the number of transactions that
contain all items in Y but no item in Z :

sup(Y Z ) = {t ∈ T | Y ⊆ i (t) and Z ∩ i (t) = ∅}
Tidset Partitioning Induced by t(A),t(C ), and t(D)
Tid Itemset
1 ABDE
t(A) t(C )
2 BCE
3 ABDE
t(AC D) = 4
4 ABCE
t(AC D) = ∅ t(AC D) = 2
5 ABCDE
6 BCD
t(ACD) = 5
t(AC D) = 13 t(ACD) = 6
t(AC D) = ∅
t(AC D) = ∅
t(D)
Inclusion–Exclusion Principle: Support Bounds
Let Y Z be a generalized itemset, and let X = Y ∪ Z = YZ . The

inclusion–exclusion principle allows one to directly compute the support of Y Z as
a combination of the supports for all itemsets W , such that Y ⊆ W ⊆ X :
X
sup(Y Z ) = −1|W \Y | · sup(W )
Y ⊆W ⊆X
Inclusion–Exclusion for Support
Consider the generalized itemset AC D = C AD, where Y = C , Z = AD and
X = YZ = ACD. In the Venn diagram, we start with all the tids in t(C ), and
remove the tids contained in t(AC ) and t(CD). However, we realize that in terms
of support this removes sup(ACD) twice, so we need to add it back. In other
words, the support of C AD is given as
sup(C AD) = sup(C ) − sup(AC ) − sup(CD) + sup(ACD)

= 4−2−2+1 = 1
But, this is precisely what the inclusion–exclusion formula gives:
sup(C AD) = (−1)0 sup(C )+ W = C , |W \ Y | = 0

(−1)1 sup(AC )+ W = AC , |W \ Y | = 1
(−1)1 sup(CD)+ W = CD, |W \ Y | = 1
(−1)2 sup(ACD) W = ACD, |W \ Y | = 2
= sup(C ) − sup(AC )−
sup(CD) + sup(ACD)
Support Bounds
Because the support of any (generalized) itemset must be non-negative, we can

derive a bound on the support of X from each of the 2k generalized itemsets by
setting sup(Y Z ) ≥ 0.
Thus, from the 2k possible subsets Y ⊆ X , we derive 2k −1 lower bounds and 2k −1
upper bounds for sup(X ), obtained after setting sup(Y Z ) ≥ 0:
X
Upper Bounds (|X \ Y | is odd): sup(X ) ≤ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X
X
Lower Bounds (|X \ Y | is even): sup(X ) ≥ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X
(1)
(2)
Support Bounds for Subsets
subset lattice
ACD sign inequality level
AC AD CD 1 ≤ 1
A C D −1 ≥ 2
∅ 1 ≤ 3
Support Bounds
Example
For example, if Y = C , then the inclusion-exclusion principle gives us

sup(C AD) = sup(C ) − sup(AC ) − sup(CD) + sup(ACD)
Setting sup(C AD) ≥ 0, we obtain
sup(ACD) ≥ −sup(C ) + sup(AC ) + sup(CD)
From each of the partitions, we get one bound, and out of the eight possible
regions, exactly four give upper bounds and the other four give lower bounds for
the support of ACD:
sup(ACD) ≥0 when Y = ACD
≤ sup(AC ) when Y = AC
≤ sup(AD) when Y = AD
≤ sup(CD) when Y = CD
≥ sup(AC ) + sup(AD) − sup(A) when Y =A
≥ sup(AC ) + sup(CD) − sup(C ) when Y =C
≥ sup(AD) + sup(CD) − sup(D) when Y =D
≤ sup(AC ) + sup(AD) + sup(CD)−
sup(A) − sup(C ) − sup(D) + sup(∅) when Y = ∅
Nonderivable Itemsets
Given an itemset X , and Y ⊆ X , let IE (Y ) denote the summation
X
IE (Y ) = −1(|X \Y |+1) · sup(W )
Y ⊆W ⊂X
Then, the sets of all upper and lower bounds for sup(X ) are given as
n o
UB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is odd
n o
LB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is even
An itemset X is called nonderivable if max{LB(X )} 6= min{UB(X )}, and we know only

the range of possible values, that is,
h i
sup(X ) ∈ max{LB(X )}, min{UB(X )}
X is derivable if sup(X ) = max{LB(X )} = min{UB(X )} Thus, the set of all frequent

nonderivable itemsets is given as

N = X ∈ F | max{LB(X )} 6= min{UB(X )}
where F is the set of all frequent itemsets.
Nonderivable Itemsets: Example
Consider the support bound formulas for sup(ACD). The lower bounds are
sup(ACD) ≥ 0
≥ sup(AC ) + sup(AD) − sup(A) = 2 + 3 − 4 = 1
≥ sup(AC ) + sup(CD) − sup(C ) = 2 + 2 − 4 = 0
≥ sup(AD) + sup(CD) − sup(D) = 3 + 2 − 4 = 0
and the upper bounds are
sup(ACD) ≤ sup(AC ) = 2
≤ sup(AD) = 3
≤ sup(CD) = 2
≤ sup(AC ) + sup(AD) + sup(CD) − sup(A) − sup(C )−
sup(D) + sup(∅) = 2 + 3 + 2 − 4 − 4 − 4 + 6 = 1
Thus, we have
LB(ACD) = {0, 1} max{LB(ACD)} = 1
UB(ACD) = {1, 2, 3} min{UB(ACD)} = 1
Because max{LB(ACD)} = min{UB(ACD)} we conclude that ACD is derivable.
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
Mohammed J. Zaki1 Wagner Meira Jr.2
1
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Chapter 9: Summarizing Itemsets

Chapter 9: Summarizing Itemsets

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 9: Summarizing Itemsets

Uploaded by

Copyright:

Available Formats

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms

Mohammed J. Zaki1 Wagner Meira Jr.2