You are on page 1of 24

Data Mining and Machine Learning:

Fundamental Concepts and Algorithms


dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 9: Summarizing Itemsets

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 1 / 24
An Example Database

Transaction database
Frequent itemsets (minsup = 3)
Tid Itemset
1 ABDE sup Itemsets
2 BCE 6 B
3 ABDE 5 E , BE
4 ABCE 4 A, C , D, AB, AE , BC , BD, ABE
5 ABCDE 3 AD, CE , DE , ABD, ADE , BCE , BDE , ABDE
6 BCD

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 2 / 24
Maximal Frequent Itemsets

Given a binary database D ⊆ T × I, over the tids T and items I, let F denote
the set of all frequent itemsets, that is,


F = X | X ⊆ I and sup(X ) ≥ minsup

A frequent itemset X ∈ F is called maximal if it has no frequent supersets. Let


M be the set of all maximal frequent itemsets, given as


M = X | X ∈ F and 6 ∃Y ⊃ X , such that Y ∈ F

The set M is a condensed representation of the set of all frequent itemset F,


because we can determine whether any itemset X is frequent or not using M. If
there exists a maximal itemset Z such that X ⊆ Z , then X must be frequent;
otherwise X cannot be frequent.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 3 / 24
Closed Frequent Itemsets

Given T ⊆ T , and X ⊆ I, define

t(X ) = {t ∈ T | t contains X }
i (T ) = {x ∈ I | ∀t ∈ T , t contains x}
c(X ) = i ◦ t(X ) = i (t(X ))

The function c is a closure operator and an itemset X is called closed if c(X ) = X . It


follows that t(c(X )) = t(X ). The set of all closed frequent itemsets is thus defined as


C = X | X ∈ F and 6 ∃Y ⊃ X such that sup(X ) = sup(Y )

X is closed if all supersets of X have strictly less support, that is, sup(X ) > sup(Y ), for
all Y ⊃ X .
The set of all closed frequent itemsets C is a condensed representation, as we can
determine whether an itemset X is frequent, as well as the exact support of X using C
alone.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 4 / 24
Minimal Generators

A frequent itemset X is a minimal generator if it has no subsets with the same


support:


G = X | X ∈ F and 6 ∃Y ⊂ X , such that sup(X ) = sup(Y )

In other words, all subsets of X have strictly higher support, that is,
sup(X ) < sup(Y ), for all Y ⊂ X .
Given an equivalence class of itemsets that have the same tidset, a closed itemset
is the unique maximum element of the class, whereas the minimal generators are
the minimal elements of the class.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 5 / 24
Frequent Itemsets: Closed, Minimal Generators and Maximal

A B D E C
1345 123456 1356 12345 2456

AD DE AB AE BD BE BC CE
135 135 1345 1345 1356 12345 2456 245

ABD ADE BDE ABE BCE


135 135 135 1345 245

ABDE
135

Itemsets boxed and shaded are closed, double boxed are maximal, and those boxed are
minimal generators
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 6 / 24
Mining Maximal Frequent Itemsets: GenMax Algorithm

Mining maximal itemsets requires additional steps beyond simply determining the
frequent itemsets. Assuming that the set of maximal frequent itemsets is initially
empty, that is, M = ∅, each time we generate a new frequent itemset X , we have
to perform the following maximality checks
Subset Check: 6 ∃Y ∈ M, such that X ⊂ Y . If such a Y exists, then clearly
X is not maximal. Otherwise, we add X to M, as a potentially maximal
itemset.
Superset Check: 6 ∃Y ∈ M, such that Y ⊂ X . If such a Y exists, then Y
cannot be maximal, and we have to remove it from M.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 7 / 24
GenMax Algorithm: Maximal Itemsets

GenMax is based on dEclat, i.e., it uses diffset intersections for support computation.
The initial call takes as input the set of frequent items along with their tidsets, hi, t(i)i,
and the initially empty set of maximal itemsets, M. Given a set of itemset–tidset pairs,
called IT-pairs, of the form hX , t(X )i, the recursive GenMax method works as follows.
S
If the union of all the itemsets, Y = Xi , is already subsumed by (or contained in) some
maximal pattern Z ∈ M, then no maximal itemset can be generated from the current
branch, and it is pruned. Otherwise, we intersect each IT-pair hXi , t(Xi )i with all the
other IT-pairs hXj , t(Xj )i, with j > i, to generate new candidates Xij , which are added to
the IT-pair set Pi .
If Pi is not empty, a recursive call to GenMax is made to find other potentially frequent
extensions of Xi . On the other hand, if Pi is empty, it means that Xi cannot be
extended, and it is potentially maximal. In this case, we add Xi to the set M, provided
that Xi is not contained in any previously added maximal set Z ∈ M.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 8 / 24
GenMax Algorithm

 Call: M ← ∅,
// Initial
P ← hi, t(i)i | i ∈ I, sup(i) ≥ minsup
GenMaxS (P, minsup, M):
1 Y ← Xi
2 if ∃Z ∈ M, such that Y ⊆ Z then
3 return // prune entire branch
4 foreach hXi , t(Xi )i ∈ P do
5 Pi ← ∅
6 foreach hXj , t(Xj )i ∈ P, with j > i do
7 Xij ← Xi ∪ Xj
8 t(Xij ) = t(Xi ) ∩ t(Xj )
9 if sup(Xij ) ≥ minsup then Pi ← Pi ∪ {hXij , t(Xij )i}
10 if Pi 6= ∅ then GenMax (Pi , minsup, M)
11 else if 6 ∃Z ∈ M, Xi ⊆ Z then
12 M = M ∪ Xi // add Xi to maximal set

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 9 / 24
Mining Maximal Frequent Itemsets

A B C D E
1345 123456 2456 1356 12345

PA PB PC PD

AB AD AE BC BD BE CE DE
1345 135 1345 2456 1356 12345 245 135

PAB PAD PBC PBD

ABD ABE ADE BCE BDE


135 1345 135 245 135

PABD

ABDE
135

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 10 / 24
Mining Closed Frequent Itemsets: Charm Algorithm

Mining closed frequent itemsets requires that we perform closure checks, that is,
whether X = c(X ). Direct closure checking can be very expensive.
Given a collection of IT-pairs {hXi , t(Xi )i}, Charm uses the following three
properties:
If t(Xi ) = t(Xj ), then c(Xi ) = c(Xj ) = c(Xi ∪ Xj ), which implies that we can
Property (1)

replace every occurrence of Xi with Xi ∪ Xj and prune the branch under Xj


because its closure is identical to the closure of Xi ∪ Xj .
If t(Xi ) ⊂ t(Xj ), then c(Xi ) 6= c(Xj ) but c(Xi ) = c(Xi ∪ Xj ), which means
Property (2)

that we can replace every occurrence of Xi with Xi ∪ Xj , but we cannot prune


Xj because it generates a different closure. Note that if t(Xi ) ⊃ t(Xj ) then
we simply interchange the role of Xi and Xj .
If t(Xi ) 6= t(Xj ), then c(Xi ) 6= c(Xj ) 6= c(Xi ∪ Xj ). In this case we cannot
Property (3)

remove either Xi or Xj , as each of them generates a different closure.

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 11 / 24
Charm Algorithm: Closed Itemsets

// Initial Call: C ← ∅, P ← hi, t(i)i : i ∈ I, sup(i) ≥ minsup
Charm (P, minsup, C):
1 Sort P in increasing order of support (i.e., by increasing |t(Xi )|)
2 foreach hXi , t(Xi )i ∈ P do
3 Pi ← ∅
4 foreach hXj , t(Xj )i ∈ P, with j > i do
5 Xij = Xi ∪ Xj
6 t(Xij ) = t(Xi ) ∩ t(Xj )
7 if sup(Xij ) ≥ minsup then
8 if t(Xi ) = t(Xj ) then // Property 1
9 Replace Xi with Xij in P and Pi
10 Remove hXj , t(Xj )i from P
11 else
12 if t(Xi ) ⊂ t(Xj ) then // Property 2
13 Replace Xi with Xij in P and Pi
14 else // Property 3
15 Pi ← Pi ∪ hXij , t(Xij )i

16 if Pi 6= ∅ then Charm (Pi , minsup, C)


17 if 6 ∃Z ∈ C, such that Xi ⊆ Z and t(Xi ) = t(Z ) then
18 C = C ∪ Xi // Add Xi to closed set

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 12 / 24
Mining Frequent Closed Itemsets: Charm
Process A

A AE AEB C D E B
1345 2456 1356 12345 123456

PA

AD ADE ADEB
135

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 13 / 24
Mining Frequent Closed Itemsets: Charm

A AE AEB C CB D DB E EB B
1345 2456 1356 12345 123456

PA PC PD

AD ADE ADEB CE CEB DE DEB


135 245 135

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 14 / 24
Nonderivable Itemsets

An itemset is called nonderivable if its support cannot be deduced from the supports of
its subsets. The set of all frequent nonderivable itemsets is a summary or condensed
representation of the set of all frequent itemsets. Further, it is lossless with respect to
support, that is, the exact support of all other frequent itemsets can be deduced from it.
Generalized Itemsets: Let X be a k-itemset, that is, X = {x1 , x2 , . . . , xk }. The k tidsets
t(xi ) for each item xi ∈ X induce a partitioning of the set of all tids into 2k regions,
where each partition contains the tids for some subset of items Y ⊆ X , but for none of
the remaining items Z = X \ Y .
Each partition is therefore the tidset of a generalized itemset Y Z , where Y consists of
regular items and Z consists of negated items.
Define the support of a generalized itemset Y Z as the number of transactions that
contain all items in Y but no item in Z :

sup(Y Z ) = {t ∈ T | Y ⊆ i (t) and Z ∩ i (t) = ∅}

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 15 / 24
Tidset Partitioning Induced by t(A),t(C ), and t(D)

Tid Itemset
1 ABDE
t(A) t(C )
2 BCE
3 ABDE
t(AC D) = 4
4 ABCE
t(AC D) = ∅ t(AC D) = 2
5 ABCDE
6 BCD
t(ACD) = 5

t(AC D) = 13 t(ACD) = 6

t(AC D) = ∅

t(AC D) = ∅

t(D)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 16 / 24
Inclusion–Exclusion Principle: Support Bounds

Let Y Z be a generalized itemset, and let X = Y ∪ Z = YZ . The


inclusion–exclusion principle allows one to directly compute the support of Y Z as
a combination of the supports for all itemsets W , such that Y ⊆ W ⊆ X :

X
sup(Y Z ) = −1|W \Y | · sup(W )
Y ⊆W ⊆X

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 17 / 24
Inclusion–Exclusion for Support
Consider the generalized itemset AC D = C AD, where Y = C , Z = AD and
X = YZ = ACD. In the Venn diagram, we start with all the tids in t(C ), and
remove the tids contained in t(AC ) and t(CD). However, we realize that in terms
of support this removes sup(ACD) twice, so we need to add it back. In other
words, the support of C AD is given as

sup(C AD) = sup(C ) − sup(AC ) − sup(CD) + sup(ACD)


= 4−2−2+1 = 1

But, this is precisely what the inclusion–exclusion formula gives:

sup(C AD) = (−1)0 sup(C )+ W = C , |W \ Y | = 0


(−1)1 sup(AC )+ W = AC , |W \ Y | = 1
(−1)1 sup(CD)+ W = CD, |W \ Y | = 1
(−1)2 sup(ACD) W = ACD, |W \ Y | = 2
= sup(C ) − sup(AC )−
sup(CD) + sup(ACD)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 18 / 24
Support Bounds

Because the support of any (generalized) itemset must be non-negative, we can


derive a bound on the support of X from each of the 2k generalized itemsets by
setting sup(Y Z ) ≥ 0.
Thus, from the 2k possible subsets Y ⊆ X , we derive 2k −1 lower bounds and 2k −1
upper bounds for sup(X ), obtained after setting sup(Y Z ) ≥ 0:

X
Upper Bounds (|X \ Y | is odd): sup(X ) ≤ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X
X
Lower Bounds (|X \ Y | is even): sup(X ) ≥ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X

(1)

(2)

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 19 / 24
Support Bounds for Subsets

subset lattice

ACD sign inequality level

AC AD CD 1 ≤ 1

A C D −1 ≥ 2

∅ 1 ≤ 3

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 20 / 24
Support Bounds
Example

For example, if Y = C , then the inclusion-exclusion principle gives us


sup(C AD) = sup(C ) − sup(AC ) − sup(CD) + sup(ACD)
Setting sup(C AD) ≥ 0, we obtain
sup(ACD) ≥ −sup(C ) + sup(AC ) + sup(CD)

From each of the partitions, we get one bound, and out of the eight possible
regions, exactly four give upper bounds and the other four give lower bounds for
the support of ACD:
sup(ACD) ≥0 when Y = ACD
≤ sup(AC ) when Y = AC
≤ sup(AD) when Y = AD
≤ sup(CD) when Y = CD
≥ sup(AC ) + sup(AD) − sup(A) when Y =A
≥ sup(AC ) + sup(CD) − sup(C ) when Y =C
≥ sup(AD) + sup(CD) − sup(D) when Y =D
≤ sup(AC ) + sup(AD) + sup(CD)−
sup(A) − sup(C ) − sup(D) + sup(∅) when Y = ∅
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 21 / 24
Nonderivable Itemsets
Given an itemset X , and Y ⊆ X , let IE (Y ) denote the summation
X
IE (Y ) = −1(|X \Y |+1) · sup(W )
Y ⊆W ⊂X

Then, the sets of all upper and lower bounds for sup(X ) are given as
n o
UB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is odd
n o
LB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is even

An itemset X is called nonderivable if max{LB(X )} 6= min{UB(X )}, and we know only


the range of possible values, that is,
h i
sup(X ) ∈ max{LB(X )}, min{UB(X )}

X is derivable if sup(X ) = max{LB(X )} = min{UB(X )} Thus, the set of all frequent


nonderivable itemsets is given as

N = X ∈ F | max{LB(X )} 6= min{UB(X )}
where F is the set of all frequent itemsets.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 22 / 24
Nonderivable Itemsets: Example
Consider the support bound formulas for sup(ACD). The lower bounds are
sup(ACD) ≥ 0
≥ sup(AC ) + sup(AD) − sup(A) = 2 + 3 − 4 = 1
≥ sup(AC ) + sup(CD) − sup(C ) = 2 + 2 − 4 = 0
≥ sup(AD) + sup(CD) − sup(D) = 3 + 2 − 4 = 0
and the upper bounds are
sup(ACD) ≤ sup(AC ) = 2
≤ sup(AD) = 3
≤ sup(CD) = 2
≤ sup(AC ) + sup(AD) + sup(CD) − sup(A) − sup(C )−
sup(D) + sup(∅) = 2 + 3 + 2 − 4 − 4 − 4 + 6 = 1

Thus, we have
LB(ACD) = {0, 1} max{LB(ACD)} = 1
UB(ACD) = {1, 2, 3} min{UB(ACD)} = 1
Because max{LB(ACD)} = min{UB(ACD)} we conclude that ACD is derivable.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 23 / 24
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info

Mohammed J. Zaki1 Wagner Meira Jr.2

1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Chapter 9: Summarizing Itemsets

Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 24 / 24

You might also like