Professional Documents
Culture Documents
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 1 / 24
An Example Database
Transaction database
Frequent itemsets (minsup = 3)
Tid Itemset
1 ABDE sup Itemsets
2 BCE 6 B
3 ABDE 5 E , BE
4 ABCE 4 A, C , D, AB, AE , BC , BD, ABE
5 ABCDE 3 AD, CE , DE , ABD, ADE , BCE , BDE , ABDE
6 BCD
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 2 / 24
Maximal Frequent Itemsets
Given a binary database D ⊆ T × I, over the tids T and items I, let F denote
the set of all frequent itemsets, that is,
F = X | X ⊆ I and sup(X ) ≥ minsup
M = X | X ∈ F and 6 ∃Y ⊃ X , such that Y ∈ F
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 3 / 24
Closed Frequent Itemsets
t(X ) = {t ∈ T | t contains X }
i (T ) = {x ∈ I | ∀t ∈ T , t contains x}
c(X ) = i ◦ t(X ) = i (t(X ))
C = X | X ∈ F and 6 ∃Y ⊃ X such that sup(X ) = sup(Y )
X is closed if all supersets of X have strictly less support, that is, sup(X ) > sup(Y ), for
all Y ⊃ X .
The set of all closed frequent itemsets C is a condensed representation, as we can
determine whether an itemset X is frequent, as well as the exact support of X using C
alone.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 4 / 24
Minimal Generators
G = X | X ∈ F and 6 ∃Y ⊂ X , such that sup(X ) = sup(Y )
In other words, all subsets of X have strictly higher support, that is,
sup(X ) < sup(Y ), for all Y ⊂ X .
Given an equivalence class of itemsets that have the same tidset, a closed itemset
is the unique maximum element of the class, whereas the minimal generators are
the minimal elements of the class.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 5 / 24
Frequent Itemsets: Closed, Minimal Generators and Maximal
∅
A B D E C
1345 123456 1356 12345 2456
AD DE AB AE BD BE BC CE
135 135 1345 1345 1356 12345 2456 245
ABDE
135
Itemsets boxed and shaded are closed, double boxed are maximal, and those boxed are
minimal generators
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 6 / 24
Mining Maximal Frequent Itemsets: GenMax Algorithm
Mining maximal itemsets requires additional steps beyond simply determining the
frequent itemsets. Assuming that the set of maximal frequent itemsets is initially
empty, that is, M = ∅, each time we generate a new frequent itemset X , we have
to perform the following maximality checks
Subset Check: 6 ∃Y ∈ M, such that X ⊂ Y . If such a Y exists, then clearly
X is not maximal. Otherwise, we add X to M, as a potentially maximal
itemset.
Superset Check: 6 ∃Y ∈ M, such that Y ⊂ X . If such a Y exists, then Y
cannot be maximal, and we have to remove it from M.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 7 / 24
GenMax Algorithm: Maximal Itemsets
GenMax is based on dEclat, i.e., it uses diffset intersections for support computation.
The initial call takes as input the set of frequent items along with their tidsets, hi, t(i)i,
and the initially empty set of maximal itemsets, M. Given a set of itemset–tidset pairs,
called IT-pairs, of the form hX , t(X )i, the recursive GenMax method works as follows.
S
If the union of all the itemsets, Y = Xi , is already subsumed by (or contained in) some
maximal pattern Z ∈ M, then no maximal itemset can be generated from the current
branch, and it is pruned. Otherwise, we intersect each IT-pair hXi , t(Xi )i with all the
other IT-pairs hXj , t(Xj )i, with j > i, to generate new candidates Xij , which are added to
the IT-pair set Pi .
If Pi is not empty, a recursive call to GenMax is made to find other potentially frequent
extensions of Xi . On the other hand, if Pi is empty, it means that Xi cannot be
extended, and it is potentially maximal. In this case, we add Xi to the set M, provided
that Xi is not contained in any previously added maximal set Z ∈ M.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 8 / 24
GenMax Algorithm
Call: M ← ∅,
// Initial
P ← hi, t(i)i | i ∈ I, sup(i) ≥ minsup
GenMaxS (P, minsup, M):
1 Y ← Xi
2 if ∃Z ∈ M, such that Y ⊆ Z then
3 return // prune entire branch
4 foreach hXi , t(Xi )i ∈ P do
5 Pi ← ∅
6 foreach hXj , t(Xj )i ∈ P, with j > i do
7 Xij ← Xi ∪ Xj
8 t(Xij ) = t(Xi ) ∩ t(Xj )
9 if sup(Xij ) ≥ minsup then Pi ← Pi ∪ {hXij , t(Xij )i}
10 if Pi 6= ∅ then GenMax (Pi , minsup, M)
11 else if 6 ∃Z ∈ M, Xi ⊆ Z then
12 M = M ∪ Xi // add Xi to maximal set
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 9 / 24
Mining Maximal Frequent Itemsets
A B C D E
1345 123456 2456 1356 12345
PA PB PC PD
AB AD AE BC BD BE CE DE
1345 135 1345 2456 1356 12345 245 135
PABD
ABDE
135
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 10 / 24
Mining Closed Frequent Itemsets: Charm Algorithm
Mining closed frequent itemsets requires that we perform closure checks, that is,
whether X = c(X ). Direct closure checking can be very expensive.
Given a collection of IT-pairs {hXi , t(Xi )i}, Charm uses the following three
properties:
If t(Xi ) = t(Xj ), then c(Xi ) = c(Xj ) = c(Xi ∪ Xj ), which implies that we can
Property (1)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 11 / 24
Charm Algorithm: Closed Itemsets
// Initial Call: C ← ∅, P ← hi, t(i)i : i ∈ I, sup(i) ≥ minsup
Charm (P, minsup, C):
1 Sort P in increasing order of support (i.e., by increasing |t(Xi )|)
2 foreach hXi , t(Xi )i ∈ P do
3 Pi ← ∅
4 foreach hXj , t(Xj )i ∈ P, with j > i do
5 Xij = Xi ∪ Xj
6 t(Xij ) = t(Xi ) ∩ t(Xj )
7 if sup(Xij ) ≥ minsup then
8 if t(Xi ) = t(Xj ) then // Property 1
9 Replace Xi with Xij in P and Pi
10 Remove hXj , t(Xj )i from P
11 else
12 if t(Xi ) ⊂ t(Xj ) then // Property 2
13 Replace Xi with Xij in P and Pi
14 else // Property 3
15 Pi ← Pi ∪ hXij , t(Xij )i
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 12 / 24
Mining Frequent Closed Itemsets: Charm
Process A
A AE AEB C D E B
1345 2456 1356 12345 123456
PA
AD ADE ADEB
135
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 13 / 24
Mining Frequent Closed Itemsets: Charm
A AE AEB C CB D DB E EB B
1345 2456 1356 12345 123456
PA PC PD
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 14 / 24
Nonderivable Itemsets
An itemset is called nonderivable if its support cannot be deduced from the supports of
its subsets. The set of all frequent nonderivable itemsets is a summary or condensed
representation of the set of all frequent itemsets. Further, it is lossless with respect to
support, that is, the exact support of all other frequent itemsets can be deduced from it.
Generalized Itemsets: Let X be a k-itemset, that is, X = {x1 , x2 , . . . , xk }. The k tidsets
t(xi ) for each item xi ∈ X induce a partitioning of the set of all tids into 2k regions,
where each partition contains the tids for some subset of items Y ⊆ X , but for none of
the remaining items Z = X \ Y .
Each partition is therefore the tidset of a generalized itemset Y Z , where Y consists of
regular items and Z consists of negated items.
Define the support of a generalized itemset Y Z as the number of transactions that
contain all items in Y but no item in Z :
sup(Y Z ) = {t ∈ T | Y ⊆ i (t) and Z ∩ i (t) = ∅}
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 15 / 24
Tidset Partitioning Induced by t(A),t(C ), and t(D)
Tid Itemset
1 ABDE
t(A) t(C )
2 BCE
3 ABDE
t(AC D) = 4
4 ABCE
t(AC D) = ∅ t(AC D) = 2
5 ABCDE
6 BCD
t(ACD) = 5
t(AC D) = 13 t(ACD) = 6
t(AC D) = ∅
t(AC D) = ∅
t(D)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 16 / 24
Inclusion–Exclusion Principle: Support Bounds
X
sup(Y Z ) = −1|W \Y | · sup(W )
Y ⊆W ⊆X
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 17 / 24
Inclusion–Exclusion for Support
Consider the generalized itemset AC D = C AD, where Y = C , Z = AD and
X = YZ = ACD. In the Venn diagram, we start with all the tids in t(C ), and
remove the tids contained in t(AC ) and t(CD). However, we realize that in terms
of support this removes sup(ACD) twice, so we need to add it back. In other
words, the support of C AD is given as
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 18 / 24
Support Bounds
X
Upper Bounds (|X \ Y | is odd): sup(X ) ≤ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X
X
Lower Bounds (|X \ Y | is even): sup(X ) ≥ −1(|X \W |+1) sup(W )
Y ⊆W ⊂X
(1)
(2)
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 19 / 24
Support Bounds for Subsets
subset lattice
AC AD CD 1 ≤ 1
A C D −1 ≥ 2
∅ 1 ≤ 3
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 20 / 24
Support Bounds
Example
From each of the partitions, we get one bound, and out of the eight possible
regions, exactly four give upper bounds and the other four give lower bounds for
the support of ACD:
sup(ACD) ≥0 when Y = ACD
≤ sup(AC ) when Y = AC
≤ sup(AD) when Y = AD
≤ sup(CD) when Y = CD
≥ sup(AC ) + sup(AD) − sup(A) when Y =A
≥ sup(AC ) + sup(CD) − sup(C ) when Y =C
≥ sup(AD) + sup(CD) − sup(D) when Y =D
≤ sup(AC ) + sup(AD) + sup(CD)−
sup(A) − sup(C ) − sup(D) + sup(∅) when Y = ∅
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 21 / 24
Nonderivable Itemsets
Given an itemset X , and Y ⊆ X , let IE (Y ) denote the summation
X
IE (Y ) = −1(|X \Y |+1) · sup(W )
Y ⊆W ⊂X
Then, the sets of all upper and lower bounds for sup(X ) are given as
n o
UB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is odd
n o
LB(X ) = IE (Y ) Y ⊆ X , |X \ Y | is even
Thus, we have
LB(ACD) = {0, 1} max{LB(ACD)} = 1
UB(ACD) = {1, 2, 3} min{UB(ACD)} = 1
Because max{LB(ACD)} = min{UB(ACD)} we conclude that ACD is derivable.
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 23 / 24
Data Mining and Machine Learning:
Fundamental Concepts and Algorithms
dataminingbook.info
1
Department of Computer Science
Rensselaer Polytechnic Institute, Troy, NY, USA
2
Department of Computer Science
Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 9: Summarizing Itemsets 24 / 24