Professional Documents
Culture Documents
Anaum Hamid
https://sites.google.com/site/anaumhamid/data-mining/lectures
Gentle Reminder
9
Basic Concepts
Transaction
Dataset
Itemset
Most Data
costly
Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
13 5/15/19
Basic Concepts
Association Rules
Generate C3 candidates
Two joining (lexicographically ordered) k-
itemsets must share first k-1 items à
from L2 by joining L2wv
L2
{I1, I2} is not joined with {I2, I4}
Itemset
Not all subsets are frequent
C4 = φ à Terminate
à Prune
{I1, I2, I3, I5}
The Apriori Algorithm—Exercise
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
The Apriori Algorithm—Exercise
(Solution)
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
Join
Prune
24 5/15/19
Mining Frequent Itemsets
Generating Association Rules from
Frequent Itemsets
𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 (𝑨⇒┴𝑩)=𝑷(𝑩|𝑨)=
𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨∪𝑩)/𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨)
28
Bottleneck of Frequent-
pattern Mining
v The core of the Apriori algorithm:
§ Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets
§ Use database scan and pattern matching to collect counts for
the candidate itemsets
v Bottleneck: candidate-generation-and-test
v Can we avoid candidate generation?
29
Mining Frequent Patterns Without
Candidate Generation
v Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
§ highly condensed, but complete for frequent pattern
mining
§ avoid costly database scans
v Develop an efficient, FP-tree-based frequent
pattern mining method
§ A divide-and-conquer methodology: decompose
mining tasks into smaller ones
§ Avoid candidate generation: sub-database test only!
30
Mining Frequent Patterns Without
Candidate Generation
v Grow long patterns from short ones using local
frequent items
§ “abc” is a frequent pattern
§ Get all transactions having “abc”: DB|abc
§ “d” is a local frequent item in DB|abc à abcd is a
frequent pattern
31
Mining Frequent Itemsets
FP-Growth
Compare
Scan dataset for candidate
Transactional data example
count of each support with
N=9, min_supp count=2
candidate min_supp
TID List of items
L1 - Reordered
T100 I1, I2, I5
C1 Itemset Support Itemset Support
T200 I2, I4 count count
T300 I2, I3 {I1} 6 {I2} 7
T400 I1, I2, I4 {I2} 7 {I1} 6
T500 I1, I3
{I3} 6 {I3} 6
T600 I2, I3
{I4} 2 {I4} 2
T700 I1, I3
{I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3
32 5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }
L1 - Reordered
Itemset Support
count
{I2} 7
{I1} 6
{I3} 6
{I4} 2
{I5} 2
33 5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }
T100
L1 - Reordered
I2:1
Itemset Support
count
L1 - Reordered
I2:1 T200
Itemset Support
count
L1 - Reordered
I2:2 T200
Itemset Support
count
L1 - Reordered
I2:2
Itemset Support
count
L1 - Reordered
I2:3
Itemset Support
count
L1 - Reordered
I2:7 I1:2
Itemse Suppo Node
t rt Link
count
L1 - Reordered
Itemse Suppo Node I2:7 I1:2
t rt Link
count
{I2} 7
I1:4 I3:2 I4:1 I3:2
{I1} 6
{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2
{I3, I5} frequency < min_support
I5:1 count threshold
L1 - Reordered
L1 - Reordered
I2:1
Itemse Suppo Node
t rt Link
count
TID List of
{I2} 7 I1:1 items
{I1} 6 Eliminate transactions
T100 I2, I1, I5
not including I5 T200 I2, I4
{I3} 6
T300 I2, I3
{I4} 2
T400 I2, I1, I4
{I5} 2 Eliminate I5
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
42 I1, I3
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I5
null { }
L1 - Reordered
min_support = 3
48
Exercise: Construct FP-Tree (Solution)
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find frequent
1-itemset (single item f:4 c:1
pattern) Item frequency head
2. Sort frequent items in
f 4
frequency descending order, c 4 c:3 b:1 b:1
f-list a 3
3. Scan DB again, construct FP- b 3 a:3 p:1
tree m 3
p 3
m:2 b:1