Professional Documents
Culture Documents
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule
is a binary partitioning of a frequent itemset
Data Mining
• Apriori principle:
• If an itemset is frequent, then all of its subsets must also
be frequent
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
Data Mining
Data Mining 6
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm
Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Data Mining
Data Mining 8
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
If every subset is considered,
6
C1 + 6C2 + 6C3 = 41
Item Count
Items (1-itemsets) With support-based pruning,
Bread 4
Coke 2 6 + 6 + 1 = 13
Milk 4 Itemset Count
Butter 3 Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Butter} 2
Eggs 1 {Bread,Diaper} 3 (No need to generate
{Milk,Butter} 2 candidates involving Coke
Minimum Support = 3 {Milk,Diaper} 3 or Eggs)
{Butter,Diaper} 3
TID Items
1 Bread, Milk Triplets (3-itemsets)
2 Bread, Diaper, Butter, Eggs Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 3
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
Data Mining
Data Mining 10
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Mining : Hash tree
Hash Function Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Data Mining
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Data Mining
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Data Mining
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458 Visited 5 out of 9 leaf nodes
Match transaction against 11 out of 15 candidates
Data Mining
Data Mining 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns, e.g.,
{a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
• Solution: Mine closed frequent patterns and maximal frequent patterns
instead
• An itemset X is closed if X is frequent and there exists no super-pattern
Y כX, with the same support as X
• An itemset X is a maximal pattern if X is frequent and there exists no
frequent super-pattern Y כX
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
Data Mining 18
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
Data Mining 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
null Transaction Ids
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
ABCDE
any transactions
Data Mining
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Data Mining
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Data Mining
Data Mining