You are on page 1of 24

Data Mining

BITS Pilani M5: Association Analysis


Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

5.2 Apriori Algorithm

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Mining Association Rules

• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule
is a binary partitioning of a frequent itemset

• Frequent itemset generation is computationally expensive

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Reducing Number of Candidates

• Apriori principle:
• If an itemset is frequent, then all of its subsets must also
be frequent

• Apriori principle holds due to the following property


of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )

• Support of an itemset never exceeds the support of its


subsets
• This is known as the anti-monotone property of support
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Apriori: A Candidate Generation-and-Test Approach

• Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get
frequent 1-itemset TID Items
1 Bread, Milk
Generate length (k+1) candidate 2 Bread, Diaper, Butter, Eggs
itemsets from length k frequent 3 Milk, Diaper, Butter, Coke
itemsets
4 Bread, Milk, Diaper, Butter
Test the candidates against DB 5 Bread, Milk, Diaper, Coke

Terminate when no frequent or


candidate set can be generated

Data Mining 6
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm

Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4={abcd}
• How to count supports of candidates?

Data Mining 8
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
If every subset is considered,
6
C1 + 6C2 + 6C3 = 41
Item Count
Items (1-itemsets) With support-based pruning,
Bread 4
Coke 2 6 + 6 + 1 = 13
Milk 4 Itemset Count
Butter 3 Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Butter} 2
Eggs 1 {Bread,Diaper} 3 (No need to generate
{Milk,Butter} 2 candidates involving Coke
Minimum Support = 3 {Milk,Diaper} 3 or Eggs)
{Butter,Diaper} 3

TID Items
1 Bread, Milk Triplets (3-itemsets)
2 Bread, Diaper, Butter, Eggs Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 3
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Count Supports of Candidates?

• Why counting supports of candidates a problem?


• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets (with ordered items) are stored in a
hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction

Data Mining 10
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Mining : Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Mining : Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Mining : Hash tree
Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8

234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9

2,5,8
3+ 56

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458 Visited 5 out of 9 leaf nodes
Match transaction against 11 out of 15 candidates
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Factors Affecting Complexity

• Choice of minimum support threshold


• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
• Size of database
• since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
• Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)

Data Mining 17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns, e.g.,
{a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
• Solution: Mine closed frequent patterns and maximal frequent patterns
instead
• An itemset X is closed if X is frequent and there exists no super-pattern
Y ‫ כ‬X, with the same support as X
• An itemset X is a maximal pattern if X is frequent and there exists no
frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules

Data Mining 18
1/30/22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns

• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}


• Min_sup = 1.
• What is the set of closed itemset?
• <a1, …, a100>: 1
• < a1, …, a50>: 2
• What is the set of maximal pattern?
• <a1, …, a100>: 1
• What is the set of all patterns?
• 1.27*1030

Data Mining 19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
null Transaction Ids

TID Items 124 123 1234 245 345


A B C D E
1 ABC
2 ABCD
3 BCE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
4 ACDE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
ABCDE
any transactions
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Maximal vs Closed Frequent Itemsets
Minimum support = 2 null Closed but
not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal

12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4

ABCDE
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Alternative Methods for Frequent Itemset Generation
• Representation of Database
• horizontal vs vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education 
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers

R1 Predictive Analytics and Data Mining: Concepts and Practice with


RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like