Data Mining: BITS Pilani

Data Mining
BITS Pilani M5: Association Analysis

Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
5.2 Apriori Algorithm
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Mining Association Rules
• Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule
is a binary partitioning of a frequent itemset
• Frequent itemset generation is computationally expensive
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Reducing Number of Candidates
• Apriori principle:
• If an itemset is frequent, then all of its subsets must also
be frequent
• Apriori principle holds due to the following property

of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
• Support of an itemset never exceeds the support of its

subsets
• This is known as the anti-monotone property of support
Data Mining

Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Data Mining

Apriori: A Candidate Generation-and-Test Approach
• Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get
frequent 1-itemset TID Items
1 Bread, Milk
Generate length (k+1) candidate 2 Bread, Diaper, Butter, Eggs
itemsets from length k frequent 3 Milk, Diaper, Butter, Coke
itemsets
4 Bread, Milk, Diaper, Butter
Test the candidates against DB 5 Bread, Milk, Diaper, Coke
Terminate when no frequent or

candidate set can be generated
Data Mining 6
1/30/22
Apriori Algorithm
Method:
• Let k=1
• Generate frequent itemsets of length 1
• Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k that
are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those
that are frequent
Data Mining

Important Details of Apriori
• How to generate candidates?
• Step 1: self-joining Lk
• Step 2: pruning
• Example of Candidate-generation
• L3={abc, abd, acd, ace, bcd}
• Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
• Pruning:
• acde is removed because ade is not in L3
• C4={abcd}
• How to count supports of candidates?
Data Mining 8
1/30/22
Illustrating Apriori Principle
If every subset is considered,
6
C1 + 6C2 + 6C3 = 41
Item Count
Items (1-itemsets) With support-based pruning,
Bread 4
Coke 2 6 + 6 + 1 = 13
Milk 4 Itemset Count
Butter 3 Pairs (2-itemsets)
{Bread,Milk} 3
Diaper 4 {Bread,Butter} 2
Eggs 1 {Bread,Diaper} 3 (No need to generate
{Milk,Butter} 2 candidates involving Coke
Minimum Support = 3 {Milk,Diaper} 3 or Eggs)
{Butter,Diaper} 3
TID Items
1 Bread, Milk Triplets (3-itemsets)
2 Bread, Diaper, Butter, Eggs Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 3
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
Data Mining

How to Count Supports of Candidates?
• Why counting supports of candidates a problem?

• The total number of candidates can be very huge
• One transaction may contain many candidates
• Method:
• Candidate itemsets (with ordered items) are stored in a
hash-tree
• Leaf node of hash-tree contains a list of itemsets and counts
• Interior node contains a hash table
• Subset function: finds all the candidates contained in a
transaction
Data Mining 10
1/30/22
Association Mining : Hash tree
Hash Function Candidate Hash Tree
1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Data Mining

1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Data Mining

1,4,7 3,6,9
2,5,8
234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Data Mining

Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
2,5,8
3+ 56
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining

Hash Function
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Data Mining

Hash Function
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
3+ 56
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458 Visited 5 out of 9 leaf nodes
Match transaction against 11 out of 15 candidates
Data Mining

Factors Affecting Complexity
• Choice of minimum support threshold

• lowering support threshold results in more frequent itemsets
• this may increase number of candidates and max length of frequent
itemsets
• Dimensionality (number of items) of the data set
• more space is needed to store support count of each item
• if number of frequent items also increases, both computation and I/O
costs may also increase
• Size of database
• since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
• Average transaction width
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Data Mining 17
Closed Patterns and Max-Patterns
• A long pattern contains a combinatorial number of sub-patterns, e.g.,
{a1, …, a100} contains (1001) + (1002) + … + (110000) = 2100 – 1 = 1.27*1030 sub-
patterns!
• Solution: Mine closed frequent patterns and maximal frequent patterns
instead
• An itemset X is closed if X is frequent and there exists no super-pattern
Y ‫ כ‬X, with the same support as X
• An itemset X is a maximal pattern if X is frequent and there exists no
frequent super-pattern Y ‫ כ‬X
• Closed pattern is a lossless compression of freq. patterns
• Reducing the # of patterns and rules
Data Mining 18
1/30/22
Closed Patterns and Max-Patterns
• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

• Min_sup = 1.
• What is the set of closed itemset?
• <a1, …, a100>: 1
• < a1, …, a50>: 2
• What is the set of maximal pattern?
• <a1, …, a100>: 1
• What is the set of all patterns?
• 1.27*1030
Data Mining 19
Maximal vs Closed Itemsets
null Transaction Ids
TID Items 124 123 1234 245 345

A B C D E
1 ABC
2 ABCD
3 BCE 12 124 24 4 123 2 3 24 34 45
4 ACDE
5 DE
12 2 24 4 4 2 3 4
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
ABCDE
any transactions
Data Mining

Maximal vs Closed Frequent Itemsets
Minimum support = 2 null Closed but
not
maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE
Data Mining

Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
Data Mining

Alternative Methods for Frequent Itemset Generation
• Representation of Database
• horizontal vs vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
Data Mining

Prescribed Text Books
Author(s), Title, Edition, Publishing House

T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with

RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
Data Mining

Data Mining: BITS Pilani

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining: BITS Pilani

Uploaded by

Copyright:

Available Formats

Data Mining

BITS Pilani M5: Association Analysis

5.2 Apriori Algorithm

• Frequent itemset generation is computationally expensive

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Apriori principle holds due to the following property

• Support of an itemset never exceeds the support of its

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

ABCD ABCE ABDE ACDE BCDE

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Apriori pruning principle: If there is any itemset which is

Terminate when no frequent or

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Why counting supports of candidates a problem?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

• Choice of minimum support threshold

• Exercise. DB = {<a1, …, a100>, < a1, …, a50>}

TID Items 124 123 1234 245 345

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Author(s), Title, Edition, Publishing House

R1 Predictive Analytics and Data Mining: Concepts and Practice with

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like