You are on page 1of 53

Lecture 06 & 07 – Association Rule Mining

Anaum Hamid

https://sites.google.com/site/anaumhamid/data-mining/lectures
Gentle Reminder

“Switch Off” your Mobile Phone Or Switch


Mobile Phone to “Silent Mode”
Agenda

Frequent Itemset Pattern Evaluation


The Basics
Mining Methods Methods
Market Basket Apriori Algorithm
Analysis Generating
Frequent Itemsets Association Rules
Association Rules from Frequent
Itemsets
FP-Growth
Agenda
Frequent Itemset Pattern Evaluation
The Basics
Mining Methods Methods
Market Basket Apriori Algorithm
Analysis Generating
Frequent Itemsets Association Rules
Association Rules from Frequent
Itemsets
FP-Growth
Basic Concepts

Which items are frequently


purchased together by
customers
Basic Concepts
Basic Concepts
Basket Data Analysis
Transaction database
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
Question of interest:
–Which items are bought together frequently?
Applications
Items Frequency
–Improved store layout {butter} 4
–Cross marketing {milk} 4
–Customer shopping behavior analysis {butter, milk} 4
–Catalogue Design {sugar} 3
–Focused attached mailings/add-on sales {butter, sugar} 3
–Maintenance Agreement => (What the store should {milk, sugar} 3
do to boost Maintenance Agreement sales) {butter, milk,sugar} 3
–Home Electronics => (What other products should {eggs} 2
the store stock up?)
Basic Concepts
v How to place SW, HW, and Accessories?
A Real Example

9
Basic Concepts

Data Mining 2013 – Mining Frequent Patterns, Association, and


10 5/15/19
Correlations
Basic Concepts - Frequent Itemsets

Transaction

Dataset

Itemset

Frequent Itemset Occurrence Frequency


Basic Concepts - Association Rules
Basic Concepts - Association Rules
#
v If frequency of itemset I satisfies min_support count
then I is a frequent itemset
%
v If a rule satisfies min_support and min_confidence
thresholds, it is said to be strong
§  problem of mining association rules reduced to mining
frequent itemsets
v Association rules mining becomes a two-step
process:
1.  Find all frequent itemsets with frequently ≥ a
predetermined min_support count
2.  Generate strong association rules from the frequent
itemsets that satisfy min_support and min_confidence

Most Data
costly
Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
13 5/15/19
Basic Concepts
Association Rules

v If min_support count is set too low à huge # of


frequent itemsets
v A frequent itemset of length 100 will contain (​
100¦1 )=100 frequent 1-itemsets, (​100¦2 ) frequent 2-
itemsets, and so on
v à total # of subset itemsets contained in itemset of 100
items =
v (​100¦1 )+(​100¦2 )+…+(​100¦100 )à too large even for
computers

Data Mining 2013 – Mining Frequent Patterns, Association, and


14 5/15/19
Correlations
Agenda
Frequent Itemset Pattern Evaluation
The Basics
Mining Methods Methods
Market Basket Apriori Algorithm
Analysis Generating
Frequent Itemsets Association Rules
Association Rules from Frequent
Itemsets
FP-Growth

Data Mining 2013 – Mining Frequent Patterns, Association, and


15 5/15/19
Correlations
Mining Frequent Itemsets
Apriori Algorithm
v Finds frequent itemsets by exploiting prior knowledge of
frequent itemset properties
v level-wise search, where k-itemsets are used to explore k
+1-itemsets
v Goes as follows:
1.  Find frequent 1-itemsets à L1
2.  Use L1 to find frequent 2-itemsets à L2
3.  … until no more frequent k-itemsets can be found
v Each Lk itemset requires a full dataset scan
v To improve efficiency, use the Apriori property:
§  “All nonempty subsets of a frequent itemset must also be frequent” –
if a set cannot pass a test, all of its supersets will fail the same test
as well – if P(I) < min_support then P(I ∪ A) < min_support
Mining Frequent Itemsets
Apriori Algorithm
Scan dataset for Compare candidate
Transactional data example
count of each support with
N=9, min_supp count=2
candidate min_supp
TID List of items
T100 I1, I2, I5
C1 Itemse Support L1 Itemse Support
T200 I2, I4 t count t count
T300 I2, I3 {I1} 6 {I1} 6
T400 I1, I2, I4 {I2} 7 {I2} 7
T500 I1, I3
{I3} 6 {I3} 6
T600 I2, I3
{I4} 2 {I4} 2
T700 I1, I3
{I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3

Data Mining 2013 – Mining Frequent Patterns, Association, and


17 5/15/19
Correlations
Mining Frequent Itemsets
Apriori Algorithm Compare
candidate
C2 Itemse C2 Itemse Suppo support with
t t rt min_supp
{I1, I2} count
L2 Itemse Support
{I1, I2} 4
{I1, I3} t count
{I1, I3} 4
{I1, I4} {I1, I2} 4
{I1, I4} 1
{I1, I5} {I1, I3} 4
{I1, I5} 2
{I2, I3} {I1, I5} 2
{I2, I3} 4
{I2, I4} {I2, I3} 4
{I2, I4} 2
{I2, I5} {I2, I4} 2
Scan dataset {I2, I5} 2
{I3, I4} {I2, I5} 2
for count of {I3, I4} 0
{I3, I5} each
candidate {I3, I5} 1
{I4, I5}
Generate C2 candidates
{I4, I5} 0
from L1 by joining L1 wv
L1
Mining Frequent Itemsets
Apriori Algorithm
C3 = L2 wv L2 = {{I1, I2, I3}, {I1,
I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, Scan dataset Compare
for count of candidate
I3, I5}, {I2, I4, I5}}
each support with
Not all subsets are frequent candidate min_supp
à Prune (Apriori property)
Itemset C3 Itemset Support L3 Itemset Support
count count
{I1, I2, I3} {I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I2, I5} {I1, I2, I5} 2 {I1, I2, I5} 2

Generate C3 candidates
Two joining (lexicographically ordered) k-
itemsets must share first k-1 items à
from L2 by joining L2wv
L2
{I1, I2} is not joined with {I2, I4}

Data Mining 2013 – Mining Frequent Patterns, Association, and


19 5/15/19
Correlations
Mining Frequent Itemsets
Apriori Algorithm

Itemset
Not all subsets are frequent
C4 = φ à Terminate
à Prune
{I1, I2, I3, I5}
The Apriori Algorithm—Exercise

Database TDB Supmin = 2


Tid Items

10 A, C, D

20 B, C, E

30 A, B, C, E

40 B, E
The Apriori Algorithm—Exercise
(Solution)
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan C4 = φ à Terminate
{B, C, E} {B, C, E} 2
Mining Frequent Itemsets
Apriori Algorithm

Data Mining 2013 – Mining Frequent Patterns, Association, and


23 5/15/19
Correlations
Apriori
Algorithm

Generate Ck using Lk-1 to find Lk

Join

Prune

24 5/15/19
Mining Frequent Itemsets
Generating Association Rules from
Frequent Itemsets

𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 (𝑨⇒┴𝑩)=𝑷(𝑩|𝑨)=​
𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨∪𝑩)/𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨) 

Data Mining 2013 – Mining Frequent Patterns, Association, and


25 5/15/19
Correlations
Mining Frequent Itemsets
Generating Association Rules from
Frequent Itemsets
Nonempty Confidence
Itemset Support
count subsets
{I1, I2} 2/4 = 50%
{I1, I2, I3} 2
{I1, I2, I5} 2 {I1, I5} 2/2 = 100%

{I2, I5} 2/2 = 100%


TID List of items
T100 I1, I2, I5 {I1} 2/6 = 33%
T200 I2, I4
{I2} 2/7 = 29%
T300 I2, I3
{I5} 2/2 = 100%
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
For a min_confidence = 70%
T800 I1, I2, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
T900 I1, I2, I3 Correlations
26 5/15/19
Important Details of Apriori

v How to generate candidates?


§  Step 1: self-joining Lk
§  Step 2: pruning
v Example of Candidate-generation
§  L3={abc, abd, acd, ace, bcd}
§  Self-joining: L3*L3
•  abcd from abc and abd
•  acde from acd and ace
§  Pruning:
•  acde is removed because ade is not in L3
§  C4={abcd}
27
Methods to Improve Apriori’s
Efficiency
v  Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
v  Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
v  Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
v  Sampling: mining on a subset of given data, lower support threshold
+ a method to determine the completeness
v  Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent

28
Bottleneck of Frequent-
pattern Mining
v  The core of the Apriori algorithm:
§  Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets
§  Use database scan and pattern matching to collect counts for
the candidate itemsets

v  Multiple database scans are costly


§  Needs (n +1 ) scans, n is the length of the longest pattern
v  Mining long patterns needs many passes of scanning and generates
lots of candidates
§  To find frequent itemset i1i2…i100
•  # of scans: 100
•  # of Candidates: (1001) + (1002) + … + (100100) = 2100-1 = 1.27*1030 !

v  Bottleneck: candidate-generation-and-test
v  Can we avoid candidate generation?
29
Mining Frequent Patterns Without
Candidate Generation
v Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
§  highly condensed, but complete for frequent pattern
mining
§  avoid costly database scans
v Develop an efficient, FP-tree-based frequent
pattern mining method
§  A divide-and-conquer methodology: decompose
mining tasks into smaller ones
§  Avoid candidate generation: sub-database test only!
30
Mining Frequent Patterns Without
Candidate Generation
v Grow long patterns from short ones using local
frequent items
§  “abc” is a frequent pattern
§  Get all transactions having “abc”: DB|abc
§  “d” is a local frequent item in DB|abc à abcd is a
frequent pattern

31
Mining Frequent Itemsets
FP-Growth
Compare
Scan dataset for candidate
Transactional data example
count of each support with
N=9, min_supp count=2
candidate min_supp
TID List of items
L1 - Reordered
T100 I1, I2, I5
C1 Itemset Support Itemset Support
T200 I2, I4 count count
T300 I2, I3 {I1} 6 {I2} 7
T400 I1, I2, I4 {I2} 7 {I1} 6
T500 I1, I3
{I3} 6 {I3} 6
T600 I2, I3
{I4} 2 {I4} 2
T700 I1, I3
{I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3

32 5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
Itemset Support
count

{I2} 7
{I1} 6
{I3} 6
{I4} 2
{I5} 2

33 5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }
T100
L1 - Reordered
I2:1
Itemset Support
count

{I2} 7 TID List of


I1:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
Order of items is kept throughout path construction, with
T800 I2, I1, I3, I5
common prefixes shared whenever applicable
T900 I2, I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
I2:1 T200
Itemset Support
count

{I2} 7 TID List of


I1:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
T900 I2,
35 I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
I2:2 T200
Itemset Support
count

{I2} 7 TID List of


I1:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
T900 I2,
36 I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
I2:2
Itemset Support
count

{I2} 7 T300 TID List of


I1:1 I3:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
T900 I2,
37 I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
null { }

L1 - Reordered
I2:3
Itemset Support
count

{I2} 7 T300 TID List of


I1:1 I3:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
T900 I2,
38 I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth – FP-tree Construction
FP-tree
Trace the node link path for each node entry
and you get that item’s support count J null { }

L1 - Reordered
I2:7 I1:2
Itemse Suppo Node
t rt Link
count

{I2} 7 I1:4 I3:2 I4:1 I3:2


{I1} 6
{I3} 6
I5:1 I3:2 I4:1
{I4} 2
{I5} 2

For Tree I5:1


Traversal

Data Mining 2013 – Mining Frequent Patterns, Association, and


39 5/15/19
Correlations
Mining Frequent Itemsets
FP-Growth – Frequent Patterns Mining
FP-tree
Bottom-up algorithm – start from leaves and go
up to root – I5 for example has two paths to root null { }

L1 - Reordered
Itemse Suppo Node I2:7 I1:2
t rt Link
count

{I2} 7
I1:4 I3:2 I4:1 I3:2
{I1} 6
{I3} 6
{I4} 2 I5:1 I3:2 I4:1

{I5} 2
{I3, I5} frequency < min_support
I5:1 count threshold

Data Mining 2013 – Mining Frequent Patterns, Association, and


40 5/15/19
Correlations
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I5
null { }

L1 - Reordered

Itemse Suppo Node


t rt Link
count
TID List of
{I2} 7 items
{I1} 6 T100 I2, I1, I5
Eliminate
transactions not T200 I2, I4
{I3} 6 including I5
T300 I2, I3
{I4} 2
T400 I2, I1, I4
{I5} 2 Eliminate
I5 T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
41 I1, I3
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I5
null { }

L1 - Reordered
I2:1
Itemse Suppo Node
t rt Link
count
TID List of
{I2} 7 I1:1 items
{I1} 6 Eliminate transactions
T100 I2, I1, I5
not including I5 T200 I2, I4
{I3} 6
T300 I2, I3
{I4} 2
T400 I2, I1, I4
{I5} 2 Eliminate I5
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
42 I1, I3
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I5
null { }
L1 - Reordered

Itemse Suppo Node I2:2


t rt Link
count

{I2} 7 TID List of


I1:2 items
{I1} 6
Eliminate transactions
T100 I2, I1, I5
{I3} 6 not including I5 T200 I2, I4
{I4} 2 I3:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I5
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Data Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
T900 I2,
43 I1, I3
5/15/19
Mining Frequent Itemsets
FP-Growth
Item Conditional Pattern Base Conditional FP- Frequent Patterns
tree Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 1}, {I1: 2}} {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} {I2, I1: 4}
Paths to which item is suffix Prefix paths to item
after eliminating
infrequent items

Data Mining 2013 – Mining Frequent Patterns, Association, and


44 5/15/19
Correlations
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I4
null { }
L1 - Reordered
Itemse Suppo Node I2:2
t rt Link
count

{I2} 7 TID List of


I1:1 items
{I1} 6
Eliminate transactions
T100 I2, I1, I5
{I3} 6 not including I4 T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
45 I1, I3
Mining Frequent Itemsets
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I3
null { }
L1 - Reordered

Itemse Suppo Node I2:4 I1:2


t rt Link
count

{I2} 7 TID List of


I1:2 items
{I1} 6 T100 I2, I1, I5
Eliminate transactions
{I3} 6 not including I3
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2, I1, I3
Mining Frequent Itemsets
FP-Growth
Item Conditional Pattern Base Conditional FP- Frequent Patterns
tree Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 1}, {I1: 2}} {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} {I2, I1: 4}

Data Mining 2013 – Mining Frequent Patterns, Association, and


47 5/15/19
Correlations
Exercise: Construct FP-Tree

TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

min_support = 3

48
Exercise: Construct FP-Tree (Solution)
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1.  Scan DB once, find frequent
1-itemset (single item f:4 c:1
pattern) Item frequency head
2.  Sort frequent items in
f 4
frequency descending order, c 4 c:3 b:1 b:1
f-list a 3
3.  Scan DB again, construct FP- b 3 a:3 p:1
tree m 3
p 3
m:2 b:1

F-list = f-c-a-b-m-p p:2 m:1


Agenda
Frequent Itemset Pattern Evaluation
The Basics
Mining Methods Methods
Market Basket Apriori Algorithm
Analysis Generating
Frequent Itemsets Association Rules
Association Rules from Frequent
Itemsets
FP-Growth

Data Mining 2013 – Mining Frequent Patterns, Association, and


50 5/15/19
Correlations
Pattern Evaluation Methods

v Not all association rules are interesting


§  buys(X, “computer games”) ⇒┴buys(X, “videos”) [40%,
66%]
§  P(“videos”) is already 75% > 66%
§  The two items are negatively associated à buying one
decreases the likelihood of buying the other
v We need to measure “real strength” of rule
v Correlation analysis
𝐴⇒┴𝐵 [𝑠𝑢𝑝𝑝𝑜𝑟𝑡, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒, 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛]

Data Mining 2013 – Mining Frequent Patterns, Association, and


51 5/15/19
Correlations
References
1.  Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2.  Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3.  Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications, CRC Press
Taylor & Francis Group, 2014
4.  Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5.  Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6.  Florin Gorunescu, Data Mining: Concepts, Models and
Techniques, Springer, 2011
7.  Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8.  Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Advances in Data Mining of Enterprise Data: Algorithms and
Applications, World Scientific, 2007
52

You might also like