DM-06,07-Association Rules

Lecture 06 & 07 – Association Rule Mining
Anaum Hamid
https://sites.google.com/site/anaumhamid/data-mining/lectures
Gentle Reminder
“Switch Off” your Mobile Phone Or Switch

Mobile Phone to “Silent Mode”
Agenda
Frequent Itemset Pattern Evaluation

The Basics
Mining Methods Methods
Market Basket Apriori Algorithm
Analysis Generating
Frequent Itemsets Association Rules
Association Rules from Frequent
Itemsets
FP-Growth
Agenda
The Basics
Analysis Generating
Itemsets
FP-Growth
Basic Concepts
Which items are frequently

purchased together by
customers
Basic Concepts
Basic Concepts
Basket Data Analysis
Transaction database
D= {{butter, bread, milk, sugar};
{butter, flour, milk, sugar};
{butter, eggs, milk, salt};
{eggs};
{butter, flour, milk, salt, sugar}}
Question of interest:
–Which items are bought together frequently?
Applications
Items Frequency
–Improved store layout {butter} 4
–Cross marketing {milk} 4
–Customer shopping behavior analysis {butter, milk} 4
–Catalogue Design {sugar} 3
–Focused attached mailings/add-on sales {butter, sugar} 3
–Maintenance Agreement => (What the store should {milk, sugar} 3
do to boost Maintenance Agreement sales) {butter, milk,sugar} 3
–Home Electronics => (What other products should {eggs} 2
the store stock up?)
Basic Concepts
v How to place SW, HW, and Accessories?
A Real Example
9
Basic Concepts
Data Mining 2013 – Mining Frequent Patterns, Association, and

10 5/15/19
Correlations
Basic Concepts - Frequent Itemsets
Transaction
Dataset
Itemset
Frequent Itemset Occurrence Frequency

Basic Concepts - Association Rules
Basic Concepts - Association Rules
#
v If frequency of itemset I satisfies min_support count
then I is a frequent itemset
%
v If a rule satisfies min_support and min_confidence
thresholds, it is said to be strong
§  problem of mining association rules reduced to mining
frequent itemsets
v Association rules mining becomes a two-step
process:
1.  Find all frequent itemsets with frequently ≥ a
predetermined min_support count
2.  Generate strong association rules from the frequent
itemsets that satisfy min_support and min_confidence
Most Data
costly
Mining 2013 – Mining Frequent Patterns, Association, and
Correlations
13 5/15/19
Basic Concepts
Association Rules
v If min_support count is set too low à huge # of

frequent itemsets
v A frequent itemset of length 100 will contain (
100¦1 )=100 frequent 1-itemsets, (100¦2 ) frequent 2-
itemsets, and so on
v à total # of subset itemsets contained in itemset of 100
items =
v (100¦1 )+(100¦2 )+…+(100¦100 )à too large even for
computers

14 5/15/19
Correlations
Agenda
The Basics
Analysis Generating
Itemsets
FP-Growth

15 5/15/19
Correlations
Mining Frequent Itemsets
Apriori Algorithm
v Finds frequent itemsets by exploiting prior knowledge of
frequent itemset properties
v level-wise search, where k-itemsets are used to explore k
+1-itemsets
v Goes as follows:
1.  Find frequent 1-itemsets à L1
2.  Use L1 to find frequent 2-itemsets à L2
3.  … until no more frequent k-itemsets can be found
v Each Lk itemset requires a full dataset scan
v To improve efficiency, use the Apriori property:
§  “All nonempty subsets of a frequent itemset must also be frequent” –
if a set cannot pass a test, all of its supersets will fail the same test
as well – if P(I) < min_support then P(I ∪ A) < min_support
Apriori Algorithm
Scan dataset for Compare candidate
Transactional data example
count of each support with
N=9, min_supp count=2
candidate min_supp
TID List of items
T100 I1, I2, I5
C1 Itemse Support L1 Itemse Support
T200 I2, I4 t count t count
T300 I2, I3 {I1} 6 {I1} 6
T400 I1, I2, I4 {I2} 7 {I2} 7
T500 I1, I3
{I3} 6 {I3} 6
T600 I2, I3
{I4} 2 {I4} 2
T700 I1, I3
{I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3

17 5/15/19
Correlations
Apriori Algorithm Compare
candidate
C2 Itemse C2 Itemse Suppo support with
t t rt min_supp
{I1, I2} count
L2 Itemse Support
{I1, I2} 4
{I1, I3} t count
{I1, I3} 4
{I1, I4} {I1, I2} 4
{I1, I4} 1
{I1, I5} {I1, I3} 4
{I1, I5} 2
{I2, I3} {I1, I5} 2
{I2, I3} 4
{I2, I4} {I2, I3} 4
{I2, I4} 2
{I2, I5} {I2, I4} 2
Scan dataset {I2, I5} 2
{I3, I4} {I2, I5} 2
for count of {I3, I4} 0
{I3, I5} each
candidate {I3, I5} 1
{I4, I5}
Generate C2 candidates
{I4, I5} 0
from L1 by joining L1 wv
L1
Apriori Algorithm
C3 = L2 wv L2 = {{I1, I2, I3}, {I1,
I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, Scan dataset Compare
for count of candidate
I3, I5}, {I2, I4, I5}}
each support with
Not all subsets are frequent candidate min_supp
à Prune (Apriori property)
Itemset C3 Itemset Support L3 Itemset Support
count count
{I1, I2, I3} {I1, I2, I3} 2 {I1, I2, I3} 2
{I1, I2, I5} {I1, I2, I5} 2 {I1, I2, I5} 2
Generate C3 candidates
Two joining (lexicographically ordered) k-
itemsets must share first k-1 items à
from L2 by joining L2wv
L2
{I1, I2} is not joined with {I2, I4}

19 5/15/19
Correlations
Apriori Algorithm
Itemset
Not all subsets are frequent
C4 = φ à Terminate
à Prune
{I1, I2, I3, I5}
The Apriori Algorithm—Exercise
Database TDB Supmin = 2

Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
The Apriori Algorithm—Exercise
(Solution)
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup

3rd scan C4 = φ à Terminate
{B, C, E} {B, C, E} 2
Apriori Algorithm

23 5/15/19
Correlations
Apriori
Algorithm
Generate Ck using Lk-1 to find Lk
Join
Prune
24 5/15/19
Generating Association Rules from
Frequent Itemsets
𝒄𝒐𝒏𝒇𝒊𝒅𝒆𝒏𝒄𝒆 (𝑨⇒┴𝑩)=𝑷(𝑩|𝑨)=
𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨∪𝑩)/𝒔𝒖𝒑𝒑𝒐𝒓𝒕_𝒄𝒐𝒖𝒏𝒕(𝑨) 

25 5/15/19
Correlations
Generating Association Rules from
Frequent Itemsets
Nonempty Confidence
Itemset Support
count subsets
{I1, I2} 2/4 = 50%
{I1, I2, I3} 2
{I1, I2, I5} 2 {I1, I5} 2/2 = 100%
{I2, I5} 2/2 = 100%

TID List of items
T100 I1, I2, I5 {I1} 2/6 = 33%
T200 I2, I4
{I2} 2/7 = 29%
T300 I2, I3
{I5} 2/2 = 100%
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
For a min_confidence = 70%
T800 I1, I2, I3, I5
T900 I1, I2, I3 Correlations
26 5/15/19
Important Details of Apriori
v How to generate candidates?

§  Step 1: self-joining Lk
§  Step 2: pruning
v Example of Candidate-generation
§  L3={abc, abd, acd, ace, bcd}
§  Self-joining: L3*L3
•  abcd from abc and abd
•  acde from acd and ace
§  Pruning:
•  acde is removed because ade is not in L3
§  C4={abcd}
27
Methods to Improve Apriori’s
Efficiency
v  Hash-based itemset counting: A k-itemset whose corresponding
hashing bucket count is below the threshold cannot be frequent
v  Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
v  Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
v  Sampling: mining on a subset of given data, lower support threshold
+ a method to determine the completeness
v  Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent
28
Bottleneck of Frequent-
pattern Mining
v  The core of the Apriori algorithm:
§  Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets
§  Use database scan and pattern matching to collect counts for
the candidate itemsets
v  Multiple database scans are costly

§  Needs (n +1 ) scans, n is the length of the longest pattern
v  Mining long patterns needs many passes of scanning and generates
lots of candidates
§  To find frequent itemset i1i2…i100
•  # of scans: 100
•  # of Candidates: (1001) + (1002) + … + (100100) = 2100-1 = 1.27*1030 !
v  Bottleneck: candidate-generation-and-test
v  Can we avoid candidate generation?
29
Mining Frequent Patterns Without
Candidate Generation
v Compress a large database into a compact,
Frequent-Pattern tree (FP-tree) structure
§  highly condensed, but complete for frequent pattern
mining
§  avoid costly database scans
v Develop an efficient, FP-tree-based frequent
pattern mining method
§  A divide-and-conquer methodology: decompose
mining tasks into smaller ones
§  Avoid candidate generation: sub-database test only!
30
Mining Frequent Patterns Without
Candidate Generation
v Grow long patterns from short ones using local
frequent items
§  “abc” is a frequent pattern
§  Get all transactions having “abc”: DB|abc
§  “d” is a local frequent item in DB|abc à abcd is a
frequent pattern
31
FP-Growth
Compare
Scan dataset for candidate
Transactional data example
count of each support with
N=9, min_supp count=2
candidate min_supp
TID List of items
L1 - Reordered
T100 I1, I2, I5
C1 Itemset Support Itemset Support
T200 I2, I4 count count
T300 I2, I3 {I1} 6 {I2} 7
T400 I1, I2, I4 {I2} 7 {I1} 6
T500 I1, I3
{I3} 6 {I3} 6
T600 I2, I3
{I4} 2 {I4} 2
T700 I1, I3
{I5} 2 {I5} 2
T800 I1, I2, I3, I5
T900 I1, I2, I3
32 5/15/19
FP-Growth – FP-tree Construction
FP-tree
null { }
L1 - Reordered
Itemset Support
count
{I2} 7
{I1} 6
{I3} 6
{I4} 2
{I5} 2
33 5/15/19
FP-tree
null { }
T100
L1 - Reordered
I2:1
Itemset Support
count
{I2} 7 TID List of

I1:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
Order of items is kept throughout path construction, with
T800 I2, I1, I3, I5
common prefixes shared whenever applicable
T900 I2, I1, I3
5/15/19
FP-tree
null { }
L1 - Reordered
I2:1 T200
Itemset Support
count
{I2} 7 TID List of

I1:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Correlations
T900 I2,
35 I1, I3
5/15/19
FP-tree
null { }
L1 - Reordered
I2:2 T200
Itemset Support
count
{I2} 7 TID List of

I1:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Correlations
T900 I2,
36 I1, I3
5/15/19
FP-tree
null { }
L1 - Reordered
I2:2
Itemset Support
count
{I2} 7 T300 TID List of

I1:1 I3:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Correlations
T900 I2,
37 I1, I3
5/15/19
FP-tree
null { }
L1 - Reordered
I2:3
Itemset Support
count
{I2} 7 T300 TID List of

I1:1 I3:1 I4:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 T200 I2, I4
{I4} 2 I5:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Correlations
T900 I2,
38 I1, I3
5/15/19
FP-tree
Trace the node link path for each node entry
and you get that item’s support count J null { }
L1 - Reordered
I2:7 I1:2
Itemse Suppo Node
t rt Link
count
{I2} 7 I1:4 I3:2 I4:1 I3:2

{I1} 6
{I3} 6
I5:1 I3:2 I4:1
{I4} 2
{I5} 2
For Tree I5:1

Traversal

39 5/15/19
Correlations
FP-Growth – Frequent Patterns Mining
FP-tree
Bottom-up algorithm – start from leaves and go
up to root – I5 for example has two paths to root null { }
L1 - Reordered
Itemse Suppo Node I2:7 I1:2
t rt Link
count
{I2} 7
I1:4 I3:2 I4:1 I3:2
{I1} 6
{I3} 6
{I4} 2 I5:1 I3:2 I4:1
{I5} 2
{I3, I5} frequency < min_support
I5:1 count threshold

40 5/15/19
Correlations
FP-Growth – Conditional FP-tree
Construction
FP-tree
For I5
null { }
L1 - Reordered
Itemse Suppo Node

t rt Link
count
TID List of
{I2} 7 items
{I1} 6 T100 I2, I1, I5
Eliminate
transactions not T200 I2, I4
{I3} 6 including I5
T300 I2, I3
{I4} 2
T400 I2, I1, I4
{I5} 2 Eliminate
I5 T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
41 I1, I3
Construction
FP-tree
For I5
null { }
L1 - Reordered
I2:1
Itemse Suppo Node
t rt Link
count
TID List of
{I2} 7 I1:1 items
{I1} 6 Eliminate transactions
T100 I2, I1, I5
not including I5 T200 I2, I4
{I3} 6
T300 I2, I3
{I4} 2
T400 I2, I1, I4
{I5} 2 Eliminate I5
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
42 I1, I3
Construction
FP-tree
For I5
null { }
L1 - Reordered
Itemse Suppo Node I2:2

t rt Link
count
{I2} 7 TID List of

I1:2 items
{I1} 6
Eliminate transactions
T100 I2, I1, I5
{I3} 6 not including I5 T200 I2, I4
{I4} 2 I3:1 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I5
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
Correlations
T900 I2,
43 I1, I3
5/15/19
FP-Growth
Item Conditional Pattern Base Conditional FP- Frequent Patterns
tree Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 1}, {I1: 2}} {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} {I2, I1: 4}
Paths to which item is suffix Prefix paths to item
after eliminating
infrequent items

44 5/15/19
Correlations
Construction
FP-tree
For I4
null { }
L1 - Reordered
Itemse Suppo Node I2:2
t rt Link
count
{I2} 7 TID List of

I1:1 items
{I1} 6
T100 I2, I1, I5
{I3} 6 not including I4 T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2,
45 I1, I3
Construction
FP-tree
For I3
null { }
L1 - Reordered
Itemse Suppo Node I2:4 I1:2

t rt Link
count
{I2} 7 TID List of

I1:2 items
{I1} 6 T100 I2, I1, I5
{I3} 6 not including I3
T200 I2, I4
{I4} 2 T300 I2, I3
{I5} 2 T400 I2, I1, I4
Eliminate I3
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I2, I1, I3, I5
T900 I2, I1, I3
FP-Growth
Item Conditional Pattern Base Conditional FP- Frequent Patterns
tree Generated
I5 {{I2, I1: 1}, {I2, I1, I3: 1}} {I2, I5: 2}, {I1, I5: 2},
{I2, I1, I5: 2}
I4 {{I2, I1: 1}, {I2: 1}} {I2, I4: 2}
I3 {{I2, I1: 2}, {I2: 1}, {I1: 2}} {I2, I3: 4}, {I1, I3: 4},
{I2, I1, I3: 2}
I1 {{I2: 4}} {I2, I1: 4}

47 5/15/19
Correlations
Exercise: Construct FP-Tree
TID Items bought

100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
min_support = 3
48
Exercise: Construct FP-Tree (Solution)
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1.  Scan DB once, find frequent
1-itemset (single item f:4 c:1
pattern) Item frequency head
2.  Sort frequent items in
f 4
frequency descending order, c 4 c:3 b:1 b:1
f-list a 3
3.  Scan DB again, construct FP- b 3 a:3 p:1
tree m 3
p 3
m:2 b:1
F-list = f-c-a-b-m-p p:2 m:1

Agenda
The Basics
Analysis Generating
Itemsets
FP-Growth

50 5/15/19
Correlations
Pattern Evaluation Methods
v Not all association rules are interesting

§  buys(X, “computer games”) ⇒┴buys(X, “videos”) [40%,
66%]
§  P(“videos”) is already 75% > 66%
§  The two items are negatively associated à buying one
decreases the likelihood of buying the other
v We need to measure “real strength” of rule
v Correlation analysis
𝐴⇒┴𝐵 [𝑠𝑢𝑝𝑝𝑜𝑟𝑡, 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒, 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛]

51 5/15/19
Correlations
References
1.  Jiawei Han and Micheline Kamber, Data Mining: Concepts and
Techniques Third Edition, Elsevier, 2012
2.  Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical
Machine Learning Tools and Techniques 3rd Edition, Elsevier,
2011
3.  Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining
Use Cases and Business Analytics Applications, CRC Press
Taylor & Francis Group, 2014
4.  Daniel T. Larose, Discovering Knowledge in Data: an Introduction
to Data Mining, John Wiley & Sons, 2005
5.  Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT
Press, 2014
6.  Florin Gorunescu, Data Mining: Concepts, Models and
Techniques, Springer, 2011
7.  Oded Maimon and Lior Rokach, Data Mining and Knowledge
Discovery Handbook Second Edition, Springer, 2010
8.  Warren Liao and Evangelos Triantaphyllou (eds.), Recent
Advances in Data Mining of Enterprise Data: Algorithms and
Applications, World Scientific, 2007
52

DM-06,07-Association Rules

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM-06,07-Association Rules

Uploaded by

Copyright:

Available Formats

Lecture 06 & 07 – Association Rule Mining

“Switch Off” your Mobile Phone Or Switch

Frequent Itemset Pattern Evaluation

Which items are frequently

Data Mining 2013 – Mining Frequent Patterns, Association, and

Frequent Itemset Occurrence Frequency

v If min_support count is set too low à huge # of

Data Mining 2013 – Mining Frequent Patterns, Association, and

Data Mining 2013 – Mining Frequent Patterns, Association, and

Data Mining 2013 – Mining Frequent Patterns, Association, and

Data Mining 2013 – Mining Frequent Patterns, Association, and

Database TDB Supmin = 2

C3 Itemset L3 Itemset sup

Data Mining 2013 – Mining Frequent Patterns, Association, and

Generate Ck using Lk-1 to find Lk

Data Mining 2013 – Mining Frequent Patterns, Association, and

{I2, I5} 2/2 = 100%

v How to generate candidates?

v Multiple database scans are costly

{I2} 7 TID List of

{I2} 7 TID List of

{I2} 7 TID List of

{I2} 7 T300 TID List of

{I2} 7 T300 TID List of

{I2} 7 I1:4 I3:2 I4:1 I3:2

For Tree I5:1

Data Mining 2013 – Mining Frequent Patterns, Association, and

Data Mining 2013 – Mining Frequent Patterns, Association, and

Itemse Suppo Node

Itemse Suppo Node I2:2

{I2} 7 TID List of

Data Mining 2013 – Mining Frequent Patterns, Association, and

{I2} 7 TID List of

Itemse Suppo Node I2:4 I1:2

{I2} 7 TID List of

Data Mining 2013 – Mining Frequent Patterns, Association, and

TID Items bought

F-list = f-c-a-b-m-p p:2 m:1

Data Mining 2013 – Mining Frequent Patterns, Association, and

v Not all association rules are interesting

Data Mining 2013 – Mining Frequent Patterns, Association, and

You might also like

v If min_support count is set too low à huge # of

v How to generate candidates?

v  Multiple database scans are costly

v Not all association rules are interesting