You are on page 1of 26

Mining

Association Rules
Motivation
 Discovering relations among transactional data
 Example – market basket analysis
Discovery of buying habits of customers: what items
are frequently purchased by a customer in a single
trip?
Help developing market strategies
 Issues:
How to formulate association rules
How to determine interesting association rules
How to discover interesting association rules
efficiently in large data set? 2
Formulating Association Rules
 Example: a customer that 1 coffee, bread
purchases coffee tends to 2 coffee, meat, apple
also buy sugar is 3 coffee, sugar, noodle, salt
represented as: 4 coffee, sugar, orange, potato
coffee  sugar [support = 10%, 5 coffee, sugar, tomato
confidence = 70%] 6 bread, sugar, bean
 support = 10%: 10% of all 7 milk, egg
customers purchase both 8 milk, fish
coffee and sugar
 confidence = 70%: 70% of the
customers who buy coffee also Total customers: 8
buy sugar Customers who bought coffee: 5
Customers who bought both
 Thresholds: support must be at
coffee and sugar: 3
least r, confidence at least c
Support: 3/8 = 37%
 Users set thresholds to
Confidents: 3/5 = 60%
indicate interestingness
3
Formulating Association Rule (cont.)
 In terms of probability
Let X = (X1, X2) be defined as
 For a random customer c, X1 = 1 if c buys coffee,
and 0 otherwise; X2 = 1 if c buys sugar, 0
otherwise
coffee  sugar [support = 10%, confidence
= 70%] is interpreted as:
 p(X1 = 1, X2 = 1) = 10% and p(X2 = 1|X1 = 1) = 70%
 or simply
 p(coffee, sugar) = 10% and p(sugar | coffee) = 70%

4
Formulating Association Rule (cont.)
 Concepts
 I = {i1,…, im} is a set of items
 D = {T1,…, Tn} is a set where for all i, Ti  I. (Ti is called a
transaction, D is referred as a transaction database.)
 An association rule is an implication: A  B where A, B 
I and A  B = 
 A  B holds in D with a support s and confidence r if
 | {T : A  B  T & T  D} | = s and | {T : A  B  T & T  D} | = r
|D| | {T : A  T & T  D} |

 If we view any U  I as the event that a randomly


selected transaction from D contains U, then p(AB) =
s and p(B|A) = r

5
I = {i1,…, im}
Formulating Association Rule (cont.) D = {T1,…, Tn}
A  I, B  I
 Association rule A => B is valid with respect AB=
to the support threshold r and confidence threshold c if
 A => B holds with a support s  r and confidence f  c

 Additional concepts
 k-itemset: any subset of I that contains exactly k items
 Occurrence frequency of itemset t, denoted as frequency(t): #
of transactions in D that contain t (other terms used: support
count)
 Itemset t is frequent with respect to support threshold r if
frequency(t)/|D|  r
 Implication: A  B being frequent with respect to r is a
necessary condition for A => B to be valid
6
Formulating Association Rule
 Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
 Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
 Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
 Consider association rule {coffee} => {sugar} 6 bread, sugar, bean
 The occ. freq. of (coffee, sugar} is 3 7 milk, egg
 {coffee, sugar} is a frequent 2-itemset, since 8 milk, fish
3/8  30%
 The occurrence frequency of {coffee} is 5
 The confidence for {coffee} => {sugar} is 3/5 
60%
 So, {coffee} => {sugar} is a valid association
rule w. r. t the giving support the confidence
threshold
Formulating Association Rule
 Let I = {apple, bread, bean, coffee, egg, fish, 1 coffee, bread
milk, meat, noodle, orange, potato, salt, sugar} 2 coffee, meat, apple
 Let D be the transaction set on the right 3 coffee, sugar, noodle, salt
 Let support threshold be 30%, confidence 4 coffee, sugar, orange, potato
threshold be 60% 5 coffee, sugar, tomato
 Consider association rule {milk} => {egg} 6 bread, sugar, bean
 The occu. freq. of {milk, egg} is 1 7 milk, egg
 {milk, egg} is not a frequent 2-itemset, since 8 milk, fish
1/8 < 30%
 {milk} => {egg} is not a valid association rule
w.r.t the given thresholds
Mining Association Rules
 Goal: discover all the valid association rules with
respect to the given support threshold r and
confidence threshold c
 Steps:
Find all frequent item sets w.r.t. r
 Generate association rules from the frequent item sets
w.r.t c
 Approaches to frequent item set searching
 Naive approach
 scan the itemset space
 for each itemset, count its frequency (scan all the
transactions), and compare with r
 high cost – # of itemsets is huge
9
An naive approach for finding all
frequent itemsets ??
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE 10
Apriori Algorithm for AR Mining
 Apriori property
 Let t1 and t2 be any itemsets and t2  t1. Then
t1 is frequent  t2 is frequent
or equivalently, t2 is not frequent  t1 is not frequent
 So if we know that an itemset is not frequent, then no need
to check its supersets
 Based on the second step, we can prune the search space
 After pruning, the remaining itemsets are called
candidate itemsets
 For each candidate itemset, we count the transactions
that contain it to determine if it is frequent

11
Illustrating the Apriori principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
not frequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE
12
Apriori Algorithm (cont.)
 Assumes the items are ordered in any itemset
as well as transactions
 Work out in the ascending order of i-itemsets
1. Find all the frequent 1-itemsets (by counting)
2. Join (i.e., union) each pair of frq 1-itemsets into a 2-itemset
3. Join each pair of frq (k-1)-itemsets into a k-itemset
4. Among them generate candidate k-itemsets
5. Get the transaction count for each candidate k-itemset and
then collect the frequent ones
6. Repeat these process until candidate sets become 
 Issues
 How to join (step 3)?
 How to generate (step 4)?
13
Apriori Algorithm (cont.)
 Let U and V be a pair of (k-1)-itemsets, we join them
in the following way
 Condition: they share the first k-2 items
 Keep these k-2 items, then add the remaining two items,
from one set each
 Example:
 join {1,4,5,7} and {1,4,5,9}, ok, get {1,4,5,7,9}
 join {1,4,5,7} and {1,2,4,8}, no
 join {1,4,5,7} and {4,5,7,9}, no
 Let W be the resulting set after joining U and V
 discard it if one of its (k-1)-subitemsets is not frequent
(this is where apriori property is applied)
 all the k-itemsets that have not been discarded constitute
the candidate k-itemsets
14
Apriori Algorithm – an Example
 I = {1,2,3,4,5}
 D = { {1,2,3,4}, {1,2,4}, {2,4,5}, {1,2,5},{2,4} }
 Support threshold: 40% (min support count: 2)
 Steps
1. 1-itemsets: {1}, {2}, {3}, {4}, {5}
2. Frequent 1-itemsets: {1}, {2}, {4}, {5}
3. Join frq 1-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
4. Candidate 2-itemsets: {1,2}, {1, 4}, {1, 5}, {2, 4}, {2, 5}, {4, 5}
5. Frequent 2-itemsets: {1,2}, {1,4}, {2,4}, {2,5}
6. Join frq 2-itemsets: {1,2,4}, {2,4,5}
7. Candidate 3-itemsets: {1,2,4}
8. Frequent 3-itemsets: {1,2,4}
9. Join frq 3-itemsets: 
10.Candidate 4-itemsets: 
11.stop
15
Correctness
 Does Apriori algorithm find all frequent itemsets?
 i.e., does the candidate k-itemsets include all the
frequent k-itemsets?
 We require two (k-1)-itemsets U and V to share the first k-2
items to be joined. Does this condition jeopardize correctness?
 Suppose U and V do not share the first k-2 items, let W = U  V
be a k-itemset. It will not be generated from joining U and V.
 Case 1, W is not frequent: not a problem.
 Case 2, W is frequent: can we conclude that its frequent itemset
status will not be discovered?

16
Generating Association Rules
 Let S be any frequent itemset
For each a  S, calculate

𝑓𝑟𝑒𝑞 𝑆
𝑓𝑟𝑒𝑞 𝑎
 If this value is not smaller than the
confidence threshold then output the
following association rule:
aS–a

17
Pattern Evaluation
 Support and confidence framework can only
help exclude uninteresting rules
 But they do not necessarily guarantee the
interestingness of the rules generated
 How to make a judgement?
❖ Mostly determined by users subjectively
❖ May be different by different users
❖ Some objective measures may be used in
limited contexts
18
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading when
 The overall % of students eating cereal is 75% > 66.7%.

 play basketball  not eat cereal [20%, 33.3%] is more accurate,


although with lower support and confidence
 Measure of dependent/correlated events: lift (larger -> higher correlation)

𝑃(𝑈, 𝑉) Basketball Not basketball Sum (row)


lift = Cereal 2000 1750 3750
𝑃 𝑈 𝑃(𝑉)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C ) = = 0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C ) = = 1.33
3000 / 5000 *1250 / 5000
19
2 correlation Test for A and B
 Notation:
❖ n: total # of transactions
❖ Dom(A) = {a1, …,ac)
❖ Dom(B) = {b1, …, br)
❖ (Ai, Bj): joint event that A = ai and B = bj
2
𝑎𝑖𝑗 −𝑒𝑖𝑗
2= σ𝑐𝑖=1 σ𝑟𝑗=1
𝑒𝑖𝑗
where
𝑎𝑖𝑗 : observed frequency of event (Ai, Bj)
𝑐𝑜𝑢𝑛𝑡(𝐴=𝑎 )×𝑐𝑜𝑢𝑛𝑡(𝐵=𝑏 )
𝑖 𝑖
𝑒𝑖𝑗 = : expected frequency of (Ai, Bj)
𝑛
𝑐𝑜𝑢𝑛𝑡(𝐴 = 𝑎𝑖 ): # of tuples with 𝐴 = 𝑎𝑖
𝑐𝑜𝑢𝑛𝑡(𝐵 = 𝑏𝑖 ): # of tuples with 𝐵 = 𝑏𝑖

Common practice: A and B are corelated if the p-value of 2 with


(c-1)(r-1) degrees of freedom is smaller than 0.05 20
• Let B and C be two random variables and
• Dom(B) = {Basketball, Not-basketball}
• Dom(C) = {Cereal, Not-cereal}
• The following is the contingency table:
Basketball Not-basketball Sum (row)
Cereal 2000 (2250) 1750 (1500) 3750

Not-cereal 1000 (750) 250 (500) 1250

Sum(col.) 3000 2000 5000

(2000−2250)2 (1750−1500)2 (1000−750)2 (250−500)2


 2= + + +
2250 1500 750 500
= 180.56

• p-value of 180.56 with one degree of freedom  0.05


• so B and C are strongly correlated
• Observing the data, they are negatively correlated 21
Multi-level AR
 Association rules may involve concepts at different
abstraction levels

22
Multi-level AR

 In some cases, it is difficult to find interesting patterns in


very low levels
 It may be easier to find strong associations between
general concepts
❖ Example:
❖ laptop => printer may be a strong rule
❖ Dell XPS 16 Notebook => Canon 7420 may be not

TID Items purchased


T100 Apple 17 Pro Notebook, HP Photosmart Pro b9180, Canon 7420 Printer
T200 Microsoft Office Pro 2010, Microsoft Wireless Optical Mouse 5000
T300 Logitech VX Namo Cordless Laser Mouse, Fellowes CEL Wrist Rest
T400 Dell Studio XPS 16 Notebook, Canon PowerShot SD1400
T500 Lenovo ThinkPad X200 Tablet PC, Symantec Norton Antivirus 2010
… 23
Multi-level AR
 Multilevel AR can be mined efficiently using support-confidence
framework
 Either top-down or bottom-up approach can be used
 Counts are accumulated toward frequent itemset at each level
 For each level, any AR algorithm can be used
 We can also define cross-level Apriori property
❖ Cross-level Apriori property: the count of any item set is not higher than its
parent, so the parent of a frequent item set is frequent also
❖ Example: frequency(Desktop, Office) ≤ frequency(Computer, Software)

24
Multi-level AR
 Variation 1: Uniform minimum support for all levels
❖ Pros: simplicity
❖ Cons: lower level concepts unlikely to occur with same frequency
as higher level concepts

25
Multi-level AR
 Variation 2: reduced minimum support at lower levels
❖ Pros: higher flexibility
❖ Cons: increased complexity in mining process
❖ Note: Apriori property may not always hold

 Variation 3: group-based support


❖ Domain experts have insight on the specificities of individual items
❖ Setting different supports for different groups may be more realistic
❖ For example, you may set low support for expansive items

26

You might also like