You are on page 1of 15

Mining Association Rules

What is Association rule mining


Apriori Algorithm
Mining Association Rules Additional Measures of rule interestingness
Advanced Techniques

1 2

What Is Association Rule Mining? What Is Association Rule Mining?


Rule form
Association rule mining
Antecedent Consequent [support, confidence]
Finding frequent patterns, associations, correlations, or causal
structures among sets of items in transaction databases
(support and confidence are user defined measures of interestingness)

Understand customer buying habits by finding associations and


correlations between the different items that customers place in Examples
their shopping basket
buys(x, computer) buys(x, financial management
software) [0.5%, 60%]
Applications
Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, web log analysis, fraud detection (supervisor->examiner) age(x, 30..39) ^ income(x, 42..48K) buys(x, car) [1%,75%]

3 4
How can Association Rules be used? How can Association Rules be used?
Probably mom was
calling dad at Let the rule discovered be
work to buy
diapers on way {Bagels,...} {Potato Chips}
home and he
decided to buy a Potato chips as consequent => Can be used to determine what
six-pack as well. should be done to boost its sales
The retailer
could move Bagels in the antecedent => Can be used to see which products
diapers and beers would be affected if the store discontinues selling bagels
to separate
places and
position high- Bagels in antecedent and Potato chips in the consequent =>
profit items of
interest to young Can be used to see what products should be sold with Bagels
fathers along the to promote sale of Potato Chips
path.
5 6

Rule basic Measures: Support and Confidence


Association Rule: Basic Concepts
A B [ s, c ]
Given:
(1) database of transactions, Support: denotes the frequency of the rule within
(2) each transaction is a list of items purchased by a customer
transactions. A high value means that the rule involve a
in a visit great part of database.
support(A B [ s, c ]) = p(A B)
Find:
all rules that correlate the presence of one set of items
(itemset) with that of another set of items Confidence: denotes the percentage of transactions
containing A which contain also B. It is an estimation of
E.g., 98% of people who purchase tires and auto accessories conditioned probability .
also get automotive services done
confidence(A B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).

7 8
Example Mining Association Rules
Itemset:
A,B or B,E,F
Trans. Id Purchased Items What is Association rule mining
1 A,D Support of an itemset:
2 A,C Sup(A,B)=1 Sup(A,C)=2 Apriori Algorithm
3 A,B,C Frequent pattern: Additional Measures of rule interestingness
4 B,E,F Given min. sup=2, {A,C} is a
frequent pattern Advanced Techniques

For minimum support = 50% and minimum confidence = 50%, we have


the following rules

A => C with 50% support and 66% confidence


C => A with 50% support and 100% confidence
9 10

Boolean association rules Mining Association Rules - An Example


Transaction ID Items Bought Min. support 50%
Each transaction is Min. confidence 50%
represented by a 2000 A,B,C
Boolean vector 1000 A,C Frequent Itemset Support
4000 A,D {A} 75%
5000 B,E,F {B} 50%
{C} 50%
{A,C} 50%

For rule A C :
support = support({A , C }) = 50%
confidence = support({A , C }) / support({A }) = 66.6%

11 12
The Apriori principle Apriori principle

Any subset of a frequent itemset must No superset of any infrequent itemset should be
be frequent generated or tested
Many item combinations can be pruned

A transaction containing {beer, diaper, nuts} also contains


{beer, diaper}

{beer, diaper, nuts} is frequent {beer, diaper} must also be


frequent

13 14

Itemset Lattice Apriori principle for pruning candidates


null

null
If an itemset is
infrequent, then all of its
A B C D E

supersets must also be


A B C D E

infrequent
AB AC AD AE BC BD BE CD CE DE
AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Pruned ABCD ABCE ABDE ACDE BCDE

supersets
ABCDE ABCDE

15 16
Mining Frequent Itemsets (the Key Step)
The Apriori Algorithm Example
Min. support: 2 transactions
Database D C1 itemset sup.
Find the frequent itemsets: the sets of items that TID Items {1} 2 L1 itemset sup.
have minimum support 100 134 {2} 3 {1} 2

A subset of a frequent itemset must also be a frequent


200 235 Scan D {3} 3 {2} 3
300 1235 {4} 1 {3} 3
itemset {5} 3 {5} 3
400 25
Generate length (k+1) candidate itemsets from length k
C2 C2 itemset

frequent itemsets, and itemset sup


Test the candidates against DB to determine which are in fact
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 {1 3}
frequent
{2 3} 2 {1 5} 1 Scan D {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
Use the frequent itemsets to generate association {3 5} 2
{3 5} 2 {3 5}
rules.
C3 itemset Scan D L3 itemset sup
Generation is straightforward {2 3 5} {2 3 5} 2
17 18

How to Generate Candidates? How to Generate Candidates?


The items in Lk-1 are listed in an order
Step 2: pruning
Step 1: self-joining Lk-1 for all itemsets c in Ck do
insert into Ck
for all (k-1)-subsets s of c do
select p.item1, p.item2, , p.itemk-1, q.itemk-1 if (s is not in Lk-1) then delete c from Ck
from Lk-1 p, Lk-1 q
where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
A D E F

A D E
=

<

A D F

19 20
The Apriori Algorithm
Example of Generating Candidates
Ck: Candidate itemset of size k Lk : frequent itemset of size k
L3={abc, abd, acd, ace, bcd} Join Step: Ck is generated by joining Lk-1 with itself

Prune Step: Any (k-1)-itemset that is not frequent cannot be a


Self-joining: L3*L3 subset of a frequent k-itemset
Algorithm:
abcd from abc and abd


L1 = {frequent items};
acde from acd and ace for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
Pruning (before counting its support):
increment the count of all candidates in Ck+1 that are
contained in t
acde is removed because ade is not in L3
Lk+1 = candidates in Ck+1 with min_support
end
C4={abcd} return L = k Lk;
21 22

How to Count Supports of Candidates? Generating AR from frequent intemsets

Why counting supports of candidates a problem? support_count({A,B})

Confidence (AB) = P(B|A) =


The total number of candidates can be very huge
support_count({A})
One transaction may contain many candidates
For every frequent itemset x, generate all non-empty
Method: subsets of x
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and counts
Interior node contains a hash table For every non-empty subset s of x, output the rule
Subset function: finds all the candidates contained in a transaction
s (x-s) if
support_count({x})
min_conf
support_count({s})

23 24
From Frequent Itemsets to Association Rules Generating Rules: example
Q: Given frequent set {A,B,E}, what are possible Trans-ID Items Min_support: 60%
Min_confidence: 75%
association rules? 1 ACD
A => B, E 2 BCE
A, B => E 3 ABCE
A, E => B
4 BE
5 ABCE Frequent Itemset Support
B => A, E
{BCE},{AC} 60%
B, E => A
Rule Conf. {BC},{CE},{A} 60%
E => A, B {BC} =>{E} 100% {BE},{B},{C},{E} 80%
__ => A,B,E (empty rule), or true => A,B,E {BE} =>{C} 75%
{CE} =>{B} 100%
{B} =>{CE} 75%
{C} =>{BE} 75%
{E} =>{BC} 75%
25 26

Exercice 0.4*7= 2.8

C1 L1 C2 L2
TID Items
Bread 6 Bread 6 Bread,Milk 5 Bread,Milk 5
1 Bread, Milk, Chips, Mustard Milk 6 Milk 6 Bread,Beer 3 Bread,Beer 3
2 Beer, Diaper, Bread, Eggs Converta os dados para o Chips 2 Beer 4 Bread,Diaper 5 Bread,Diaper 5
Mustard 2 Diaper 6 Bread,Coke 2 Milk,Beer 3
3 Beer, Coke, Diaper, Milk formato booleano e para um
Beer 4 Coke 3 Milk,Beer 3 Milk,Diaper 5
suporte de 40%, aplique o
4 Beer, Bread, Diaper, Milk, Chips algoritmo apriori.
Diaper 6 Milk,Diaper 5 Milk,Coke 3
Eggs 1 Milk,Coke 3 Beer,Diaper 4
5 Coke, Bread, Diaper, Milk
Coke 3 Beer,Diaper 4 Diaper,Coke 3
6 Beer, Bread, Diaper, Milk, Mustard Beer,Coke 1
7 Coke, Bread, Diaper, Milk Diaper,Coke 3

C3 L3
Bread Milk Chips Mustard Beer Diaper Eggs Coke
Bread,Milk,Beer 2 Bread,Milk,Diaper 4
1 1 1 1 0 0 0 0
Bread,Milk,Diaper 4 Bread,Beer,Diaper 3
1 0 0 0 1 1 1 0
Bread,Beer,Diaper 3 Milk,Beer,Diaper 3
0 1 0 0 1 1 0 1
Milk,Beer,Diaper 3 Milk,Diaper,Coke 3
1 1 1 0 1 1 0 0 Milk,Beer,Coke
1 1 0 0 0 1 0 1 Milk,Diaper,Coke 3
1 1 0 1 1 1 0 0
1 1 0 0 0 1 0 1 8 + C 28 + C 38 = 92 >> 24
27 28
Challenges of Frequent Pattern Mining Improving Aprioris Efficiency

Challenges
Problem with Apriori: every pass goes over whole data.
Multiple scans of transaction database AprioriTID: Generates candidates as apriori but DB is
Huge number of candidates used for counting support only on the first pass.
Tedious workload of support counting for candidates Needs much more memory than Apriori
Improving Apriori: general ideas Builds a storage set C^k that stores in memory the frequent sets
Reduce number of transaction database scans per transaction
Shrink number of candidates
AprioriHybrid: Use Apriori in initial passes; Estimate the


Facilitate support counting of candidates
size of C^k; Switch to AprioriTid when C^k is expected to

fit in memory
The switch takes time, but it is still better in most cases

29 30

C^1 L1
TID Items TID Set-of-itemsets Itemset Support Improving Aprioris Efficiency
100 134 100 { {1},{3},{4} } {1} 2
200 235 200 { {2},{3},{5} } {2} 3 Transaction reduction: A transaction that does not
300 1235 {3} 3
300 { {1},{2},{3},{5} }
contain any frequent k-itemset is useless in subsequent
400 25 400 { {2},{5} } {5} 3
scans
C2 C^2
itemset TID Set-of-itemsets
L2
{1 2} 100 { {1 3} } Itemset Support Sampling: mining on a subset of given data.
{1 3} 200 { {2 3},{2 5} {3 5} } {1 3} 2
The sample should fit in memory
{1 5} 300 { {1 2},{1 3},{1 5}, {2 3} 3
{2 5} 3 Use lower support threshold to reduce the probability of
{2 3} {2 3}, {2 5}, {3 5} }
{3 5} 2
missing some itemsets.
{2 5} 400 { {2 5} }
{3 5}
The rest of the DB is used to determine the actual itemset
C^3 L3 count.
C3 TID Set-of-itemsets
Itemset Support
itemset 200 { {2 3 5} }
{2 3 5} 2
{2 3 5} 300 { {2 3 5} }
31 32
Improving Aprioris Efficiency Improving Aprioris Efficiency

Partitioning: Any itemset that is potentially frequent in DB must be


Dynamic itemset counting: partitions the DB into several blocks
frequent in at least one of the partitions of DB (2 DB scans)
each marked by a start point.
(support in a partition is lowered to be proportional to the number of
elements in the partition) At each start point, DIC estimates the support of all itemsets that are
currently counted and adds new itemsets to the set of candidate
Phase I
itemsets if all its subsets are estimated to be frequent.
Phase II
If DIC adds all frequent itemsets to the set of candidate itemsets
Combine
Divide D Find frequent Find global during the first scan, it will have counted each itemsets exact support
results to
into n itemsets local frequent Freq.
Trans.
Non- to each
form a global
itemsets itemsets at some point during the second scan;
in D set of
overlapping partition among in D
candidate thus DIC can complete in two scans.
partitions (parallel alg.) candidates
itemset

33 34

Comment Mining Association Rules

Traditional methods such as database queries: What is Association rule mining


support hypothesis verification about a relationship such as
Apriori Algorithm

the co-occurrence of diapers & beer.

Additional Measures of rule interestingness


Data Mining methods automatically discover significant
Advanced Techniques


associations rules from data.
Find whatever patterns exist in the database, without the user
having to specify in advance what to look for (data driven).
Therefore allow finding unexpected correlations

35 36
Interestingness Measurements Objective measures of rule interest
Are all of the strong association rules discovered interesting
Support
enough to present to the user?

How can we measure the interestingness of a rule? Confidence or strength


Lift or Interest or Correlation

Subjective measures Conviction


A rule (pattern) is interesting if Leverage or Piatetsky-Shapiro
it is unexpected (surprising to the user); and/or Coverage
actionable (the user can do something with it)
(only the user can judge the interestingness of a rule)

37 38

Criticism to Support and Confidence Lift of a Rule


Example 1: (Aggarwal & Yu,
basketball not basketball sum(row)
PODS98) Lift (Correlation, Interest)
cereal 2000 1750 3750 75%
Among 5000 students not cereal 1000 250 1250 2 5%

3000 play basketball sum(col.) 3000 2000 5000 sup(A, B ) P (B | A)


3750 eat cereal Lift(A B ) = =
sup(A) sup(B ) P (B )
60% 40%

2000 both play basket ball


and eat cereal
A and B negatively correlated, if the value is less than 1;
play basketball eat cereal [40%, 66.7%] otherwise A and B positively correlated

misleading because the overall percentage of students eating cereal is rule Support Lift
75% which is higher than 66.7%. X 1 1 1 1 0 0 0 0 X Y 25% 2.00
Y 1 1 0 0 0 0 0 0 XZ 37.50% 0.86
play basketball not eat cereal [20%, 33.3%]
Z 0 1 1 1 1 1 1 1 YZ 12.50% 0.57
is more accurate, although with lower support and confidence
39 40
Lift of a Rule Problems With Lift
Example 1 (cont)
Rules that hold 100% of the time may not have the highest
2000
possible lift. For example, if 5% of people are Vietnam veterans
5000
play basketball eat cereal [40%, 66.7%] LIFT =
3000 3750
= 0.89
and 90% of the people are more than 5 years old, we get a lift of

5000 5000
0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule
1000 Vietnam veterans -> more than 5 years old.
5000
play basketball not eat cereal [20%, 33.3%] LIFT = 3000 1250
= 1.33

5000 5000
And, lift is symmetric:

basketball not basketball sum(row) not eat cereal play basketball [20%, 80%]
cereal 2000 1750 3750
1000
not cereal 1000 250 1250 LIFT = 5000 = 1.33
1250 3000

sum(col.) 3000 2000 5000 5000 5000
41 42

Conviction of a Rule Leverage of a Rule


Note that A -> B can be rewritten as (A,B) Leverage or Piatetsky-Shapiro

sup(A) sup(B ) P (A) P (B ) P (A)(1 P (B ))


Conv (A B ) = = =
sup(A, B ) P (A, B ) P (A) P (A, B ) PS (A B ) = sup(A, B ) sup(A) sup(B )

Conviction is a measure of the implication and has value 1 if items are


unrelated. PS (or Leverage):
3000 3750
1
5000

5000
is the proportion of additional elements covered by both the
play basketball eat cereal [40%, 66.7%] Conv =
3000 2000
= 0.75

5000 5000
premise and consequence above the expected if independent.
eat cereal play basketball conv:0.85

3000 1250
play basketball not eat cereal [20%, 33.3%] 1
5000

5000
Conv = = 1.125
not eat cereal play basketball conv:1.43 3000 1000


5000 5000

43 44
Coverage of a Rule Association Rules Visualization

coverage(A B ) = sup(A)

The coloured column indicates the association rule BC.


Different icon colours are used to show different metadata values of
the association rule.
45 46

Association Rules Visualization


Association Rules Visualization

Size of ball equates to total support

Height
equates to
confidence

47 48
Association Rules Visualization - Ball graph
The Ball graph Explained

A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green
or blue. The blue nodes are active nodes representing the items in the rule in which
the user is interested. The yellow nodes are passive representing items related to
the active nodes in some way. The green nodes merely assist in visualizing two or
more items in either the head or the body of the rule.
A circular node represents a frequent (large) data item. The volume of the ball
represents the support of the item. Only those items which occur sufficiently
frequently are shown
An arrow between two nodes represents the rule implication between the two
items. An arrow will be drawn only when the support of a rule is no less than the
minimum support

49 50

Association Rules Visualization Mining Association Rules

What is Association rule mining


Apriori Algorithm
FP-tree Algorithm
Additional Measures of rule interestingness
Advanced Techniques

51 52
Multiple-Level Association Rules Multi-Dimensional Association Rules
FoodStuff
Single-dimensional rules:
Frozen Refrigerated Fresh Bakery Etc... buys(X, milk) buys(X, bread)

Vegetable Fruit Dairy Etc....


Multi-dimensional rules: 2 dimensions or predicates
Banana Apple Orange Etc... Inter-dimension association rules (no repeated predicates)
age(X,19-25) occupation(X,student) buys(X,coke)
Fresh Bakery [20%, 60%]
hybrid-dimension association rules (repeated predicates)
Dairy Bread [6%, 50%]
age(X,19-25) buys(X, popcorn) buys(X, coke
Fruit Bread [1%, 50%] is not valid

Items often form hierarchy.


Flexible support settings: Items at the lower level are expected to have lower support.
Transaction database can be encoded based on dimensions and levels explore shared multi-level
mining
53 54

Quantitative Association Rules Sequential Pattern Mining


Given
A database of customer transactions ordered by increasing
age(X,30-34) income(X,24K - 48K) buys(X,high resolution

transaction time
TV)
Each transaction is a set of items
A sequence is an ordered list of itemsets
Example:
Mining Sequential Patterns 10% of customers bought Foundation and Ringworld" in one
transaction, followed by Ringworld Engineers" in another transaction.

10% of customers bought 10% is called the support of the pattern


(a transaction may contain more books than those in the pattern)
Foundation and Ringworld in one transaction,
Problem
followed by
Find all sequential patterns supported by more than a user-specified
Ringworld Engineers in another transaction. percentage of data sequences

55 56
Application Difficulties Some Suggestions

Wal-Mart knows that customers who buy Barbie dolls By increasing the price of Barbie doll and giving the type of candy bar
free, wal-mart can reinforce the buying habits of that particular types of
(it sells one every 20 seconds) have a 60% likelihood of buyer
buying one of three types of candy bars. What does Highest margin candy to be placed near dolls.
Wal-Mart do with information like that?
Special promotions for Barbie dolls with candy at a slightly higher margin.
'I don't have a clue,' says Wal-Mart's chief of Take a poorly selling product X and incorporate an offer on this which is
merchandising, Lee Scott. based on buying Barbie and Candy. If the customer is likely to buy these
two products anyway then why not try to increase sales on X?
Probably they can not only bundle candy of type A with Barbie dolls, but
can also introduce new candy of Type N in this bundle while offering
See - KDnuggets 98:01 for many ideas discount on whole bundle. As bundle is going to sell because of Barbie
www.kdnuggets.com/news/98/n01.html dolls & candy of type A, candy of type N can get free ride to customers
houses. And with the fact that you like something, if you see it often,
Candy of type N can become popular.

61 62

References
Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, 2000

Vipin Kumar and Mahesh Joshi, Tutorial on High


Performance Data Mining , 1999
Rakesh Agrawal, Ramakrishnan Srikan, Fast Algorithms
for Mining Association Rules, Proc VLDB, 1994
(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%
20Rules.ppt)

Alpio Jorge, seleco de regras: medidas de interesse


e meta queries,
(http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)
Thank you !!!
63 64