You are on page 1of 15

Mining Association Rules

„ What is Association rule mining


„ Apriori Algorithm
Mining Association Rules „ Additional Measures of rule interestingness
„ Advanced Techniques

1 2

What Is Association Rule Mining? What Is Association Rule Mining?


„ Rule form
„ Association rule mining
Antecedent → Consequent [support, confidence]
„ Finding frequent patterns, associations, correlations, or causal
structures among sets of items in transaction databases
(support and confidence are user defined measures of interestingness)

„ Understand customer buying habits by finding associations and


correlations between the different items that customers place in „ Examples
their “shopping basket”
„ buys(x, “computer”) → buys(x, “financial management
software”) [0.5%, 60%]
„ Applications
„ Basket data analysis, cross-marketing, catalog design, loss-leader
analysis, web log analysis, fraud detection (supervisor->examiner) „ age(x, “30..39”) ^ income(x, “42..48K”) Æ buys(x, “car”) [1%,75%]

3 4
How can Association Rules be used? How can Association Rules be used?
Probably mom was
calling dad at „ Let the rule discovered be
work to buy
diapers on way {Bagels,...} → {Potato Chips}
home and he
decided to buy a „ Potato chips as consequent => Can be used to determine what
six-pack as well. should be done to boost its sales
The retailer
could move „ Bagels in the antecedent => Can be used to see which products
diapers and beers would be affected if the store discontinues selling bagels
to separate
places and
position high- „ Bagels in antecedent and Potato chips in the consequent =>
profit items of
interest to young Can be used to see what products should be sold with Bagels
fathers along the to promote sale of Potato Chips
path.
5 6

Rule basic Measures: Support and Confidence


Association Rule: Basic Concepts
A ⇒ B [ s, c ]
„ Given:
„ (1) database of transactions, Support: denotes the frequency of the rule within
„ (2) each transaction is a list of items purchased by a customer
transactions. A high value means that the rule involve a
in a visit great part of database.
support(A ⇒ B [ s, c ]) = p(A ∪ B)
„ Find:
„ all rules that correlate the presence of one set of items
(itemset) with that of another set of items Confidence: denotes the percentage of transactions
containing A which contain also B. It is an estimation of
„ E.g., 98% of people who purchase tires and auto accessories conditioned probability .
also get automotive services done
confidence(A ⇒ B [ s, c ]) = p(B|A) = sup(A,B)/sup(A).

7 8
Example Mining Association Rules
Itemset:
A,B or B,E,F
Trans. Id Purchased Items „ What is Association rule mining
1 A,D Support of an itemset:
2 A,C Sup(A,B)=1 Sup(A,C)=2 „ Apriori Algorithm
3 A,B,C Frequent pattern: „ Additional Measures of rule interestingness
4 B,E,F Given min. sup=2, {A,C} is a
frequent pattern „ Advanced Techniques

For minimum support = 50% and minimum confidence = 50%, we have


the following rules

A => C with 50% support and 66% confidence


C => A with 50% support and 100% confidence
9 10

Boolean association rules Mining Association Rules - An Example


Transaction ID Items Bought Min. support 50%
Each transaction is Min. confidence 50%
represented by a 2000 A,B,C
Boolean vector 1000 A,C Frequent Itemset Support
4000 A,D {A} 75%
5000 B,E,F {B} 50%
{C} 50%
{A,C} 50%

For rule A ⇒ C :
support = support({A , C }) = 50%
confidence = support({A , C }) / support({A }) = 66.6%

11 12
The Apriori principle Apriori principle

Any subset of a frequent itemset must „ No superset of any infrequent itemset should be
be frequent generated or tested
„ Many item combinations can be pruned

„ A transaction containing {beer, diaper, nuts} also contains


{beer, diaper}

„ {beer, diaper, nuts} is frequent Æ {beer, diaper} must also be


frequent

13 14

Itemset Lattice Apriori principle for pruning candidates


null

null
If an itemset is
infrequent, then all of its
A B C D E

supersets must also be


A B C D E

infrequent
AB AC AD AE BC BD BE CD CE DE
AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Found to be
Infrequent ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Pruned ABCD ABCE ABDE ACDE BCDE

supersets
ABCDE ABCDE

15 16
Mining Frequent Itemsets (the Key Step)
The Apriori Algorithm — Example
Min. support: 2 transactions
Database D C1 itemset sup.
„ Find the frequent itemsets: the sets of items that TID Items {1} 2 L1 itemset sup.
have minimum support 100 134 {2} 3 {1} 2

„ A subset of a frequent itemset must also be a frequent


200 235 Scan D {3} 3 {2} 3
300 1235 {4} 1 {3} 3
itemset {5} 3 {5} 3
400 25
Generate length (k+1) candidate itemsets from length k
C2 C2 itemset
„

frequent itemsets, and itemset sup


„ Test the candidates against DB to determine which are in fact
L2 itemset sup {1 2} 1 {1 2}
{1 3} 2 {1 3} 2 {1 3}
frequent
{2 3} 2 {1 5} 1 Scan D {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
„ Use the frequent itemsets to generate association {3 5} 2
{3 5} 2 {3 5}
rules.
C3 itemset Scan D L3 itemset sup
„ Generation is straightforward {2 3 5} {2 3 5} 2
17 18

How to Generate Candidates? How to Generate Candidates?


„ The items in Lk-1 are listed in an order
„ Step 2: pruning
„ Step 1: self-joining Lk-1 for all itemsets c in Ck do
insert into Ck
for all (k-1)-subsets s of c do
select p.item1, p.item2, …, p.itemk-1, q.itemk-1 if (s is not in Lk-1) then delete c from Ck
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
A D E F

A D E
=

<

A D F

19 20
The Apriori Algorithm
Example of Generating Candidates
„ Ck: Candidate itemset of size k Lk : frequent itemset of size k
„ L3={abc, abd, acd, ace, bcd} „ Join Step: Ck is generated by joining Lk-1 with itself

„ Prune Step: Any (k-1)-itemset that is not frequent cannot be a


„ Self-joining: L3*L3 subset of a frequent k-itemset
Algorithm:
abcd from abc and abd
„
„
L1 = {frequent items};
„ acde from acd and ace for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
„ Pruning (before counting its support):
increment the count of all candidates in Ck+1 that are
contained in t
„ acde is removed because ade is not in L3
Lk+1 = candidates in Ck+1 with min_support
end
„ C4={abcd} return L = ∪k Lk;
21 22

How to Count Supports of Candidates? Generating AR from frequent intemsets

Why counting supports of candidates a problem? support_count({A,B})


„
„ Confidence (A⇒B) = P(B|A) =
„ The total number of candidates can be very huge
support_count({A})
„ One transaction may contain many candidates
„ For every frequent itemset x, generate all non-empty
„ Method: subsets of x
„ Candidate itemsets are stored in a hash-tree
„ Leaf node of hash-tree contains a list of itemsets and counts
„ Interior node contains a hash table „ For every non-empty subset s of x, output the rule
„ Subset function: finds all the candidates contained in a transaction
“ s ⇒ (x-s) ” if
support_count({x})
≥ min_conf
support_count({s})

23 24
From Frequent Itemsets to Association Rules Generating Rules: example
„ Q: Given frequent set {A,B,E}, what are possible Trans-ID Items Min_support: 60%
Min_confidence: 75%
association rules? 1 ACD
„ A => B, E 2 BCE
„ A, B => E 3 ABCE
„ A, E => B
4 BE
5 ABCE Frequent Itemset Support
„ B => A, E
{BCE},{AC} 60%
„ B, E => A
Rule Conf. {BC},{CE},{A} 60%
„ E => A, B {BC} =>{E} 100% {BE},{B},{C},{E} 80%
„ __ => A,B,E (empty rule), or true => A,B,E {BE} =>{C} 75%
{CE} =>{B} 100%
{B} =>{CE} 75%
{C} =>{BE} 75%
{E} =>{BC} 75%
25 26

Exercice 0.4*7= 2.8

C1 L1 C2 L2
TID Items
Bread 6 Bread 6 Bread,Milk 5 Bread,Milk 5
1 Bread, Milk, Chips, Mustard Milk 6 Milk 6 Bread,Beer 3 Bread,Beer 3
2 Beer, Diaper, Bread, Eggs Converta os dados para o Chips 2 Beer 4 Bread,Diaper 5 Bread,Diaper 5
Mustard 2 Diaper 6 Bread,Coke 2 Milk,Beer 3
3 Beer, Coke, Diaper, Milk formato booleano e para um
Beer 4 Coke 3 Milk,Beer 3 Milk,Diaper 5
suporte de 40%, aplique o
4 Beer, Bread, Diaper, Milk, Chips algoritmo apriori.
Diaper 6 Milk,Diaper 5 Milk,Coke 3
Eggs 1 Milk,Coke 3 Beer,Diaper 4
5 Coke, Bread, Diaper, Milk
Coke 3 Beer,Diaper 4 Diaper,Coke 3
6 Beer, Bread, Diaper, Milk, Mustard Beer,Coke 1
7 Coke, Bread, Diaper, Milk Diaper,Coke 3

C3 L3
Bread Milk Chips Mustard Beer Diaper Eggs Coke
Bread,Milk,Beer 2 Bread,Milk,Diaper 4
1 1 1 1 0 0 0 0
Bread,Milk,Diaper 4 Bread,Beer,Diaper 3
1 0 0 0 1 1 1 0
Bread,Beer,Diaper 3 Milk,Beer,Diaper 3
0 1 0 0 1 1 0 1
Milk,Beer,Diaper 3 Milk,Diaper,Coke 3
1 1 1 0 1 1 0 0 Milk,Beer,Coke
1 1 0 0 0 1 0 1 Milk,Diaper,Coke 3
1 1 0 1 1 1 0 0
1 1 0 0 0 1 0 1 8 + C 28 + C 38 = 92 >> 24
27 28
Challenges of Frequent Pattern Mining Improving Apriori’s Efficiency

„ Challenges
„ Problem with Apriori: every pass goes over whole data.
„ Multiple scans of transaction database „ AprioriTID: Generates candidates as apriori but DB is
„ Huge number of candidates used for counting support only on the first pass.
„ Tedious workload of support counting for candidates „ Needs much more memory than Apriori
„ Improving Apriori: general ideas „ Builds a storage set C^k that stores in memory the frequent sets
„ Reduce number of transaction database scans per transaction
Shrink number of candidates
AprioriHybrid: Use Apriori in initial passes; Estimate the
„
„
Facilitate support counting of candidates
size of C^k; Switch to AprioriTid when C^k is expected to
„

fit in memory
„ The switch takes time, but it is still better in most cases

29 30

C^1 L1
TID Items TID Set-of-itemsets Itemset Support Improving Apriori’s Efficiency
100 134 100 { {1},{3},{4} } {1} 2
200 235 200 { {2},{3},{5} } {2} 3 „ Transaction reduction: A transaction that does not
300 1235 {3} 3
300 { {1},{2},{3},{5} }
contain any frequent k-itemset is useless in subsequent
400 25 400 { {2},{5} } {5} 3
scans
C2 C^2
itemset TID Set-of-itemsets
L2
{1 2} 100 { {1 3} } Itemset Support „ Sampling: mining on a subset of given data.
{1 3} 200 { {2 3},{2 5} {3 5} } {1 3} 2
„ The sample should fit in memory
{1 5} 300 { {1 2},{1 3},{1 5}, {2 3} 3
{2 5} 3 „ Use lower support threshold to reduce the probability of
{2 3} {2 3}, {2 5}, {3 5} }
{3 5} 2
missing some itemsets.
{2 5} 400 { {2 5} }
{3 5}
„ The rest of the DB is used to determine the actual itemset
C^3 L3 count.
C3 TID Set-of-itemsets
Itemset Support
itemset 200 { {2 3 5} }
{2 3 5} 2
{2 3 5} 300 { {2 3 5} }
31 32
Improving Apriori’s Efficiency Improving Apriori’s Efficiency

„ Partitioning: Any itemset that is potentially frequent in DB must be


„ Dynamic itemset counting: partitions the DB into several blocks
frequent in at least one of the partitions of DB (2 DB scans)
each marked by a start point.
„ (support in a partition is lowered to be proportional to the number of
elements in the partition) „ At each start point, DIC estimates the support of all itemsets that are
currently counted and adds new itemsets to the set of candidate
Phase I
itemsets if all its subsets are estimated to be frequent.
Phase II
„ If DIC adds all frequent itemsets to the set of candidate itemsets
Combine
Divide D Find frequent Find global during the first scan, it will have counted each itemset’s exact support
results to
into n itemsets local frequent Freq.
Trans.
Non- to each
form a global
itemsets itemsets at some point during the second scan;
in D set of
overlapping partition among in D
candidate thus DIC can complete in two scans.
partitions (parallel alg.) candidates „
itemset

33 34

Comment Mining Association Rules

„ Traditional methods such as database queries: „ What is Association rule mining


support hypothesis verification about a relationship such as
Apriori Algorithm
„

the co-occurrence of diapers & beer. „

„ Additional Measures of rule interestingness


Data Mining methods automatically discover significant
Advanced Techniques
„
„
associations rules from data.
„ Find whatever patterns exist in the database, without the user
having to specify in advance what to look for (data driven).
„ Therefore allow finding unexpected correlations

35 36
Interestingness Measurements Objective measures of rule interest
„ Are all of the strong association rules discovered interesting
„ Support
enough to present to the user?

„ How can we measure the interestingness of a rule? „ Confidence or strength


„ Lift or Interest or Correlation

„ Subjective measures „ Conviction


„ A rule (pattern) is interesting if „ Leverage or Piatetsky-Shapiro
• it is unexpected (surprising to the user); and/or „ Coverage
• actionable (the user can do something with it)
• (only the user can judge the interestingness of a rule)

37 38

Criticism to Support and Confidence Lift of a Rule


„ Example 1: (Aggarwal & Yu,
basketball not basketball sum(row)
PODS98) „ Lift (Correlation, Interest)
cereal 2000 1750 3750 75%
„ Among 5000 students not cereal 1000 250 1250 2 5%

„ 3000 play basketball sum(col.) 3000 2000 5000 sup(A, B ) P (B | A)


3750 eat cereal Lift(A → B ) = =
sup(A) ⋅ sup(B ) P (B )
60% 40%
„

„ 2000 both play basket ball


and eat cereal
„ A and B negatively correlated, if the value is less than 1;
play basketball ⇒ eat cereal [40%, 66.7%] otherwise A and B positively correlated

misleading because the overall percentage of students eating cereal is rule Support Lift
75% which is higher than 66.7%. X 1 1 1 1 0 0 0 0 X ⇒Y 25% 2.00
Y 1 1 0 0 0 0 0 0 X⇒Z 37.50% 0.86
play basketball ⇒ not eat cereal [20%, 33.3%]
Z 0 1 1 1 1 1 1 1 Y⇒Z 12.50% 0.57
is more accurate, although with lower support and confidence
39 40
Lift of a Rule Problems With Lift
„ Example 1 (cont)
„ Rules that hold 100% of the time may not have the highest
2000
possible lift. For example, if 5% of people are Vietnam veterans
5000
„ play basketball ⇒ eat cereal [40%, 66.7%] LIFT =
3000 3750
= 0.89
and 90% of the people are more than 5 years old, we get a lift of
×
5000 5000
0.05/(0.05*0.9)=1.11 which is only slightly above 1 for the rule
1000 „ Vietnam veterans -> more than 5 years old.
5000
„ play basketball ⇒ not eat cereal [20%, 33.3%] LIFT = 3000 1250
= 1.33
×
5000 5000
„ And, lift is symmetric:

basketball not basketball sum(row) „ not eat cereal ⇒ play basketball [20%, 80%]
cereal 2000 1750 3750
1000
not cereal 1000 250 1250 LIFT = 5000 = 1.33
1250 3000
×
sum(col.) 3000 2000 5000 5000 5000
41 42

Conviction of a Rule Leverage of a Rule


Note that A -> B can be rewritten as ¬(A,¬B) „ Leverage or Piatetsky-Shapiro

sup(A) ⋅ sup(B ) P (A) ⋅ P (B ) P (A)(1 − P (B ))


Conv (A → B ) = = =
sup(A, B ) P (A, B ) P (A) − P (A, B ) PS (A → B ) = sup(A, B ) − sup(A) ⋅ sup(B )

„ Conviction is a measure of the implication and has value 1 if items are


unrelated. „ PS (or Leverage):
3000 ⎛ 3750 ⎞
⎜1 −
5000 ⎝

5000 ⎠
„ is the proportion of additional elements covered by both the
„ play basketball ⇒ eat cereal [40%, 66.7%] Conv =
3000 2000
= 0.75

5000 5000
premise and consequence above the expected if independent.
„ eat cereal ⇒ play basketball conv:0.85

3000 ⎛ 1250 ⎞
„ play basketball ⇒ not eat cereal [20%, 33.3%] ⎜1 −
5000 ⎝

5000 ⎠
Conv = = 1.125
not eat cereal ⇒ play basketball conv:1.43 3000 1000
„

5000 5000

43 44
Coverage of a Rule Association Rules Visualization

coverage(A → B ) = sup(A)

The coloured column indicates the association rule B→C.


Different icon colours are used to show different metadata values of
the association rule.
45 46

Association Rules Visualization


Association Rules Visualization

Size of ball equates to total support

Height
equates to
confidence

47 48
Association Rules Visualization - Ball graph
The Ball graph Explained

„ A ball graph consists of a set of nodes and arrows. All the nodes are yellow, green
or blue. The blue nodes are active nodes representing the items in the rule in which
the user is interested. The yellow nodes are passive representing items related to
the active nodes in some way. The green nodes merely assist in visualizing two or
more items in either the head or the body of the rule.
„ A circular node represents a frequent (large) data item. The volume of the ball
represents the support of the item. Only those items which occur sufficiently
frequently are shown
„ An arrow between two nodes represents the rule implication between the two
items. An arrow will be drawn only when the support of a rule is no less than the
minimum support

49 50

Association Rules Visualization Mining Association Rules

„ What is Association rule mining


„ Apriori Algorithm
„ FP-tree Algorithm
„ Additional Measures of rule interestingness
„ Advanced Techniques

51 52
Multiple-Level Association Rules Multi-Dimensional Association Rules
FoodStuff
„ Single-dimensional rules:
Frozen Refrigerated Fresh Bakery Etc... „ buys(X, “milk”) ⇒ buys(X, “bread”)

Vegetable Fruit Dairy Etc....


„ Multi-dimensional rules: ≥ 2 dimensions or predicates
Banana Apple Orange Etc... „ Inter-dimension association rules (no repeated predicates)
„ age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
„ Fresh ⇒ Bakery [20%, 60%]
„ hybrid-dimension association rules (repeated predicates)
„ Dairy ⇒ Bread [6%, 50%]
„ age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”
„ Fruit ⇒ Bread [1%, 50%] is not valid

Items often form hierarchy.


Flexible support settings: Items at the lower level are expected to have lower support.
Transaction database can be encoded based on dimensions and levels explore shared multi-level
mining
53 54

Quantitative Association Rules Sequential Pattern Mining


„ Given
A database of customer transactions ordered by increasing
age(X,”30-34”) ∧ income(X,”24K - 48K”) ⇒ buys(X,”high resolution
„

transaction time
TV”)
„ Each transaction is a set of items
„ A sequence is an ordered list of itemsets
„ Example:
Mining Sequential Patterns „ 10% of customers bought “Foundation“ and “Ringworld" in one
transaction, followed by “Ringworld Engineers" in another transaction.

10% of customers bought „ 10% is called the support of the pattern


(a transaction may contain more books than those in the pattern)
“Foundation” and “Ringworld” in one transaction,
„ Problem
followed by
„ Find all sequential patterns supported by more than a user-specified
“Ringworld Engineers” in another transaction. percentage of data sequences

55 56
Application Difficulties Some Suggestions

„ Wal-Mart knows that customers who buy Barbie dolls „ By increasing the price of Barbie doll and giving the type of candy bar
free, wal-mart can reinforce the buying habits of that particular types of
(it sells one every 20 seconds) have a 60% likelihood of buyer
buying one of three types of candy bars. What does „ Highest margin candy to be placed near dolls.
Wal-Mart do with information like that?
„ Special promotions for Barbie dolls with candy at a slightly higher margin.
„ 'I don't have a clue,' says Wal-Mart's chief of „ Take a poorly selling product X and incorporate an offer on this which is
merchandising, Lee Scott. based on buying Barbie and Candy. If the customer is likely to buy these
two products anyway then why not try to increase sales on X?
„ Probably they can not only bundle candy of type A with Barbie dolls, but
can also introduce new candy of Type N in this bundle while offering
„ See - KDnuggets 98:01 for many ideas discount on whole bundle. As bundle is going to sell because of Barbie
www.kdnuggets.com/news/98/n01.html dolls & candy of type A, candy of type N can get free ride to customers
houses. And with the fact that you like something, if you see it often,
Candy of type N can become popular.

61 62

References
„ Jiawei Han and Micheline Kamber, “Data Mining:
Concepts and Techniques”, 2000

„ Vipin Kumar and Mahesh Joshi, “Tutorial on High


Performance Data Mining ”, 1999
„ Rakesh Agrawal, Ramakrishnan Srikan, “Fast Algorithms
for Mining Association Rules”, Proc VLDB, 1994
(http://www.cs.tau.ac.il/~fiat/dmsem03/Fast%20Algorithms%20for%20Mining%20Association%
20Rules.ppt)

„ Alípio Jorge, “selecção de regras: medidas de interesse


e meta queries”,
(http://www.liacc.up.pt/~amjorge/Aulas/madsad/ecd2/ecd2_Aulas_AR_3_2003.pdf)
Thank you !!!
63 64

You might also like