Professional Documents
Culture Documents
Algorithm
Association rule mining
• Proposed by Agrawal et al in 1993.
• It is an important data mining model studied extensively by the database
and data mining community.
• Assume all data are categorical.
• No good algorithm for numeric data.
• Initially used for Market Basket Analysis to find how items purchased by
customers are related.
“Item set” = the items (e.g., products) comprising the antecedent or consequent
8
Definition: Frequent Itemset
• Itemset
TID Items
• A collection of one or more items
• Example: {Milk, Bread, Diaper} 1 Bread, Milk
• k-itemset 2 Bread, Diaper, Beer, Eggs
• An itemset that contains k items 3 Milk, Diaper, Beer, Coke
• Support count () 4 Bread, Milk, Diaper, Beer
• Frequency of occurrence of an itemset 5 Bread, Milk, Diaper, Coke
• E.g. ({Milk, Bread,Diaper}) = 2
• Support
• Fraction of transactions that contain an itemset
• E.g. s({Milk, Bread, Diaper}) = 2/5
• Frequent Itemset
• An itemset whose support is greater than or equal to a minsup threshold
The model: data
• I = {i1, i2, …, im}: a set of items.
• Transaction t :
• t a set of items, and t I.
• Transaction Database T: a set of transactions T = {t1, t2, …, tn}.
I: itemset
{cucumber, parsley, onion, tomato, salt, bread, olives, cheese, butter}
T: set of transactions
1 {{cucumber, parsley, onion, tomato, salt, bread},
2 {tomato, cucumber, parsley},
3 {tomato, cucumber, olives, onion, parsley},
4 {tomato, cucumber, onion, bread},
5 {tomato, salt, onion},
6 {bread, cheese}
7 {tomato, cheese, cucumber}
8 {bread, butter}} 10
The model: Association rules
• A transaction t contains X, a set of items (itemset) in I, if X t.
• An association rule is an implication of the form:
X Y, where X, Y I, and X Y =
11
Rule strength measures
• Support: The rule holds with support sup in T (the transaction data set) if
sup% of transactions contain X Y.
• sup = probability that a transaction contains Pr(X Y)
(Percentage of transactions that contain X Y)
• Confidence: The rule holds in T with confidence conf if conf% of tranactions
that contain X also contain Y.
• conf = conditional probability that a transaction having X also contains Y
Pr(Y | X)
(Ratio of number of transactions that contain X Y to the number that
contain X)
• An association rule is a pattern that states when X occurs, Y occurs with
certain probability.
12
Support and Confidence
• Support count: The support count of an itemset X, denoted by X.count, in a
data set T is the number of transactions in T that contain X. Assume T has n
transactions.
( X Y ).count
• Then, support
n
( X Y ).count
confidence
X .count
Goal: Find all rules that satisfy the user-specified minimum support (minsup)
and minimum confidence (minconf).
13
Definition: Association Rule
Association Rule TID Items
An implication expression of the form 1 Bread, Milk
X Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper} {Beer} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
16
Basic Concept: Association Rules
Transaction-id Items bought Let min_support = 50%,
10 A, B, C min_conf = 50%:
20 A, C A C (50%, 66.7%)
30 A, D C A (50%, 100%)
40 B, E, F
Frequent pattern Support
Customer Customer {A} 75%
buys both buys diaper
{B} 50%
{C} 50%
{A, C} 50%
Customer
buys beer
Association Rule Mining Task
• Given a set of transactions T, the goal of association rule mining is to find all
rules having
• support ≥ minsup threshold
• confidence ≥ minconf threshold
• Brute-force approach:
• List all possible association rules
• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Frequent Itemset Generation
• Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset
• Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
• Match each transaction against every candidate
• Complexity ~ O(NMw) => Expensive since M = 2d !!!
Brute-force approach:
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1d d 1
1. Rule Generation
– Generate high confidence rules from each frequent itemset, where
each rule is a binary partitioning of a frequent itemset
25
The Apriori algorithm
• The algorithm uses a level-wise search, where k-itemsets are used to explore
(k+1)-itemsets
• In this algorithm, frequent subsets are extended one item at a time (this step
is known as candidate generation process)
• Then groups of candidates are tested against the data.
• It identifies the frequent individual items in the database and extends them
to larger and larger item sets as long as those itemsets appear sufficiently
often in the database.
• Apriori algorithm determines frequent itemsets that can be used to
determine association rules which highlight general trends in the database.
The Apriori algorithm
• The Apriori algorithm takes advantage of the fact that any subset of a
frequent itemset is also a frequent itemset.
• i.e., if {l1,l2} is a frequent itemset, then {l1} and {l2} should be frequent itemsets.
• Let us assume:
• minimum confidence threshold is 60%
Association Rules with confidence
• R1 : 1,3 -> 5
– Confidence = sc{1,3,5}/sc{1,3} = 2/3 = 66.66% (R1 is selected)
• R2 : 1,5 -> 3
– Confidence = sc{1,5,3}/sc{1,5} = 2/2 = 100% (R2 is selected)
• R3 : 3,5 -> 1
– Confidence = sc{3,5,1}/sc{3,5} = 2/3 = 66.66% (R3 is selected)
• R4 : 1 -> 3,5
– Confidence = sc{1,3,5}/sc{1} = 2/3 = 66.66% (R4 is selected)
• R5 : 3 -> 1,5
– Confidence = sc{3,1,5}/sc{3} = 2/4 = 50% (R5 is REJECTED)
• R6 : 5 -> 1,3
– Confidence = sc{5,1,3}/sc{5} = 2/4 = 50% (R6 is REJECTED)
How to efficiently generate rules?
• In general, confidence does not have an anti-monotone property
c(ABC→D) can be larger or smaller than c(AB →D)
• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L= {A,B,C,D}
c(ABC→D) ≥ c(AB→CD) ≥ c(A→BCD)
Pruned the
Rule
Rule generation for Apriori Algorithm
• Cdidate rule is generated by merging two rules that share the same
prefix in the rule consequent