You are on page 1of 15

Mining Association Rules

 Proposed by Rakesh Agrawal et al in 1993.


 It is an important data mining model studied extensively
by the database and data mining community.
 The term association describes a relationship between a
set of items that people tend to buy together .
 For example, if customers buy two-wheelers, there is a
possibility that they also buy some accessories such as
seat cover, helmet, gloves,etc.
 The discovery of association rules is one of the major
tasks involved in data mining.
 The task of mining association rules is to find interesting
releationship among various items in a given data set.
 Let us consider some examples of associations given as
follows:

CS583, Bing Liu, UIC 2


 A person who buys a mobile is also likely to buy some
accessories such as mobile cover,hands free,etc.
 A person who buys bread is also likely to buy butter and jam.
 Someone who buys eggs is also likely to buy bread.
 Someone who bought the book Data Structure using C is also
likely to buy programming in C.
 An association rule has the form X  Y
WhereX={X1,X2,…Xn } and Y={Y1,Y2,…Yn } are the
disjoint sets of items, that is, X Y =  . It stands that if
a person buys an item X, he/she is likely to buy an item Y.
 The set X  Y is called an itemset – a set of items a
custormer tends to buy together. The item X is called the
antecedent, while Y is called the consequent of the rule.
 An example of association rule is
 Mobile  mobile cover, head set

CS583, Bing Liu, UIC 3


For the association rule to be of interest to an analyst, the rule should
satisfy two interest measures: namely support and confidence
Support (also known as prevalence):

 It is the percentage or fraction of the total transactions that satisfy both the
antecedent and consequent of the rule.
 If the support is low, it implies that there is no strong evidence that the items

in the itemset X  Y are bought together.


 Support can be calculated as

support (X  Y) = P (X  Y )
For example , suppose only 0.002% of the customers buy mobile and
chocolates, then the support for the rule mobile  chocolates will be low.
Confidence (also known as strength) ;
 It is the probability that a customer will buy the items in the set Y if he/she
purchases the items in the set X. It is computed as
Confidence (XY) = P(Y|X) = support (X  Y ) / support (X)
Ex : the association rule mobile  mobile cover has he confidence of 80% if
80% of the purchase that include mobile also include mobile cover.
In general, association rules are said to be interesting or strong if they
satisfy both a minimum support threshold (min_sup) and minimum
confidence threshold (min_conf).
CS583, Bing Liu, UIC 4
Association rule mining
 Initially used for Market Basket Analysis to find
how items purchased by customers are related.
Bread  Milk [sup = 5%, conf = 100%]

CS583, Bing Liu, UIC 5


 Objective:

Given a transaction T, the problem of mining


association rules is to discover all association
rules in T that have support and confidence
greater or equal to user specified minimum
support (minsup) and minimum
confidence (minconf).

CS583, Bing Liu, UIC 6


The model: data

 I = {i1, i2, …, im}: a set of items.


 Transaction t :
 t a set of items, and t  I.

 Transaction Database T: a set of transactions


T = {t1, t2, …, tn}.

CS583, Bing Liu, UIC 7


Transaction data: supermarket data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket
 I: the set of all items sold in the store
 A transaction: items purchased in a basket; it may
have TID (transaction ID)
 A transactional dataset: A set of transactions
CS583, Bing Liu, UIC 8
The model: rules
 A transaction t contains X, a set of items
(itemset) in I, if X  t.
 An association rule is an implication of the
form:
X  Y, where X, Y  I, and X Y = 

 An itemset is a set of items.


 E.g., X = {milk, bread, butter} is an itemset.
 A k-itemset is an itemset with k items.
 E.g., {milk, bread, butter} is a 3-itemset

CS583, Bing Liu, UIC 9


Rule strength measures
 Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X  Y.
 sup = Pr(X  Y).
 Confidence: The rule holds in T with
confidence conf if conf% of transactions that
contain X also contain Y.
 conf = Pr(Y | X)
 An association rule X-> Y states that when X
occurs, Y occurs with certain probability.

CS583, Bing Liu, UIC 10


Support and Confidence
 Support count: The support count of an
itemset X, denoted by X.count, in a data set
T is the number of transactions in T that
contain X. Assume T has n transactions.
 Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
CS583, Bing Liu, UIC 11
Goal and key features
 Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).
 Key Features
 Completeness: find all rules.
 No target item(s) on the right-hand-side
 Mining with data on hard disk (not in memory)

CS583, Bing Liu, UIC 12


t1: Beef, Chicken, Milk
An example t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
 Transaction data t6: Chicken, Clothes, Milk
 Assume: t7: Chicken, Milk, Clothes

minsup = 30%
minconf = 80%
 An example frequent itemset (also called large
itemset whose support is greater than minsup):
{Chicken, Clothes, Milk} [sup = 3/7]
 Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]
CS583, Bing Liu, UIC 13
Transaction data representation
 A simplistic view of shopping baskets,
 Some important information not considered.
E.g,
 the quantity of each item purchased and
 the price paid.

CS583, Bing Liu, UIC 14


Many mining algorithms
 There are a large number of them!!
 They use different strategies and data structures.
 Their resulting sets of rules are all the same.
 Given a transaction data set T, and a minimum support and
a minimum confidence, the set of association rules existing
in T is uniquely determined.
 Any algorithm should find the same set of rules
although their computational efficiencies and
memory requirements may be different.
 We will discuss only one: the Apriori Algorithm

CS583, Bing Liu, UIC 15

You might also like