You are on page 1of 54

Data mining and data warehousing

Lecture 06
Associations Mining
Definitions: Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
It is an example of un supervised directed data mining
Example:
Set of transactions
Association rules found in the
TID Items
transactions
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Diaper}  {Beer},
3 Milk, Diaper, Beer, Coke {Milk, Bread}  {Eggs,Coke},
4 Bread, Milk, Diaper, Beer {Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Applications of association mining in Business enterprises
1. Market Basket Analysis:
 Association rules are often used by retail stores to analyze
market basket transactions.
 Given a database of customer transactions, where each
transaction is represented as a set of items with an aim of
finding groups of items which are frequently purchased
together. e.g beer & diaper case
 The discovered association rules can be used by
management to increase the effectiveness (and reduce the
cost) associated with advertising, target marketing,
inventory, and stock location on the floor.
 Credit Cards/ Banking Services: analyzing payments where
each card/account represented as a transaction containing
the set of customer’s payments
Applications of association mining in Business enterprises

 Market Basket Analysis: Example:


 A grocery store has weekly specials for which advertising
supplements are created for the local newspaper.
 When an item, such as peanut butter, has been designated
to go on sale, management determines what other items
are frequently purchased with peanut butter. They find
that bread is purchased with peanut butter 30% of the
time and that jelly is purchased with it 40% of the time.
 Based on these associations, special displays of jelly and
bread are placed near the peanut butter which is on sale.
 They also decide not to put these items on sale. These
actions are aimed at increasing overall sales volume by
taking advantage of the frequency with which these items
are purchased together.
Applications of association mining in medical field
1. Medical Treatments: Finding symptoms that occur together.
Where each patient is represented as a transaction
containing the ordered set of diseases. For example,
describing Patients that exhibited symptom A and also
exhibited symptom B in the same disease category during
the same season.
2. Drugs analysis: Finding shared substructures in a group of
effective drugs. For example, describing patients who were
treated with drug X and developed side effect B at a
particular rate.

3. Genes association analysis: Determining commonly occurring


subsequences in a group of genes.
Definitions: Association Rule

Association rule is an implication expression of the form


 Itemset1 => Itemset2

Where;
•Itemset1 and Itemset2 are disjoint
•Itemset2 is non-empty

This means , if transaction includes Itemset1 then it also has


Itemset2
Meaning of Association Rules

Association rules do not represent any sort of causality or


correlation between the two Itemsets.
The Implication means co-occurrence!

X  Y does not mean X causes Y, so no Causality


Association rules: Example

Given a set of Market-Basket transactions

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

The following association rules can be generated


{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Association rules: Example

{Diaper}  {Beer},

Customer
buys both Customer
buys diaper

Customer
buys beer
Types of Association Rules

1. Actionable Rules – contain high-quality, actionable


information
2. Trivial Rules – information already well-known by those
familiar with the business
3. Inexplicable Rules – no explanation and do not suggest action
Trivial and Inexplicable Rules occur most often
Definitions

Transaction: an event that an itemset (I) with its transaction ID (TID)


TID is a unique identifier that is associated with each transaction

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definitions

Itemset(I) : A collection of one or more items. (order not


important)
Example: I = {Milk, Bread, Diaper}
Item: value or an element of itemset
E.g bread
k-itemset: An itemset that contains k items

A data set: is a set of transactions (itemset with IDs)


TID Items

A database: is a collection of 1 Bread, Milk


2 Bread, Diaper, Beer, Eggs
related data sets
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Definitions

Support count () : Frequency of occurrence of an itemset


Example:
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

Support count ({Milk, Bread,Diaper}) = 2


Evaluation Metrics for Association Rules

There are two main categories of measuring interestingness of


association rules, namely, subjective measures and objective
measures
1. Subjective measures:
An association rule (pattern) is interesting if
a) It is unexpected: surprising to the user.
b) Actionable: The user can do something with it
Evaluation Metrics for Association Rules

2. Objective measures:
An association rule (pattern) is interesting if it has
equal or greater than the required
a) minimum Support; and/or
b) minimum confidence
c) simplicity
d) lift
e) leverage
f) conviction
Measures

 In most cases, it is sufficient to focus on a combination of


support, confidence, and either lift or leverage to
quantitatively measure the "quality" of the rule.

 However, the real value of a rule, in terms of usefulness and


action ability is subjective and depends heavily of the
particular domain and business objectives.
Objective measures:

(a) Support (utility): Fraction of transactions that contain an


itemset (I). i.e . The proportion of transactions t that contain
both X and Y.

supp(A) = # records that contain A = supp count(XUY)


m m

(b) Confidence (certainty): Ratio of the number of transactions


that contain both Xand Y to the number of transactions that
contain X
 confidence is used to =Measures
conf(XY) how
supp count often items in Y
(XY)
appear in transactions that contain
supp X
count(X)
Support and confidence

Support and confidence are the main objective measures of


association rules
Exact rule: refers to a rule that has a confidence value of 100 %
Even if confidence reaches high values the rule is not useful unless
the support value is high as well
Strong rules: Refers to rules that have both high confidence and
support

18
Lecture Notes for data mining
Rule Evaluation Metrics : Example:

 Given a set of Market-Basket transactions

TID Items
Determine support and confidence
1 Bread, Milk of the following rule :
2 Bread, Diaper, Beer, Eggs  {milk, Diaper}⇒ Beer
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 (Milk , Diaper, Beer ) 2
s   0.4
|M| 5
Solution:-
 (Milk, Diaper, Beer) 2
c   0.67
 (Milk, Diaper) 3
Rule Evaluation Metrics : example

Given that the minimum support 50%, and minimum confidence 50%,
Find all the rules X & Y  Z with minimum confidence and support

Transaction ID Items Bought


2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
 Solution
A C (50%, 66.6%)
 C  A (50%, 100%)
Rule Evaluation Metrics : Other objective measures:

C ) Simplicity
A simple rule has smaller item sets.
With smaller Itemsets, it easier to interpret
Length of rule can be limited by user-defined threshold
Example:
Consider the following association rule that holds association between cocoa
powder and milk
buys(cocoa powder)  buys(bread,milk,salt)
A simple rule might be:
buys(cocoa powder)  buys(milk)
buys(cocoa powder)  buys(Bread)
buys(cocoa powder)  buys(salt)

21
Lecture Notes for data mining
Rule Evaluation Metrics : Other objective Measures

d) Lift measures how far from independence are X and Y.


That is, how many times more often X and Y occur together than
expected if they were statistically independent.
- It rages from 0.5 to ∞
- Values close to 1 imply that X and Y are independent and the
rule is not interesting.
- Values far from 1 indicate that the evidence of X provides
information about Y. That is, X and Y are associated
- Lift measures co-occurrence only (not implication).
- It is defined as:
Rule Evaluation Metrics : Other objective Measures
 Conviction is a measure of the implication and has value 1 if
items are unrelated (independent)
It compares the probability that X appears without Y

 Conviction attempts to measure the degree of implication of a


rule.
 Conviction is sensitive to rule direction
 (conv (X->Y) ≠ conv(Y->X).
Rule Evaluation Metrics : Other objective Measures

It rages from 0.5 to ∞


Conviction is 1 if X and Y are independent. Hence the rule is not
interesting.

Conviction values far from 1 indicate interesting rules.

Conviction was developed as an alternative to confidence which was


found to not capture direction of associations adequately.
Unlike confidence, the support of both antecedent and consequent
are considered in conviction.
Rule Evaluation Metrics : Other objective Measures

 e) Leverage measures novelty of a rule by determining how


much more counting is obtained from the independence.
 That is, co-occurrence of X and Y and expected support if X
and Y were independent
 It ranges from−0.25 to 0.25
 High leverage implies high support. That is, the rule is
interesting
 It is computed as follows:
Association Rule Mining goal

•Given a set of transactions T, the goal of association rule


mining is to find all rules having support ≥ minsup threshold
and confidence ≥ minconf threshold

•This means that rule generation should ensure production of


rules satisfy minimum confidence and minimum support
threshold

26
Lecture Notes for data mining
Achieving association Rule Mining goal

Given a set of transactions T, the goal of association rule mining


Is achieved by executing the following 2 main tasks:

1. Generating frequent itemsets whose support count≥ minsup


count threshold
Where; support count is the occurrence frequency of an
itemset

2. Generating association rules from those frequent itemsets


with
confidence ≥ minconf threshold

These two tasks are executed iteratively until no new rules will
emerge

27
Lecture Notes for data mining
1. Generating frequent itemsets

Frequent Itemset generation involves Generating all itemsets


whose support  minsup

Frequent Itemset is an itemset whose support is greater than or


equal to minimum support count (occurrence frequency of an
itemset)
1. Generating frequent itemsets

Example
Given that minimum support count =2 find Frequent Itemset
the following transaction database
Transactions database
TID Products
1 A, B, E
•Solution
2 B, D
3 B, C •sup ({A,B,E}) = 2
4 A, B, D •sup ({B,C}) = 4
5 A, C • and others
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
1. Generating frequent itemsets

Apriori principle
This is a rule used in association mining algorithms to create
frequent itemsets from a given set of itemsets.
Apriori principle: states that if an itemset is frequent, then all of
its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
X , Y : ( X  Y )  s( X )  s(Y )
Where X is a subset and Y is an itemset
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
1. Generating frequent itemsets

Prori principle Example:


Suppose {A,C} is frequent. Since each occurrence of A,C includes
both A and C, then both A and C must also be frequent
Similar argument for larger itemsets
Almost all association rule algorithms are based on this subset
property

Priori principle is also known as priori property or Subset Property


2. Generating association rules from Frequent Itemsets

Frequent itemsets != association rules


Association rules are generated from frequent itemset.
One more step is required to find association rules
There are two main approaches for generating Association rules
1. Brute force
2. Two-step approach:
1. Brute-force approach:

 Steps
1. Each itemset in the transaction database is a candidate
frequent itemset. That is, minsup count=1
2. List all possible association rules
3. Compute the support and confidence for each rule
4. Prune rules that fail the minsup and minconf
thresholds
Brute-force approach: Example

Q: Given frequent set {A,B,E}, what association TID List of items


rules have minsup = 1 and minconf= 50% ?
1 A, B, E
2 B, D
solution 3 B, C
A, B => E : conf=2/4 = 50% 4 A, B, D
A, E => B : conf=2/2 = 100% 5 A, C
B, E => A : conf=2/2 = 100% 6 B, C
7 A, C
E => A, B : conf=2/2 = 100%
8 A, B, C, E
NB: Rules originating from the same itemset have 9 A, B, C
identical support count but can have different
confidence
Brute-force approach: Example

 Rules that Don’t qualify (i.e minconf<50% )


A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
2. Two-step approach:

Steps
i. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

ii. Rule Generation


– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
2. Two-step approach Example

Apriori Algorithm
Apriori is Latin word that means ”from what comes before”
The Algorithm is called apriori because it uses knowledge from
previous iteration phase to produce frequent itemsets
It attempts to find subsets which are common (frequent) in a
given data base of itemsets. (e.g collections of items bought by
customers).

37
Lecture Notes for data mining
2. Two-step approach Example

Apriori Algorithm that is used to generate frequent itemset or to


reduce the number of candidate itemset:
This algorithm makes several passes over the transaction list
Apriori Algorithm Step1
Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Prune candidate itemsets containing subsets of length k that
are infrequent
 Count the support of each candidate by scanning the DB
 Eliminate candidates that are infrequent, leaving only those
that are frequent
2. Two-step approach Example

Apriori Algorithm Step2


For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A

 A B is an association rule if

 Confidence (A  B) ≥ minConf,

where support (A  B) = support (AB), and


confidence (A  B) = support (AB) / support (A)
Apriori Algorithm: Example

1.Given the following transaction data base, use apriori algorithm


to find frequent 3-Itemset where minimum support count =3
TID Items
1 Milk, Beer
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Bread, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

2. Generate strong association rules using apriori algorithm


given that minimum confidence =1
Important Observation: Counts of subsets can’t be smaller than the
count of an itemset!
Apriori Algorithm Example: 1. Generating frequent Itemset
Items (1-itemsets)
Item Count
Bread 4
Coke 2 Itemset Count Pairs (2-itemsets)
Milk 4
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1 {Bread,Diaper} 4 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
Minimum Support count = 3 {Beer,Diaper} 2

Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 3

Write all possible 3-itemsets


and prune the list based on infrequent 2-itemsets

NB: Counts of subsets can’t be smaller than the count of an


itemset!
Apriori Algorithm Example: 2. Rule generation

•Foreach frequent k-itemset {Bread, Milk, Diaper}


where k >high
Generate 2 confidence rules 1-item rules
with one item in the consequent {Bread, Milk}  {Diaper} (s=0.6,
Using these rules, iteratively c=1.0)
generate high confidence rules {Milk, Diaper}  {Bread} (s=0.6,
with more than one items in the c=1.0)
consequent
{Diaper, Bread}  {Milk} (s=0.6,
if any rule has low confidence
then all the other rules containing c=1.0)
the consequents can be pruned 2-item rules
(not generated) {Bread}  {Milk, Diaper} (s=0.6,
c=0.75)
{Diaper}  {Milk, Bread} (s=0.6,
c=0.75)
{Milk}  {Diaper, Bread} (s=0.6,
Apriori Algorithm: example2

Generating frequent Itemset


Apriori Algorithm: generating association rules

Generating Association Rules


Exercise
1. use Apriori Algorithm to generate association rules from the
following frequent
3-Itemset: ‘bce’
2. use weka software to mine the data set using apriori algorithm
and compare your rules with those generate by weka.
Interpreting association mining results Generated by weka
Interpreting assocition mining results Generated by weka
 The output can be interpreted as follows:

 Lift =1.1
 Number of cycles peformed 13
 Size of set of large itemsets L(1): 22
 22 one-item sets
 36 two-item sets
 3 three-item sets
Association rule mining Parameters

 When performing association rule mining using weka, the following


parameters need to be set
 1. Required number of rules output: The required number of rules.
(default = 10)
 2. The metric type by which to rank rules that include confidence,
lift, leverage and Conviction.
 - The default metric is confidence
 3. minimum metric score of a rule(C ): The minimum confidence of
a rule. (default = 0.9).
 - Weka considers only rules with scores higher than this value.
 4. Delta(D) for minimum support: The rate by which the minimum
support is decreased in each iteration. (default = 0.05)
Association rule mining Parameters

 5. Upper bound for minimum support: The highest allowed value of


support. (default = 1.0).

 6.Lower bound for minimum support: The lowest allowed value of


the support. (default = 0.1).
 - Minimum support can be set in lower bond minimum support
field

 Usually, Apriori in WEKA starts with the upper bound support and
incrementally decreases support (by delta increments which by
default is set to 0.05 or 5%).
 The algorithm halts when either the specified number of rules are
generated, or the lower bound for min. support is reached.
Association rule mining Parameters

 7. Itemsets(I): Specifies if the itemsets found are also output


(default = no).

 8 Remove columns(R) that contain all missing values (default


= no)

 9. Class index is the index of the class attribute. If it is set to (-


1) the last attribute is taken as class attribute.
Interpreting assocition mining results Generated by weka

 Upper bound and lower bound support


 Upper bound support: The highest
 Iteratively reduces the minimum support until it finds the
required number of rules with the given minimum
confidence.
 The significance testing option is only applicable in the case of
confidence and is by default not used (-1.0).
Classification vs Association Rules

Association Rules
Classification Rules
1. Many target fields
1. Focus on one target field
2. Applicable in some cases
2. Specify class in all cases
3. Measures: Support,
3. Measures: Accuracy, coverage
Confidence, Lift, leverage,
conviction
Lab exercise

 Use weka Software to perform associations rule mining on


weather data.
 Identify the number of large item sets
 Interpret rules generated
 Determine the quality of each rule by checking values of
evaluation measures.
Online Reading materials
 Association mining using weka
 http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEK
A/associate.html
End

Thank you

Questions

You might also like