Lecture06 Association Mining

Data mining and data warehousing
Lecture 06
Associations Mining
Definitions: Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
It is an example of un supervised directed data mining
Example:
Set of transactions
Association rules found in the
TID Items
transactions
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs {Diaper}  {Beer},
3 Milk, Diaper, Beer, Coke {Milk, Bread}  {Eggs,Coke},
4 Bread, Milk, Diaper, Beer {Beer, Bread}  {Milk},
5 Bread, Milk, Diaper, Coke
Applications of association mining in Business enterprises
1. Market Basket Analysis:
 Association rules are often used by retail stores to analyze
market basket transactions.
 Given a database of customer transactions, where each
transaction is represented as a set of items with an aim of
finding groups of items which are frequently purchased
together. e.g beer & diaper case
 The discovered association rules can be used by
management to increase the effectiveness (and reduce the
cost) associated with advertising, target marketing,
inventory, and stock location on the floor.
 Credit Cards/ Banking Services: analyzing payments where
each card/account represented as a transaction containing
the set of customer’s payments
Applications of association mining in Business enterprises
 Market Basket Analysis: Example:

 A grocery store has weekly specials for which advertising
supplements are created for the local newspaper.
 When an item, such as peanut butter, has been designated
to go on sale, management determines what other items
are frequently purchased with peanut butter. They find
that bread is purchased with peanut butter 30% of the
time and that jelly is purchased with it 40% of the time.
 Based on these associations, special displays of jelly and
bread are placed near the peanut butter which is on sale.
 They also decide not to put these items on sale. These
actions are aimed at increasing overall sales volume by
taking advantage of the frequency with which these items
are purchased together.
Applications of association mining in medical field
1. Medical Treatments: Finding symptoms that occur together.
Where each patient is represented as a transaction
containing the ordered set of diseases. For example,
describing Patients that exhibited symptom A and also
exhibited symptom B in the same disease category during
the same season.
2. Drugs analysis: Finding shared substructures in a group of
effective drugs. For example, describing patients who were
treated with drug X and developed side effect B at a
particular rate.
3. Genes association analysis: Determining commonly occurring

subsequences in a group of genes.
Definitions: Association Rule
Association rule is an implication expression of the form

 Itemset1 => Itemset2
Where;
•Itemset1 and Itemset2 are disjoint
•Itemset2 is non-empty
This means , if transaction includes Itemset1 then it also has

Itemset2
Meaning of Association Rules
Association rules do not represent any sort of causality or

correlation between the two Itemsets.
The Implication means co-occurrence!
X  Y does not mean X causes Y, so no Causality

Association rules: Example
Given a set of Market-Basket transactions
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
The following association rules can be generated

{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
Association rules: Example
{Diaper}  {Beer},
Customer
buys both Customer
buys diaper
Customer
buys beer
Types of Association Rules
1. Actionable Rules – contain high-quality, actionable

information
2. Trivial Rules – information already well-known by those
familiar with the business
3. Inexplicable Rules – no explanation and do not suggest action
Trivial and Inexplicable Rules occur most often
Definitions
Transaction: an event that an itemset (I) with its transaction ID (TID)

TID is a unique identifier that is associated with each transaction
TID Items
1 Bread, Milk
Definitions
Itemset(I) : A collection of one or more items. (order not

important)
Example: I = {Milk, Bread, Diaper}
Item: value or an element of itemset
E.g bread
k-itemset: An itemset that contains k items
A data set: is a set of transactions (itemset with IDs)

TID Items
A database: is a collection of 1 Bread, Milk

related data sets
Definitions
Support count () : Frequency of occurrence of an itemset

Example:
TID Items
1 Bread, Milk
Support count ({Milk, Bread,Diaper}) = 2

Evaluation Metrics for Association Rules
There are two main categories of measuring interestingness of

association rules, namely, subjective measures and objective
measures
1. Subjective measures:
An association rule (pattern) is interesting if
a) It is unexpected: surprising to the user.
b) Actionable: The user can do something with it
Evaluation Metrics for Association Rules
2. Objective measures:
An association rule (pattern) is interesting if it has
equal or greater than the required
a) minimum Support; and/or
b) minimum confidence
c) simplicity
d) lift
e) leverage
f) conviction
Measures
 In most cases, it is sufficient to focus on a combination of

support, confidence, and either lift or leverage to
quantitatively measure the "quality" of the rule.
 However, the real value of a rule, in terms of usefulness and

action ability is subjective and depends heavily of the
particular domain and business objectives.
Objective measures:
(a) Support (utility): Fraction of transactions that contain an

itemset (I). i.e . The proportion of transactions t that contain
both X and Y.
supp(A) = # records that contain A = supp count(XUY)

m m
(b) Confidence (certainty): Ratio of the number of transactions

that contain both Xand Y to the number of transactions that
contain X
 confidence is used to =Measures
conf(XY) how
supp count often items in Y
(XY)
appear in transactions that contain
supp X
count(X)
Support and confidence
Support and confidence are the main objective measures of

association rules
Exact rule: refers to a rule that has a confidence value of 100 %
Even if confidence reaches high values the rule is not useful unless
the support value is high as well
Strong rules: Refers to rules that have both high confidence and
support
18
Lecture Notes for data mining
Rule Evaluation Metrics : Example:
 Given a set of Market-Basket transactions
TID Items
Determine support and confidence
1 Bread, Milk of the following rule :
2 Bread, Diaper, Beer, Eggs  {milk, Diaper}⇒ Beer
 (Milk , Diaper, Beer ) 2
s   0.4
|M| 5
Solution:-
 (Milk, Diaper, Beer) 2
c   0.67
 (Milk, Diaper) 3
Rule Evaluation Metrics : example
Given that the minimum support 50%, and minimum confidence 50%,
Find all the rules X & Y  Z with minimum confidence and support
Transaction ID Items Bought

2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
 Solution
A C (50%, 66.6%)
 C  A (50%, 100%)
Rule Evaluation Metrics : Other objective measures:
C ) Simplicity
A simple rule has smaller item sets.
With smaller Itemsets, it easier to interpret
Length of rule can be limited by user-defined threshold
Example:
Consider the following association rule that holds association between cocoa
powder and milk
buys(cocoa powder)  buys(bread,milk,salt)
A simple rule might be:
buys(cocoa powder)  buys(milk)
buys(cocoa powder)  buys(Bread)
buys(cocoa powder)  buys(salt)
21
Rule Evaluation Metrics : Other objective Measures
d) Lift measures how far from independence are X and Y.

That is, how many times more often X and Y occur together than
expected if they were statistically independent.
- It rages from 0.5 to ∞
- Values close to 1 imply that X and Y are independent and the
rule is not interesting.
- Values far from 1 indicate that the evidence of X provides
information about Y. That is, X and Y are associated
- Lift measures co-occurrence only (not implication).
- It is defined as:
 Conviction is a measure of the implication and has value 1 if
items are unrelated (independent)
It compares the probability that X appears without Y
 Conviction attempts to measure the degree of implication of a

rule.
 Conviction is sensitive to rule direction
 (conv (X->Y) ≠ conv(Y->X).
It rages from 0.5 to ∞

Conviction is 1 if X and Y are independent. Hence the rule is not
interesting.
Conviction values far from 1 indicate interesting rules.
Conviction was developed as an alternative to confidence which was

found to not capture direction of associations adequately.
Unlike confidence, the support of both antecedent and consequent
are considered in conviction.
 e) Leverage measures novelty of a rule by determining how

much more counting is obtained from the independence.
 That is, co-occurrence of X and Y and expected support if X
and Y were independent
 It ranges from−0.25 to 0.25
 High leverage implies high support. That is, the rule is
interesting
 It is computed as follows:
Association Rule Mining goal
•Given a set of transactions T, the goal of association rule

mining is to find all rules having support ≥ minsup threshold
and confidence ≥ minconf threshold
•This means that rule generation should ensure production of

rules satisfy minimum confidence and minimum support
threshold
26
Achieving association Rule Mining goal
Given a set of transactions T, the goal of association rule mining

Is achieved by executing the following 2 main tasks:
1. Generating frequent itemsets whose support count≥ minsup

count threshold
Where; support count is the occurrence frequency of an
itemset
2. Generating association rules from those frequent itemsets

with
confidence ≥ minconf threshold
These two tasks are executed iteratively until no new rules will
emerge
27
1. Generating frequent itemsets
Frequent Itemset generation involves Generating all itemsets

whose support  minsup
Frequent Itemset is an itemset whose support is greater than or

equal to minimum support count (occurrence frequency of an
itemset)
Example
Given that minimum support count =2 find Frequent Itemset
the following transaction database
Transactions database
TID Products
1 A, B, E
•Solution
2 B, D
3 B, C •sup ({A,B,E}) = 2
4 A, B, D •sup ({B,C}) = 4
5 A, C • and others
6 B, C
7 A, C
8 A, B, C, E
9 A, B, C
Apriori principle
This is a rule used in association mining algorithms to create
frequent itemsets from a given set of itemsets.
Apriori principle: states that if an itemset is frequent, then all of
its subsets must also be frequent
Apriori principle holds due to the following property of the
support measure:
X , Y : ( X  Y )  s( X )  s(Y )
Where X is a subset and Y is an itemset
Support of an itemset never exceeds the support of its subsets
This is known as the anti-monotone property of support
Prori principle Example:

Suppose {A,C} is frequent. Since each occurrence of A,C includes
both A and C, then both A and C must also be frequent
Similar argument for larger itemsets
Almost all association rule algorithms are based on this subset
property
Priori principle is also known as priori property or Subset Property

2. Generating association rules from Frequent Itemsets
Frequent itemsets != association rules

Association rules are generated from frequent itemset.
One more step is required to find association rules
There are two main approaches for generating Association rules
1. Brute force
2. Two-step approach:
1. Brute-force approach:
 Steps
1. Each itemset in the transaction database is a candidate
frequent itemset. That is, minsup count=1
2. List all possible association rules
3. Compute the support and confidence for each rule
4. Prune rules that fail the minsup and minconf
thresholds
Brute-force approach: Example
Q: Given frequent set {A,B,E}, what association TID List of items

rules have minsup = 1 and minconf= 50% ?
1 A, B, E
2 B, D
solution 3 B, C
A, B => E : conf=2/4 = 50% 4 A, B, D
A, E => B : conf=2/2 = 100% 5 A, C
B, E => A : conf=2/2 = 100% 6 B, C
7 A, C
E => A, B : conf=2/2 = 100%
8 A, B, C, E
NB: Rules originating from the same itemset have 9 A, B, C
identical support count but can have different
confidence
Brute-force approach: Example
 Rules that Don’t qualify (i.e minconf<50% )

A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
2. Two-step approach:
Steps
i. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
ii. Rule Generation

– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
2. Two-step approach Example
Apriori Algorithm
Apriori is Latin word that means ”from what comes before”
The Algorithm is called apriori because it uses knowledge from
previous iteration phase to produce frequent itemsets
It attempts to find subsets which are common (frequent) in a
given data base of itemsets. (e.g collections of items bought by
customers).
37
Apriori Algorithm that is used to generate frequent itemset or to

reduce the number of candidate itemset:
This algorithm makes several passes over the transaction list
Apriori Algorithm Step1
Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Prune candidate itemsets containing subsets of length k that
are infrequent
 Count the support of each candidate by scanning the DB
 Eliminate candidates that are infrequent, leaving only those
that are frequent
Apriori Algorithm Step2

For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A
 A B is an association rule if
 Confidence (A  B) ≥ minConf,
where support (A  B) = support (AB), and

confidence (A  B) = support (AB) / support (A)
Apriori Algorithm: Example
1.Given the following transaction data base, use apriori algorithm

to find frequent 3-Itemset where minimum support count =3
TID Items
1 Milk, Beer
3 Milk, Diaper, Bread, Coke
2. Generate strong association rules using apriori algorithm

given that minimum confidence =1
Important Observation: Counts of subsets can’t be smaller than the
count of an itemset!
Apriori Algorithm Example: 1. Generating frequent Itemset
Items (1-itemsets)
Item Count
Bread 4
Coke 2 Itemset Count Pairs (2-itemsets)
Milk 4
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1 {Bread,Diaper} 4 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
Minimum Support count = 3 {Beer,Diaper} 2
Triplets (3-itemsets)
Itemset Count
{Bread,Milk,Diaper} 3
Write all possible 3-itemsets

and prune the list based on infrequent 2-itemsets
NB: Counts of subsets can’t be smaller than the count of an

itemset!
Apriori Algorithm Example: 2. Rule generation
•Foreach frequent k-itemset {Bread, Milk, Diaper}

where k >high
Generate 2 confidence rules 1-item rules
with one item in the consequent {Bread, Milk}  {Diaper} (s=0.6,
Using these rules, iteratively c=1.0)
generate high confidence rules {Milk, Diaper}  {Bread} (s=0.6,
with more than one items in the c=1.0)
consequent
{Diaper, Bread}  {Milk} (s=0.6,
if any rule has low confidence
then all the other rules containing c=1.0)
the consequents can be pruned 2-item rules
(not generated) {Bread}  {Milk, Diaper} (s=0.6,
c=0.75)
{Diaper}  {Milk, Bread} (s=0.6,
c=0.75)
{Milk}  {Diaper, Bread} (s=0.6,
Apriori Algorithm: example2
Generating frequent Itemset

Apriori Algorithm: generating association rules
Generating Association Rules

Exercise
1. use Apriori Algorithm to generate association rules from the
following frequent
3-Itemset: ‘bce’
2. use weka software to mine the data set using apriori algorithm
and compare your rules with those generate by weka.
Interpreting association mining results Generated by weka
Interpreting assocition mining results Generated by weka
 The output can be interpreted as follows:
 Lift =1.1
 Number of cycles peformed 13
 Size of set of large itemsets L(1): 22
 22 one-item sets
 36 two-item sets
 3 three-item sets
Association rule mining Parameters
 When performing association rule mining using weka, the following

parameters need to be set
 1. Required number of rules output: The required number of rules.
(default = 10)
 2. The metric type by which to rank rules that include confidence,
lift, leverage and Conviction.
 - The default metric is confidence
 3. minimum metric score of a rule(C ): The minimum confidence of
a rule. (default = 0.9).
 - Weka considers only rules with scores higher than this value.
 4. Delta(D) for minimum support: The rate by which the minimum
support is decreased in each iteration. (default = 0.05)
 5. Upper bound for minimum support: The highest allowed value of

support. (default = 1.0).
 6.Lower bound for minimum support: The lowest allowed value of

the support. (default = 0.1).
 - Minimum support can be set in lower bond minimum support
field
 Usually, Apriori in WEKA starts with the upper bound support and
incrementally decreases support (by delta increments which by
default is set to 0.05 or 5%).
 The algorithm halts when either the specified number of rules are
generated, or the lower bound for min. support is reached.
 7. Itemsets(I): Specifies if the itemsets found are also output

(default = no).
 8 Remove columns(R) that contain all missing values (default

= no)
 9. Class index is the index of the class attribute. If it is set to (-

1) the last attribute is taken as class attribute.
Interpreting assocition mining results Generated by weka
 Upper bound and lower bound support

 Upper bound support: The highest
 Iteratively reduces the minimum support until it finds the
required number of rules with the given minimum
confidence.
 The significance testing option is only applicable in the case of
confidence and is by default not used (-1.0).
Classification vs Association Rules
Association Rules
Classification Rules
1. Many target fields
1. Focus on one target field
2. Applicable in some cases
2. Specify class in all cases
3. Measures: Support,
3. Measures: Accuracy, coverage
Confidence, Lift, leverage,
conviction
Lab exercise
 Use weka Software to perform associations rule mining on

weather data.
 Identify the number of large item sets
 Interpret rules generated
 Determine the quality of each rule by checking values of
evaluation measures.
Online Reading materials
 Association mining using weka
 http://facweb.cs.depaul.edu/mobasher/classes/ect584/WEK
A/associate.html
End
Thank you
Questions

Lecture06 Association Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture06 Association Mining

Uploaded by

Copyright:

Available Formats

Data mining and data warehousing

 Market Basket Analysis: Example:

3. Genes association analysis: Determining commonly occurring

Association rule is an implication expression of the form

This means , if transaction includes Itemset1 then it also has

Association rules do not represent any sort of causality or

X  Y does not mean X causes Y, so no Causality

Given a set of Market-Basket transactions

The following association rules can be generated

1. Actionable Rules – contain high-quality, actionable

Transaction: an event that an itemset (I) with its transaction ID (TID)

Itemset(I) : A collection of one or more items. (order not

A data set: is a set of transactions (itemset with IDs)

A database: is a collection of 1 Bread, Milk

Support count () : Frequency of occurrence of an itemset

Support count ({Milk, Bread,Diaper}) = 2

There are two main categories of measuring interestingness of

 In most cases, it is sufficient to focus on a combination of

 However, the real value of a rule, in terms of usefulness and

(a) Support (utility): Fraction of transactions that contain an

supp(A) = # records that contain A = supp count(XUY)

(b) Confidence (certainty): Ratio of the number of transactions

Support and confidence are the main objective measures of

 Given a set of Market-Basket transactions

Transaction ID Items Bought

d) Lift measures how far from independence are X and Y.

 Conviction attempts to measure the degree of implication of a

It rages from 0.5 to ∞

Conviction values far from 1 indicate interesting rules.

Conviction was developed as an alternative to confidence which was

 e) Leverage measures novelty of a rule by determining how

•Given a set of transactions T, the goal of association rule

•This means that rule generation should ensure production of

Given a set of transactions T, the goal of association rule mining

1. Generating frequent itemsets whose support count≥ minsup

2. Generating association rules from those frequent itemsets

Frequent Itemset generation involves Generating all itemsets

Frequent Itemset is an itemset whose support is greater than or

Prori principle Example:

Priori principle is also known as priori property or Subset Property

Frequent itemsets != association rules

Q: Given frequent set {A,B,E}, what association TID List of items

 Rules that Don’t qualify (i.e minconf<50% )

ii. Rule Generation

Apriori Algorithm that is used to generate frequent itemset or to

Apriori Algorithm Step2

where support (A  B) = support (AB), and

1.Given the following transaction data base, use apriori algorithm

2. Generate strong association rules using apriori algorithm

Write all possible 3-itemsets

NB: Counts of subsets can’t be smaller than the count of an

•Foreach frequent k-itemset {Bread, Milk, Diaper}

Generating frequent Itemset

Generating Association Rules

 When performing association rule mining using weka, the following

 5. Upper bound for minimum support: The highest allowed value of

 6.Lower bound for minimum support: The lowest allowed value of

 7. Itemsets(I): Specifies if the itemsets found are also output

 8 Remove columns(R) that contain all missing values (default

 9. Class index is the index of the class attribute. If it is set to (-

 Upper bound and lower bound support

 Use weka Software to perform associations rule mining on

You might also like