Association Analysis PartI

Data Mining
A
Association
i i Analysis:
A l i Basic
B i Concepts
C
and Algorithms
Part I
Instructor: Wei Ding
g
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1
Association Rule Mining
z Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other
items in the transaction
M k t B k t ttransactions
Market-Basket ti
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 B d Milk
Bread, {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper,
p Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

Definition: Frequent Itemset
z Itemset
– A collection of one or more items
Example: {Milk
{Milk, Bread
Bread, Diaper}
– k-itemset TID Items
An itemset that contains k items 1 Bread, Milk

z S
Support
t countt (σ)
( ) 2 Bread, Diaper,
Bread Diaper Beer,
Beer Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. σ({Milk, Bread,Diaper}) = 2
5 Bread Milk,
Bread, Milk Diaper,
Diaper Coke
z Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
z Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold The minsup threshold is specified by users
Definition: Association Rule

z Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X → Y,
Y where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper} → {Beer}
5 Bread, Milk, Diaper, Coke
z Rule Evaluation Metrics
– Support (s)
Fraction
F ti off transactions
t ti that
th t contain
t i Example:
both X and Y {Milk , Diaper } ⇒ Beer
– Confidence (c)
Measures how often items in Y σ (Milk , Diaper,
p , Beer ) 2

appear in transactions that
s= = = 0 .4
|T| 5
contain X
σ (Milk, Diaper, Beer) 2
c= = = 0.67
σ (Milk, Diaper) 3
Association Rule Mining Task
z Given a set of transactions T, the goal of

association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
z Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Mining Association Rules
Example of Rules:
TID Items
1 B d Milk
Bread, {Milk Diaper} → {Beer} (s=0.4,
{Milk,Diaper} (s=0 4 c=0
c=0.67)
67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4,
(s 0.4, c=0.67)
c 0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk Diaper
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules
z Two-step approach:
1 Frequent Itemset Generation
1.
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
z Frequent itemset generation is still

computationally expensive
Frequent Itemset Generation

null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d p
possible
ABCDE candidate itemsets
Frequent Itemset Generation
z Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions
TID Items
1 Bread, Milk
2 B d Di
Bread, Diaper, B
Beer, E
Eggs
5 Bread, Milk, Diaper, Coke
– Match each transaction against

g every
y candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
Computational Complexity
z Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥
d −1 d −k
⎣⎝ k ⎠ ⎝ j ⎠⎦
k =1 j =1
= 3 − 2 +1
d d +1
If d
d=6
6, R = 602 rules

Frequent Itemset Generation Strategies
z Reduce the number of candidates (M)

– Complete search: M=2d
– Use pruning techniques to reduce M
z Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Reducing Number of Candidates
z Apriori principle:
– If an itemset is frequent
frequent, then all of its subsets must also
be frequent
z Apriori principle holds due to the following property

of the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone
anti monotone property of support
Illustrating Apriori Principle
Found to be
Infrequent
Pruned
supersets
Illustrating Apriori Principle
Item Count Items (1-itemsets)

Bread 4
C k
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Ite m s e t C ount

6C + 6C + 6C = 41 { B r e a d ,M
M ilk ,D
D ia
i p e r} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

Apriori Algorithm
z Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those

that are frequent
Factors Affecting Complexity
z Choice of minimum support threshold

– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
z Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and
I/O costs may also increase
z Size of database
– since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
z Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets

Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
Maximal
Itemsets
Infrequent
Itemsets Border
Closed Itemset
z An itemset is closed if none of its immediate supersets

has the same support as the itemset
Itemset Support
{A} 4
TID Items Itemsett Support
It S t
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A B}
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{{B,C}
, } 3
{B,D} 4
{C,D} 3
Assume the minimum support threshold is 3.

Maximal vs Closed Itemsets
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
5 DE
12 2 24 4 4 2 3 4
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE
Maximal vs Closed Frequent Itemsets

Minimum support = 2 null
Closed but
not maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4
ABCDE

Maximal vs Closed Itemsets

Association Analysis PartI

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Association Analysis PartI

Uploaded by

Copyright:

Available Formats

Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Association Rule Mining

z Given a set of transactions, find rules that will predict the

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

 An itemset that contains k items 1 Bread, Milk

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Definition: Association Rule

z Given a set of transactions T, the goal of

Mining Association Rules

z Frequent itemset generation is still

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Frequent Itemset Generation

ABCD ABCE ABDE ACDE BCDE

– Match each transaction against

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

z Reduce the number of candidates (M)

z Reduce the number of comparisons (NM)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Reducing Number of Candidates

z Apriori principle holds due to the following property

Illustrating Apriori Principle

Item Count Items (1-itemsets)

If every subset is considered, Ite m s e t C ount

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

 Eliminate candidates that are infrequent, leaving only those

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Factors Affecting Complexity

z Choice of minimum support threshold

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

z An itemset is closed if none of its immediate supersets

Assume the minimum support threshold is 3.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Maximal vs Closed Frequent Itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

You might also like

An itemset that contains k items 1 Bread, Milk

Eliminate candidates that are infrequent, leaving only those