You are on page 1of 11

Data Mining

A
Association
i i Analysis:
A l i Basic
B i Concepts
C
and Algorithms

Part I
Instructor: Wei Ding
g

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Association Rule Mining

z Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

M k t B k t ttransactions
Market-Basket ti
Example of Association Rules
TID Items
{Diaper} → {Beer},
1 B d Milk
Bread, {Milk, Bread} → {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} → {Milk},
3 Milk, Diaper,
p Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2


Definition: Frequent Itemset
z Itemset
– A collection of one or more items
‹ Example: {Milk
{Milk, Bread
Bread, Diaper}
– k-itemset TID Items

‹ An itemset that contains k items 1 Bread, Milk


z S
Support
t countt (σ)
( ) 2 Bread, Diaper,
Bread Diaper Beer,
Beer Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. σ({Milk, Bread,Diaper}) = 2
5 Bread Milk,
Bread, Milk Diaper,
Diaper Coke
z Support
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
z Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold The minsup threshold is specified by users

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Definition: Association Rule


z Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X → Y,
Y where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper} → {Beer}
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
z Rule Evaluation Metrics
– Support (s)
‹ Fraction
F ti off transactions
t ti that
th t contain
t i Example:
both X and Y {Milk , Diaper } ⇒ Beer
– Confidence (c)
Measures how often items in Y σ (Milk , Diaper,
p , Beer ) 2
‹
appear in transactions that
s= = = 0 .4
|T| 5
contain X
σ (Milk, Diaper, Beer) 2
c= = = 0.67
σ (Milk, Diaper) 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4
Association Rule Mining Task

z Given a set of transactions T, the goal of


association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

z Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Mining Association Rules

Example of Rules:
TID Items
1 B d Milk
Bread, {Milk Diaper} → {Beer} (s=0.4,
{Milk,Diaper} (s=0 4 c=0
c=0.67)
67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} → {Milk,Diaper} (s=0.4,
(s 0.4, c=0.67)
c 0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} → {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk Diaper
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6
Mining Association Rules

z Two-step approach:
1 Frequent Itemset Generation
1.
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

z Frequent itemset generation is still


computationally expensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Frequent Itemset Generation


null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d p
possible
ABCDE candidate itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8
Frequent Itemset Generation

z Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions
TID Items
1 Bread, Milk
2 B d Di
Bread, Diaper, B
Beer, E
Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

– Match each transaction against


g every
y candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

Computational Complexity
z Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

⎡⎛ d ⎞ ⎛ d − k ⎞⎤
R = ∑ ⎢⎜ ⎟ × ∑ ⎜ ⎟⎥
d −1 d −k

⎣⎝ k ⎠ ⎝ j ⎠⎦
k =1 j =1

= 3 − 2 +1
d d +1

If d
d=6
6, R = 602 rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10


Frequent Itemset Generation Strategies

z Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

z Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Reducing Number of Candidates

z Apriori principle:
– If an itemset is frequent
frequent, then all of its subsets must also
be frequent

z Apriori principle holds due to the following property


of the support measure:

∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone
anti monotone property of support
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12
Illustrating Apriori Principle

Found to be
Infrequent

Pruned
supersets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
C k
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Ite m s e t C ount


6C + 6C + 6C = 41 { B r e a d ,M
M ilk ,D
D ia
i p e r} 3
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14


Apriori Algorithm

z Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
‹ Generate length (k+1) candidate itemsets from length k
frequent itemsets
‹ Prune candidate itemsets containing subsets of length k that
are infrequent
‹ Count the support of each candidate by scanning the DB

‹ Eliminate candidates that are infrequent, leaving only those


that are frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Factors Affecting Complexity

z Choice of minimum support threshold


– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of
frequent itemsets
z Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and
I/O costs may also increase
z Size of database
– since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
z Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16


Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent

Maximal
Itemsets

Infrequent
Itemsets Border

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17

Closed Itemset

z An itemset is closed if none of its immediate supersets


has the same support as the itemset

Itemset Support
{A} 4
TID Items Itemsett Support
It S t
{B} 5
1 {A,B} {A,B,C} 2
{C} 3
2 {B,C,D} {A,B,D} 3
{D} 4
3 {A,B,C,D} {A,C,D} 2
{A B}
{A,B} 4
4 {A,B,D} {B,C,D} 3
{A,C} 2
5 {A,B,C,D} {A,B,C,D} 2
{A,D} 3
{{B,C}
, } 3
{B,D} 4
{C,D} 3

Assume the minimum support threshold is 3.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18


Maximal vs Closed Itemsets
null
Transaction Ids
TID Items
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Maximal vs Closed Frequent Itemsets


Minimum support = 2 null
Closed but
not maximal
124 123 1234 245 345
A B C D E
Closed and
maximal

12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE # Closed = 9
# Maximal = 4

ABCDE

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20


Maximal vs Closed Itemsets

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

You might also like