Associion Rule Mining

Association Rule Mining
Mining Association Rules in Large Databases

Association rule mining Algorithms Apriori and FP-Growth Max and closed patterns Mining various kinds of association/correlation rules
Max-patterns & Close-patterns

If there are frequent patterns with many items, enumerating all of them is costly. We may be interested in finding the boundary frequent patterns. Two types
Max-patterns
Frequent pattern {a1, , a100} (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 frequent sub-patterns! Max-pattern: frequent patterns without proper frequent super pattern

BCDE, ACD are max-patterns BCD is not a max-pattern
Tid 10 20 30
Items A,B,C,D,E B,C,D,E, A,C,D,F
Min_sup=2
Maximal Frequent Itemset

An itemset is maximal frequent if none of its immediate supersets is frequent
Maximal Itemsets
Infrequent Itemsets
Border
Closed Itemset
An itemset is closed if none of its immediate supersets has the same support as the itemset
TID 1 2 3 4 5 Items {A,B} {B,C,D} {A,B,C,D} {A,B,D} {A,B,C,D}
Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D}
Support 4 5 3 4 4 2 3 3 4 3
Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2
Maximal vs Closed Itemsets

null
Transaction Ids
245
D E
TID 1 2 3 4 5
Items ABC ABCD BCE ACDE DE

12
ABC
124
A
123
B
1234
C
345
12
AB
124
AC
24
AD
4
AE
123
BC
2
BD
3
BE
24
CD
34
CE
45
DE
2
ABD ABE
24
ACD
4
ACE
4
ADE
2
BCD
3
BCE BDE
4
CDE
4
ABCD ABCE ABDE ACDE BCDE
Not supported by any transactions
ABCDE
Maximal vs Closed Frequent Itemsets

Minimum support = 2
124
A null
Closed but not maximal

245
D E
123
B
1234
C
345
Closed and maximal
12
AB
124
AC
24
AD
4
AE
123
BC
2
BD
3
BE
24
CD
34
CE
45
DE
12
ABC
2
ABD ABE
24
ACD
4
ACE
4
ADE
2
BCD
3
BCE BDE
4
CDE
4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9 # Maximal = 4
ABCDE
Maximal vs Closed Itemsets
MaxMiner: Mining Max-patterns

Idea: generate the complete setenumeration tree one level at a time, while prune if applicable.
(ABCD) A (BCD) B (CD) AC (D) AD () BC (D) BD () BCD () C (D) CD () D ()
AB (CD)
ABC (C) ABD () ACD () ABCD ()
Local Pruning Techniques (e.g. at node A)

Check the frequency of ABCD and AB, AC, AD. If ABCD is frequent, prune the whole sub-tree. If AC is NOT frequent, remove C from the parenthesis before expanding. (ABCD) A (BCD) AB (CD) AC (D) AD () B (CD) BC (D) BD () BCD () C (D) CD () D ()
Algorithm MaxMiner
Initially, generate one node N= (ABCD) where h(N)= and t(N)={A,B,C,D}. Consider expanding N,

If h(N)t(N) is frequent, do not expand N. If for some it(N), h(N){i} is NOT frequent, remove i from t(N) before expanding N.
Apply global pruning techniques
Global Pruning Technique (across sub-trees)

When a max pattern is identified (e.g. ABCD), prune all nodes (e.g. B, C and D) where h(N)t(N) is a sub-set of it (e.g. ABCD). (ABCD) A (BCD) B (CD) AC (D) AD () BC (D) BD () BCD () C (D) CD () D ()
AB (CD)
Example
(ABCDEF) A (BCDE) B (CDE) C (DE)
Items ABCDEF A B C D E F Frequency 0 2 2 3 3 2 1 Tid 10 Items A,B,C,D,E B,C,D,E, A,C,D,F
D (E)
E ()
20 30
Min_sup=2 Max patterns:
Example
(ABCDEF) A (BCDE) B (CDE) C (DE) AC (D) AD () D (E)
Node A
Items ABCDE AB AC AD AE Frequency 1 1 2 2 1 Tid 10 Items A,B,C,D,E B,C,D,E, A,C,D,F
E ()
20 30
Example
Node B
Items BCDE BC BD BE Frequency 2 Tid 10 Items A,B,C,D,E B,C,D,E, A,C,D,F
E ()
20 30
BCDE
Example
Node AC
Items ACD Frequency 2 Tid 10 Items A,B,C,D,E B,C,D,E, A,C,D,F
E ()
20 30
BCDE ACD
Frequent Closed Patterns

For frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern
ab is a frequent closed pattern
Concise rep. of freq pats Reduce # of patterns and rules N. Pasquier et al. In ICDT99
Min_sup=2
TID 10 20 30 40 50 Items a, b, c a, b, c a, b, d a, b, d e, f
Max Pattern vs. Frequent Closed Pattern

max pattern closed pattern

if itemset X is a max pattern, adding any item to it would not be a frequent pattern; thus there exists no item y s.t. every transaction containing X also contains y.
closed pattern max pattern

Min_sup=2
TID 10 20 30 40 50 Items a, b, c a, b, c a, b, d a, b, d e, f
ab is a closed pattern, but not max
Mining Frequent Closed Patterns: CLOSET

Flist: list of all frequent items in support ascending order

Flist: d-a-f-e-c Patterns having d Patterns having a but not d, etc.
Min_sup=2
TID 10 20 30 40 50 Items a, c, d, e, f a, b, e c, e, f a, c, d, f c, e, f
Divide search space

Find frequent closed pattern recursively

Among the transactions having d, cfa is frequent closed cfad is a frequent closed pattern
J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", DMKD'00.
10
Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at appropriate levels could be quite useful. A transactional database can be encoded based on dimensions and levels We can explore shared multilevel mining
Food milk skim 2% fat

....
bread wheat white
Garelick
Wonder
Mining Multi-Level Associations

A top_down, progressive deepening approach:

First find high-level strong rules:
milk bread [20%, 60%]. Then find their lower-level weaker rules: 2% fat milk wheat bread [6%, 50%].
Variations at mining multiple-level association rules.

Level-crossed association rules: skim milk Wonder wheat bread Association rules with multiple, alternative hierarchies: full fat milk
Wonder bread
11
Multi-level Association: Uniform

Support vs. Reduced Support
Uniform Support: the same minimum support for all levels

+ One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. If support threshold too high miss low level associations too low generate too many high level associations
Lower level items do not occur as frequently.
Multi-level Association: Uniform

Support vs. Reduced Support
Reduced Support: reduced minimum support at lower levels

There are 4 search strategies: Level-by-level independent

Independent search at all levels (no misses)
Level-cross
filtering by k-itemset filtering by single item
Prune a k-pattern if the corresponding k-pattern at the upper level is infrequent

Level-cross Controlled
Prune an item if its parent node is infrequent
level-cross filtering by single item
Consider subfrequent items that pass a passage threshold
12
Uniform Support
Multi-level mining with uniform support
Level 1 min_sup = 5% Milk [support = 10%]
Level 2 min_sup = 5%
full fat Milk [support = 6%]
Skim Milk
[support = 4%]
Reduced Support
Multi-level mining with reduced support
Level 1 min_sup = 5% Milk [support = 10%]
Level 2 min_sup = 3%
full fat Milk [support = 6%]
Skim Milk [support = 4%]
13
Pattern Evaluation
Association rule algorithms tend to produce too many rules

many of them are uninteresting or redundant Redundant if {A,B,C} {D} and {A,B} {D} have same support & confidence
Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used
Computing Interestingness Measure

Given a rule X Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X Y Y Y f11: support of X and Y X f11 f10 f1+ f10: support of X and Y f01: support of X and Y X f01 f00 fo+ f00: support of X and Y f f |T|
+1 +0
Used to define various measures

support, confidence, lift, Gini, J-measure, etc.
14
Drawback of Confidence
Coffee Coffee Tea Tea 15 75 90 5 5 10 20 80 100
Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Although confidence is high, rule is misleading P(Coffee|Tea) = 0.9375
Statistical Independence
Population of 1000 students

600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(SB) = 420/1000 = 0.42 P(S) P(B) = 0.6 0.7 = 0.42 P(SB) = P(S) P(B) => Statistical independence P(SB) > P(S) P(B) => Positively correlated P(SB) < P(S) P(B) => Negatively correlated

15
Statistical-based Measures
Measures that take into account statistical dependence
P (Y | X ) P(Y ) P( X , Y ) Interest = P ( X ) P (Y ) PS = P( X , Y ) P ( X ) P (Y ) Lift =
coefficient =
P ( X , Y ) P( X ) P(Y ) P( X )[1 P ( X )]P (Y )[1 P (Y )]
Example: Lift/Interest
Coffee Coffee Tea Tea 15 75 90 5 5 10 20 80 100
Association Rule: Tea Coffee

Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
16
Drawback of Lift & Interest

Y X X 10 0 10 Y 0 90 90 10 90 100 X X Y 90 0 90 Y 0 10 10 90 10 100
Lift =
0.1 = 10 (0.1)(0.1)
Lift =
0.9 = 1.11 (0.9)(0.9)
Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1
There are lots of measures proposed in the literature
Some measures are good for certain applications, but not for others
What criteria should we use to determine whether a measure is good or bad?
What about Aprioristyle support based pruning? How does it affect these measures?
17
Properties of A Good Measure

Piatetsky-Shapiro:
3 properties a good measure M must satisfy:

M(A,B) = 0 if A and B are statistically independent M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged
Comparing Different Measures

Exam ple
E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
f11
8123 8330 9481 3954 2886 1500 4000 4000 1720 61
f10
83 2 94 3080 1363 2000 2000 2000 7121 2483
f01
424 622 127 5 1320 500 1000 2000 5 4
f00
1370 1046 298 2961 4431 6000 3000 2000 1154 7452
10 examples of contingency tables:

Rankings of contingency tables using various measures:
18
Property under Variable Permutation

A A B p r
B q s
B B
A p q
A r s
Does M(A,B) = M(B,A)? Symmetric measures:

support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
confidence, conviction, Laplace, J-measure, etc
19

Associion Rule Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Associion Rule Mining

Uploaded by

Copyright:

Available Formats

Association Rule Mining

Mining Association Rules in Large Databases

Max-patterns & Close-patterns

BCDE, ACD are max-patterns BCD is not a max-pattern

Items A,B,C,D,E B,C,D,E, A,C,D,F

Maximal Frequent Itemset

Itemset Support {A,B,C} 2 {A,B,D} 3 {A,C,D} 2 {B,C,D} 3 {A,B,C,D} 2

Maximal vs Closed Itemsets

Items ABC ABCD BCE ACDE DE

Not supported by any transactions

Maximal vs Closed Frequent Itemsets

Closed but not maximal

Closed and maximal

Maximal vs Closed Itemsets

MaxMiner: Mining Max-patterns

ABC (C) ABD () ACD () ABCD ()

Local Pruning Techniques (e.g. at node A)

ABC (C) ABD () ACD () ABCD ()

Apply global pruning techniques

Global Pruning Technique (across sub-trees)

ABC (C) ABD () ACD () ABCD ()

Min_sup=2 Max patterns:

Min_sup=2 Max patterns:

Min_sup=2 Max patterns:

Min_sup=2 Max patterns:

Frequent Closed Patterns

ab is a frequent closed pattern

Max Pattern vs. Frequent Closed Pattern

max pattern closed pattern

closed pattern max pattern

ab is a closed pattern, but not max

Mining Frequent Closed Patterns: CLOSET

Flist: list of all frequent items in support ascending order

Flist: d-a-f-e-c Patterns having d Patterns having a but not d, etc.

Divide search space

Find frequent closed pattern recursively

Multiple-Level Association Rules

Food milk skim 2% fat

bread wheat white

Mining Multi-Level Associations

A top_down, progressive deepening approach:

Variations at mining multiple-level association rules.

Multi-level Association: Uniform

Uniform Support: the same minimum support for all levels

Lower level items do not occur as frequently.

Multi-level Association: Uniform

Reduced Support: reduced minimum support at lower levels

There are 4 search strategies: Level-by-level independent

filtering by k-itemset filtering by single item

Prune a k-pattern if the corresponding k-pattern at the upper level is infrequent

Prune an item if its parent node is infrequent

level-cross filtering by single item

Consider subfrequent items that pass a passage threshold

full fat Milk [support = 6%]

full fat Milk [support = 6%]

Skim Milk [support = 4%]

Association rule algorithms tend to produce too many rules

Computing Interestingness Measure

Used to define various measures

support, confidence, lift, Gini, J-measure, etc.

Association Rule: Tea Coffee

Population of 1000 students

Measures that take into account statistical dependence

P (Y | X ) P(Y ) P( X , Y ) Interest = P ( X ) P (Y ) PS = P( X , Y ) P ( X ) P (Y ) Lift =