Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction
Market-Basket transactions
TID Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Tan,Steinbach, Kumar
4/18/2004
Itemset
A collection of one or more items
!
k-itemset
!
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Support
Fraction of transactions that contain an itemset E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater than or equal to a minsup threshold
Tan,Steinbach, Kumar
4/18/2004
Association Rule
An implication expression of the form X ! Y, where X and Y are itemsets Example: {Milk, Diaper} ! {Beer}
TID
Items
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Fraction of transactions that contain both X and Y Measures how often items in Y appear in transactions that contain X
Example:
Confidence (c)
!
2 = 0.4 5
a set of transactions T, the goal of association rule mining is to find all rules having
support ! minsup threshold confidence ! minconf threshold
! Brute-force
approach:
List all possible association rules Compute the support and confidence for each rule Prune rules that fail the minsup and minconf thresholds # Computationally prohibitive!
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Example of Rules:
{Milk,Diaper} ! {Beer} (s=0.4, c=0.67) {Milk,Beer} ! {Diaper} (s=0.4, c=1.0) {Diaper,Beer} ! {Milk} (s=0.4, c=0.67) {Beer} ! {Milk,Diaper} (s=0.4, c=0.67) {Diaper} ! {Milk,Beer} (s=0.4, c=0.5) {Milk} ! {Diaper,Beer} (s=0.4, c=0.5)
1 2 3 4 5
Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Observations:
All the above rules are binary partitions of the same itemset: {Milk, Diaper, Beer} Rules originating from the same itemset have identical support but can have different confidence Thus, we may decouple the support and confidence requirements
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Two-step approach:
1. Frequent Itemset Generation
2. Rule Generation
Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset
Tan,Steinbach, Kumar
4/18/2004
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
Tan,Steinbach, Kumar
approach:
Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database
Transactions
TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
List of Candidates
Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2d !!!
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Computational Complexity
! Given
d unique items:
Tan,Steinbach, Kumar
4/18/2004
Reduce size of N as the size of itemset increases Used by DHP and vertical-based mining algorithms
! Reduce
Use efficient data structures to store the candidates or transactions No need to match every candidate against every transaction
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
principle:
$X , Y : ( X # Y ) " s( X ) ! s(Y )
Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
Found to be Infrequent
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Pruned supersets
Tan,Steinbach, Kumar Introduction to Data Mining
ABCDE
4/18/2004
Items (1-itemsets)
Itemset {Bread,Milk} {Bread,Beer} {Bread,Diaper} {Milk,Beer} {Milk,Diaper} {Beer,Diaper} Count 3 2 3 2 3 3
Minimum Support = 3
If every subset is considered, 6C + 6C + 6C = 41 1 2 3 With support-based pruning, 6 + 6 + 1 = 13
Triplets (3-itemsets)
Itemset {Bread,Milk,Diaper} Count 3
Tan,Steinbach, Kumar
4/18/2004
Apriori Algorithm
!
Method:
Let k=1 Generate frequent itemsets of length 1 Repeat until no new frequent itemsets are identified
! Generate
length (k+1) candidate itemsets from length k frequent itemsets ! Prune candidate itemsets containing subsets of length k that are infrequent ! Count the support of each candidate by scanning the DB ! Eliminate candidates that are infrequent, leaving only those that are frequent
Tan,Steinbach, Kumar
4/18/2004
counting:
Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure
! Instead
of matching each transaction against every candidate, match it against candidates contained in the hashed buckets
Transactions
TID 1 2 3 4 5 Items Bread, Milk Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke
Buckets
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Hash Structure
367 368
Tan,Steinbach, Kumar
4/18/2004
1,4,7 2,5,8
3,6,9
Tan,Steinbach, Kumar
1,4,7 2,5,8
3,6,9
Tan,Steinbach, Kumar
1,4,7 2,5,8
3,6,9
Tan,Steinbach, Kumar
Subset Operation
Given a transaction t, what are the possible subsets of size 3?
Transaction, t
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
135 136
156
235 236
256
356
Level 3
Tan,Steinbach, Kumar
Subsets of 3 items
Introduction to Data Mining 4/18/2004 #
1,4,7 2,5,8
3,6,9
Tan,Steinbach, Kumar
4/18/2004
Hash Function
2+ 356 3+ 56
234 567 356 357 689 367 368
1,4,7 2,5,8
3,6,9
125 458
159
4/18/2004
Hash Function
2+ 356 3+ 56
234 567 356 357 689 367 368
1,4,7 2,5,8
3,6,9
125 458
159
Size of database
since Apriori makes multiple passes, run time of algorithm may increase with number of transactions
Tan,Steinbach, Kumar
4/18/2004
Some itemsets are redundant because they have identical support as their supersets
Tan,Steinbach, Kumar
4/18/2004
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
BCDE
Infrequent Itemsets
Tan,Steinbach, Kumar
ABCD E
Border
4/18/2004 #
Closed Itemset
!
An itemset is closed if none of its immediate supersets has the same support as the itemset
Itemset {A} {B} {C} {D} {A,B} {A,C} {A,D} {B,C} {B,D} {C,D} Support 4 5 3 4 4 2 3 3 4 3
TID 1 2 3 4 5
Tan,Steinbach, Kumar
4/18/2004
Transaction Ids
245
D E
124
A
123
B
1234
C
345
124
AC
24
AD
4
AE
123
BC
2
BD
BE
24
CD
34
CE
45
DE
12
ABC
2
ABD ABE
24
ACD
4
ACE
4
ADE
2
BCD
BCE
BDE
CDE
4
ABCD ABCE ABDE ACDE BCDE
ABCDE
4/18/2004
123
B
1234
C
345
12
AB
124
AC
24
AD
4
AE
123
BC
2
BD
BE
24
CD
CE
DE
12
ABC
2
ABD ABE
24
ACD
4
ACE
4
ADE
2
BCD
BCE
BDE
CDE
4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9 # Maximal = 4
ABCDE
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
of Itemset Lattice
Frequent itemset border
General-to-specific vs Specific-to-general
Frequent itemset border null null null
.. ..
{a1,a2,...,an}
.. ..
{a1,a2,...,an} Frequent itemset border
.. ..
{a1,a2,...,an}
(a) General-to-specific
Tan,Steinbach, Kumar
(b) Specific-to-general
(c) Bidirectional
4/18/2004 #
of Itemset Lattice
null null
Equivalent Classes
AB
AC
AD
BC
BD
CD
AB
AC
BC
AD
BD
CD
ABC
ABD
ACD
BCD
ABC
ABD
ACD
BCD
ABCD
ABCD
of Itemset Lattice
Breadth-first vs Depth-first
Tan,Steinbach, Kumar
4/18/2004
of Database
Tan,Steinbach, Kumar
4/18/2004
FP-growth Algorithm
! Use
a compressed representation of the database using an FP-tree an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets
! Once
Tan,Steinbach, Kumar
4/18/2004
FP-tree construction
After reading TID=1: null A:1 B:1 After reading TID=2: A:1 B:1
TID 1 2 3 4 5 6 7 8 9 10
Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E}
Tan,Steinbach, Kumar
4/18/2004
FP-Tree Construction
TID 1 2 3 4 5 6 7 8 9 10 Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E}
Transaction Database
null A:7 B:3 C:3 D:1 D:1 D:1 E:1 Pointers are used to assist frequent itemset generation E:1 E:1
C:1
D:1
4/18/2004
FP-growth
C:1 D:1
D:1
Conditional Pattern base for D: P = {(A:1,B:1,C:1), (A:1,B:1), (A:1,C:1), (A:1), (B:1,C:1)} Recursively apply FPgrowth on P Frequent Itemsets found (with sup > 1): AD, BD, CD, ACD, BCD
Tan,Steinbach, Kumar
4/18/2004
Tree Projection
Set enumeration tree:
Possible Extension: E(A) = {B,C,D,E}
AB AC A B null
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCDE
Tan,Steinbach, Kumar
4/18/2004
Tree Projection
! Items
are listed in lexicographic order ! Each node P stores the following information:
Itemset for node P List of possible lexicographic extensions of P: E(P) Pointer to projected database of its ancestor node Bitvector containing information about which transactions in the projected database contain the itemset
Tan,Steinbach, Kumar
4/18/2004
Projected Database
Original Database: Projected Database for node A:
TID 1 2 3 4 5 6 7 8 9 10
Items {A,B} {B,C,D} {A,C,D,E} {A,D,E} {A,B,C} {A,B,C,D} {B,C} {A,B,C} {A,B,D} {B,C,E}
TID 1 2 3 4 5 6 7 8 9 10
ECLAT
!
TID-list
4/18/2004 #
Tan,Steinbach, Kumar
ECLAT
!
Determine support of any k-itemset by intersecting tid-lists of two of its (k-1) subsets.
A 1 4 5 6 7 8 9
B 1 2 5 7 8 10
&
AB 1 5 7 8
! !
3 traversal approaches:
top-down, bottom-up and hybrid
Advantage: very fast support counting ! Disadvantage: intermediate tid-lists may become too large for memory
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Rule Generation
! Given
a frequent itemset L, find all non-empty subsets f ' L such that f ! L f satisfies the minimum confidence requirement
If {A,B,C,D} is a frequent itemset, candidate rules:
ABC !D, A !BCD, AB !CD, BD !AC, ABD !C, B !ACD, AC ! BD, CD !AB, ACD !B, C !ABD, AD ! BC, BCD !A, D !ABC BC !AD,
! If
Tan,Steinbach, Kumar
4/18/2004
Rule Generation
! How
But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC ! D) $ c(AB ! CD) $ c(A ! BCD)
! Confidence
BCD=>A
ACD=>B
ABD=>C
ABC=>D
CD=>AB
BD=>AC
BC=>AD
AD=>BC
AC=>BD
AB=>CD
D=>ABC
C=>ABD
B=>ACD
A=>BCD
Pruned Rules
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
rule is generated by merging two rules that share the same prefix in the rule consequent
CD=>AB BD=>AC
! join(CD=>AB,BD=>AC)
rule D=>ABC if its subset AD=>BC does not have high confidence
Introduction to Data Mining
D=>ABC
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
If minsup is set too high, we could miss itemsets involving interesting rare items (e.g., expensive products) If minsup is set too low, it is computationally expensive and the number of itemsets is very large
! Using
Tan,Steinbach, Kumar
4/18/2004
MS(i): minimum support for item i e.g.: MS(Milk)=5%, MS(Coke) = 3%, MS(Broccoli)=0.1%, MS(Salmon)=0.5% MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) = 0.1% Challenge: Support is no longer anti-monotone
!
Suppose:
Support(Milk, Coke) = 1.5% and Support(Milk, Coke, Broccoli) = 0.5% is infrequent but {Milk,Coke,Broccoli} is frequent
Introduction to Data Mining 4/18/2004 #
! {Milk,Coke}
Tan,Steinbach, Kumar
X
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
4/18/2004 #
AB AC A AD B C BD D BE CD AE BC
0.10% 0.25%
0.20% 0.26%
0.30% 0.29%
0.50% 0.05%
3%
4.20%
CE DE
Tan,Steinbach, Kumar
AB AC A
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
4/18/2004 #
0.10% 0.25% B C
AD AE BC
0.20% 0.26%
0.30% 0.29% D
BD BE CD E CE DE
0.50% 0.05%
3%
4.20%
Tan,Steinbach, Kumar
! Need
L1 : set of frequent items F1 : set of items whose support is $ MS(1) where MS(1) is mini( MS(i) ) C2 : candidate itemsets of size 2 is generated from F1 instead of L1
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
to Apriori:
In traditional Apriori,
candidate (k+1)-itemset is generated by merging two frequent itemsets of size k ! The candidate is pruned if it contains any infrequent subsets of size k
only if subset contains the first item ! e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to minimum support) ! {Broccoli, Coke} and {Broccoli, Milk} are frequent but {Coke, Milk} is infrequent Candidate is not pruned because {Coke,Milk} does not contain the first item, i.e., Broccoli.
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Pattern Evaluation
! Association
many rules
many of them are uninteresting or redundant Redundant if {A,B,C} ! {D} and {A,B} ! {D} have same support & confidence
! Interestingness
measures can be used to prune/ rank the derived patterns the original formulation of association rules, support & confidence are the only measures used
Introduction to Data Mining 4/18/2004 #
! In
Tan,Steinbach, Kumar
Interestingness Measures
Tan,Steinbach, Kumar
4/18/2004
Given a rule X ! Y, information needed to compute rule interestingness can be obtained from a contingency table
Contingency table for X
Y X X f11 f01 f+1 Y f10 f00 f+0 f1+ fo+ |T|
!Y f11: support of X and Y f10: support of X and Y f01: support of X and Y f00: support of X and Y Used to define various measures
" support,
Tan,Steinbach, Kumar
Drawback of Confidence
Coffee 5 5 10 20 80 100
Statistical Independence
! Population
of 1000 students
600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S&B) = 420/1000 = 0.42 P(S) ) P(B) = 0.6 ) 0.7 = 0.42 P(S&B) = P(S) ) P(B) => Statistical independence P(S&B) > P(S) ) P(B) => Positively correlated P(S&B) < P(S) ) P(B) => Negatively correlated
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Statistical-based Measures
! Measures
P (Y | X ) Lift = P (Y ) P( X , Y ) Interest = P ( X ) P(Y ) PS = P ( X , Y ) ! P ( X ) P(Y ) P ( X , Y ) ! P ( X ) P(Y ) " ! coefficien t = P( X )[1 ! P( X )]P (Y )[1 ! P (Y )]
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
Example: Lift/Interest
Coffee 5 5 10 20 80 100
Tan,Steinbach, Kumar
4/18/2004
Y X X 10 0 10
Y 0 90 90 10 90 100 X X
Y 90 0 90
Y 0 10 10 90 10 100
Tan,Steinbach, Kumar
4/18/2004
There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Aprioristyle support based pruning? How does it affect these measures?
Tan,Steinbach, Kumar
4/18/2004
f11
8123 8330 9481 3954 2886 1500 4000 4000 1720 61
f10
f01
f00
83 424 1370 2 622 1046 94 127 298 3080 5 2961 1363 1320 4431 2000 500 6000 2000 1000 3000 2000 2000 2000 7121 5 1154 2483 4 7452
Tan,Steinbach, Kumar
4/18/2004
A A
B B
A p q
A r s
Asymmetric measures:
" confidence,
Tan,Steinbach, Kumar
Mosteller: Underlying association should be independent of the relative number of male and female students in the samples
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
2x
10x
B 0 0 0 0 1 0 0 0 0 0 (a)
C 0 1 1 1 1 1 1 1 1 0
D 1 1 1 1 0 1 1 1 1 1 (b)
E 0 1 1 1 1 1 1 1 1 0 (c)
4/18/2004
F 0 0 0 0 1 0 0 0 0 0
. . . . .
Transaction N
1 0 0 0 0 0 0 0 0 1
Tan,Steinbach, Kumar
Example: *-Coefficient
! *-coefficient
Y X X 60 10 70
0 . 2 " 0 .3 ! 0 .3 #= 0 . 7 ! 0 .3 ! 0 .7 ! 0 .3 = 0.5238
4/18/2004 #
A A
A A
B p r
B q s+k
Invariant measures:
" support,
Non-invariant measures:
" correlation,
Tan,Steinbach, Kumar
4/18/2004
Range -1 ! 0 ! 1 0!1 0!1!$ -1 ! 0 ! 1 -1 ! 0 ! 1 -1 ! 0 ! 1 0!1 0!1 0!1 0!1 0!1 0!1 0.5 ! 1 ! $ 0!1!$ 0 .. 1 -0.25 ! 0 ! 0.25 -1 ! 0 ! 1 0.5 ! 1 ! 1 0!1!$ 0 .. 1
P1 Yes Yes Yes* Yes Yes Yes Yes Yes Yes No No No No Yes* No Yes Yes Yes No No
P2 Yes No Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
P3 Yes No Yes Yes Yes Yes Yes No No No No No No Yes Yes Yes Yes Yes Yes Yes Yes
O1 Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes** Yes Yes Yes No No Yes Yes No
O3' Yes Yes Yes Yes Yes Yes Yes No Yes No No No Yes No No Yes Yes No Yes No No
#
! " # Q Y % M J G s c L V I IS PS F AV S & K
Tan,Steinbach, Kumar
4/18/2004
Support-based Pruning
! Most
of the association rule mining algorithms use support measure to prune rules and itemsets effect of support pruning on correlation of itemsets
Generate 10000 random contingency tables Compute support and pairwise correlation for each table Apply support-based pruning and examine the tables that are removed
! Study
Tan,Steinbach, Kumar
4/18/2004
Correlation
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #
8 0.
0.
0.
0.
0.
0.
0.
0.
Correlation
Correlation
Correlation
4/18/2004
0.
! Steps:
Generate 10000 contingency tables Rank each table according to the different measures Compute the pair-wise correlation between the measures
Tan,Steinbach, Kumar
4/18/2004
Reliability Kappa Klosgen Yule Q Confidence Laplace IS Support Jaccard Lambda Gini J-measure Mutual Info 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Jaccard
Yule Y
-0.8
-0.6
-0.4
-0.2
0 0.2 Correlation
0.4
0.6
0.8
" Red
cells indicate correlation between the pair of measures > 0.85 pairs have correlation > 0.85
Introduction to Data Mining
" 40.14%
Tan,Steinbach, Kumar
4/18/2004
Reliability PS Yule Q CF Yule Y Kappa IS Jaccard Support Lambda Gini J-measure Mutual Info 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Jaccard
Klosgen
-0.8
-0.6
-0.4
-0.2
0 0.2 Correlation
0.4
0.6
0.8
" 61.45%
Tan,Steinbach, Kumar
4/18/2004
Support Interest Reliability Conviction Yule Q Odds ratio Confidence CF Yule Y Kappa Correlation Col Strength IS Jaccard Laplace PS Klosgen Lambda Mutual Info Gini J-measure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Jaccard
-0.2
0.6
0.8
" 76.42%
Tan,Steinbach, Kumar
4/18/2004
measure:
Rank patterns based on statistics computed from data e.g., 21 measures of association (support, confidence, Laplace, Gini, mutual information, Jaccard, etc).
! Subjective
! A ! A
measure:
Tan,Steinbach, Kumar
4/18/2004
+ -
Pattern expected to be frequent Pattern expected to be infrequent Pattern found to be frequent Pattern found to be infrequent
+ - +
Need to combine expectation of users with evidence from data (i.e., extracted patterns)
Introduction to Data Mining 4/18/2004 #
Tan,Steinbach, Kumar
Domain knowledge in the form of site structure Given an itemset F = {X1, X2, ", Xk} (Xi : Web pages)
! L:
number of links connecting the pages = L / (k ) k-1) = 1 (if graph is connected), 0 (disconnected graph)
! lfactor ! cfactor
Use Dempster-Shafer theory to combine domain knowledge and evidence from data
Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 #