You are on page 1of 24

Frequent Pattern(FP) Growth

(FP Tree)
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
Methods to Improve Apriori’s Efficiency

 Transaction reduction
 A transaction that does not contain any frequent k-itemset is useless in
subsequent scans

 Partitioning
 Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB

4
Methods to Improve Apriori’s Efficiency

 Sampling
 mining on a subset of given data, lower support
threshold + a method to determine the
completeness.

5
Mining Frequent Patterns Without Candidate
Generation

 Compress a large database into a compact, Frequent-


Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only

6
Mining Frequent Patterns Without
Candidate Generation

 Grow long patterns from short ones using local


frequent items

 “abc” is a frequent pattern

 Get all transactions having “abc”: DB|abc

 “d” is a local frequent item in DB|abc  abcd is


a frequent pattern
Steps

 1) Find frequency table of each item


 2) Order frequent itemset in desc order(consider
only those whose support > = minm support
 3) Draw FP Tree
 4) Find frequent pattern from FP tree
Example FP growth
 TID Items bought
 100 {f, a, c, d, g, i, m, p}
 200 {a, b, c, f, l, m, o}
 300 {b, f, h, j, o, w}
 400 {b, c, k, s, p}
 500 {a, f, c, e, l, p, m, n}

Minimum Support = 3
Construct FP-tree from a Transaction Database

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Benefits of the FP-tree Structure

 Reduce irrelevant info—infrequent items are gone


 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets


according to f-list
 F-list=f-c-a-b-m-p

 Patterns containing p

 Patterns having m but no p

 …

 Patterns having c but no a nor b, m, p

 Pattern f
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1

p:2 m:1 p fcam:2, cb:1


From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
Recursion: Mining Each Conditional FP-tree
{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}

Cond. pattern base of “cam”: (f:3) f:3


cam-conditional FP-tree
Mining Frequent Patterns With FP-trees

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and

database partition
 Method
 For each frequent item, construct its conditional

pattern-base, and then its conditional FP-tree


 Repeat the process on each newly created conditional

FP-tree
 Until the resulting FP-tree is empty, or it contains only

one path—single path will generate all the


combinations of its sub-paths, each of which is a
frequent pattern
Why Is FP-Growth the Winner?

 Divide-and-conquer:
 decompose both the mining task and DB according to
the frequent patterns obtained so far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
From association mining to
correlation analysis
Interestingness Measurements

 Objective measures-
 Two popular measurements
 support
 confidence

 Subjective measures-
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
Criticism to Support and Confidence
 Example

 Among 5000 students


 3000 play basketball

 3750 eat cereal

 2000 both play basket ball and eat cereal

 play basketball  eat cereal [40%, 66.7%] is misleading


because the overall percentage of students eating cereal is 75%
which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence

basketball not basketball sum(row)


cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Other Interestingness Measures: Interest
 Interest (correlation, lift) P( A  B)
P ( A) P( B)
 taking both P(A) and P(B) in consideration

 A and B negatively correlated, if the value is less than 1;


otherwise A and B positively correlated

X 1 1 1 1 0 0 0 0 Itemset Support Interest


Y 1 1 0 0 0 0 0 0 X,Y 25% 2
X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
Criticism to Support and Confidence

 Example
 X and Y: positively correlated,

 X and Z, negatively correlated


X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
We need a measure of dependent or
Z 0 1 1 1 1 1 1 1

correlated events

P( A B) Itemset Support Interest


corrA, B  X,Y 25% 2
P( A) P( B) X,Z 37.50% 0.9
Y,Z 12.50% 0.57

Rule Support Confidence


X=>Y 25% 50%
X=>Z 37.50% 75%
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)

P( A  B) Cereal 2000 1750 3750

Not cereal 1000 250 1250


P ( A) P ( B ) Sum(col.) 3000 2000 5000

2000 / 5000 1000 / 5000


lift ( B, C )   0.89 lift ( B, C )   1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

You might also like