FP Growth (Tree)

Frequent Pattern(FP) Growth
(FP Tree)
Challenges of Frequent Pattern Mining
 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Bottleneck of Frequent-pattern Mining
 Multiple database scans are costly

 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
Methods to Improve Apriori’s Efficiency
 Transaction reduction
 A transaction that does not contain any frequent k-itemset is useless in
subsequent scans
 Partitioning
 Any itemset that is potentially frequent in DB must be frequent in at
least one of the partitions of DB
4
Methods to Improve Apriori’s Efficiency
 Sampling
 mining on a subset of given data, lower support
threshold + a method to determine the
completeness.
5
Mining Frequent Patterns Without Candidate
Generation
 Compress a large database into a compact, Frequent-

Pattern tree (FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern
mining method
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only
6
Mining Frequent Patterns Without
Candidate Generation
 Grow long patterns from short ones using local

frequent items
 “abc” is a frequent pattern
 Get all transactions having “abc”: DB|abc
 “d” is a local frequent item in DB|abc  abcd is

a frequent pattern
Steps
 1) Find frequency table of each item

 2) Order frequent itemset in desc order(consider
only those whose support > = minm support
 3) Draw FP Tree
 4) Find frequent pattern from FP tree
Example FP growth
 TID Items bought
 100 {f, a, c, d, g, i, m, p}
 200 {a, b, c, f, l, m, o}
 300 {b, f, h, j, o, w}
 400 {b, c, k, s, p}
 500 {a, f, c, e, l, p, m, n}
Minimum Support = 3
Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table
1. Scan DB once, find
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, m:2 b:1
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Benefits of the FP-tree Structure
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more
frequently occurring, the more likely to be shared
Partition Patterns and Databases
 Frequent patterns can be partitioned into subsets

according to f-list
 F-list=f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree

 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1

From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:

{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
Recursion: Mining Each Conditional FP-tree
{}
{} Cond. pattern base of “am”: (fc:3) f:3
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3

cam-conditional FP-tree
Mining Frequent Patterns With FP-trees
 Idea: Frequent pattern growth

 Recursively grow frequent patterns by pattern and
database partition
 Method
 For each frequent item, construct its conditional
pattern-base, and then its conditional FP-tree

 Repeat the process on each newly created conditional
FP-tree
 Until the resulting FP-tree is empty, or it contains only
one path—single path will generate all the

combinations of its sub-paths, each of which is a
frequent pattern
Why Is FP-Growth the Winner?
 Divide-and-conquer:
 decompose both the mining task and DB according to
the frequent patterns obtained so far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
From association mining to
correlation analysis
Interestingness Measurements
 Objective measures-
 Two popular measurements
 support
 confidence
 Subjective measures-
A rule (pattern) is interesting if
*it is unexpected (surprising to the user); and/or
*actionable (the user can do something with it)
Criticism to Support and Confidence
 Example
 Among 5000 students

 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
 play basketball  eat cereal [40%, 66.7%] is misleading

because the overall percentage of students eating cereal is 75%
which is higher than 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is far more
accurate, although with lower support and confidence
basketball not basketball sum(row)

cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
Other Interestingness Measures: Interest
 Interest (correlation, lift) P( A  B)
P ( A) P( B)
 taking both P(A) and P(B) in consideration
 A and B negatively correlated, if the value is less than 1;

otherwise A and B positively correlated
X 1 1 1 1 0 0 0 0 Itemset Support Interest

Y 1 1 0 0 0 0 0 0 X,Y 25% 2
X,Z 37.50% 0.9
Z 0 1 1 1 1 1 1 1 Y,Z 12.50% 0.57
Criticism to Support and Confidence
 Example
 X and Y: positively correlated,
 X and Z, negatively correlated

X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
We need a measure of dependent or
Z 0 1 1 1 1 1 1 1

correlated events
P( A B) Itemset Support Interest

corrA, B  X,Y 25% 2
P( A) P( B) X,Z 37.50% 0.9
Y,Z 12.50% 0.57
Rule Support Confidence

X=>Y 25% 50%
X=>Z 37.50% 75%
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
Basketball Not basketball Sum (row)
P( A  B) Cereal 2000 1750 3750
Not cereal 1000 250 1250

P ( A) P ( B ) Sum(col.) 3000 2000 5000
2000 / 5000 1000 / 5000

lift ( B, C )   0.89 lift ( B, C )   1.33
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000

FP Growth (Tree)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FP Growth (Tree)

Uploaded by

Copyright:

Available Formats

Frequent Pattern(FP) Growth

 Multiple database scans are costly

 Compress a large database into a compact, Frequent-

 Grow long patterns from short ones using local

 “abc” is a frequent pattern

 Get all transactions having “abc”: DB|abc

 “d” is a local frequent item in DB|abc  abcd is

 1) Find frequency table of each item

TID Items bought (ordered) frequent items

 Reduce irrelevant info—infrequent items are gone

 Frequent patterns can be partitioned into subsets

 Patterns having m but no p

 Patterns having c but no a nor b, m, p

 Starting at the frequent item header table in the FP-tree

p:2 m:1 p fcam:2, cb:1

 For each pattern-base

 Construct the FP-tree for the frequent items of the

m-conditional pattern base:

{} Cond. pattern base of “am”: (fc:3) f:3

Cond. pattern base of “cam”: (f:3) f:3

 Idea: Frequent pattern growth

pattern-base, and then its conditional FP-tree

one path—single path will generate all the

 Among 5000 students

 3750 eat cereal

 2000 both play basket ball and eat cereal

 play basketball  eat cereal [40%, 66.7%] is misleading

basketball not basketball sum(row)

 A and B negatively correlated, if the value is less than 1;

X 1 1 1 1 0 0 0 0 Itemset Support Interest

 X and Z, negatively correlated

P( A B) Itemset Support Interest

Rule Support Confidence

P( A  B) Cereal 2000 1750 3750

Not cereal 1000 250 1250

2000 / 5000 1000 / 5000

You might also like