You are on page 1of 31

Mining Association Rules with FP

Tree
Mining Frequent Itemsets
without Candidate Generation
 In many cases, the Apriori candidate
generate-and-test method significantly
reduces the size of candidate sets, leading to
good performance gain.

 However, it suffer from two nontrivial costs:


 It may generate a huge number of candidates (for
example, if we have 10^4 1-itemset, it may
generate more than 10^7 candidata 2-itemset)
 It may need to scan database many times
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
T100 I1, I2, I5
T200 I2, I4

Association Rules with Apriori T300 I2, I3


T400 I1, I2, I4
T500 I1, I3
Minimum support=2 T600 I2, I3
T700 I1, I3
Minimum confidence=70% T800 I1, I2, I3, I5
T900 I1, I2, I3
Bottleneck of Frequent-pattern
Mining
 Multiple database scans are costly
 Mining long patterns needs many passes of scanning and
generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?
Process of FP growth
 FP-Growth: allows frequent itemset
discovery without candidate itemset
generation. Two step approach:
 Step 1: Build a compact data structure
called the FP-tree
 Built using 2 passes over the data-set.
 Step 2: Extracts frequent itemsets directly
from the FP-tree
Step 1: FP-Tree Construction
 FP-Tree is constructed using 2 passes over the data-
set:
Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on their
support.
Use this order when building the FP-Tree, so
common prefixes can be shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path

2. Fixed order is used, so paths can overlap when transactions


share items (when they have the same prfix ).
 In this case, counters are incremented
3. Pointers are maintained between nodes containing the same
item, creating singly linked lists (dotted lines)
 The more paths that overlap, the higher the compression. FP-
tree may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.
Association Rules
 Let’s have an example

 T100 1,2,5
 T200 2,4
 T300 2,3
 T400 1,2,4
 T500 1,3
 T600 2,3
 T700 1,3
 T800 1,2,3,5
 T900 1,2,3
FP Tree
FP-Tree size
 The FP-Tree usually has a smaller size than the uncompressed
data - typically many transactions share items (and hence
prefixes).
 Best case scenario: all transactions contain the same set of items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of items (no
items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.

 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset
Generation
 FP-Growth extracts frequent itemsets from the
FP-tree.
 Bottom-up algorithm - from the leaves towards
the root.
 Divide and conquer: first look for frequent
itemsets ending in e, then de, etc. . . then d,
then cd, etc. . .
 First, extract prefix path sub-trees ending in an
item(set).
Mining the FP tree
FP Tree
Benefits of the FP-tree Structure
 Completeness
 Preserve complete information for frequent pattern

mining
 Never break a long pattern of any transaction

 Compactness
 Reduce irrelevant info—infrequent items are gone

 Items in frequency descending order: the more

frequently occurring, the more likely to be shared


 Never be larger than the original database (not count

node-links and the count field)


 For Connect-4 DB, compression ratio could be over 100
Exercise

TID Items_bought
 A dataset has five
transactions, let min- T1 M, O, N, K, E, Y
support=60% and
min_confidence=80% T2 D, O, N, K , E, Y
T3 M, A, K, E
 Find all frequent itemsets
using FP Tree T4 M, U, C, K ,Y
T5 C, O, O, K, I ,E
Association Rules with Apriori
K:5 KE:4 KE
E:4 KM:3 KM
M:3 KO:3 KO
O:3 => KY:3 => KY => KEO
Y:3 EM:2 EO
EO:3
EY:2
MO:1
MY:2
OY:2
Association Rules with FP Tree
K:5
E:4
M:3
O:3
Y:3
Association Rules with FP Tree
Y: KEMO:1 KEO:1 KY:1
K:3 KY
O: KEM:1 KE:2
KE:3 KO EO KEO
M: KE:2 K:1
K:3 KM
E: K:4KE
Why Is FP-Growth the Winner?
 Divide-and-conquer:
 decompose both the mining task and DB according to
the frequent patterns obtained so far
 leads to focused search of smaller databases
 Other factors
 no candidate generation, no candidate test
 compressed database: FP-tree structure
 no repeated scan of entire database
 basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching
 Applications of Association Rules
 Association rule learning is a type of
unsupervised learning methods that tests
for the dependence of one data element on
another data element and create
appropriately so that it can be more
effective.
 It tries to discover some interesting
relations or relations among the variables of
the dataset.
 It depends on several rules to find
interesting relations between variables in
the database.
 The association rule learning is the
important technique of machine learning,
and it is employed in
 Market Basket analysis,
 Web usage mining,

 continuous production, etc.

In market basket analysis, it is an adequate used


by several big retailers to find the relations
among items.
 Association rules were originally
transformed from point-of-sale data
that represent what products are
purchased together. Although its roots
are in linking point-of-sale transactions,
association rules can be used external
the retail market to find relationships
among types of “baskets.”
 There are various applications of
Association Rule which are as follows −
 Items purchased on a credit card, such
as rental cars and hotel rooms, support
insight into the following product that
customer are likely to buy.
 Optional services purchased by tele-
connection users (call waiting, call
forwarding, DSL, speed call, etc.)
support decide how to bundle these
functions to maximize revenue.
 Banking services used by retail users
(money industry accounts, CDs,
investment services, car loans, etc.)
recognize users likely to needed other
services.
 Unusual group of insurance claims can be
an expression of fraud and can spark
higher investigation.
 Medical patient histories can supports
expressions of likely complications based
on definite set of treatments.
 Association rules falls to live up to
expectations. For instance, they are not
the best method for producing cross-
selling models in market such as retail
banking, because the rules end up
describing previous marketing
promotions. Also, in retail banking,
users frequently start with a checking
account and then a saving account.
Differentiation among products does not
occur until users have higher products
 In Apriori Algorithm, this algorithm
needed frequent datasets to create
association rules.
 It is created to work on databases that
includes transactions.
 This algorithm needed a breadth-first
search and hash tree to compute the
itemset effectively.
 It is generally used for market basket
analysis and support to understand the
products that can be purchased. It is
used in the healthcare space to discover
drug reactions for patients.
 There are various applications of
Association Rule which are as follows −
 Items purchased on a credit card, such
as rental cars and hotel rooms, support
insight into the following product that
customer are likely to buy.
 Optional services purchased by tele-
connection users (call waiting, call
forwarding, DSL, speed call, etc.)
support decide how to bundle these
functions to maximize revenue.
 Banking services used by retail users
(money industry accounts, CDs,
investment services, car loans, etc.)
recognize users likely to needed other
services.
 Unusual group of insurance claims can
be an expression of fraud and can spark
higher investigation.
 Medical patient histories can supports
expressions of likely complications
based on definite set of treatments.

You might also like