You are on page 1of 132

Association Rules

Association mining

Association rule mining:


Finding frequent patterns, associations,
correlations, or causal structures among sets
of items or objects in transaction databases,
relational databases, and other information
repositories.
Applications:
Basket data analysis, cross-marketing,
catalog design, loss-leader analysis,
clustering, classification, etc.
Association mining

Let A = {i1, i2, …, im}: a set of items.


Transaction t :
t a set of items, and t  A.
Transaction Database T: a set of
transactions T = {t1, t2, …, tn}.
Association mining

Transaction data: supermarket data


Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
Concepts:
• An item: an item/article in a basket
• I: the set of all items sold in the store
• A transaction: items purchased in a basket; it
may have TID (transaction ID)
• A transactional dataset: A set of transactions
Association mining

Support
A transaction t is said to support an item Ii, if Ii
is present in t. t is said to support a subset of
items X ⊆ A, if t supports each item I in X.
An itemset X ⊆ A has a support s in T, denoted
by s(X)T if s% transaction in T supports X
The support of an item is the percentage of
transaction in which that item occurs.
The rule holds with support sup in T (the
transaction data set) if sup% of transactions
contain X  Y.
sup = Pr(X  Y).
Association mining

A={bear, diaper}
Customer
Customer
buys both
Support (bear, diaper)= buys diaper

=10000/500000
= 2%
Customer
buys beer

No of Transaction Item
20,000 Beer
30,000 Diaper
10,000 Diaper and beer
5,00,000 Total transaction
Association mining

A={ANN, CC, DS, TC, CG} Transaction Item


T={t1, t2, t3, t4, t5, t6} t1 ANN,CC,TC,CG
t2 CC, DS, CG
Support (DS)= 4/6
t3 ANN, CC, DS, CG
= 66.6%
t4 ANN, CC, TC, CG
t5 ANN, CC, DS, TC, CG
Support (ANN,CC) = 4/6
t6 CC, DS, TC
Association mining

Support count (σ ) 1 Bread, Milk


•Frequency of occurrence of
an itemset
•E.g. σ ({Milk, Bread,Diaper}) 2 Bread, Diaper, Beer,
=2 Eggs

3 Milk, Diaper, Beer, Coke


Support
•Fraction of transactions that 4 Bread, Milk, Diaper, Beer
contain an Itemset
•E.g. s({Milk, Bread, Diaper}) 5 Bread, Milk, Diaper,
= 2/5 Coke
Association mining

Confidence
For a given transaction database T, an association
rule is an expression of the form X->Y where x
and y are subsets of A and X Y holds with the
confidence 𝞃, if 𝞃% of transaction in D that support
X also support Y
The rule holds in T with confidence conf if conf% of
transactions that contain X also contain Y.
conf = Pr(Y | X)
confidence for an association rule X Y is the
ratio of the number of transaction that contain X
U Y to the number of transaction that contain x
Confidence measures how much particular item
depend on another
Association mining

Confidence Example
( X  Y ).count
confidence 
X .count
People buy diapers also buy beer
= no of transaction (beer, diaper) / no of
diaper transaction
= 10000/20000
= 50%
People buy beer also buy diaper =
10,000/30,000
= 33.33%
Association mining

Confidence Example
Confidence (ANN  CC)
= no of transaction for both book/ no of

transaction for
ANN
= 4/4 =100%
Confidence (CC ANN)
= no of transaction for both book Purchased/
no
of transaction for CC
= 4/6 =66%
Association mining

Confidence Example
Confidence (ANN  CC)
= no of transaction for both book/ no of

transaction for
ANN
= 4/4 =100%
Confidence (CC ANN)
= no of transaction for both book Purchased/
no
of transaction for CC
= 4/6 =66%
Association mining

Transaction ID Items Bought Min. support 50%


2000 A,B,C Min. confidence 50%
1000 A,C
4000 A,D Frequent Itemset Support
{A} 75%
5000 B,E,F
{B} 50%
{C} 50%
{A,C} 50%
For rule A  C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
Association mining

Assume: t1: Beef, Chicken, Milk


t2: Beef, Cheese
minsup = 30% t3: Cheese, Boots
minconf = 80% t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]
Association mining

Association rule mining is two step


process
Find all frequent itemset
• Each of these itemset will occur at least as
frequently as a predetermined minimum support
count, min_sup
Generate a strong association rules from the
frequent itemset
• These rules must satisfy minimum support and
minimum confidence.
Methods to discover
Association Rules

The algorithm developed must be


efficient :
To reduce the I/O operation
Efficient in computing
Methods to discover
Association Rules

Problem Decomposition : can be


decomposed into 2 sub problems
Find all the set of items whose support is
greater than user specified minimum
support 𝞂 such itemset is called frequent
itemset.
Use the frequent itemset to generate the
desired rules. If ABCD and AB are frequent
itemset, then we can determine if the rule
ABCD holds by checking
• S({A,B,C,D}) / S({A,B}) > = 𝞃 where S(X) is the
support of X in T
Methods to discover
Association Rules

Frequent Set:
Let T be the transaction database and 𝞂 be
the user specified minimum support.
An itemset x ⊆ A is said to be frequent
itemset in T with respect to 𝞂 if s(X)T ≥ 𝞂
Example :let us say 𝞂 =50%,
• then {ANN,CC, TC} is frequent set as supported
by 3 transaction out of 6. (any subset of this set
is also frequent set)
• But {ANN, CC, DS} is not frequent set ( so no set
which properly contains this set is frequent set)
Methods to discover
Association Rules

Properties of Frequent Set:


Downward Closure property: any subset
of a frequent set is a frequent set
Upward Closure property: any superset
of an infrequent set is an infrequent set.
Maximal Frequent set
A frequent set is a maximal frequent set if it
is a frequent set and no superset of this set
is frequent set.
i.e. there exist no super itemset Y such that
X⊂Y and Y is frequent in D
Methods to discover
Association Rules

Closed
An itemset X is closed in a dataset D if there
exist no proper super itemset such that Y
has same support count as X in D
Closed frequent itemset
An itemset X is a Closed frequent itemset in
a set D if x is both closed and frequent in D
Border set
An itemset is a border set if it is not a
frequent set, but all its proper subsets are
frequent sets
Methods to discover
Association Rules

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets
Methods to discover
Association Rules

Example A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
Assume
0 1 0 1 0 0 0 1 0
𝞂 =20% as we 0 0 0 1 1 0 1 0 0
have 10 0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
Records
0 1 1 1 0 0 0 0 0
So an itemset0 1 0 0 0 1 1 0 1
supported by0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
atleast 2 0 0 1 0 1 0 1 0 0
transaction
is frequent set
Methods to discover
Association Rules

Example x Support Count


{1} 1
Frequent count {2} 4
{1} is not a frequent {3} 3

count but {3} is with {4} 3


{5} 5
respect to 𝞂 {6} 3
{5,6,7} is a border set{7} 4
{8} 3
{9} 1
{5,6} 2
{5,7} 3
{6,7} 3
{5,6,7} 1
Methods to discover
Association Rules

Large Itemset
Is an itemset whose number of occurrences
are above threshold, s
The most common approach to find
association rules is to break up the
problem into two parts
Find large itemset
Generate rules from frequent itemset
Methods to discover
Association Rules

Finding large itemset is easy but very costly.


If we count all itmeset that appear in any
transaction and if set of itmes of size m ,there
are 2m subset.
Since we are not interested in empty set, the
potential number of large itemset is them 2m -1
Most of the algorithms are based on reducing
the number of itmeset to be counted.
These potentially large itemset are called
candidate and the set of all counted (potentially
large) itemset is called candidate itemset.
Methods to discover
Association Rules

ARGen algorithm D Database of transaction


I itmes
R= L large itemset
For each I ϵ L do S support
𝞃 confidence
for each x ⊂ I such that x ≠  do
if (support(l) / support (x) ≥ 𝞃) then
R =R U {X  (1-X)};
Apriori Algorithm
Apriori Algorithm
Proposed by R. Agrwal and R. Shrikant
Uses prior knowledge of frequent itemset
properties
Uses a iterative approach known as level
wise search, where k itemsets are used to
explore (k+1) itemset
• Set of frequent itemset is found by scanning the
database to accumulate the count for each item,
and collecting those items that satisfy minimum
support. The resulting set is L1
Apriori Algorithm
Apriori Algorithm
• Next L1 is used to find L2, the set of frequent 2
itemset, which will find L3.
• This process continues until no more frequent k –
itemset can be found.
• The finding of each Lk requires one full scan of the
database.
To improve the efficiency a property called
Apriori property is used to reduce the search
space.
Apriori Property
• All non empty subset of a frequent itemset must
also be frequent. (Downward Closure)
Apriori Algorithm
Apriori Algorithm (2 steps)
Find all itemsets with a specified minimal
support (coverage).
• An itemset is just a specific set of items, e.g.
{apples, cheese}. The Apriori algorithm can
efficiently find all itemsets whose coverage is
above a given minimum.
Use these itemsets to help generate
interersting rules.
• Having done stage 1, we have considerably
narrowed down the possibilities, and can do
reasonably fast processing of the large itemsets to
generate candidate rules.
Apriori Algorithm
The join step
To find Lk, a set of candidate k-itemset is generated by joining
Lk-1 with itself.
This set of candidate is denoted by Ck
Let l1,l2 be the itemset in Lk-1
The join Lk-1 ⋈ Lk-1 is performed where members of Lk-1 are
joinable if their first (k-2) items are in common. That is the
members l1,l2 of Lk-1 are joined if
(l1[1]=l2[1])^(l1[2]=l2[2])^…..^(l1[k-2]=l2[k-2])^(l1[k-1]
<l2[k-1])
Condition (l1[k-1] <l2[k-1]) ensures that no duplicates should
be produced.
The resulting itemset formed is
{l1[1] ,l2[1] . . . l1[k-2],l1[k-1], l2[k-1])
Apriori Algorithm
The join step
note that there is always an ordering of the items.
a < b will mean that a comes before b in
alphabetical order.
Suppose we have Lk and wish to generate Ck+1
First we take every distinct pair of sets in Lk
{a1, a2 , … ak} and {b1, b2 , … bk}, and do this:
in all cases where
{a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk,
{a1, a2 , … ak, bk} is a candidate k+1-itemset.
Apriori Algorithm
The join step
Suppose the 2-itemsets are:
L2 = { {milk, noodles}, {milk, pasta}, {noodles,

meat}, {noodles, peas}, {noodles, pasta}}


The pairs that satisfy this:
{a1, a2 , … ak-1} = {b1, b2 , … bk-1}, and ak < bk, are:
{milk, noodles}|{milk, pasta}
{noodles, peas} |{noodles, meat}
{noodles, peas}|{noodles, pasta}
{noodles, meat}|{noodles, pasta}
So the candidate 3-itemsets are:
{milk, noodles, pasta}, {noodles, peas, meat}
{noodles, peas, pasta}, {noodles, meat, pasta}
Apriori Algorithm
The Prune step
Ck is a super set of Lk-that is its numbers may
or may not be frequent, but all the frequent k
itemset are included in Ck
A database scan determines the count of each
candidate in Ck would result in determination of
Lk
That is All the candidates having a count less
than minimum support count are included to Lk
This makes Ck very huge and we have reduce
the size of Ck
Apriori Algorithm
The Prune step
To reduce the size of Ck
• If any (k-1) subset of a candidate k-itemset is not
in Lk-1, then the set is not frequent and hence
removed from Ck
• This subset testing is done quickly by maintaining
a hash tree of frequent itemset.
Apriori Algorithm
The Prune step
Now we have some candidate k+1 itemsets,
and are guaranteed to have all of the ones
that possibly could be large, but we have the
chance to maybe prune out some more
before we enter the next stage of Apriori
that counts their support.
In the prune step, we take the candidate
k+1 itemsets we have, and remove any for
which some 2-subset of it is not a large k-
itemset. Such couldn’t possibly be a large
k+1-itemset
Apriori Algorithm
The Prune step
E.g. in the current example, we have (n =
noodles, m=milk, p=peas, t=pasta, q=meat ):
L2 = { {milk, n}, {milk, pasta}, {n, meat}, {n,
peas}, {n, pasta}}
And candidate k+1-itemsets so far:
{m, n, t}, {n, p, q}, {n, p, t}, {n, q, t}
Now, {p, q} is not a 2-itemset, so {n,p,q} is pruned.
{p,t} is not a 2-itemset, so {n,p,t} is pruned
{q,t} is not a 2-itemset, so {n,q,t} is pruned.
After this we finally have C3 = {{milk, noodles,
pasta}}
Apriori Algorithm Example (1)
Database D itemset sup.
Min support =50%
TID Items C1 {1} 2 L1 itemset sup.
100 134 {2} 3 {1} 2
200 235 Scan D {3} 3 {2} 3
300 1235 {4} 1 {3} 3
400 25 {5} 3 {5} 3
C2 itemset sup
C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 5} 3 {2 3} 2
{2 3}
{3 5} 2 {2 5} 3
{2 5}
{3 5} 2
{3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
Apriori Algorithm Example (2)
ID a, b, c, d, e, f, g
1 1 1 1 1
2 1 1 1 We will assume this is
3 1 1 1
4 1 1
our transaction database
5 1 1 1 D and we will assume
6 1 minsup is 4 (20%)
7 1 1 1
8 1
9 1 1
10 1 1
11 1 1 1
12 1
13 1 1
14 1 1 1 1
15
16 1
17 1 1 1
18 1 1 1 1
19 1 1 1 1 1
20 1
Apriori Algorithm Example
First we find all the large 1-itemsets. I.e., in this case,
all the 1-itemsets that are contained by at least 4
records in the DB.
In this example, that’s all of them. So,
L1 = {{a}, {b}, {c}, {d}, {e}, {f}, {g}}
Now we set k = 2 and run apriori-gen to generate C2
The join step when k=2 just gives us the set of all
alphabetically ordered pairs from L1, and we cannot
prune any away, so we have C2
= {{a, b}, {a, c}, {a, d}, {a, e}, {a, f}, {a, g}, {b, c},
{b, d}, {b, e}, {b, f}, {b, g}, {c, d}, {c, e}, {c,
f}, {c, g},
{d, e}, {d, f}, {d, g}, {e, f}, {e, g}, {f, g}}
Apriori Algorithm Example
Apriori algorithm now tells us set a counter
for each of these to 0.
Now take each record in the DB in turn, and
find which of those in C2 are contained in it.
The first record r1 is: {a, b, d, g}. Those of
C2 it contains are:
{a, b}, {a, d}, {a, g}, {a, d}, {a, g}, {b, d},
{b, g}, {d, g}.
Hence Cr1 = {{a, b}, {a, d}, {a, g}, {a, d}, {a,
g}, {b,d}, {b, g}, {d, g}}
and increment the counters of these itemsets.
Apriori Algorithm Example
The second record r2 is:{c, d, e}; Cr2 = {{c,
d}, {c, e}, {d, e}},
and we increment the counters for these
three itemsets.

After all 20 records, we look at the counters,
and in this case we will find that the
itemsets with >= minsup (4) counters are:
L2 = {{a, c}, {a, d}, {c, d}, {c, e}, {c,
f}}
Apriori Algorithm Example
We now set k = 3 and run apriori-gen on L2 .
The join step finds the following pairs that
meet the required pattern:
{a, c}:{a, d} {c, d}:{c, e} {c, d}:{c, f} {c, e}:
{c, f}
This leads to the candidates 3-itemsets:
{a, c, d}, {c, d, e}, {c, d, f}, {c, e, f}
We prune {c, d, e} since {d, e} is not in L2
We prune {c, d, f} since {d, f} is not in L2
We prune {c, e, f} since {e, f} is not in L2
Apriori Algorithm Example
We are left with C3 = {a, c, d}
We now count how many records contain
{a, c, d}. The count is 4, so L3 =
{a, c, d}
So we have L3 = {a, c, d}
We now set k = 4, but when we run
apriori-gen on L3 we get the empty set,
and hence eventually we find L4 = {}
This means we now finish, and return the
set of all of the non-empty Ls
Apriori Algorithm Example(3)
A1 A2 A3 A4 A5 A6 A7 A8 A9
1 0 0 0 1 1 0 1 0
0 1 0 1 0 0 0 1 0
0 0 0 1 1 0 1 0 0
0 1 1 0 0 0 0 0 0
0 0 0 0 1 1 1 0 0
0 1 1 1 0 0 0 0 0
0 1 0 0 0 1 1 0 1
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 1 0
0 0 1 0 1 0 1 0 0
0 0 1 0 1 0 1 0 0
0 0 0 0 1 1 0 1 0
0 1 0 1 0 1 1 0 0
1 0 1 0 1 0 1 0 0
0 1 1 0 0 0 0 0 1
Apriori Algorithm Example(3)
K=1 read the database to count the
support of 1-itemset. The frequent 1
itemset and their support count
{1} are
2 :
{2} 6
{3} 6
{4} 4
{5} 8
{6} 5
{7} 7
{8} 4
{9} 2

L1={{2}6, {3} 6 ,{4}4 ,{5}8, {6}5, {7}7 ,


{8}4 }
Apriori Algorithm Example(3)
K=2 candidate generation
c2= {{2,3},{2,4},{2,5},{2,6},{2,7},{2,8},
{3,4},{3,5} {3,6},{3,7}, {3,8},{4,5}, {4,6},
{4,7}, {4,8},{5,6}, {5,7},{5,8}, {6,7},{6,8},
{7,8}}
The pruning step does not change c2
Read the data base to count the support of
elements in c2 to get
L2= {{2,3}3,{2,4}3,{3,5} 3,{3,7} 3,{5,6} 3,
{5,7}  5 {6,7} 3}
Apriori Algorithm Example(3)
K=3 candidate generation
Using {2,3} and {2,4}, we get {2,3,4}
Using {3,5} and {3,7}, we get {3,5,7}
Using {5,6} and {5,7}, we get {5,6,7}
So c3 ={{2,3,4},{3,5,7},{5,6,7}}
Pruning step prunes {2,3,4} as not all the
subset of size 2 i.e. {2,3},{2,4},{3,4}are
present in L2 other 2 remains
Thus c3 is ={{3,5,7},{5,6,7}}
Read the database to count the support of
itemset in c3
L3={{3,5,7}3}
Apriori Algorithm Example(3)
K=4
since L3 contains only one element, C4 is
empty and hence the algorithm stops
Return the set of frequent sets along
with their respective support values.
L=L1 U L2 U L3
Generating association Rule
Once the frequent itemsets have been
found, next generate strong association
rules from them.
Strong association rule:
rules that satisfy both minimum support and
minimum confidence.
( X  Y ).count
confidence 
X .count
Generating association Rule
Example : L3={{3,5,7}3}
First find out the non empty sets;
{3,5}, {3,7}, {5,7}, {3}, {5}, {7}
3^57 confidence = count(3^5^7)/
count(3^5)
= 3/3=100%
3^75 confidence = 3/3 =100%
5^73 confidence = 3/5 =60%
35^7 confidence = 3/6 =50%
53^7 confidence = 3/8 =37%
75^3 confidence = 3/7 =42%
Generating association Rule
Example : if the minimum confidence is
60% then the rules generated are:
3^57 confidence = 3/3 =100%
3^75 confidence = 3/3 =100%
5^73 confidence = 3/5 =60%
35^7 confidence = 3/6 =50%
53^7 confidence = 3/8 =37%
75^7 confidence = 3/7 =42%
Another example
Overview
The FP-tree contains a compressed
representation of the transaction
database.
A trie (prefix-tree) data structure is used
Each transaction is a path in the tree – paths
can overlap.

Once the FP-tree is constructed the


recursive, divide-and-conquer FP-Growth
algorithm is used to enumerate all
FP-tree Construction
The FP-tree is a trie (prefix
TID Items tree)
1 {A,B}
Since transactions are sets of
2 {B,C,D}
3 {A,C,D,E} items, we need to transform
4 {A,D,E} them into ordered sequences
5 {A,B,C} so that we can have prefixes
6 {A,B,C,D} Otherwise, there is no common
7 {B,C} prefix between sets {A,B} and
8 {A,B,C}
{B,C,A}
9 {A,B,D}
10 {B,C,E} We need to impose an order to
the items
Initially, assume a lexicographic
FP-tree Construction
Initially the tree is empty
TID Items
1 {A,B} null
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
FP-tree Construction
Reading transaction TID = 1
TID Items null
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} A:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} B:1
7 {B,C}
8 {A,B,C}
9 {A,B,D}
Node label = item:support
10 {B,C,E}

Each node in the tree has a label consisting of the


item and the support (number of transactions that
reach that node, i.e. follow that path)
FP-tree Construction

Reading transaction TID = 2


TID Items null
1 {A,B}
2 {B,C,D}
A:1 B:1
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C} B:1
6 {A,B,C,D}
C:1
7 {B,C}
8 {A,B,C} D:1
9 {A,B,D}
10 {B,C,E} Each transaction is a path in the tree

We add pointers between nodes that


refer to the same item
FP-tree Construction

TID Items
null
1 {A,B} After reading transactions
2 {B,C,D} TID=1, 2:
3 {A,C,D,E} A:1 B:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} B:1 C:1
7 {B,C}
8 {A,B,C} D:1
Header Table
9 {A,B,D}
10 {B,C,E} Item Pointer
A
B
The Header Table and the C
pointers assist in computing D
the itemset support E
FP-tree Construction
Reading transaction TID = 3
null

TID Items
A:1 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer
8 {A,B,C} A
9 {A,B,D} B
10 {B,C,E} C
D
E
FP-tree Construction
Reading transaction TID = 3
null

TID Items
A:2 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1
C:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer D:1
8 {A,B,C} A
9 {A,B,D} B E:1
10 {B,C,E} C
D
E
FP-tree Construction
Reading transaction TID = 3
null

TID Items
A:2 B:1
1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1
C:1 C:1
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D} D:1
7 {B,C} Item Pointer D:1
8 {A,B,C} A
9 {A,B,D} B E:1
10 {B,C,E} C
D
E

Each transaction is a path in the tree


FP-Tree Construction
TID Items Each transaction is a path in the tree
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
E:1
Header table D:1
C:3
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-tree size
Every transaction is a path in the FP-tree
The size of the tree depends on the
compressibility of the data
Extreme case: All transactions are the same,
the FP-tree is a single branch
Extreme case: All transactions are different
the size of the tree is the same as that of the
database (bigger actually since we need
additional pointers)
Item ordering
The size of the tree also depends on the ordering of the items.
Heuristic: order the items in according to their frequency from
larger to smaller.
We would need to do an extra pass over the dataset to count
frequencies
Example:
TID Items TID Items
1 {A,B} σ(Α)=7, σ(Β)=8, 1 {Β,Α}
2 {B,C,D} σ(C)=7, σ(D)=5, 2 {B,C,D}
3 {A,C,D,E} σ(Ε)=3 3 {A,C,D,E}
4 {A,D,E} 4 {A,D,E}
Ordering : Β,Α,C,D,E
5 {A,B,C} 5 {Β,Α,C}
6 {A,B,C,D} 6 {Β,Α,C,D}
7 {B,C} 7 {B,C}
8 {A,B,C} 8 {Β,Α,C}
9 {A,B,D} 9 {Β,Α,D}
10 {B,C,E} 10 {B,C,E}
Finding Frequent Itemsets
Input: The FP-tree
Output: All Frequent Itemsets and their
support
Method:
Divide and Conquer:
Consider all itemsets that end in: E, D, C, B, A
• For each possible ending item, consider the itemsets
with last items one of items preceding it in the ordering
• E.g, for E, consider all itemsets with last item D, C, B, A.
This way we get all the itesets ending at DE, CE, BE, AE
• Proceed recursively this way.
• Do this for all items.
Frequent itemsets

All Itemsets

Ε D C B A

DE CE BE AE CD BD AD BC AC AB

CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC

ACDE BCDE ABDE ABCE ABCD

ABCDE
Frequent Itemsets

All Itemsets

Ε D C B A
Frequent?;

DE CE BE AE CD BD AD BC AC AB

Frequent?;

CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent?

ACDE BCDE ABDE ABCE ABCD


Frequent?

ABCDE
Frequent Itemsets
All Itemsets

Frequent?
Ε D C B A

DE CE BE AE CD BD AD BC AC AB

Frequent?

CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent? Frequent?

ACDE BCDE ABDE ABCE ABCD


Frequent?

ABCDE
Frequent Itemsets

All Itemsets

Ε D C B A
Frequent?

DE CE BE AE CD BD AD BC AC AB

Frequent?

CDE BDE ADE BCE ACE ABE BCD ACD ABD ABC
Frequent?

ACDE BCDE ABDE ABCE ABCD

ABCDE
We can generate all itemsets this way
We expect the FP-tree to contain a lot less
Using the FP-tree to find frequent itemsets
TID Items
Transaction
1 {A,B}
2 {B,C,D}
Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
E:1
Header table D:1
C:3
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D
Bottom-up traversal of the tree.
E First, itemsets ending in E, then D, etc,
each time a suffix-based class
Finding Frequent Itemsets
null
Subproblem: find frequent
itemsets ending in E
A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E

 We will then see how to compute the support for the possible itemsets
Finding Frequent Itemsets
null
Ending in D
A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Finding Frequent Itemsets
null

Ending in C
A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Finding Frequent Itemsets
null

Ending in B
A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Finding Frequent Itemsets
null
Ending in Α

A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Algorithm
For each suffix X
Phase 1
Construct the prefix tree for X as shown before,
and compute the support using the header table
and the pointers

Phase 2
If X is frequent, construct the conditional FP-tree
for X in the following steps
1. Recompute support
2. Prune infrequent items
3. Prune leaves and recurse
Example
null
Phase 1 – construct prefix
tree
A:7 B:3
Find all prefix paths that
contain E

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Suffix Paths for Ε:
{A,C,D,E}, {A,D,Ε}, {B,C,E}
Example
null
Phase 1 – construct prefix
tree
A:7 B:3
Find all prefix paths that
contain E

C:3
C:1 D:1

D:1 E:1 E:1

E:1

Prefix Paths for Ε:


{A,C,D,E}, {A,D,Ε}, {B,C,E}
Example
null
Compute Support for E
(minsup = 2)
A:7 B:3
How?
Follow pointers while
summing up counts: 1+1+1 = C:3
3>2 C:1 D:1
E is frequent

D:1 E:1 E:1

E:1

{E} is frequent so we can now consider suffixes DE, CE, BE, AE


Example
null
E is frequent so we proceed with Phase 2

Phase 2 A:7 B:3


Convert the prefix tree of E into a
conditional FP-tree
C:3
Two changes C:1 D:1
(1) Recompute support
(2) Prune infrequent D:1 E:1 E:1

E:1
Example
null
Recompute Support
A:7 B:3

The support counts for some of the


nodes include transactions that do not C:3
C:1 D:1
end in E

For example in null->B->C->E we count


{B, C} D:1 E:1 E:1

The support of any node is equal to the E:1


sum of the support of leaves with label
E in its subtree
Example
null

A:7 B:3

C:3
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:7 B:3

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:7 B:1

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:7 B:1

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:7 B:1

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:2 B:1

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:2 B:1

C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1

D:1 E:1 E:1

E:1
Example
null

A:2 B:1
Truncate
Delete the nodes of Ε C:1
C:1 D:1

D:1
Example
null

A:2 B:1
Prune infrequent
In the conditional FP-tree C:1
some nodes may have C:1 D:1
support less than minsup
e.g., B needs to be pruned D:1

This means that B appears


with E less than minsup
times
Example
null

A:2 B:1

C:1
C:1 D:1

D:1
Example
null

A:2 C:1

C:1 D:1

D:1
Example
null

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E


Repeat the algorithm for {D, E}, {C, E}, {A, E}
Example
null

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain D (DE) in the conditional FP-tree
Example
null

A:2

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain D (DE) in the conditional FP-tree
Example
null

A:2

C:1 D:1

D:1

Compute the support of {D,E} by following the pointers in the tree


1+1 = 2 ≥ 2 = minsup

{D,E} is frequent
Example
null

A:2

C:1 D:1

D:1
Phase 2

Construct the conditional FP-tree


1. Recompute Support
2. Prune nodes
Example
null

A:2

Recompute support C:1 D:1

D:1
Example
null

A:2

Prune nodes C:1 D:1

D:1
Example
null

A:2

Prune nodes C:1


Example
null

A:2

Small support
Prune nodes C:1
Example
null

A:2

Final condition FP-tree for {D,E}

The support of A is ≥ minsup so {A,D,E} is frequent


Since the tree has a single node we return to the next
subproblem
Example
null

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E

We repeat the algorithm for {D,E}, {C,E}, {A,E}


Example
null

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain C (CE) in the conditional FP-tree
Example
null

A:2 C:1

C:1

Phase 1

Find all prefix paths that contain C (CE) in the conditional FP-tree
Example
null

A:2 C:1

C:1

Compute the support of {C,E} by following the pointers in the tree


1+1 = 2 ≥ 2 = minsup

{C,E} is frequent
Example
null

A:2 C:1

C:1

Phase 2

Construct the conditional FP-tree


1. Recompute Support
2. Prune nodes
Example
null

A:1 C:1

Recompute support C:1


Example
null

A:1 C:1

Prune nodes C:1


Example
null

A:1

Prune nodes
Example
null

A:1

Prune nodes
Example
null

Prune nodes

Return to the previous subproblem


Example
null

A:2 C:1

C:1 D:1

D:1

The conditional FP-tree for E

We repeat the algorithm for {D,E}, {C,E}, {A,E}


Example
null

A:2 C:1

C:1 D:1

D:1

Phase 1

Find all prefix paths that contain A (AE) in the conditional FP-tree
Example
null

A:2

Phase 1

Find all prefix paths that contain A (AE) in the conditional FP-tree
Example
null

A:2

Compute the support of {A,E} by following the pointers in the tree


2 ≥ minsup

{A,E} is frequent

There is no conditional FP-tree for {A,E}


Example
So for E we have the following frequent
itemsets
{E}, {D,E}, {C,E}, {A,E}

We proceed with D
Example
null

Ending in D
A:7 B:3

B:5 C:3
C:1 D:1

Header table D:1


C:3
Item Pointer D:1 E:1 E:1
A D:1
B E:1
C D:1
D
E
Example
null
Phase 1 – construct prefix
tree B:3
A:7
Find all prefix paths that
contain D
Support 5 > minsup, D is B:5 C:3
C:1 D:1
frequent
Phase 2 D:1
C:3
Convert prefix tree into D:1
conditional FP-tree D:1

D:1
Example
null

A:7 B:3

B:5 C:3
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Recompute support
Example
null

A:7 B:3

B:2 C:3
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Recompute support
Example
null

A:3 B:3

B:2 C:3
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Recompute support
Example
null

A:3 B:3

B:2 C:1
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Recompute support
Example
null

A:3 B:1

B:2 C:1
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Recompute support
Example
null

A:3 B:1

B:2 C:1
C:1 D:1

D:1
C:1
D:1
D:1

D:1

Prune nodes
Example
null

A:3 B:1

B:2 C:1
C:1

C:1

Prune nodes
Example
null

A:3 B:1

B:2 C:1
C:1

C:1

Construct conditional FP-trees for {C,D}, {B,D}, {A,D}

And so on….
Observations
At each recursive step we solve a
subproblem
Construct the prefix tree
Compute the new support
Prune nodes
Subproblems are disjoint so we never
consider the same itemset twice

Support computation is efficient – happens


together with the computation of the
frequent itemsets.
Observations
The efficiency of the algorithm depends
on the compaction factor of the dataset

If the tree is bushy then the algorithm


does not work well, it increases a lot of
number of sub problems that need to be
solved.

You might also like