You are on page 1of 157

Data Mining:

Concepts and Techniques

Mining Frequent Patterns


Associations and Correlations
What Is Frequent Pattern Analysis?

 Frequent pattern: a pattern (a set of items, subsequences,


substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
What Is Frequent Pattern Analysis?

Applications
 Basket data analysis: Market basket analysis may provide the
retailer with information to understand the purchase behavior of a
buyer. This information will enable the retailer to understand the
buyer's needs and rewrite the store's layout accordingly, develop
cross-promotional programs, or even capture new buyers

 Cross-marketing: Cross-promotion is a form


of marketing promotion where customers of one product or service
are targeted with promotion of a related product.  Cross-promotion
may involve two or more companies working together in promoting
a service or product, in a way that benefits both.
What Is Frequent Pattern Analysis?
Why Is Frequent Pattern Mining Important?

 Frequent pattern: An intrinsic and important property of


datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in multimedia, time-series, and stream
data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset

 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper}
 k-itemset TID Items
 An itemset that contains k items 1 Bread, Milk
 Support count () 2 Bread, Diaper, Beer, Eggs
 Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
 E.g. ({Milk, Bread, Diaper}) = 2 4 Bread, Milk, Diaper, Beer
 Support 5 Bread, Milk, Diaper, Coke
 Fraction of transactions that contain
an itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
Definition: TID
1
Items
Bread, Milk
Association Rule 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Association Rule
– An implication expression of the form Customer Customer
buys all
X  Y, where X and Y are itemsets three
buys both
milk and
– Example: diaper
{Milk, Diaper}  {Beer}

 Rule Evaluation Metrics Customer


buys beer
– Support (s)
 Fraction of transactions that contain Example: {Milk , Diaper}  Beer
both X and Y
– Confidence (c)  (Milk , Diaper, Beer ) 2
s   0.4
 Measures how often items in Y |T| 5
appear in transactions that
contain X  (Milk, Diaper, Beer ) 2
c   0.67
 (Milk , Diaper ) 3
Association Rule Mining Task

 Given a set of transactions T, the goal of association


rule mining is to find all rules having
 support ≥ minsup threshold
 confidence ≥ minconf threshold

 Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules

TID Items Example of Rules:


1 Bread, Milk
{Milk, Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs
{Milk, Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper, Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Beer}  {Milk, Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Diaper}  {Milk, Beer} (s=0.4, c=0.5)
{Milk}  {Diaper, Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Example

TID date items_bought


100 10/10/99 {F, A, D, B}
200 15/10/99 {D, A, C, E, B}
300 19/10/99 {C, A, B, E}
400 20/10/99 {B, A, D}

What is the support and confidence of the rule: {B, D}  {A}


 Support:
 percentage of tuples that contain {A,B,D} = 75%

 Confidence:
number of tuples that contain {A, B, D}
 100%
number of tuples that contain {B, D}
Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

 Frequent itemset generation is still computationally


expensive
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
Brute-force algorithm for association-rule
mining

• List all possible association rules


• Compute the support and confidence for each rule
• Prune rules that fail the minsup and minconf thresholds

• Computationally prohibitive!
Frequent Itemset Generation

 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database

Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w

 Match each transaction against every candidate


How many association rules are there?

 Given d unique items in I:


 Total number of itemsets = 2d

 Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules


Frequent Itemset Generation Strategies

 Reduce the number of candidates (M)


 Complete search: M=2d
 Use pruning techniques to reduce M

 Reduce the number of transactions (N)


 Reduce size of N as the size of itemset increases
 Used by DHP and vertical-based mining algorithms

 Reduce the number of comparisons (NM)


 Use efficient data structures to store the candidates or
transactions
 No need to match every candidate against every transaction
Reducing Number of Candidates

 Apriori principle:
 If an itemset is frequent, then all of its subsets must also be
frequent X , Y : ( X  Y )  s( X )  s(Y )

 Apriori principle holds due to the following property of the


support measure:
 Support of an itemset never exceeds the support of its subsets
 This is known as the anti-monotone property of support
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
supersets ABCDE
Closed Patterns and Max-Patterns

 An itemset X is closed if X is frequent and there exists no


super-pattern Y ‫ כ‬X, with the same support as X (proposed
by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by
Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of frequent
patterns
 Reducing the # of patterns and rules
Closed patterns
 An itemset is closed if none of its immediate supersets has the same
support as the itemset

Consider 4 transactions: TID Items


10 a, b, c, d, e
20 a, b, d

Assume that minsup = 2 30 b, e, a, c


40 b, c, d, e

 {b, c} is a frequent sequential pattern because it appears in two sequences


(it has a support of 2). {b, c} is not a closed patterns because it is
contained in a larger sequential pattern {b, c, d} having the same support.

 {b, c, d} has a support of 2. It is also not a closed sequential pattern


because it is contained in a larger sequential pattern {b, c, d, e} having the
same support. {b, c, d, e} is a closed pattern because there it is not
included in any other pattern having the same support.
Why are closed patterns interesting?

 s({A, B}) = s(A), i.e., conf({A}{B}) = 1

 We can infer that for every itemset X ,


s(A U {X}) = s({A, B} U X)

 No need to count the frequencies of sets X U {A, B} from the


database!

 If there are lots of rules with confidence 1, then a significant


amount of work can be saved
 Very useful if there are strong correlations between the

items and when the transactions in the database are similar


Why closed patterns are interesting?

 Closed patterns and their frequencies alone are sufficient


representation for all the frequencies of all frequent patterns

 Proof: Assume a frequent itemset X:


 X is closed  s(X) is known

 X is not closed 

s(X) = max {s(Y) | Y is closed and X subset of Y}


Maximal patterns

Frequent patterns without proper frequent super pattern

BCDE, ACD are max-patterns


BCD is not a max-pattern

Min_sup=2
Too many frequent itemsets

 If there are frequent patterns with many items,


enumerating all of them is costly.

 If {a1, …, a100} is a frequent itemset, then there are

100  100  100 


           2100  1
 1   2  100 
1.27*1030 frequent sub-patterns!

 Solution: Mine closed patterns and max-patterns instead


Maximal vs Closed sets

 Knowing all maximal patterns


Frequent
(and their frequencies) allows Itemsets
us to reconstruct the set of
frequent patterns Closed
Frequent
Itemsets
 Knowing all closed patterns
and their frequencies allows Maximal
Frequent
us to reconstruct the set of all Itemsets
frequent patterns and their
frequencies
Closed Patterns and Max-Patterns
 Exercise: Suppose a DB contains only two transactions
 <a1, …, a100>, <a1, …, a50>
 Let minsup = 1
 What is the set of closed itemset?
 {a1, …, a100}: 1
 {a1, …, a50}: 2
 What is the set of max-pattern?
 {a1, …, a100}: 1
 What is the set of all patterns?
 {a1}: 2, …, {a1, a2}: 2, …, {a1, a51}: 1, …, {a1, a2, …, a100}: 1
 A big number: 2100 - 1? Why?
Scalable Frequent Itemset Mining
Methods
 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format


Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
Minimum Support = 3 {Beer,Diaper} 3

Triplets (3-itemsets)

If every subset is considered, Itemset Count


6
C1 + 6C2 + 6C3 = 41 {Bread,Milk,Diaper} 3
With support-based pruning,
6 + 6 + 1 = 13
The Apriori algorithm

 Uses prior knowledge of frequent itemset properties.


 It is an iterative algorithm known as level-wise search or breadth
first search. 
 The search proceeds level-by-level as follows:
 First determine the set of frequent 1-itemset; L1
 Second determine the set of frequent 2-itemset using L1: L2
 Etc.
 The complexity of computing Li is O(n) where n is the number of
transactions in the transaction database.
 Reduction of search space:
 In the worst case what is the number of itemsets in a level Li?
 Apriori uses “Apriori Property”
Apriori Property

 It is an anti-monotone property: if a set cannot pass a test, all of


its supersets will fail the same test as well.
 It is called anti-monotone because the property is monotonic in
the context of failing a test.
 All nonempty subsets of a frequent itemset must also be frequent.
 An itemset I is not frequent if it does not satisfy the minimum
support threshold: s(I) < min_sup
 If an item A is added to the itemset I, then the resulting itemset I
 A cannot occur more frequently than I: I  A is not frequent

 Therefore, s(I  A) < min_sup


How Apriori algorithm uses “Apriori
property”?

 In the computation of the itemsets in Lk using Lk-1

 It is done in two steps:


 Join
 Prune
The Apriori algorithm: Join Step

 The set of candidate k-itemsets (element of Lk), Ck, is generated by


joining Lk-1 with itself:
Lk-1 ∞ Lk-1
 Given l1 and l2 of Lk-1
Li=li1, li2, li3,…, li(k-2), li(k-1)
Lj=lj1, lj2, lj3,…, lj(k-2), lj(k-1) 
where Li and Lj are sorted.
 Li and Lj are joined if they are different (no duplicate generation).
Assume the following:
li1=lj1, li2=lj1, …, li(k-2)=lj(k-2) and li(k-1) < lj(k-1)
 The resulting itemset is:
The Apriori algorithm: Prune Step

 Ck is a superset of Lk → some itemset in Ck may or may not be


frequent
 Lk: Test each generated itemset against the database:
 Scan the database to determine the count of each generated itemset
and include those that have a count no less than the minimum support
count.

This may require intensive computation. 
 Use Apriori property to reduce the search space:
 Any (k-1)-itemset that is not frequent cannot be a subset of a frequent
k-itemset.
 Remove from Ck any k-itemset that has a (k-1)-subset not in Lk-1
(itemsets that are not frequent)
 Efficiently implemented: maintain a hash table of all frequent itemset.
Example of Candidate-generation and
Pruning

 L3={abc, abd, acd, ace, bcd}


  
 Self-joining: L3 ∞ L3
 abcd from abc and abd
 acde from acd and ace
  
 Pruning:
 acde is removed because ade is not in L3

 C4={abcd}
Example

Supmin = 2 Itemset sup


Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
Details: the algorithm

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

Algorithm Apriori
L1=find_frequent_1-itemsets(D)
for (k = 2; Lk-1  ; k++) {
Ck = apriori_gen(Lk-1); // candidates generated from Lk-1
for each transaction t  D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c  Ct
c.count++;
}
Lk  {c  Ck | c.count  min_sup}// candidates in Ck+1 with min_support
}
return L = k Lk;
apriori-gen function

Function apriori-gen(Lk-1: frequent (k-1)-itemsets)


Ck  ;
for all itemset l1, l2  Lk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 {
c  {i1, …, ik-1, i’k-1}; // join f1 and f2
Ck  Ck  {c};
for each (k-1)-subset s of c {
if (s  Lk-1) then
delete c from Ck; // prune
}
}
return Ck;
The Apriori Algorithm: Example
TID List of Items  Consider a database, D , consisting of 9
T100 I1, I2, I5 transactions.
T200 I2, I4  Suppose min. support count required is 2
(i.e. min_sup = 2/9 = 22 % )
T300 I2, I3
 Let minimum confidence required is 70%.
T400 I1, I2, I4
 We have to first find out the frequent
T500 I1, I3
itemset using Apriori algorithm.
T600 I2, I3  Then, Association rules will be generated
T700 I1, I3 using min. support & min. confidence.
T800 I1, I2 ,I3, I5

T900 I1, I2, I3


Step 1: Generating 1-itemset Frequent
Pattern
Itemset Sup.Count Itemset Sup.Count
Compare candidate
Scan D for support count with
{I1} 6 {I1} 6
count of minimum support count
each {I2} 7 {I2} 7
candidate {I3} 6 {I3} 6

{I4} 2 {I4} 2

{I5} 2 {I5} 2

C1 L1
• In the first iteration of the algorithm, each item is a member of the set of
candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets
satisfying minimum support.
Step 2: Generating 2-itemset Frequent
Pattern
Itemset Itemset Sup. Itemset Sup
Generate
{I1, I2} Count Compare Count
C2 Scan D for
candidate
candidates {I1, I3} count of {I1, I2} 4 {I1, I2} 4
support
from L1 each {I1, I3} 4
{I1, I4} {I1, I3} 4 count with
candidate
minimum {I1, I5} 2
{I1, I5} {I1, I4} 1
support
{I2, I3} {I1, I5} 2 count {I2, I3} 4
{I2, I4} {I2, I4} 2
{I2, I3} 4
{I2, I5} {I2, I5} 2
{I2, I4} 2
{I3, I4}
{I2, I5} 2 L2
{I3, I5}
{I3, I4} 0
{I4, I5}
{I3, I5} 1
C2 {I4, I5} 0

C2
Step 2: Generating 2-itemset
Frequent Pattern [Cont.]

 To discover the set of frequent 2-itemsets, L2 , the algorithm


uses L1 ∞ L1 to generate a candidate set of 2-itemsets, C2.

 Next, the transactions in D are scanned and the support count


for each candidate itemset in C2 is accumulated (as shown in the
middle table).
 The set of frequent 2-itemsets, L2 , is then determined,
consisting of those candidate 2-itemsets in C2 having minimum
support.
 Note: We haven’t used Apriori Property yet.
Step 3: Generating 3-itemset
Frequent Pattern

Generate Scan D for Itemset Sup. Compare Itemset Sup


Itemset candidate
C3 count of Count Count
candidates {I1, I2, I3} each support
{I1, I2, I3} 2 count with {I1, I2, I3} 2
from L2 candidate
{I1, I2, I5} min support {I1, I2, I5} 2
{I1, I2, I5} 2
count
C3 C3 L3

 The generation of the set of candidate 3-itemsets, C3 , involves use of


the Apriori Property.
 In order to find C3, we compute L2 ∞ L2.
 C3 = L2 ∞ L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4},
{I2, I3, I5}, {I2, I4, I5}}.
 Now, Join step is complete and Prune step will be used to reduce the
size of C3. Prune step helps to avoid heavy computation due to large
Ck.
Step 3: Generating 3-itemset
Frequent Pattern [Cont.]
 Based on the Apriori property that all subsets of a frequent itemset must
also be frequent, we can determine that four latter candidates cannot
possibly be frequent. How ?
 For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, I3}
& {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L 2, We will
keep {I1, I2, I3} in C3.
 Lets take another example of {I2, I3, I5} which shows how the pruning is
performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
 BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating
Apriori Property. Thus We will have to remove {I2, I3, I5} from C3.
 Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of
result of Join operation for Pruning.
 Now, the transactions in D are scanned in order to determine L3, consisting
of those candidates 3-itemsets in C3 having minimum support.
Step 4: Generating 4-itemset
Frequent Pattern

 The algorithm uses L3 ∞ L3 to generate a candidate set of 4-


itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this
itemset is pruned since its subset {{I2, I3, I5}} is not frequent.
 Thus, C4 = φ , and algorithm terminates, having found all of the
frequent items. This completes our Apriori Algorithm.
 What’s Next ?

These frequent itemsets will be used to generate strong


association rules ( where strong association rules satisfy both
minimum support & minimum confidence).
Step 5: Generating Association Rules
from Frequent Itemsets
 Procedure:
 For each frequent itemset “l”, generate all nonempty subsets of l.
 For every nonempty subset s of l, output the rule “s  (l-s)” if

support_count(l) / support_count(s) >= min_conf where


min_conf is minimum confidence threshold.
 Back To Example:

We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5},


{I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
 Lets take l = {I1,I2,I5}.
 Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Step 5: Generating Association Rules
from Frequent Itemsets [Cont.]

 Let minimum confidence threshold is , say 70%.


 The resulting association rules are shown below, each listed
with its confidence.
 R1: I1 ^ I2  I5
 Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
 R1 is Rejected.
 R2: I1 ^ I5  I2
 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
 R2 is Selected.
 R3: I2 ^ I5  I1
 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
 R3 is Selected.
Step 5: Generating Association Rules
from Frequent Itemsets [Cont.]

 R4: I1  I2 ^ I5
 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%

 R4 is Rejected.

 R5: I2  I1 ^ I5
 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%

 R5 is Rejected.

 R6: I5  I1 ^ I2
 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%

 R6 is Selected.

In this way, We have found three strong association rules.


Generating association rules from
frequent itemsets
 Finding the frequent itemsets from transactions in a database D
 Generating strong association rules:
 Confidence(A=>B)=P(B|A)=
support_count(AUB)/support_count(A)
 support_count(AUB) – number of transactions containing the
itemsets AUB
 support_count(A) - number of transactions containing the
itemsets A
Challenges

 Multiple scans of transaction database


 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori:
o
general ideas
o
Reduce passes of transaction database scans
o
Shrink number of candidates
o
Facilitate support counting of candidates

 
Easily parallelized
Methods to Improve Apriori’s
Efficiency

 Hash-based itemset counting: A k-itemset whose corresponding


hashing bucket count is below the threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets only when
all of their subsets are estimated to be frequent.
Reducing Number of Comparisons

 Candidate counting:
 Scan the database of transactions to determine the support of
each candidate itemset
 To reduce the number of comparisons, store the candidates in a
hash structure
 Instead of matching each transaction against every candidate, match
it against candidates contained in the hashed buckets

Transactions Hash Structure


TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Buckets
Improving the Efficiency of Apriori

 Several attempts have been Transaction DB:


introduced to improve the TID List of Transactions
efficiency of Apriori: T100 I1,I2,I5
T200 I2,I4
 Hash-based technique
T300 I2,I3
 Hashing itemset counts T400 I1,I2,I4
T500 I1,I3
 Example
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3
Create a hash table for candidate 2-
itemsets:
 Generate all 2-itemsets for each transaction in the transaction DB
 H(x,y) = ((order of x) * 10 + (order of y)) mod 7
 A 2-itemset whose corresponding bucket count is below the
support threshold cannot be frequent.
Bucket @ 0 1 2 3 4 5 6
Bucket 2 2 4 2 2 4 4
count
Content {I1,I4} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I1,I2} {I1,I3}
  {I3,I5} {I1,I5} {I2,I3} {I2,I4} {I2,I5} {I1,I2} {I1,I3}
      {I2,I3}     {I1,I2} {I1,I3}
      {I2,I3}     {I1,I2} {I1,I3}

 Remember: support(x→y) = percentage number of transactions that contain x


and y. Therefore, if the minimum support is 3, then the itemsets in buckets 0, 1,
3, and 4 cannot be frequent and so they should not be included in C 2.
How to Count Supports of Candidates?

 Why counting supports of candidates a problem?


 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemset Ck is stored in a hash-tree.
 Leaf node of hash-tree contains a list of itemsets and counts.
 Interior node contains a hash table keyed by items (i.e., an
item hashes to a bucket) and each bucket points to a child
node at next level.
 Subset function: finds all the candidates contained in a
transaction.
 Increment count per candidate and return frequent ones.
Example: Using a Hash-Tree for Ck to
Count Support

hash ptrs
Storing the C4 below in a hash-tree with a
a max of 2 itemsets per leaf node: b
c
<a, b, c, d>
<a, b, e, f> Depth
<a, b, h, j> 0
<a, d, e, f> a c
b
<b, c, e, f> 1
<b, d, f, h>
b d <c, e, f> <e, g, k>
<c, e, g, k>
2 <d, f, h> <f, g, h>
<c, f, g, h> <e, f>
c e h
3
<d> <f> <j>
How to Build a Hash Tree on a
Candidate Set
<a, b, c, d>
Example: Building the hash tree on <a, b, e, f>
the candidate set C4 of the previous <a, b, h, j>
<a, d, e, f>
slide <b, c, e, f>
(max 2 itemsets per leaf node) <b, d, f, h>
<c, e, g, k>
<b, c, d> <c, f, g, h>
<a, b, c, d> <b, e, f>
<a, b, e, f> <b, h, j> a c
<d, e, f> b
<a, b, h, j>
<a, d, e, f> <c, d> b <c, e, f> <e, g, k>
<e, f> d
<b, c, e, f> <d, f, h> <f, g, h>
<h, j>
<b, d, f, h>
<e, f>
<c, e, g, k>
c e h
<c, f, g, h>

<d> <f> <j>


How to Use a Hash-Tree for Ck to
Count Support

For each transaction T, process T through the hash tree to find members of Ck
contained in T and increment their count. After all transactions are processed,
eliminate those candidates with less than min support.

Example: Find candidates in C4 contained in T = <a, b, c, e, f, g, h>


<a, b, c, e, f, g, h>
C4 Count*
<a, b, c, d> 0
a c
<a, b, e, f> 10 b <e, f, g, h>
<b, c, e, f, g, h> <c, e, f, g, h>
<a, b, h, j> 0
<a, d, e, f> 0 <e, g, k>
b d <c, e, f>
<b, c, e, f> 10 <c, e, f, g, h> <f, g, h>
<d, f, h>
<b, d, f, h> 0 <e, f>
<c, e, g, k> 0 c h
e
<c, f, g, h> 01 <e, f, g, h> <f, g, h> <>
<d> <f> <j>
Hash-Tree: Example

Hash(k1)

Hash(k2)

Hash(k3)

k1, k2, k3

1) Depth 1: hash(k1)
2) Depth 2: hash(k2)
3) Depth 3: hash(k3)
Generate Hash Tree

Suppose there are 15 candidate itemsets of length 3: {1 4 5}, {1 2 4}, {4 5


7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5 6}, {3
5 7}, {6 8 9}, {3 6 7}, {3 6 8}
need:
 Hash function
 Max leaf size: max number of itemsets stored in a leaf node (if number
of candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 145 345 356 367
136 368
2,5,8 357
124 689
457 125 159
458
Association Rule Discovery: Hash tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8 234
567

145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Association Rule Discovery: Hash tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8 234
567

145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Association Rule Discovery: Hash tree

Hash Function Candidate Hash Tree

1,4,7 3,6,9

2,5,8 234
567

145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Subset Operation
Given a transaction t, what
are the possible subsets of Transaction, t
size 3?
1 2 3 5 6

Level 1
1 2 3 5 6 2 3 5 6 3 5 6

Level 2

12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6

123
135 235
125 156 256 356
136 236
126

Level 3 Subsets of 3 items


Subset Operation Using Hash Tree

transaction Hash Function


12356

1+ 2356
2+ 356 1,4,7 3,6,9

3+ 56 2,5,8

234
567

145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree

transaction Hash Function


12356

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree

Hash Function
1 2 3 5 6 transaction

1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567

145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Transaction reduction

 Reduce the number of transactions scanned in future iterations.


 A transaction that does not contain any frequent k-itemsets
cannot contain any frequent (k+1)-itemsets: Do not include such
transaction in subsequent scans.
Partitioning

 First scan:
 Subdivide the transactions of database D into n non

overlapping partitions
 If the minimum support in D is min_sup, then the

minimum support for a partition is min_sup * number of


transactions in that partition
 Local frequent items are determined

 A local frequent item my not by a frequent item in D

 Second scan:
 Frequent items are determined from the local frequent

items
Partitioning

 First scan:
 Subdivide the transactions of database D into n non

overlapping partitions
 If the minimum support in D is min_sup, then the minimum

support for a partition is

min_sup * number of transactions in D /


number of transactions in that partition

 Local frequent items are determined


 A local frequent item my not by a frequent item in D

 Second scan:
 Frequent items are determined from the local frequent

items
Reducing Scans via Partitioning
 Divide the dataset D into n non-overlapping partitions, D 1, D2,…, Dn, so that
each partition can fit into memory.
 Find frequent itemsets Fi in Di, with support ≥ minSup, for each i.
 If it is frequent in D, it must be frequent in some Di.
 The union of all Fi forms a candidate set of the frequent itemsets in D; get
their counts.
 Often this requires only two scans of D.
Sampling and Dynamic itemset
counting
 Select a sample of original database, mine frequent patterns within
sample using Apriori
 Scan database once to verify frequent itemsets found in sample,
only borders of closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns

 Dynamic itemset counting – adding candidate itemsets at different


points during a scan
Alternative Methods for Frequent Itemset
Generation

 Representation of Database
 horizontal vs vertical data layout

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
Improving the Efficiency of Apriori

 Transaction reduction
 Reduce the number of transactions scanned in future
iterations.
 A transaction that does not contain any frequent k-itemsets
cannot contain any frequent (k+1)-itemsets: Do not include
such transaction in subsequent scans.
Factors Affecting Complexity

 Choice of minimum support threshold


 lowering support threshold results in more frequent itemsets
 this may increase number of candidates and max length of frequent
itemsets
 Dimensionality (number of items) of the data set
 more space is needed to store support count of each item
 if number of frequent items increases, both computation and I/O costs may
also increase
 Size of database
 since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
 Average transaction width
 transaction width increases with denser data sets
 This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
Frequent Pattern Mining: An Example

Given a transaction database DB and a minimum support threshold ξ,


find all frequent patterns (item sets) with support no less than ξ.

Input: DB: TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

Minimum support: ξ =3
Output all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
:
Problem Statement: How to efficiently find all frequent patterns?
Apriori
Candidate
Generation
 Main Steps of Apriori Algorithm:
 Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
 Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
Candidate
 E.g. , Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p
300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp
400 {b, c, k, s, p} L2 fa, fc, fm, …
500 {a, f, c, e, l, p, m, n}

Performance Bottlenecks of Apriori

 Objectives:
 The bottleneck of Apriori: candidate generation
 Huge candidate sets:
 For 104 frequent 1-itemset, Apriori will generate 107
candidate 2-itemsets.
 To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate
 2100  1030 candidates.
 Multiple scans of database:
 Needs (n +1) scans, n is the length of the longest pattern.
Overview of FP-Growth: Ideas

 Compress a large database into a compact, Frequent-Pattern tree


(FP-tree) structure
 highly condensed, but complete for frequent pattern mining
 avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern mining
method  
 A divide-and-conquer methodology: decompose mining tasks
into smaller ones
 Avoid candidate generation: thus improving performance
Algorithm 1: FP-tree construction

Input: A transaction database DB and a minimum support threshold ?


Output: FP-tree, the frequent-pattern tree of DB.
Method: The FP-tree is constructed as follows.
1. Scan the transaction database DB once. Collect F, the set of frequent items, and
the support of each frequent item. Sort F in support-descending order as FList,
the list of frequent items.
2. Create the root of an FP-tree, T, and label it as “null”. For each transaction Trans
in DB do the following:
i. Select the frequent items in Trans and sort them according to the order of
FList. Let the sorted frequent-item list in Trans be [ p | P], where p is the
first element and P is the remaining list. Call insert tree([ p | P], T ).
ii. The function insert tree([ p | P], T ) is performed as follows. If T has a child
N such that N.item-name = p.item-name, then increment N ’s count by 1;
else create a new node N , with its count initialized to 1, its parent link linked
to T , and its node-link linked to the nodes with the same item-name via the
node-link structure. If P is nonempty, call insert tree(P, N ) recursively.
Step 1: FP-Tree Construction

FP-Tree is constructed using 2 passes over the data-set:


Pass 1:
 Scan data and find support for each item.
 Discard infrequent items.
 Sort frequent items in decreasing order based on their support.

Use this order when building the FP-Tree, so common prefixes can be
shared.
Step 1: FP-Tree Construction

Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions share
items (when they have the same prefix ).
 In this case, counters are incremented
3. Pointers are maintained between nodes containing the same item,
creating single linked lists (Here, shown by dotted lines)
 The more paths that overlap, the higher the compression. FP-tree may
fit in memory.
4. Frequent itemsets extracted from the FP-Tree.
Step 1: FP-Tree Construction
(Example)
FP-Tree size

 The FP-Tree usually has a smaller size than the uncompressed data
- typically many transactions share items (and hence prefixes).
 Best case scenario: all transactions contain the same set of items.
 1 path in the FP-tree
 Worst case scenario: every transaction has a unique set of items (no
items in common)
 Size of the FP-tree is at least as large as the original data.
 Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
 The size of the FP-tree depends on how the items are ordered
 Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation

 FP-Growth extracts frequent itemsets from the FP-tree.


 Bottom-up algorithm - from the leaves towards the root
 Divide and conquer: first look for frequent itemsets ending in e,
then de, etc. . . then d, then cd, etc. . .
 First, extract prefix path sub-trees ending in an item(set). (hint:
use the linked lists)
Prefix path sub-trees (Example)
Step 2: Frequent Itemset Generation

 Each prefix path sub-tree is processed recursively to


extract the frequent itemsets. Solutions are then
merged.
 E.g. the prefix path sub-tree for e will be used to
extract frequent itemsets ending in e, then in de, ce,
be and ae, then in cde, bde, cde, etc.
 Divide and conquer approach
Conditional FP-Tree

 The FP-Tree that would be built if we only consider transactions


containing a particular itemset (and then removing that itemset
from all transactions).
 Example: FP-Tree conditional on e.
Example

Let minSup = 2 and extract all frequent itemsets containing e.

1. Obtain the prefix path sub-tree for e:


Example

2. Check if e is a frequent item by adding the counts along the


linked list (dotted line). If so, extract it.
 Yes, count =3 so {e} is extracted as a frequent itemset.

3. As e is frequent, find frequent itemsets ending in e. i.e. de, ce, be


and ae.
Example

4. Use the conditional FP-tree for e to find frequent itemsets


ending in de, ce and ae
 Note that be is not considered as b is not in the conditional
FP-tree for e.
 For each of them (e.g. de), find the prefix paths from the
conditional tree for e, extract frequent itemsets, generate
conditional FP-tree, etc... (recursive)
Example

 Example: e -> de -> ade ({d,e}, {a,d,e} are found to be frequent)

 Example: e -> ce ({c,e} is found to be frequent)


Result

Frequent itemsets found (ordered by sufix and order in which they are
found):
Construct FP-tree

Two Steps:
1. Scan the transaction DB for the first time, find frequent items (1-
item patterns) and order them into a list L in frequency
descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the
order in L; Scan DB the second time, construct FP-tree by putting
each frequency ordered transaction onto it.
Another FP-tree Example: Step 1

Step 1: Scan DB for the first time to generate L


L

TID Items bought Item frequency


100 {f, a, c, d, g, i, m, p} f 4
200 {a, b, c, f, l, m, o} c 4
300 {b, f, h, j, o}
400 {b, c, k, s, p} a 3
500 {a, f, c, e, l, p, m, n} b 3
m 3
p 3

By-Product of First
Scan of Database
FP-tree Example: Step 2

Step 2: scan the DB for the second time, order frequent items in
each transaction

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
FP-tree Example: Step 2

Step 2: construct FP-tree


{} {}

f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2

a:1 a:2

m:1 m:1 b:1


NOTE: Each
transaction
corresponds to one p:1 p:1 m:1
path in the FP-tree
FP-tree Example: Step 2

Step 2: construct FP-tree

{} {} {}

f:3 f:3 c:1 f:4 c:1


{f, b} {c, b, p} {f, c, a, m, p}
c:2 b:1 c:2 b:1 b:1 c:3 b:1 b:1

a:2 a:2 p:1 a:3 p:1

m:1 b:1 m:1 b:1 m:2 b:1

p:1 m:1 p:1 m:1 p:2 m:1


Node-Link
Construction Example

Final FP-tree

{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1

p:2 m:1
FP-Tree structure

The frequent-pattern tree (FP-tree) is a compact structure that


stores quantitative information about frequent patterns in a
database.
1. One root labeled as “null“ with a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the path
reaching this node,
 node-link that links to the next node in the FP-tree carrying the same item-
name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying the item-
name.
Advantages of the FP-tree Structure

 The most significant advantage of the FP-tree


 Scan the DB twice only.
 Completeness:
 the FP-tree contains all the information related to mining
frequent patterns (given the min-support threshold). Why?
 Compactness:
 The size of the tree is bounded by the occurrences of frequent
items
 The height of the tree is bounded by the maximum number of
items in a transaction
Questions?

 Why descending order?


 Example: {}

f:1 a:1

TID (unordered) frequent items


100 {f, a, c, m, p} a:1 f:1
500 {a, f, c, p, m}
c:1 c:1

m:1 p:1

p:1 m:1
Questions?

 Example: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f}
400 {p, b, c} m:2 b:1 b:1 b:1
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1

This tree is larger than FP-tree,


c:2 c:1
because in FP-tree, more frequent
items have a higher position, which
makes branches less f:2 f:2
Mining Frequent Patterns Using FP-
tree

 General idea (divide-and-conquer)


Recursively grow frequent patterns using the FP-tree: looking for
shorter ones recursively and then concatenating the suffix:
 For each frequent item, construct its conditional pattern base,
and then its conditional FP-tree;
 Repeat the process on each newly created conditional FP-tree
until the resulting FP-tree is empty, or it contains only one path
(single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern)
Mining Frequent Patterns Using FP-
tree

 General idea (divide-and-conquer)


• Recursively grow frequent patterns using the FP-tree: looking for shorter
ones recursively and then concatenating the suffix
 Method
• For each frequent item, construct its conditional pattern base, and
then its conditional FP-tree;
• Repeat the process on each newly created conditional FP-tree
• Until the resulting FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of which
is a frequent pattern)
Major Steps to Mine FP-tree

Starting the processing from the end of list L:


1. Construct conditional pattern base for each item in the header table

2. Construct conditional FP-tree from each conditional pattern base

3. Recursively mine conditional FP-trees and grow frequent patterns


obtained so far

• If the conditional FP-tree contains a single path, simply enumerate


all the patterns
Step 1: Construct Conditional Pattern Base

 Starting at the bottom of frequent-item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
{} Conditional pattern bases
Header Table itemcond. pattern base
f:4 c:1 p fcam:2, cb:1
Item head
f m fca:2, fcab:1
c c:3 b:1 b:1
b fca:1, f:1, c:1
a
b a:3 p:1 a fc:3
m c f:3
p m:2 b:1 f {}
p:2 m:1
{} {}

f:2 c:1 f:3

c:2 b:1 c:3


{}
a:2 p:1 a:3
f:4 c:1 +p b:1 +m
m:2
c:3 b:1 b:1 (1 (2
) )
a:3 p:1
{} {} {}
m:2 b:1
f:2 c:1 f:3 f:4
f:3
p:2 m:1
c:1 c:3 +c
+a
+b
a:1
(3 (4 (5 (6
Properties of FP-Tree

 Node-link property
 For any frequent item ai, all the possible frequent patterns that contain
ai can be obtained by following ai's node-links, starting from ai's head
in the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
Step 2: Construct Conditional FP-tree

 For each pattern base


 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the pattern
base

{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3  fca:2, fcab:1 
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree
Step 3: Recursively mine the
conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) “cam”: (f:3)
add
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add
c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern Frequent Pattern
add f:3 conditional FP-tree of
“f” “fcm”: 3

Frequent Pattern Frequent Pattern fcam


conditional FP-tree of “fm”: 3

Frequent Pattern
Principles of FP-Growth

 Pattern growth property


 Let  be a frequent itemset in DB, B be 's conditional pattern base,
and  be an itemset in B. Then    is a frequent itemset in DB iff
 is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.
Conditional Pattern Bases and
Conditional FP-Tree

Item Conditional pattern base Conditional FP-tree

p {(fcam:2), (cb:1)} {(c:3)}|p

m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m

b {(fca:1), (f:1), (c:1)} Empty

a {(fc:3)} {(f:3, c:3)}|a

c {(f:3)} {(f:3)}|c

f Empty Empty

order of L
Single FP-tree Path Generation

 Suppose an FP-tree T has a single path P. The complete set of


frequent pattern of T can be generated by enumeration of all the
combinations of the sub-paths of P

{}
All frequent patterns concerning m:
combination of {f, c, a} and m
f:3 m,
c:3  fm, cm, am,
fcm, fam, cam,
a:3 fcam
m-conditional FP-tree
Summary of FP-Growth Algorithm

 Mining frequent patterns can be viewed as first mining 1-itemset


and progressively growing each 1-itemset by mining on its
conditional pattern base recursively

 Transform a frequent k-itemset mining problem into a sequence of


k frequent 1-itemset mining problems via a set of conditional
pattern bases
Efficiency Analysis

Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
 mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
 Divide-and-conquer and dramatic scale of shrinking
Exams Questions

Q1: What are the main drawback s of Apriori –like approaches and
explain why ?
A:

The main disadvantages of Apriori-like approaches are:

1. It is costly to generate those candidate sets;


2. It incurs multiple scan of the database.
The reason is that: Apriori is based on the following heuristic/down-
closure property:
if any length k patterns is not frequent in the database, any length
(k+1) super-pattern can never be frequent.
The two steps in Apriori are candidate generation and test. If the 1-
itemsets is huge in the database, then the generation for successive
item-sets would be quite costly and thus the test.
Exams Questions

Q2: What is FP-Tree?


Previous answer: A FP-Tree is a tree data structure that represents

the
database in a compact way. It is constructed by mapping each
frequency
ordered transaction onto a path in the FP-Tree.
Other Answer: A FP-Tree is an extended prefix tree structure that

represents the transaction database in a compact and complete way.


Only frequent length-1 items will have nodes in the tree, and the tree
nodes are arranged in such a way that more frequently occurring
nodes will have better chances of sharing nodes than less frequently
occurring ones. Each transaction in the database is mapped to one
path in the FP-Tree.
Exams Questions

Q3: What is the most significant advantage of FP-Tree? Why FP-Tree


is complete in relevance to frequent pattern mining?
 A: Efficiency, the most significant advantage of the FP-tree is that it
requires two scans to the underlying database (and only two scans)
to construct the FP-tree. This efficiency is further apparent in
database with prolific and long patterns or for mining frequent
patterns with low support threshold.
 As each transaction in the database is mapped to one path in the
FP-Tree, therefore, the frequent item-set information in each
transaction is completely stored in the FP-Tree. Besides, one path
in the FP-Tree may represent frequent item-sets in multiple
transactions without ambiguity since the path representing every
transaction must start from the root of each item prefix sub-tree.
FP-Growth Method : An Example

TID List of Items  Consider the same previous example


T100 I1, I2, I5 of a database, D , consisting of 9
transactions.
T200 I2, I4
 Suppose min. support count required
T300 I2, I3
is 2 (i.e. min_sup = 2/9 = 22 % )
T400 I1, I2, I4  The first scan of database is same as
T500 I1, I3 Apriori, which derives the set of 1-
itemsets & their support counts.
T600 I2, I3
 The set of frequent items is sorted in
T700 I1, I3 the order of descending support
T800 I1, I2 ,I3, I5 count.

T900 I1, I2, I3


 The resulting set is denoted as L =
{I2:7, I1:6, I3:6, I4:2, I5:2}
FP-Growth Method: Construction of
FP-Tree

null{}
Item Sup Node-
Id Count link I2:7 I1:2
I2 7
I1 6 I1:4
I3:2 I4:1
I3 6
I4 2 I3:2
I5 2
I3:2 I4:1
I5:1
I5:1

An FP-Tree that registers compressed, frequent pattern information


Mining the FP-Tree by Creating
Conditional (sub) pattern bases
Steps:
1. Start from each frequent length-1 pattern (as an initial suffix
pattern).
2. Construct its conditional pattern base which consists of the set of
prefix paths in the FP-Tree co-occurring with suffix pattern.
3. Then, Construct its conditional FP-Tree & perform mining on such
a tree.
4. The pattern growth is achieved by concatenation of the suffix
pattern with the frequent patterns generated from a conditional
FP-Tree.
5. The union of all frequent patterns (generated by step 4) gives the
required frequent itemset.
Finding frequent itemsets without
candidate generation
 The first scan of the database is the same as Apriori, which
derives the set of frequent items (1-itemsets) and their support
counts (frequencies).
 Let the minimum support count be 2.
 The set of frequent items is sorted in the order of descending
support count.
 This resulting set or list is denoted by L. Thus, L= {I2:7, I1:6,
I3:6, I4:2, I5:2}
Finding frequent itemsets without
candidate generation
An FP-tree is then constructed as follows.
 First, create the root of the tree, labeled with “null”.
 Scan database D a second time. The items in each transaction are
processed in L order (i.e., sorted according to descending support
count), and a branch is created for each transaction.
 For example, the scan of the first transaction, “T100: I1, I2, I5,”
which contains three items (I2, I1, I5 in L order), leads to the
construction of the first branch of the tree with three nodes, <I2:
1>, <I1: 1>, and <I5: 1>, where I2 is linked as a child to the root,
I1 is linked to I2, and I5 is linked to I1.
Finding frequent itemsets without
candidate generation
 The second transaction, T200, contains the items I2 and I4 in L
order, which would result in a branch where I2 is linked to the root
and I4 is linked to I2.
 However, this branch would share a common prefix, I2, with the
existing path for T100.
 Therefore, we instead increment the count of the I2 node by 1, and
create a new node, <I4: 1>, which is linked as a child to<I2: 2>.
 In general, when considering the branch to be added for a
transaction, the count of each node along a common prefix is
incremented by 1, and nodes for the items following the prefix are
created and linked accordingly.
Finding frequent itemsets without
candidate generation
 To facilitate tree traversal, an item header table is built so that
each item points to its occurrences in the tree via a chain of
node-links.
 The tree obtained after scanning all the transactions is shown
with the associated node-links.
 In this way, the problem of mining frequent patterns in
databases is transformed into that of mining the FP-tree.
Finding frequent itemsets without
candidate generation
The FP-tree is mined as follows.
 Start from each frequent length-1 pattern (as an initial suffix
pattern), construct its conditional pattern base (a “sub-
database,” which consists of the set of prefix paths in the FP-tree
co-occurring with the suffix pattern), then construct its (conditional)
FP-tree, and perform mining recursively on the tree.
 The pattern growth is achieved by the concatenation of the suffix
pattern with the frequent patterns generated from a conditional FP-
tree.
FP-Tree Example Continued

Item Conditional pattern Conditional Frequent pattern


base FP-Tree generated
I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2
I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2
I3 {(I2 I1: 1),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1, I3: 2 , I2 I1 I3: 2
I2 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating conditional (sub) pattern bases

Lets start from I5.


 The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would
be {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
FP-Tree Example Continued

 Out of these, Only I1 & I2 is selected in the conditional FP-Tree


because I3 is not satisfying the minimum support count.
For I1 , support count in conditional pattern base = 1 + 1 = 2
For I2 , support count in conditional pattern base = 1 + 1 = 2
For I3, support count in conditional pattern base = 1
Thus support count for I3 is less than required min_sup which is 2
here.
 Now , We have conditional FP-Tree with us.
 All frequent pattern corresponding to suffix I5 are generated by
considering all possible combinations of I5 and conditional FP-Tree.
 The same procedure is applied to suffixes I4, I3 and I1.
 Note: I2 is not taken into consideration for suffix because it doesn’t
have any prefix at all.
FP-Tree Example Continued

 For I4, its two prefix paths form the conditional pattern base, <I2 I1:
1>, <I2: 1>, which generates a single-node conditional FP-tree, <I2:
2>, and derives one frequent pattern, <I2, I4: 2>.
 Similar to the preceding analysis, I3's conditional pattern base is <I2,
I1: 2>, <I2: 2>, <I1: 2>.
 Its conditional FP-tree has two branches, <I2: 4, I1: 2> and <I1: 2>,
as shown , which generates the set of patterns <I2, I3: 4>, <I1, I3:
4>, <I2, I1, I3: 2>.
 Finally, I1's conditional pattern base is <I2: 4>, with an FP-tree that
contains only one node, <I2: 4>, which generates one frequent
pattern, <I2, I1: 4>.
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern mining
 Never break a long pattern of any transaction
 Compactness
 Reduce irrelevant info—infrequent items are gone
 Items in frequency descending order: the more frequently
occurring, the more likely to be shared
 Never be larger than the original database (not count node-links
and the count field)
The Frequent Pattern Growth Mining
Method

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by pattern and database
partition
 Method
 For each frequent item, construct its conditional pattern-base,
and then its conditional FP-tree
 Repeat the process on each newly created conditional FP-tree
 Until the resulting FP-tree is empty, or it contains only one path
—single path will generate all the combinations of its sub-paths,
each of which is a frequent pattern
Why Frequent Pattern Growth Fast ?

 Performance study shows


 FP-growth is an order of magnitude faster than Apriori,
and is also faster than tree-projection
 Reasoning
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scan
 Basic operation is counting and FP-tree building
Another Example

First scan – determine frequent 1-itemsets, then build header

TID Items B 8
1 {A,B} A 7
2 {B,C,D} C 7
3 {A,C,D,E} D 5
4 {A,D,E} E 3
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
FP-tree construction

null
After reading TID=1:

TID Items A:1


1 {A,B}
2 {B,C,D}
3 {A,C,D,E} B:1
4 {A,D,E}
5 {A,B,C} After reading TID=2:
6 {A,B,C,D} null
7 {B,C}
8 {A,B,C} A:1 B:1
9 {A,B,D}
10 {B,C,E}
B:1 C:1

D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-growth

Conditional Pattern base for


null
D:
P = {(A:1,B:1,C:1),
A:7 B:1 (A:1,B:1),
(A:1,C:1),
(A:1),
B:5 C:1 (B:1,C:1)}
C:1 D:1
Recursively apply FP-growth
D:1 on P
C:3
D:1
Frequent Itemsets found
D:1
(with sup > 1):
D:1 AD, BD, CD, ACD, BCD
Tree Projection
Set enumeration tree: null

A B C D E
Possible Extension:
E(A) = {B,C,D,E}

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Possible Extension:
E(ABC) = {D,E}
ABCD ABCE ABDE ACDE BCDE

ABCDE
Tree Projection

 Items are listed in lexicographic order


 Each node P stores the following information:
 Itemset for node P
 List of possible lexicographic extensions of P: E(P)
 Pointer to projected database of its ancestor node
 Bitvector containing information about which transactions in the
projected database contain the itemset
Projected Database
Projected Database
Original Database: for node A:
TID Items TID Items
1 {A,B} 1 {B}
2 {B,C,D} 2 {}
3 {A,C,D,E} 3 {C,D,E}
4 {A,D,E} 4 {D,E}
5 {A,B,C} 5 {B,C}
6 {A,B,C,D} 6 {B,C,D}
7 {B,C} 7 {}
8 {A,B,C} 8 {B,C}
9 {A,B,D} 9 {B,D}
10 {B,C,E} 10 {}
For each transaction T, projected transaction at node A is T  E(A)
ECLAT

 For each item, store a list of transaction ids (tids)

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B TID-list
ECLAT
 Determine support of any k-itemset by intersecting tid-lists of two
of its (k-1) subsets.
A B AB
1 1 1
4 2 5
5
6
 5
7
 7
8
7 8
8 10
9
 3 traversal approaches:
 top-down, bottom-up and hybrid
 Advantage: very fast support counting
 Disadvantage: intermediate tid-lists may become too large for
memory
Rule Generation

 Given a frequent itemset L, find all non-empty subsets f  L such


that f  L – f satisfies the minimum confidence requirement
 If {A,B,C,D} is a frequent itemset, candidate rules:

ABC D, ABD C, ACD B, BCD A,


A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

 If |L| = k, then there are 2k – 2 candidate association rules


(ignoring L   and   L)
Rule Generation

 How to efficiently generate rules from frequent itemsets?


 In general, confidence does not have an anti-monotone

property
c(ABC D) can be larger or smaller than c(AB D)

 But confidence of rules generated from the same itemset has an


anti-monotone property
 e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)

 Confidence is anti-monotone w.r.t. number of items on the


RHS of the rule
Correlation Concepts

 Two item sets A and B are independent (the occurrence of A is


independent of the occurrence of item set B) iff
P(A  B) = P(A)  P(B)
 Otherwise A and B are dependent and correlated

 The measure of correlation, or correlation between A and B is


given by the formula:
Corr(A,B)= P(A U B ) / P(A) . P(B)
Correlation Concepts [Cont.]

 corr(A,B) >1 means that A and B are positively correlated


i.e. the occurrence of one implies the occurrence of the other.

 corr(A,B) < 1 means that the occurrence of A is


negatively correlated with ( or discourages) the occurrence of
B.

 corr(A,B) =1 means that A and B are independent and there


is no correlation between them.
Association & Correlation

 The correlation formula can be re-written as


 Corr(A,B) = P(B|A) / P(B)

 We already know that


 Support(A B)= P(AUB)

 Confidence(A  B)= P(B|A)

 That means that, Confidence(A B)= corr(A,B) P(B)

 So correlation, support and confidence are all different, but the


correlation provides an extra information about the association
rule (A B).

 We say that the correlation corr(A,B) provides the LIFT of the


association rule (A=>B), i.e. A is said to increase (or LIFT) the
likelihood of B by the factor of the value returned by the formula
for corr(A,B).
Correlation Rules

 A correlation rule is a set of items {i1, i2 , ….in}, where the items


occurrences are correlated.

 The correlation value is given by the correlation formula and we


use Χ square test to determine if correlation is statistically
significant. The Χ square test can also determine the negative
correlation. We can also form minimal correlated item sets, etc…

 Limitations: Χ square test is less accurate on the data tables that


are sparse and can be misleading for the contingency tables larger
then 2x2
Partition: Scan Database Only Twice

 Any itemset that is potentially frequent in DB must be frequent in


at least one of the partitions of DB
 Scan 1: partition database and find local frequent patterns
 Scan 2: consolidate global frequent patterns

 A. Savasere, E. Omiecinski and S. Navathe, VLDB’95

DB1 + DB2 + + DBk = DB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
DHP: Reduce the Number of
Candidates
 A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent count itemsets
 Candidates: a, b, c, d, e 35 {ab, ad, ae}
88 {bd, be, de}
 Hash entries
{ab, ad, ae}
.
 .
. .
 {bd, be, de} . .
 …
102 {yz, qs, wt}
 Frequent 1-itemset: a, b, d, e Hash Table
 ab is not a candidate 2-itemset if the sum of count of {ab, ad,
ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for
mining association rules. SIGMOD’95
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent patterns


within sample using Apriori
 Scan database once to verify frequent itemsets found in sample,
only borders of closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns
 H. Toivonen. Sampling large databases for association rules. In
VLDB’96
DIC: Reduce Number of Scans
ABCD
 Once both A and D are determined
frequent, the counting of AD begins
ABC ABD ACD BCD  Once all length-2 subsets of BCD are
determined frequent, the counting of BCD
begins
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J. Ullman, 2-items
and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
The Frequent Pattern Growth Mining
Method
 Idea: Frequent pattern growth
 Recursively grow frequent patterns by pattern and database

partition
 Method
 For each frequent item, construct its conditional pattern-base,

and then its conditional FP-tree


 Repeat the process on each newly created conditional FP-tree

 Until the resulting FP-tree is empty, or it contains only one path

—single path will generate all the combinations of its sub-paths,


each of which is a frequent pattern
Scaling FP-growth by Database
Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected DB
 Parallel projection vs. partition projection techniques
 Parallel projection
 Project the DB in parallel for each frequent item
 Parallel projection is space costly
 All the partitions can be processed in parallel
 Partition projection
 Partition the DB based on the ordered frequent items
 Passing the unprocessed parts to the subsequent partitions
Partition-Based Projection

 Parallel projection needs a lot Tran. DB


of disk space fcamp
fcabm
 Partition projection saves it fb
cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
Advantages of the Pattern Growth Approach

 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
 A good open-source implementation and refinement of FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Further Improvements of Mining Methods

 AFOPT (Liu, et al. @ KDD’03)


 A “push-right” method for mining condensed frequent pattern
(CFP) tree
 Carpenter (Pan, et al. @ KDD’03)
 Mine data sets with small rows but numerous columns
 Construct a row-enumeration tree for efficient mining
 FPgrowth+ (Grahne and Zhu, FIMI’03)
 Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
ICDM'03 Int. Workshop on Frequent Itemset Mining
Implementations (FIMI'03), Melbourne, FL, Nov. 2003
 TD-Close (Liu, et al, SDM’06)

You might also like