You are on page 1of 23

IT496: Introduction to Data Mining

Lecture - 08
Frequent Itemset Mining

Arpit Rana
25th Aug 2022
Frequent Itemset Generation

The Apriori Principle (a.k.a. downward closure property of frequent patterns)


If an itemset is frequent, then all of its subsets must also be frequent.

Every transaction
having {c, d, e} also
contains, for
example {c, d} or
{e}.

Therefore, if {c, d, e}
is frequent, then, all
of its subsets are
also frequent.
Frequent Itemset Generation

The Apriori Pruning Principle (a.k.a. anti-monotone property of support measure)


If an itemset is infrequent, then all of its supersets must also be infrequent.

Support-based Pruning

Any transaction that


does not have {a, b},
cannot have any of its
supersets, for example,
{a, b, c} or {a, b, c, e}.
Frequent Itemset Generation

Apriori Method
● Initially, scan T once to get frequent 1-itemset
● Generate length k candidate itemsets from length (k-1) frequent itemsets
● Test the candidates against T
Brute-force Method
● Terminate when no frequent or candidate set can be generated

Candidate 1-itemsets Candidate 2-itemsets


Apriori Method
Itemset Count Itemset Count

Cereal 3 {Cereal, Bread} 2


Candidate 3-itemsets
Bread 4 minsup = 3 {Cereal, Diapers} 3 minsup = 3
Itemset Count
Coke 2 {Cereal, Milk} 2
{Bread, Diapers, Milk} 2
Diapers 4 {Bread, Diapers} 3

Milk 4 {Bread, Milk} 3

Sugar 1 {Diapers, Milk} 3


Frequent Itemset Generation

Apriori Pseudocode

Ck – candidate itemset
Fk – frequent itemset

The total number of iterations


needed by the algorithm is kmax + 1,

where kmax is the maximum size of


the frequent itemsets.
Frequent Itemset Generation

Apriori: Candidate Generation


An effective candidate generation procedure must be complete and non-redundant.

Completeness

● A candidate generation procedure is said to be complete if it does not omit any frequent
itemsets.
● To ensure completeness, the set of candidate itemsets must subsume the set of all frequent
itemsets, i.e., ∀k : Fk ⊆ Ck.

Non-redundant

● A candidate generation procedure is non-redundant if it does not generate the same


candidate itemset more than once.

Also, an effective candidate generation procedure should avoid generating too many unnecessary
candidates (i.e., at least one of its subsets is infrequent).
Frequent Itemset Generation

Brute-Force Method
● The brute-force method considers every k-itemset as a potential candidate, and
● It then prunes any unnecessary candidates whose any of the subsets are infrequent.
● The number of candidate itemsets generated at level k is equal to .
Candidate Generation

Itemset Itemset

Cereal {Cereal, Bread, Coke}


Candidate Pruning
Bread .
Itemset
Coke {Bread, Diapers, Milk}
{Bread, Diapers, Milk}
Diapers .

Milk .

Sugar {Diapers, Milk, Sugar}


Frequent Itemset Generation
Frequent Frequent
Fk−1 × F1 Method 2-itemsets 1-itemsets

Itemset Itemset
● It extends each frequent (k−1)-itemset with
frequent 1-itemset that are not part of the {Cereal, Diapers} Cereal
(k−1)-itemset.
{Bread, Diapers} Bread
● It then prunes any unnecessary candidates
whose any of the subsets are infrequent. {Bread, Milk} Diapers

{Diapers, Milk} Milk


● Lexicographic ordering of the frequent
itemsets results in less redundant
Candidate Generation
candidates.
Itemset

Candidate Pruning {Cereal, Bread, Diapers}


Itemset
{Cereal, Diapers, Milk}

{Bread, Diapers, Milk}


{Bread, Diapers, Milk}

{Cereal, Bread, Milk}


Frequent Itemset Generation

Fk−1 × Fk-1 Method Frequent Frequent


2-itemsets 2-itemsets
● This method is used in Apriori algorithm. Itemset Itemset
● It merges a pair of frequent (k−1)-itemsets {Cereal, Diapers} {Cereal, Diapers}
only if their first k−2 items, arranged in
lexicographic order, are identical. {Bread, Diapers} {Bread, Diapers}

● Is this method complete? {Bread, Milk} {Bread, Milk}

{Diapers, Milk} {Diapers, Milk}

Candidate Pruning Candidate Generation

Itemset Itemset

{Bread, Diapers, Milk} {Bread, Diapers, Milk}


Frequent Itemset Generation

Apriori: Candidate Pruning


Let a candidate k-itemset, X = {i1 , i2 , . . . , ik}. Its k proper subsets are, X − {ij }, ∀ j = 1, 2, . . . , k.

If any of them are infrequent, then X is immediately pruned by using the Apriori principle
(anti-monotone property).

● Brute-Force Method
For each candidate k-itemset, candidate pruning requires checking only k subsets of size k−1.

● Fk−1 × F1 Method

We need to check for the k−1 subsets, except the one known to be frequent.

● Fk−1 × Fk-1 Method

We need to check for the k−2 subsets, except the two known to be frequent.
Frequent Itemset Generation

Apriori Pseudocode

Ck – candidate k-itemsets
Fk – frequent k-itemsets
Frequent Itemset Generation

Support Counting: Brute-force Method


● Determine the support count for every candidate itemset in the lattice structure.
○ We need to compare each candidate against every transaction, if the candidate is
contained in a transaction, its support count will be incremented.

TID Itemsets Candidates

1 {Bread, Milk} -

2 {Bread, Diapers, Cereal, Sugar} . -


.
N 3 {Milk, Diapers, Cereal, Coke} - M
.
4 {Bread, Milk, Diapers, Cereal} . -

5 {Bread, Milk, Diapers, Coke} -

w O(NMw) comparisons, quite expensive


Frequent Itemset Generation

Support Counting Using A Simple Hash Structure


● Create a dictionary (hash table) that stores the candidate itemsets as keys, and the number
of appearances as the value.
● Increment the counter for each itemset that you see in the transaction.
Frequent Itemset Generation

Example key value

Suppose we have the following candidate 3-itemsets (C3) {1 4 5} 0

{1 3 6} 2
{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},
{4 5 7} 1
{1 5 9}, {1 2 4}, {2 3 4}, {5 6 7}, {3 4 5},
{1 2 5} 3
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
. .
. .
. .

Hash table stores the counts of the candidate itemsets as {3 5 6} 3


they have been computed so far
{3 5 7} 5

{6 8 9} 2

{3 6 7} 4

{3 6 8} 1
Frequent Itemset Generation

Example
Suppose we have the transaction {1, 2, 3, 5, 6}
that generates (i.e. 10) itemsets of
length-3.

We can systematically enumerate subsets of


three items from the above transaction ⟶
Frequent Itemset Generation

Example key value

Given the transaction {1, 2, 3, 5, 6} that generates the {1 4 5} 0

following 10 itemsets of length-3 {1 3 6} 3

{1 2 3}, {1 2 5}, {1 2 6}, {4 5 7} 1

{1 3 5}, {1 3 6}, {1 5 6}, {1 2 5} 4

. .
{2 3 5}, {2 3 6}, {2 5 6}, . .
. .
{3 5 6}
{3 5 6} 4

Increment the counters for the itemsets in the dictionary {3 5 7} 5

{6 8 9} 2
Can we use a better hash structure to store candidates?
{3 6 7} 4

{3 6 8} 1
Frequent Itemset Generation

Support Counting Using A Hash Tree h(x) = x mod 3

● Create a hash tree that stores the candidate itemsets as leaf nodes -
○ Need a hash function – h(x)

Example
Suppose we have the same candidate 3-itemsets

{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},


{1 5 9}, {1 2 4}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}

At the i-th level we hash on the i-th item


Frequent Itemset Generation

Support Counting Using A Hash Tree


Given a transaction {1, 2, 3, 5, 6} and a candidate
3-itemset hash tree structure, we can perform
the subset operation as shown.

h(x) = x mod 3
Frequent Itemset Generation

Support Counting Using A Hash Tree


Given a transaction {1, 2, 3, 5, 6} and a candidate
3-itemset hash tree structure, we can perform
the subset operation as shown.

h(x) = x mod 3

● 5 out of the 9 leaf nodes are visited


● 9 out of the 15 itemsets are compared
against the transaction.
Frequent Itemset Generation

Factors Affecting Computational Complexity of Apriori Algorithm

Support Threshold
● Lowering support threshold results in more frequent itemsets
● This may increase number of candidates and max length of frequent itemsets

Number of Items (dimensionality)


● More space is needed to store support count of each item
● If number of frequent items also increases, both computation and I/O costs may also
increase
Frequent Itemset Generation

Factors Affecting Computational Complexity of Apriori Algorithm

Number of Transactions
● Since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions

Average Transaction Width


● Transaction width increases with denser data sets
● This may increase max length of frequent itemsets and traversals of hash tree (number of
subsets in a transaction increases with its width)
Frequent Itemset Generation

Exercise

● Computational complexity analysis of Apriori Algorithm


IT496: Introduction to Data Mining

Next lecture Association Rule Mining


29th August 2022

You might also like