Lec 8

IT496: Introduction to Data Mining
Lecture - 08
Frequent Itemset Mining
Arpit Rana
25th Aug 2022
Frequent Itemset Generation
The Apriori Principle (a.k.a. downward closure property of frequent patterns)

If an itemset is frequent, then all of its subsets must also be frequent.
Every transaction
having {c, d, e} also
contains, for
example {c, d} or
{e}.
Therefore, if {c, d, e}
is frequent, then, all
of its subsets are
also frequent.
The Apriori Pruning Principle (a.k.a. anti-monotone property of support measure)

If an itemset is infrequent, then all of its supersets must also be infrequent.
Support-based Pruning
Any transaction that

does not have {a, b},
cannot have any of its
supersets, for example,
{a, b, c} or {a, b, c, e}.
Apriori Method
● Initially, scan T once to get frequent 1-itemset
● Generate length k candidate itemsets from length (k-1) frequent itemsets
● Test the candidates against T
Brute-force Method
● Terminate when no frequent or candidate set can be generated
Candidate 1-itemsets Candidate 2-itemsets

Apriori Method
Itemset Count Itemset Count
Cereal 3 {Cereal, Bread} 2

Candidate 3-itemsets
Bread 4 minsup = 3 {Cereal, Diapers} 3 minsup = 3
Itemset Count
Coke 2 {Cereal, Milk} 2
{Bread, Diapers, Milk} 2
Diapers 4 {Bread, Diapers} 3
Milk 4 {Bread, Milk} 3
Sugar 1 {Diapers, Milk} 3

Apriori Pseudocode
Ck – candidate itemset
Fk – frequent itemset
The total number of iterations

needed by the algorithm is kmax + 1,
where kmax is the maximum size of

the frequent itemsets.
Apriori: Candidate Generation

An effective candidate generation procedure must be complete and non-redundant.
Completeness
● A candidate generation procedure is said to be complete if it does not omit any frequent
itemsets.
● To ensure completeness, the set of candidate itemsets must subsume the set of all frequent
itemsets, i.e., ∀k : Fk ⊆ Ck.
Non-redundant
● A candidate generation procedure is non-redundant if it does not generate the same

candidate itemset more than once.
Also, an effective candidate generation procedure should avoid generating too many unnecessary
candidates (i.e., at least one of its subsets is infrequent).
Brute-Force Method
● The brute-force method considers every k-itemset as a potential candidate, and
● It then prunes any unnecessary candidates whose any of the subsets are infrequent.
● The number of candidate itemsets generated at level k is equal to .
Candidate Generation
Itemset Itemset
Cereal {Cereal, Bread, Coke}

Candidate Pruning
Bread .
Itemset
Coke {Bread, Diapers, Milk}
{Bread, Diapers, Milk}
Diapers .
Milk .
Sugar {Diapers, Milk, Sugar}

Frequent Frequent
Fk−1 × F1 Method 2-itemsets 1-itemsets
Itemset Itemset
● It extends each frequent (k−1)-itemset with
frequent 1-itemset that are not part of the {Cereal, Diapers} Cereal
(k−1)-itemset.
{Bread, Diapers} Bread
● It then prunes any unnecessary candidates
whose any of the subsets are infrequent. {Bread, Milk} Diapers
{Diapers, Milk} Milk

● Lexicographic ordering of the frequent
itemsets results in less redundant
Candidate Generation
candidates.
Itemset
Candidate Pruning {Cereal, Bread, Diapers}

Itemset
{Cereal, Diapers, Milk}

{Cereal, Bread, Milk}

Fk−1 × Fk-1 Method Frequent Frequent

2-itemsets 2-itemsets
● This method is used in Apriori algorithm. Itemset Itemset
● It merges a pair of frequent (k−1)-itemsets {Cereal, Diapers} {Cereal, Diapers}
only if their first k−2 items, arranged in
lexicographic order, are identical. {Bread, Diapers} {Bread, Diapers}
● Is this method complete? {Bread, Milk} {Bread, Milk}
{Diapers, Milk} {Diapers, Milk}
Candidate Pruning Candidate Generation
Itemset Itemset
{Bread, Diapers, Milk} {Bread, Diapers, Milk}

Apriori: Candidate Pruning

Let a candidate k-itemset, X = {i1 , i2 , . . . , ik}. Its k proper subsets are, X − {ij }, ∀ j = 1, 2, . . . , k.
If any of them are infrequent, then X is immediately pruned by using the Apriori principle
(anti-monotone property).
● Brute-Force Method
For each candidate k-itemset, candidate pruning requires checking only k subsets of size k−1.
● Fk−1 × F1 Method
We need to check for the k−1 subsets, except the one known to be frequent.
● Fk−1 × Fk-1 Method
We need to check for the k−2 subsets, except the two known to be frequent.
Apriori Pseudocode
Ck – candidate k-itemsets
Fk – frequent k-itemsets
Support Counting: Brute-force Method

● Determine the support count for every candidate itemset in the lattice structure.
○ We need to compare each candidate against every transaction, if the candidate is
contained in a transaction, its support count will be incremented.
TID Itemsets Candidates
1 {Bread, Milk} -
2 {Bread, Diapers, Cereal, Sugar} . -

.
N 3 {Milk, Diapers, Cereal, Coke} - M
.
4 {Bread, Milk, Diapers, Cereal} . -
5 {Bread, Milk, Diapers, Coke} -
w O(NMw) comparisons, quite expensive

Support Counting Using A Simple Hash Structure

● Create a dictionary (hash table) that stores the candidate itemsets as keys, and the number
of appearances as the value.
● Increment the counter for each itemset that you see in the transaction.
Example key value
Suppose we have the following candidate 3-itemsets (C3) {1 4 5} 0
{1 3 6} 2
{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},
{4 5 7} 1
{1 5 9}, {1 2 4}, {2 3 4}, {5 6 7}, {3 4 5},
{1 2 5} 3
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
. .
. .
. .
Hash table stores the counts of the candidate itemsets as {3 5 6} 3

they have been computed so far
{3 5 7} 5
{6 8 9} 2
{3 6 7} 4
{3 6 8} 1
Example
Suppose we have the transaction {1, 2, 3, 5, 6}
that generates (i.e. 10) itemsets of
length-3.
We can systematically enumerate subsets of

three items from the above transaction ⟶
Example key value
Given the transaction {1, 2, 3, 5, 6} that generates the {1 4 5} 0
following 10 itemsets of length-3 {1 3 6} 3
{1 2 3}, {1 2 5}, {1 2 6}, {4 5 7} 1
{1 3 5}, {1 3 6}, {1 5 6}, {1 2 5} 4
. .
{2 3 5}, {2 3 6}, {2 5 6}, . .
. .
{3 5 6}
{3 5 6} 4
Increment the counters for the itemsets in the dictionary {3 5 7} 5
{6 8 9} 2
Can we use a better hash structure to store candidates?
{3 6 7} 4
{3 6 8} 1
Support Counting Using A Hash Tree h(x) = x mod 3
● Create a hash tree that stores the candidate itemsets as leaf nodes -
○ Need a hash function – h(x)
Example
Suppose we have the same candidate 3-itemsets
{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},

{1 5 9}, {1 2 4}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
At the i-th level we hash on the i-th item

Support Counting Using A Hash Tree

Given a transaction {1, 2, 3, 5, 6} and a candidate
3-itemset hash tree structure, we can perform
the subset operation as shown.
h(x) = x mod 3
Support Counting Using A Hash Tree

Given a transaction {1, 2, 3, 5, 6} and a candidate
3-itemset hash tree structure, we can perform
the subset operation as shown.
h(x) = x mod 3
● 5 out of the 9 leaf nodes are visited

● 9 out of the 15 itemsets are compared
against the transaction.
Factors Affecting Computational Complexity of Apriori Algorithm
Support Threshold
● Lowering support threshold results in more frequent itemsets
● This may increase number of candidates and max length of frequent itemsets
Number of Items (dimensionality)

● More space is needed to store support count of each item
● If number of frequent items also increases, both computation and I/O costs may also
increase
Factors Affecting Computational Complexity of Apriori Algorithm
Number of Transactions
● Since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions
Average Transaction Width

● Transaction width increases with denser data sets
● This may increase max length of frequent itemsets and traversals of hash tree (number of
subsets in a transaction increases with its width)
Exercise
● Computational complexity analysis of Apriori Algorithm

IT496: Introduction to Data Mining
Next lecture Association Rule Mining

29th August 2022

Lec 8

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec 8

Uploaded by

Copyright:

Available Formats

IT496: Introduction to Data Mining

The Apriori Principle (a.k.a. downward closure property of frequent patterns)

The Apriori Pruning Principle (a.k.a. anti-monotone property of support measure)

Any transaction that

Candidate 1-itemsets Candidate 2-itemsets

Cereal 3 {Cereal, Bread} 2

Milk 4 {Bread, Milk} 3

Sugar 1 {Diapers, Milk} 3

The total number of iterations

where kmax is the maximum size of

Apriori: Candidate Generation

● A candidate generation procedure is non-redundant if it does not generate the same

Cereal {Cereal, Bread, Coke}

Sugar {Diapers, Milk, Sugar}

{Diapers, Milk} Milk

Candidate Pruning {Cereal, Bread, Diapers}

{Bread, Diapers, Milk}

{Cereal, Bread, Milk}

Fk−1 × Fk-1 Method Frequent Frequent

● Is this method complete? {Bread, Milk} {Bread, Milk}

{Diapers, Milk} {Diapers, Milk}

Candidate Pruning Candidate Generation

{Bread, Diapers, Milk} {Bread, Diapers, Milk}

Apriori: Candidate Pruning

● Fk−1 × Fk-1 Method

Support Counting: Brute-force Method

TID Itemsets Candidates

2 {Bread, Diapers, Cereal, Sugar} . -

5 {Bread, Milk, Diapers, Coke} -

w O(NMw) comparisons, quite expensive

Support Counting Using A Simple Hash Structure

Example key value

Suppose we have the following candidate 3-itemsets (C3) {1 4 5} 0

Hash table stores the counts of the candidate itemsets as {3 5 6} 3

We can systematically enumerate subsets of

Example key value

Given the transaction {1, 2, 3, 5, 6} that generates the {1 4 5} 0

following 10 itemsets of length-3 {1 3 6} 3

{1 2 3}, {1 2 5}, {1 2 6}, {4 5 7} 1

{1 3 5}, {1 3 6}, {1 5 6}, {1 2 5} 4

Increment the counters for the itemsets in the dictionary {3 5 7} 5

Support Counting Using A Hash Tree h(x) = x mod 3

{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},

At the i-th level we hash on the i-th item

Support Counting Using A Hash Tree

Support Counting Using A Hash Tree

● 5 out of the 9 leaf nodes are visited

Factors Affecting Computational Complexity of Apriori Algorithm

Number of Items (dimensionality)

Factors Affecting Computational Complexity of Apriori Algorithm

Average Transaction Width

● Computational complexity analysis of Apriori Algorithm

Next lecture Association Rule Mining

You might also like