Professional Documents
Culture Documents
Lecture - 08
Frequent Itemset Mining
Arpit Rana
25th Aug 2022
Frequent Itemset Generation
Every transaction
having {c, d, e} also
contains, for
example {c, d} or
{e}.
Therefore, if {c, d, e}
is frequent, then, all
of its subsets are
also frequent.
Frequent Itemset Generation
Support-based Pruning
Apriori Method
● Initially, scan T once to get frequent 1-itemset
● Generate length k candidate itemsets from length (k-1) frequent itemsets
● Test the candidates against T
Brute-force Method
● Terminate when no frequent or candidate set can be generated
Apriori Pseudocode
Ck – candidate itemset
Fk – frequent itemset
Completeness
● A candidate generation procedure is said to be complete if it does not omit any frequent
itemsets.
● To ensure completeness, the set of candidate itemsets must subsume the set of all frequent
itemsets, i.e., ∀k : Fk ⊆ Ck.
Non-redundant
Also, an effective candidate generation procedure should avoid generating too many unnecessary
candidates (i.e., at least one of its subsets is infrequent).
Frequent Itemset Generation
Brute-Force Method
● The brute-force method considers every k-itemset as a potential candidate, and
● It then prunes any unnecessary candidates whose any of the subsets are infrequent.
● The number of candidate itemsets generated at level k is equal to .
Candidate Generation
Itemset Itemset
Milk .
Itemset Itemset
● It extends each frequent (k−1)-itemset with
frequent 1-itemset that are not part of the {Cereal, Diapers} Cereal
(k−1)-itemset.
{Bread, Diapers} Bread
● It then prunes any unnecessary candidates
whose any of the subsets are infrequent. {Bread, Milk} Diapers
Itemset Itemset
If any of them are infrequent, then X is immediately pruned by using the Apriori principle
(anti-monotone property).
● Brute-Force Method
For each candidate k-itemset, candidate pruning requires checking only k subsets of size k−1.
● Fk−1 × F1 Method
We need to check for the k−1 subsets, except the one known to be frequent.
We need to check for the k−2 subsets, except the two known to be frequent.
Frequent Itemset Generation
Apriori Pseudocode
Ck – candidate k-itemsets
Fk – frequent k-itemsets
Frequent Itemset Generation
1 {Bread, Milk} -
{1 3 6} 2
{1 4 5}, {1 3 6}, {4 5 7}, {1 2 5}, {4 5 8},
{4 5 7} 1
{1 5 9}, {1 2 4}, {2 3 4}, {5 6 7}, {3 4 5},
{1 2 5} 3
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
. .
. .
. .
{6 8 9} 2
{3 6 7} 4
{3 6 8} 1
Frequent Itemset Generation
Example
Suppose we have the transaction {1, 2, 3, 5, 6}
that generates (i.e. 10) itemsets of
length-3.
. .
{2 3 5}, {2 3 6}, {2 5 6}, . .
. .
{3 5 6}
{3 5 6} 4
{6 8 9} 2
Can we use a better hash structure to store candidates?
{3 6 7} 4
{3 6 8} 1
Frequent Itemset Generation
● Create a hash tree that stores the candidate itemsets as leaf nodes -
○ Need a hash function – h(x)
Example
Suppose we have the same candidate 3-itemsets
h(x) = x mod 3
Frequent Itemset Generation
h(x) = x mod 3
Support Threshold
● Lowering support threshold results in more frequent itemsets
● This may increase number of candidates and max length of frequent itemsets
Number of Transactions
● Since Apriori makes multiple passes, run time of algorithm may increase with number of
transactions
Exercise