You are on page 1of 34

Apriori Algorithm

What is Association Rule?


Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness.
Useful Concepts
Support is an indication of how frequently the itemset appears in the dataset.

Support (apple) = (Transactions involving apple) / (Total


Transactions)
Useful Concepts
Confidence is an indication of how often the rule has been found to be true.

Confidence = (Transactions involving both apple and beer) /


(Total Transactions involving jam)
Useful Concepts
Lift: This says how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is.

Lift = (Confidence (apple –> Beer)) / (Support (apple))

If the Lift value is less than 1, it entails that the


customers are unlikely to buy both the items together.
Greater the value, the better is the combination.
Exercise
Out of the 2000 transactions, 200 contain jam whereas 300 contain bread. These
300 transactions include a 100 that includes bread as well as jam. Using this data,
we shall find out the support, confidence, and lift.

Calculate Support, Confidence and lift.


Solution
Out of the 2000 transactions, 200 contain jam whereas 300 contain bread. These 300
transactions include a 100 that includes bread as well as jam. Using this data, we shall
find out the support, confidence, and lift.
Support (Jam) = (Transactions involving jam) / (Total Transactions)
= 200/2000 = 10%
Confidence = (Transactions involving both bread and jam) /
(Total Transactions involving jam)
= 100/200=50%
Lift = (Confidence (Jam͢ – Bread)) / (Support (Jam))
= 50 / 10 = 500 = 50%
The Apriori Algorithm: Basic
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules.

Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining L k-1
with itself.
The Apriori Algorithm in a Nutshell
1. Find the frequent itemsets: the sets of items that have minimum support
2. A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
3. Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
4. Use the frequent itemsets to generate association rules.
The Apriori Algorithm: Example
Itemset List of Items
T1 I1, I2, I5 Consider a database, D , consisting of 9 transactions.
T2 I2,I4 • Suppose min. support count required is 2
T3 I2,I3 (i.e. min_sup = 2/9 =22 %)
T4 I1, I2, I4
• Let minimum confidence required is 70%.
T5 I1, I3
• We have to first find out the frequent itemset using
T6 I2, I3
Apriori algorithm.
T7 I1, I3
• Then, Association rules will be generated using min.
T8 I1, I2, I3,I5
support & min. confidence.
T9 I1, I2, I3
The Apriori Algorithm: Example
Step 1: Generating 1-itemset Frequent Pattern
Support Itemset Support
Itemset Count Compare candidate Count
Scan D for
support count with
count of each {I1} 6 {I1} 6
minimum support
candidate {I2} 7
{I2} 7 count
{I3} 6 {I3} 6

{I4} 2 {I4} 2

{I5} 2 {I5} 2

C1 L1
The set of frequent 1-itemsets, L1 , consists of the candidate 1- itemsets satisfying
minimum support.
Step 2: Generating 2-itemset Frequent Pattern
Itemset Itemset Support Count
{I1,I2} {I1,I2} 4
{I1,I3} {I1,I3} 4
Generate {I1,I4} Scan D for
C2 {I1,I4} 1
{I1,I5} count of
candidates each {I1,I5} 2
from L1 {I2,I3}
candidate {I2,I3} 4
{I2,I4}
{I2,I5} {I2,I4} 2
{I3,I4} {I2,I5} 2
{I3,I5} {I3,I4} 0
{I4,I5} {I3,I5} 1
{I4,I5} 0
C2 C2
Step 2: Generating 2-itemset Frequent Pattern
Itemset Support Count Itemset Support Count
{I1,I2} 4 {I1,I2} 4
{I1,I3} 4 {I1,I3} 4
Scan D for Compare
{I1,I4} 1 {I1,I5} 2
count of candidate
each {I1,I5} 2 support {I2,I3} 4
candidate {I2,I3} 4 count with {I2,I4} 2
minimum
{I2,I4} 2 support {I2,I5} 2
{I2,I5} 2 count L2
{I3,I4} 0
{I3,I5} 1
{I4,I5} 0
C2
The Apriori Algorithm : Example
Step 2: Generating 2-itemset Frequent Pattern
• To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to
generate a candidate set of 2-itemsets, C2.
• Next, the transactions in D are scanned and the support count for each candidate
itemset in C2 is accumulated (as shown in the middle table).
• The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate
2-itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet
Step 3: Generating 3-itemset Frequent Pattern

Scan D for Scan D for


count of Itemset count of Itemset Support Count
each each
{I1,I2,I3} {I1,I2,I3} 2
candidate candidate
{I1,I2, I5} {I1,I2, I5} 2

C3 C3

Compare candidate
Support count with Itemset Support Count
Itemset Support Count
Minimum support {I1,I2,I3} 2
{I1,I2,I3} 2 count
{I1,I2, I5} 2
{I1,I2, I5} 2
L3
C3
The Apriori Algorithm: Example
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori
Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3,I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
The Apriori Algorithm: Example
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, 3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
• Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning.
• Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
Step 4: Generating 4-itemset Frequent Pattern

The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
• Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
Step 5: Generating Association Rules from Frequent Itemsets

Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s  (l-s)” if
support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence
threshold.

Example
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}. – Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Step 5: Generating Association Rules from Frequent Itemsets

• Let minimum confidence threshold is , say 70%.


• The resulting association rules are shown below, each listed with its confidence.
– R1: I1 ^ I2 I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5  I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
– R3: I2 ^ I5  I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
Step 5: Generating Association Rules from Frequent Itemsets

– R4: I1 I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2  I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5  I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
The Apriori Algorithm: Exercise
A database has five transactions. Let min sup =
TID Item_bought 60% and min conf = 80%.
T100 {M,O,N,K,E,Y}
T200 {D,O,N,K,E,Y}
(a)Find all frequent itemsets using Apriori
T300 {M,A,K,E}
algorithm.
T400 {M,U,C,K,Y}
T500 {C,O,O,K,I,E} (b)List all of the strong association rules (with
support s and confidence c)
The Apriori Algorithm: improvement
Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.

h(x, y) = ((order of x) * 10 +(order of y)) mod 7


The Apriori Algorithm: improvement
Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or removed.
Partitioning: This method requires only two database scans to mine the frequent itemsets.
It says that for any itemset to be potentially frequent in the database, it should be
frequent in at least one of the partitions of the database.
The Apriori Algorithm: improvement
Sampling: This method picks a random sample S from Database D and then searches for
frequent itemset in S. It may be possible to lose a global frequent itemset. This can be
reduced by lowering the min_sup.
Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked
start point of the database during the scanning of the database.
Mining multilevel association rules
Mining multilevel association rules
Approach to mining multilevel association rules
Using uniform minimum support for all levels: The same minimum support threshold is used
when mining at each level of abstraction.
+ One minimum support threshold. No need to examine itemsets containing any item
whose ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support threshold
too high  miss low level associations
too low  generate too many high level associations
Approach to mining multilevel association rules
Using reduced minimum support at lower levels: Each level of abstraction has its own
minimum support threshold.
Approach to mining multilevel association rules
1. level-by-level independent: This is a full breadth search, where no background knowledge
of frequent itemsets is used for pruning. Each node is examined, regardless of whether or
not its parent node is found to be frequent.
2. level-cross Filtering by single item: An item at the i-th level is examined if and only if its
parent node at the (i-1)th level is frequent. If a node is frequent, its children will be
examined; otherwise, its descendents are pruned from the search.
Approach to mining multilevel association rules
3. level-cross filtering by k-itemset: A k-itemset at the i-th level is examined if and only if its
corresponding parent k-itemset at the (i-1)th level is frequent.
Mining multidimensional association
Single-dimensional or intra-dimension association
buys(X "IBM home computer") => buys(X "Sony b=w printer")
where X is a variable representing customers who purchased items in AllElectronics
transactions. ()
Multidimensional association rules
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
age(X "19-24") ^ occupation(X "student") => buys(X "laptop")
contains three predicates (age, occupation, and buys), each of which occurs only once in the
rule. Hence, we say that it has no repeated predicates. Multidimensional association rules
Mining multidimensional association
Hybrid-dimension association rule
Multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicate
age(X "19- 24") ^ buys(X "laptop") => buys(X "b/w printer")
Mining multidimensional association

You might also like