Apriori

Apriori Algorithm
What is Association Rule?

Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in large databases. It is intended to identify
strong rules discovered in databases using some measures of interestingness.
Useful Concepts
Support is an indication of how frequently the itemset appears in the dataset.
Support (apple) = (Transactions involving apple) / (Total

Transactions)
Useful Concepts
Confidence is an indication of how often the rule has been found to be true.
Confidence = (Transactions involving both apple and beer) /

(Total Transactions involving jam)
Useful Concepts
Lift: This says how likely item Y is purchased when item X is purchased, while
controlling for how popular item Y is.
Lift = (Confidence (apple –> Beer)) / (Support (apple))
If the Lift value is less than 1, it entails that the

customers are unlikely to buy both the items together.
Greater the value, the better is the combination.
Exercise
Out of the 2000 transactions, 200 contain jam whereas 300 contain bread. These
300 transactions include a 100 that includes bread as well as jam. Using this data,
we shall find out the support, confidence, and lift.
Calculate Support, Confidence and lift.

Solution
Out of the 2000 transactions, 200 contain jam whereas 300 contain bread. These 300
transactions include a 100 that includes bread as well as jam. Using this data, we shall
find out the support, confidence, and lift.
Support (Jam) = (Transactions involving jam) / (Total Transactions)
= 200/2000 = 10%
Confidence = (Transactions involving both bread and jam) /
(Total Transactions involving jam)
= 100/200=50%
Lift = (Confidence (Jam͢ – Bread)) / (Support (Jam))
= 50 / 10 = 500 = 50%
The Apriori Algorithm: Basic
The Apriori Algorithm is an influential algorithm for mining frequent itemsets for
boolean association rules.
Key Concepts :
• Frequent Itemsets: The sets of item which has minimum support (denoted by Li for
ith-Itemset).
• Apriori Property: Any subset of frequent itemset must be frequent.
• Join Operation: To find Lk , a set of candidate k-itemsets is generated by joining L k-1
with itself.
The Apriori Algorithm in a Nutshell
1. Find the frequent itemsets: the sets of items that have minimum support
2. A subset of a frequent itemset must also be a frequent itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent
itemset
3. Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
4. Use the frequent itemsets to generate association rules.
The Apriori Algorithm: Example
Itemset List of Items
T1 I1, I2, I5 Consider a database, D , consisting of 9 transactions.
T2 I2,I4 • Suppose min. support count required is 2
T3 I2,I3 (i.e. min_sup = 2/9 =22 %)
T4 I1, I2, I4
• Let minimum confidence required is 70%.
T5 I1, I3
• We have to first find out the frequent itemset using
T6 I2, I3
Apriori algorithm.
T7 I1, I3
• Then, Association rules will be generated using min.
T8 I1, I2, I3,I5
support & min. confidence.
T9 I1, I2, I3
Step 1: Generating 1-itemset Frequent Pattern
Support Itemset Support
Itemset Count Compare candidate Count
Scan D for
support count with
count of each {I1} 6 {I1} 6
minimum support
candidate {I2} 7
{I2} 7 count
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1
The set of frequent 1-itemsets, L1 , consists of the candidate 1- itemsets satisfying
minimum support.
Itemset Itemset Support Count
{I1,I2} {I1,I2} 4
{I1,I3} {I1,I3} 4
Generate {I1,I4} Scan D for
C2 {I1,I4} 1
{I1,I5} count of
candidates each {I1,I5} 2
from L1 {I2,I3}
candidate {I2,I3} 4
{I2,I4}
{I2,I5} {I2,I4} 2
{I3,I4} {I2,I5} 2
{I3,I5} {I3,I4} 0
{I4,I5} {I3,I5} 1
{I4,I5} 0
C2 C2
Itemset Support Count Itemset Support Count
{I1,I2} 4 {I1,I2} 4
{I1,I3} 4 {I1,I3} 4
Scan D for Compare
{I1,I4} 1 {I1,I5} 2
count of candidate
each {I1,I5} 2 support {I2,I3} 4
candidate {I2,I3} 4 count with {I2,I4} 2
minimum
{I2,I4} 2 support {I2,I5} 2
{I2,I5} 2 count L2
{I3,I4} 0
{I3,I5} 1
{I4,I5} 0
C2
The Apriori Algorithm : Example
• To discover the set of frequent 2-itemsets, L2 , the algorithm uses L1 Join L1 to
generate a candidate set of 2-itemsets, C2.
• Next, the transactions in D are scanned and the support count for each candidate
itemset in C2 is accumulated (as shown in the middle table).
• The set of frequent 2-itemsets, L2 , is then determined, consisting of those candidate
2-itemsets in C2 having minimum support.
Note: We haven’t used Apriori Property yet
Scan D for Scan D for

count of Itemset count of Itemset Support Count
each each
{I1,I2,I3} {I1,I2,I3} 2
candidate candidate
{I1,I2, I5} {I1,I2, I5} 2
C3 C3
Compare candidate
Support count with Itemset Support Count
Itemset Support Count
Minimum support {I1,I2,I3} 2
{I1,I2,I3} 2 count
{I1,I2, I5} 2
{I1,I2, I5} 2
L3
C3
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori
Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3,I5}, {I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to reduce the size of C3.
Prune step helps to avoid heavy computation due to large Ck.
For example , lets take {I1, I2, I3}. The 2-item subsets of it are {I1, I2}, {I1, 3} & {I2, I3}.
Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3.
• Lets take another example of {I2, I3, I5} which shows how the pruning is performed.
The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}.
• BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori
Property. Thus We will have to remove {I2, I3, I5} from C3.
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join
operation for Pruning.
• Now, the transactions in D are scanned in order to determine L3, consisting of those
candidates 3-itemsets in C3 having minimum support.
The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the
join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
• Thus, C4 = φ , and algorithm terminates, having found all of the frequent items. This
completes our Apriori Algorithm.
Step 5: Generating Association Rules from Frequent Itemsets
Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s  (l-s)” if
support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence
threshold.
Example
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}. – Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
• Let minimum confidence threshold is , say 70%.

• The resulting association rules are shown below, each listed with its confidence.
– R1: I1 ^ I2 I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5  I2
• R2 is Selected.
– R3: I2 ^ I5  I1
– R4: I1 I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2  I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5  I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
The Apriori Algorithm: Exercise
A database has five transactions. Let min sup =
TID Item_bought 60% and min conf = 80%.
T100 {M,O,N,K,E,Y}
T200 {D,O,N,K,E,Y}
(a)Find all frequent itemsets using Apriori
T300 {M,A,K,E}
algorithm.
T400 {M,U,C,K,Y}
T500 {C,O,O,K,I,E} (b)List all of the strong association rules (with
support s and confidence c)
The Apriori Algorithm: improvement
Hash-Based Technique: This method uses a hash-based structure called a hash table for
generating the k-itemsets and its corresponding count. It uses a hash function for
generating the table.
h(x, y) = ((order of x) * 10 +(order of y)) mod 7

Transaction Reduction: This method reduces the number of transactions scanning in
iterations. The transactions which do not contain frequent items are marked or removed.
Partitioning: This method requires only two database scans to mine the frequent itemsets.
It says that for any itemset to be potentially frequent in the database, it should be
frequent in at least one of the partitions of the database.
Sampling: This method picks a random sample S from Database D and then searches for
frequent itemset in S. It may be possible to lose a global frequent itemset. This can be
reduced by lowering the min_sup.
Dynamic Itemset Counting: This technique can add new candidate itemsets at any marked
start point of the database during the scanning of the database.
Mining multilevel association rules
Mining multilevel association rules
Approach to mining multilevel association rules
Using uniform minimum support for all levels: The same minimum support threshold is used
when mining at each level of abstraction.
+ One minimum support threshold. No need to examine itemsets containing any item
whose ancestors do not have minimum support.
– Lower level items do not occur as frequently. If support threshold
too high  miss low level associations
too low  generate too many high level associations
Using reduced minimum support at lower levels: Each level of abstraction has its own
minimum support threshold.
1. level-by-level independent: This is a full breadth search, where no background knowledge
of frequent itemsets is used for pruning. Each node is examined, regardless of whether or
not its parent node is found to be frequent.
2. level-cross Filtering by single item: An item at the i-th level is examined if and only if its
parent node at the (i-1)th level is frequent. If a node is frequent, its children will be
examined; otherwise, its descendents are pruned from the search.
3. level-cross filtering by k-itemset: A k-itemset at the i-th level is examined if and only if its
corresponding parent k-itemset at the (i-1)th level is frequent.
Mining multidimensional association
Single-dimensional or intra-dimension association
buys(X "IBM home computer") => buys(X "Sony b=w printer")
where X is a variable representing customers who purchased items in AllElectronics
transactions. ()
Multidimensional association rules
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
age(X "19-24") ^ occupation(X "student") => buys(X "laptop")
contains three predicates (age, occupation, and buys), each of which occurs only once in the
rule. Hence, we say that it has no repeated predicates. Multidimensional association rules
Hybrid-dimension association rule
Multidimensional association rules with repeated predicates, which contain multiple
occurrences of some predicate
age(X "19- 24") ^ buys(X "laptop") => buys(X "b/w printer")

Apriori

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Apriori

Uploaded by

Copyright:

Available Formats

Apriori Algorithm

What is Association Rule?

Support (apple) = (Transactions involving apple) / (Total

Confidence = (Transactions involving both apple and beer) /

Lift = (Confidence (apple –> Beer)) / (Support (apple))

If the Lift value is less than 1, it entails that the

Calculate Support, Confidence and lift.

Scan D for Scan D for

• Let minimum confidence threshold is , say 70%.

h(x, y) = ((order of x) * 10 +(order of y)) mod 7

You might also like