You are on page 1of 53

BITS Pilani

BITS Pilani Dr.Aruna Malapati


Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus

Association Rule Mining using Apriori


and FP-Tree
Today’s Learning objective

• Generate rules given K-frequent itemset

• Define and use the property of confidence to prune some


rules
• Define and use Maximal frequent and closed item sets

• Efficiently compute support using vertical partitioning of data


using ECLAT algorithm
• Construct FP-tree using the transaction data

• Extract the Frequent itemsets using conditional FP tree

BITS Pilani, Hyderabad Campus


Rule Generation

• Given a frequent K-itemset, Y how many association rule can it produce?

2k-2 ignoring empty rules in antecedents and consequents

• An association rule can be extracted by partitioning the itemset


Y into two non-empty subsets, X and Y-X such that X -> Y-X
satisfies the confidence threshold.

If X is 3-frequent itemset{1,2,3} what are the rules it can generate?


Computing confidence does require you to compute support.
BITS Pilani, Hyderabad Campus
Rule Generation

• Given a frequent itemset L, find all non-empty subsets f  L


such that f  L – f satisfies the minimum confidence
requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
• ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,

• If |L| = k, then there are 2k – 2 candidate association rules


(ignoring L   and   L)

BITS Pilani, Hyderabad Campus


Confidence based pruning

BITS Pilani, Hyderabad Campus


Rule Generation

• How to efficiently generate rules from frequent itemsets?


– In general, confidence does not have an anti-monotone
property
• c(ABC D) can be larger or smaller than c(AB D)

– But confidence of rules generated from the same


itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:

c(ABC  D)  c(AB  CD)  c(A  BCD)


• Confidence is anti-monotone w.r.t. number of items
on the RHS of the rule

BITS Pilani, Hyderabad Campus


Rule generation using
Apriori

• Apriori uses a level wise approach for rule generation,


where each level corresponds to the number of items
that belong to the rule consequent.
• Initially all the high-confidence rules that have only one
item in the rule consequent are extracted which are used
to generate subsequent rules.

BITS Pilani, Hyderabad Campus


Rule Generation for Apriori
Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
BITS Pilani, Hyderabad Campus
Maximal Frequent Itemset

• A compact representation of frequent itemsets is


extremely important when we look at association rule
mining.
• If the length of a frequent itemset is ‘k’ we know all of it
2k subsets are also frequent because of the downward
closure pro
• sometimes when the computation is very expensive and
we not interested in associations alone the process of
generating these additional subsets can be avoided and
we can just look at the frequent itemset with maximum
length.

BITS Pilani, Hyderabad Campus


Compression of Itemset
Information
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported
by any ABCDE

transactions BITS Pilani, Hyderabad Campus


Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null

Maximal A B C D E
Itemsets

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent
Itemsets ABCD Border
E
BITS Pilani, Hyderabad Campus
Maximal vs Closed Itemsets
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported
by any ABCDE

transactions BITS Pilani, Hyderabad Campus


Maximal vs Closed
Frequent Itemsets
Minimum support = 2 null
Closed but not maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE

BITS Pilani, Hyderabad Campus


Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

BITS Pilani, Hyderabad Campus


Determining the support of non-
closed frequent itemsets

BITS Pilani, Hyderabad Campus


Determining the support of non-
closed frequent itemsets

BITS Pilani, Hyderabad Campus


Characteristics of Apriori
algorithm

BITS Pilani, Hyderabad Campus


Weaknesses of Apriori

BITS Pilani, Hyderabad Campus


Alternative methods for generating frequent
itemsets: Traversal of itemset lattice

BITS Pilani, Hyderabad Campus


Alternative methods for generating
frequent itemsets: Equivalence classes

BITS Pilani, Hyderabad Campus


Prefix and suffix trees

BITS Pilani, Hyderabad Campus


Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first

BITS Pilani, Hyderabad Campus


Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first

BITS Pilani, Hyderabad Campus


Alternative Methods for
Frequent Itemset Generation
Representation of Database
– horizontal vs vertical data layout

Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B

BITS Pilani, Hyderabad Campus


Bottleneck of Frequent-
pattern Mining
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning
and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1
= 1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?

BITS Pilani, Hyderabad Campus


FP-growth Algorithm

• Encodes data into a compact data structure called FP-


tree
• FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the FP-
tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-
tree

BITS Pilani, Hyderabad Campus


Step 1: FP-Tree
Construction
• FP-Tree is constructed using 2 passes over the data-set:

• Pass 1:

– Scan data and find support for each item.

– Discard infrequent items.

– Sort frequent items in decreasing order based on their


support.
• Use this order when building the FP-Tree, so common
prefixes can be shared.

BITS Pilani, Hyderabad Campus


Step 1: FP-Tree
Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions
share items (when they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same
item, creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.

BITS Pilani, Hyderabad Campus


FP-tree construction
After reading TID=1:

A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
BITS Pilani, Hyderabad Campus
FP-tree construction
After reading TID=1:

A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
BITS Pilani, Hyderabad Campus
FP-tree construction
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}

BITS Pilani, Hyderabad Campus


FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1

Header table D:1


C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
BITS Pilani, Hyderabad Campus
FP-Tree size
• The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
– Best case scenario: all transactions contain the same set of items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items (no items in
common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.

• The size of the FP-tree depends on how the items are ordered
• Ordering by decreasing support is typically used but it does not always
lead to the smallest tree (it's a heuristic).

BITS Pilani, Hyderabad Campus


FP-Tree Construction with
different ordering scheme

Ordered by Lowest support with highest support of the items

BITS Pilani, Hyderabad Campus


Step 2: Frequent Itemset
Generation

• FP-Growth extracts frequent itemsets from the FP-tree.

• Bottom-up algorithm - from the leaves towards the root

• Divide and conquer: first look for frequent itemsets


ending in e, then de, etc. . . then d, then cd, etc. . .
• First, extract prefix path sub-trees ending in an item(set).
(hint: use the linked lists)

BITS Pilani, Hyderabad Campus


Prefix path sub-trees
(Example)

BITS Pilani, Hyderabad Campus


Step 2: Frequent Itemset
Generation
• Each prefix path sub-tree is processed
recursively to extract the frequent itemsets.
Solutions are then merged.
– E.g. the prefix path sub-tree for e will be
used to extract frequent itemsets ending
in e, then in de, ce, be and ae, then in
cde, bde, cde, etc.
– Divide and conquer approach

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Conditional FP-Tree

BITS Pilani, Hyderabad Campus


Result

• Frequent itemsets found (ordered by sufix and order in


which they are found):

BITS Pilani, Hyderabad Campus


Pros and cons of FP-growth

• Advantages of FP-Growth

– only 2 passes over data-set

– “compresses” data-set

– no candidate generation

– much faster than Apriori

• Disadvantages of FP-Growth

– FP-Tree may not fit in memory!!

– FP-Tree is expensive to build

BITS Pilani, Hyderabad Campus


Limitations of the Support
Confidence

• Pattern mining generates a large set of Patterns/rules.

• Not all the generated patterns/rules are interesting

• Interestingness measures: Objective VS Subjective

• Objective: Measured using mathematical formulas and

same for any user. Ex Support, confidence

• Subjective: One person’s trash can be other’s treasure

BITS Pilani, Hyderabad Campus


Example

BITS Pilani, Hyderabad Campus


Interesting measure: Lift

BITS Pilani, Hyderabad Campus


Take home message

• To reduce generating the low confidence rules we will use


the property of confidence as discussed.

• Instead of generating all frequent itemsets we can make


use of Maximal frequent and closed itemsets.

• ECLAT algorithm exploits the vertical partitioning of the


database to help us find the support of itemsets.
• Frequent pattern tree helps in organizing the items set
based on their support.

BITS Pilani, Hyderabad Campus


TID Items bought (ordered) frequent items min_support = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} Item frequency
200 {a, b, c, f, l, m, o} {f, c, a, b, m} f 4
300 {b, f, h, j, o} {f, b} c 4
400 {b, c, k, s, p} {c, b, p} a 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} b 3
m 3
TID freq. Items bought p 3
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b} root
500 {f, c, a, m, p}
Header Table f:4 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3
m:2 b:1

p:2 m:1
EXAMPLE in which the items are ordered as per their increasing support
BITS Pilani, Hyderabad Campus
Example

TID items bought (ordered) frequent items


1 {K, A, D, B} {A, B, D}
2 {D, A, C, E, B} {A, B, D}
3 {C, A, B, E} {A, B}
4 {B, A, D} {A, B, D}
item frequency
A 4
B 4
D 3
C 2
E 2
K 1 BITS Pilani, Hyderabad Campus

You might also like