L9 Rule Generation Using Apriori and Traversals

BITS Pilani
BITS Pilani Dr.Aruna Malapati

Asst Professor
Hyderabad Campus Department of CSIS
BITS Pilani
Hyderabad Campus
Association Rule Mining using Apriori

and FP-Tree
Today’s Learning objective
• Generate rules given K-frequent itemset
• Define and use the property of confidence to prune some

rules
• Define and use Maximal frequent and closed item sets
• Efficiently compute support using vertical partitioning of data

using ECLAT algorithm
• Construct FP-tree using the transaction data
• Extract the Frequent itemsets using conditional FP tree
BITS Pilani, Hyderabad Campus

Rule Generation
• Given a frequent K-itemset, Y how many association rule can it produce?
2k-2 ignoring empty rules in antecedents and consequents
• An association rule can be extracted by partitioning the itemset

Y into two non-empty subsets, X and Y-X such that X -> Y-X
satisfies the confidence threshold.
If X is 3-frequent itemset{1,2,3} what are the rules it can generate?

Computing confidence does require you to compute support.
Rule Generation
• Given a frequent itemset L, find all non-empty subsets f  L

such that f  L – f satisfies the minimum confidence
requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
• ABC D, ABD C, ACD B, BCD A,
A BCD, B ACD, C ABD, D ABC
AB CD, AC  BD, AD  BC, BC AD,
BD AC, CD AB,
• If |L| = k, then there are 2k – 2 candidate association rules

(ignoring L   and   L)

Confidence based pruning

Rule Generation
• How to efficiently generate rules from frequent itemsets?

– In general, confidence does not have an anti-monotone
property
• c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same

itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)

• Confidence is anti-monotone w.r.t. number of items
on the RHS of the rule

Rule generation using
Apriori
• Apriori uses a level wise approach for rule generation,

where each level corresponds to the number of items
that belong to the rule consequent.
• Initially all the high-confidence rules that have only one
item in the rule consequent are extracted which are used
to generate subsequent rules.

Rule Generation for Apriori
Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD
D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
Maximal Frequent Itemset
• A compact representation of frequent itemsets is

extremely important when we look at association rule
mining.
• If the length of a frequent itemset is ‘k’ we know all of it
2k subsets are also frequent because of the downward
closure pro
• sometimes when the computation is very expensive and
we not interested in associations alone the process of
generating these additional subsets can be avoided and
we can just look at the frequent itemset with maximum
length.

Compression of Itemset
Information
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported
by any ABCDE
transactions BITS Pilani, Hyderabad Campus

Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets is frequent
null
Maximal A B C D E
Itemsets
Infrequent
Itemsets ABCD Border
E
Maximal vs Closed Itemsets
null Transaction
TID Items
Ids
1 ABC 124 123 1234 245 345
A B C D E
2 ABCD
3 BCE
4 ACDE 12 124 24 4 123 2 3 24 34 45
5 DE
12 2 24 4 4 2 3 4
2 4
Not supported
by any ABCDE
transactions BITS Pilani, Hyderabad Campus

Maximal vs Closed
Frequent Itemsets
Minimum support = 2 null
Closed but not maximal
124 123 1234 245 345
A B C D E
Closed and
maximal
12 124 24 4 123 2 3 24 34 45
12 2 24 4 4 2 3 4
2 4
# Closed = 9
# Maximal = 4
ABCDE

Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets

Determining the support of non-
closed frequent itemsets

Determining the support of non-
closed frequent itemsets

Characteristics of Apriori
algorithm

Weaknesses of Apriori

Alternative methods for generating frequent
itemsets: Traversal of itemset lattice

Alternative methods for generating
frequent itemsets: Equivalence classes

Prefix and suffix trees

Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first

Alternative Methods for Frequent Itemset
Generation: Breadth-first vs Depth-first

Alternative Methods for
Frequent Itemset Generation
Representation of Database
– horizontal vs vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B

Bottleneck of Frequent-
pattern Mining
• Multiple database scans are costly
• Mining long patterns needs many passes of scanning
and generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1
= 1.27*1030 !
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?

FP-growth Algorithm
• Encodes data into a compact data structure called FP-

tree
• FP-Growth: allows frequent itemset discovery without
candidate itemset generation. Two step approach:
– Step 1: Build a compact data structure called the FP-
tree
• Built using 2 passes over the data-set.
– Step 2: Extracts frequent itemsets directly from the FP-
tree

Step 1: FP-Tree
Construction
• FP-Tree is constructed using 2 passes over the data-set:
• Pass 1:
– Scan data and find support for each item.
– Discard infrequent items.
– Sort frequent items in decreasing order based on their

support.
• Use this order when building the FP-Tree, so common
prefixes can be shared.

Step 1: FP-Tree
Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions
share items (when they have the same prfix ).
– In this case, counters are incremented
3. Pointers are maintained between nodes containing the same
item, creating singly linked lists (dotted lines)
– The more paths that overlap, the higher the compression.
FP-tree may fit in memory.
4. Frequent itemsets extracted from the FP-Tree.

FP-tree construction
After reading TID=1:
A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
After reading TID=1:
A:1
TID Items
1 {A,B}
2 {B,C,D} B:1
3 {A,C,D,E}
4 {A,D,E} After reading TID=2:
5 {A,B,C} null
6 {A,B,C,D} B:1
A:1
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:1 C:1
10 {B,C,E}
D:1
TID Items
1 {A,B}
2 {B,C,D}
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}

FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
Header table D:1

C:3 E:1
Item Pointer D:1 E:1
A D:1
B E:1
C D:1
D Pointers are used to assist
E frequent itemset generation
FP-Tree size
• The FP-Tree usually has a smaller size than the uncompressed data -
typically many transactions share items (and hence prefixes).
– Best case scenario: all transactions contain the same set of items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items (no items in
common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
• The size of the FP-tree depends on how the items are ordered
• Ordering by decreasing support is typically used but it does not always
lead to the smallest tree (it's a heuristic).

FP-Tree Construction with
different ordering scheme
Ordered by Lowest support with highest support of the items

Step 2: Frequent Itemset
Generation
• FP-Growth extracts frequent itemsets from the FP-tree.
• Bottom-up algorithm - from the leaves towards the root
• Divide and conquer: first look for frequent itemsets

ending in e, then de, etc. . . then d, then cd, etc. . .
• First, extract prefix path sub-trees ending in an item(set).
(hint: use the linked lists)

Prefix path sub-trees
(Example)

Step 2: Frequent Itemset
Generation
• Each prefix path sub-tree is processed
recursively to extract the frequent itemsets.
Solutions are then merged.
– E.g. the prefix path sub-tree for e will be
used to extract frequent itemsets ending
in e, then in de, ce, be and ae, then in
cde, bde, cde, etc.
– Divide and conquer approach

Example

Conditional FP-Tree

Conditional FP-Tree

Conditional FP-Tree

Conditional FP-Tree

Conditional FP-Tree

Conditional FP-Tree

Result
• Frequent itemsets found (ordered by sufix and order in

which they are found):

Pros and cons of FP-growth
• Advantages of FP-Growth
– only 2 passes over data-set
– “compresses” data-set
– no candidate generation
– much faster than Apriori
• Disadvantages of FP-Growth
– FP-Tree may not fit in memory!!
– FP-Tree is expensive to build

Limitations of the Support
Confidence
• Pattern mining generates a large set of Patterns/rules.
• Not all the generated patterns/rules are interesting
• Interestingness measures: Objective VS Subjective
• Objective: Measured using mathematical formulas and
same for any user. Ex Support, confidence
• Subjective: One person’s trash can be other’s treasure

Example

Interesting measure: Lift

Take home message
• To reduce generating the low confidence rules we will use

the property of confidence as discussed.
• Instead of generating all frequent itemsets we can make

use of Maximal frequent and closed itemsets.
• ECLAT algorithm exploits the vertical partitioning of the

database to help us find the support of itemsets.
• Frequent pattern tree helps in organizing the items set
based on their support.

TID Items bought (ordered) frequent items min_support = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} Item frequency
200 {a, b, c, f, l, m, o} {f, c, a, b, m} f 4
300 {b, f, h, j, o} {f, b} c 4
400 {b, c, k, s, p} {c, b, p} a 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} b 3
m 3
TID freq. Items bought p 3
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, p, b} root
500 {f, c, a, m, p}
Header Table f:4 c:1
Item frequency head
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3
m:2 b:1
p:2 m:1
EXAMPLE in which the items are ordered as per their increasing support
Example
TID items bought (ordered) frequent items

1 {K, A, D, B} {A, B, D}
2 {D, A, C, E, B} {A, B, D}
3 {C, A, B, E} {A, B}
4 {B, A, D} {A, B, D}
item frequency
A 4
B 4
D 3
C 2
E 2
K 1 BITS Pilani, Hyderabad Campus

L9 Rule Generation Using Apriori and Traversals

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L9 Rule Generation Using Apriori and Traversals

Uploaded by

Copyright:

Available Formats

BITS Pilani

BITS Pilani Dr.Aruna Malapati

Association Rule Mining using Apriori

• Generate rules given K-frequent itemset

• Define and use the property of confidence to prune some

• Efficiently compute support using vertical partitioning of data

• Extract the Frequent itemsets using conditional FP tree

BITS Pilani, Hyderabad Campus

• Given a frequent K-itemset, Y how many association rule can it produce?

2k-2 ignoring empty rules in antecedents and consequents

• An association rule can be extracted by partitioning the itemset

If X is 3-frequent itemset{1,2,3} what are the rules it can generate?

• Given a frequent itemset L, find all non-empty subsets f  L

• If |L| = k, then there are 2k – 2 candidate association rules

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• How to efficiently generate rules from frequent itemsets?

– But confidence of rules generated from the same

c(ABC  D)  c(AB  CD)  c(A  BCD)

BITS Pilani, Hyderabad Campus

• Apriori uses a level wise approach for rule generation,

BITS Pilani, Hyderabad Campus

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

• A compact representation of frequent itemsets is

BITS Pilani, Hyderabad Campus

transactions BITS Pilani, Hyderabad Campus

ABCD ABCE ABDE ACDE BCDE

transactions BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

• Encodes data into a compact data structure called FP-

BITS Pilani, Hyderabad Campus

– Scan data and find support for each item.

– Discard infrequent items.

– Sort frequent items in decreasing order based on their

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

Header table D:1

BITS Pilani, Hyderabad Campus

Ordered by Lowest support with highest support of the items

BITS Pilani, Hyderabad Campus

• FP-Growth extracts frequent itemsets from the FP-tree.

• Bottom-up algorithm - from the leaves towards the root

• Divide and conquer: first look for frequent itemsets

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus

BITS Pilani, Hyderabad Campus