Professional Documents
Culture Documents
Applications
Basket data analysis: Market basket analysis may provide the
retailer with information to understand the purchase behavior of a
buyer. This information will enable the retailer to understand the
buyer's needs and rewrite the store's layout accordingly, develop
cross-promotional programs, or even capture new buyers
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} {Beer},
1 Bread, Milk {Milk, Bread} {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread} {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!
Definition: Frequent Itemset
Itemset
A collection of one or more items
Example: {Milk, Bread, Diaper}
k-itemset TID Items
An itemset that contains k items 1 Bread, Milk
Support count () 2 Bread, Diaper, Beer, Eggs
Frequency of occurrence of an itemset 3 Milk, Diaper, Beer, Coke
E.g. ({Milk, Bread, Diaper}) = 2 4 Bread, Milk, Diaper, Beer
Support 5 Bread, Milk, Diaper, Coke
Fraction of transactions that contain
an itemset
E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
An itemset whose support is greater
than or equal to a minsup threshold
Definition: TID
1
Items
Bread, Milk
Association Rule 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Association Rule
– An implication expression of the form Customer Customer
buys all
X Y, where X and Y are itemsets three
buys both
milk and
– Example: diaper
{Milk, Diaper} {Beer}
Brute-force approach:
List all possible association rules
Compute the support and confidence for each rule
Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
Mining Association Rules
Confidence:
number of tuples that contain {A, B, D}
100%
number of tuples that contain {B, D}
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
• Computationally prohibitive!
Frequent Itemset Generation
Brute-force approach:
Each itemset in the lattice is a candidate frequent itemset
Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
Apriori principle:
If an itemset is frequent, then all of its subsets must also be
frequent X , Y : ( X Y ) s( X ) s(Y )
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
supersets ABCDE
Closed Patterns and Max-Patterns
X is not closed
Min_sup=2
Too many frequent itemsets
Triplets (3-itemsets)
C4={abcd}
Example
Algorithm Apriori
L1=find_frequent_1-itemsets(D)
for (k = 2; Lk-1 ; k++) {
Ck = apriori_gen(Lk-1); // candidates generated from Lk-1
for each transaction t D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c Ct
c.count++;
}
Lk {c Ck | c.count min_sup}// candidates in Ck+1 with min_support
}
return L = k Lk;
apriori-gen function
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1
• In the first iteration of the algorithm, each item is a member of the set of
candidate.
• The set of frequent 1-itemsets, L1 , consists of the candidate 1-itemsets
satisfying minimum support.
Step 2: Generating 2-itemset Frequent
Pattern
Itemset Itemset Sup. Itemset Sup
Generate
{I1, I2} Count Compare Count
C2 Scan D for
candidate
candidates {I1, I3} count of {I1, I2} 4 {I1, I2} 4
support
from L1 each {I1, I3} 4
{I1, I4} {I1, I3} 4 count with
candidate
minimum {I1, I5} 2
{I1, I5} {I1, I4} 1
support
{I2, I3} {I1, I5} 2 count {I2, I3} 4
{I2, I4} {I2, I4} 2
{I2, I3} 4
{I2, I5} {I2, I5} 2
{I2, I4} 2
{I3, I4}
{I2, I5} 2 L2
{I3, I5}
{I3, I4} 0
{I4, I5}
{I3, I5} 1
C2 {I4, I5} 0
C2
Step 2: Generating 2-itemset
Frequent Pattern [Cont.]
R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
R5 is Rejected.
R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
R6 is Selected.
Easily parallelized
Methods to Improve Apriori’s
Efficiency
Candidate counting:
Scan the database of transactions to determine the support of
each candidate itemset
To reduce the number of comparisons, store the candidates in a
hash structure
Instead of matching each transaction against every candidate, match
it against candidates contained in the hashed buckets
hash ptrs
Storing the C4 below in a hash-tree with a
a max of 2 itemsets per leaf node: b
c
<a, b, c, d>
<a, b, e, f> Depth
<a, b, h, j> 0
<a, d, e, f> a c
b
<b, c, e, f> 1
<b, d, f, h>
b d <c, e, f> <e, g, k>
<c, e, g, k>
2 <d, f, h> <f, g, h>
<c, f, g, h> <e, f>
c e h
3
<d> <f> <j>
How to Build a Hash Tree on a
Candidate Set
<a, b, c, d>
Example: Building the hash tree on <a, b, e, f>
the candidate set C4 of the previous <a, b, h, j>
<a, d, e, f>
slide <b, c, e, f>
(max 2 itemsets per leaf node) <b, d, f, h>
<c, e, g, k>
<b, c, d> <c, f, g, h>
<a, b, c, d> <b, e, f>
<a, b, e, f> <b, h, j> a c
<d, e, f> b
<a, b, h, j>
<a, d, e, f> <c, d> b <c, e, f> <e, g, k>
<e, f> d
<b, c, e, f> <d, f, h> <f, g, h>
<h, j>
<b, d, f, h>
<e, f>
<c, e, g, k>
c e h
<c, f, g, h>
For each transaction T, process T through the hash tree to find members of Ck
contained in T and increment their count. After all transactions are processed,
eliminate those candidates with less than min support.
Hash(k1)
Hash(k2)
Hash(k3)
k1, k2, k3
1) Depth 1: hash(k1)
2) Depth 2: hash(k2)
3) Depth 3: hash(k3)
Generate Hash Tree
1,4,7 3,6,9
2,5,8 234
567
145 136
345 356 367
Hash on
357 368
1, 4 or 7
124 159 689
125
457 458
Association Rule Discovery: Hash tree
1,4,7 3,6,9
2,5,8 234
567
145 136
345 356 367
Hash on
357 368
2, 5 or 8
124 159 689
125
457 458
Association Rule Discovery: Hash tree
1,4,7 3,6,9
2,5,8 234
567
145 136
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125
457 458
Subset Operation
Given a transaction t, what
are the possible subsets of Transaction, t
size 3?
1 2 3 5 6
Level 1
1 2 3 5 6 2 3 5 6 3 5 6
Level 2
12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
123
135 235
125 156 256 356
136 236
126
1+ 2356
2+ 356 1,4,7 3,6,9
3+ 56 2,5,8
234
567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Subset Operation Using Hash Tree
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356 1,4,7 3,6,9
12+ 356
3+ 56 2,5,8
13+ 56
234
15+ 6 567
145 136
345 356 367
357 368
124 159 689
125
457 458
Match transaction against 11 out of 15 candidates
Transaction reduction
First scan:
Subdivide the transactions of database D into n non
overlapping partitions
If the minimum support in D is min_sup, then the
Second scan:
Frequent items are determined from the local frequent
items
Partitioning
First scan:
Subdivide the transactions of database D into n non
overlapping partitions
If the minimum support in D is min_sup, then the minimum
Second scan:
Frequent items are determined from the local frequent
items
Reducing Scans via Partitioning
Divide the dataset D into n non-overlapping partitions, D 1, D2,…, Dn, so that
each partition can fit into memory.
Find frequent itemsets Fi in Di, with support ≥ minSup, for each i.
If it is frequent in D, it must be frequent in some Di.
The union of all Fi forms a candidate set of the frequent itemsets in D; get
their counts.
Often this requires only two scans of D.
Sampling and Dynamic itemset
counting
Select a sample of original database, mine frequent patterns within
sample using Apriori
Scan database once to verify frequent itemsets found in sample,
only borders of closure of frequent patterns are checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patterns
Representation of Database
horizontal vs vertical data layout
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B
Improving the Efficiency of Apriori
Transaction reduction
Reduce the number of transactions scanned in future
iterations.
A transaction that does not contain any frequent k-itemsets
cannot contain any frequent (k+1)-itemsets: Do not include
such transaction in subsequent scans.
Factors Affecting Complexity
Minimum support: ξ =3
Output all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
:
Problem Statement: How to efficiently find all frequent patterns?
Apriori
Candidate
Generation
Main Steps of Apriori Algorithm:
Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
Candidate
E.g. , Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p
300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp
400 {b, c, k, s, p} L2 fa, fc, fm, …
500 {a, f, c, e, l, p, m, n}
…
Performance Bottlenecks of Apriori
Objectives:
The bottleneck of Apriori: candidate generation
Huge candidate sets:
For 104 frequent 1-itemset, Apriori will generate 107
candidate 2-itemsets.
To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate
2100 1030 candidates.
Multiple scans of database:
Needs (n +1) scans, n is the length of the longest pattern.
Overview of FP-Growth: Ideas
Use this order when building the FP-Tree, so common prefixes can be
shared.
Step 1: FP-Tree Construction
Pass 2:
Nodes correspond to items and have a counter
1. FP-Growth reads 1 transaction at a time and maps it to a path
2. Fixed order is used, so paths can overlap when transactions share
items (when they have the same prefix ).
In this case, counters are incremented
3. Pointers are maintained between nodes containing the same item,
creating single linked lists (Here, shown by dotted lines)
The more paths that overlap, the higher the compression. FP-tree may
fit in memory.
4. Frequent itemsets extracted from the FP-Tree.
Step 1: FP-Tree Construction
(Example)
FP-Tree size
The FP-Tree usually has a smaller size than the uncompressed data
- typically many transactions share items (and hence prefixes).
Best case scenario: all transactions contain the same set of items.
1 path in the FP-tree
Worst case scenario: every transaction has a unique set of items (no
items in common)
Size of the FP-tree is at least as large as the original data.
Storage requirements for the FP-tree are higher - need to store the pointers
between the nodes and the counters.
The size of the FP-tree depends on how the items are ordered
Ordering by decreasing support is typically used but it does not
always lead to the smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation
Frequent itemsets found (ordered by sufix and order in which they are
found):
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (1-
item patterns) and order them into a list L in frequency
descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the
order in L; Scan DB the second time, construct FP-tree by putting
each frequency ordered transaction onto it.
Another FP-tree Example: Step 1
By-Product of First
Scan of Database
FP-tree Example: Step 2
Step 2: scan the DB for the second time, order frequent items in
each transaction
f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2
a:1 a:2
{} {} {}
Final FP-tree
{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1
p:2 m:1
FP-Tree structure
f:1 a:1
m:1 p:1
p:1 m:1
Questions?
Example: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f}
400 {p, b, c} m:2 b:1 b:1 b:1
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1
Node-link property
For any frequent item ai, all the possible frequent patterns that contain
ai can be obtained by following ai's node-links, starting from ai's head
in the FP-tree header.
Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
Step 2: Construct Conditional FP-tree
{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3 fca:2, fcab:1
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree
Step 3: Recursively mine the
conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) “cam”: (f:3)
add
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add
c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern Frequent Pattern
add f:3 conditional FP-tree of
“f” “fcm”: 3
Frequent Pattern
Principles of FP-Growth
c {(f:3)} {(f:3)}|c
f Empty Empty
order of L
Single FP-tree Path Generation
{}
All frequent patterns concerning m:
combination of {f, c, a} and m
f:3 m,
c:3 fm, cm, am,
fcm, fam, cam,
a:3 fcam
m-conditional FP-tree
Summary of FP-Growth Algorithm
Facts: usually
1. FP-tree is much smaller than the size of the DB
2. Pattern base is smaller than original FP-tree
3. Conditional FP-tree is smaller than pattern base
mining process works on a set of usually much smaller pattern
bases and conditional FP-trees
Divide-and-conquer and dramatic scale of shrinking
Exams Questions
Q1: What are the main drawback s of Apriori –like approaches and
explain why ?
A:
the
database in a compact way. It is constructed by mapping each
frequency
ordered transaction onto a path in the FP-Tree.
Other Answer: A FP-Tree is an extended prefix tree structure that
null{}
Item Sup Node-
Id Count link I2:7 I1:2
I2 7
I1 6 I1:4
I3:2 I4:1
I3 6
I4 2 I3:2
I5 2
I3:2 I4:1
I5:1
I5:1
For I4, its two prefix paths form the conditional pattern base, <I2 I1:
1>, <I2: 1>, which generates a single-node conditional FP-tree, <I2:
2>, and derives one frequent pattern, <I2, I4: 2>.
Similar to the preceding analysis, I3's conditional pattern base is <I2,
I1: 2>, <I2: 2>, <I1: 2>.
Its conditional FP-tree has two branches, <I2: 4, I1: 2> and <I1: 2>,
as shown , which generates the set of patterns <I2, I3: 4>, <I1, I3:
4>, <I2, I1, I3: 2>.
Finally, I1's conditional pattern base is <I2: 4>, with an FP-tree that
contains only one node, <I2: 4>, which generates one frequent
pattern, <I2, I1: 4>.
Benefits of the FP-tree Structure
Completeness
Preserve complete information for frequent pattern mining
Never break a long pattern of any transaction
Compactness
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more frequently
occurring, the more likely to be shared
Never be larger than the original database (not count node-links
and the count field)
The Frequent Pattern Growth Mining
Method
TID Items B 8
1 {A,B} A 7
2 {B,C,D} C 7
3 {A,C,D,E} D 5
4 {A,D,E} E 3
5 {A,B,C}
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D}
10 {B,C,E}
FP-tree construction
null
After reading TID=1:
D:1
FP-Tree Construction
TID Items
Transaction
1 {A,B}
2 {B,C,D} Database
null
3 {A,C,D,E}
4 {A,D,E}
5 {A,B,C}
A:7 B:3
6 {A,B,C,D}
7 {B,C}
8 {A,B,C}
9 {A,B,D} B:5 C:3
10 {B,C,E}
C:1 D:1
A B C D E
Possible Extension:
E(A) = {B,C,D,E}
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Possible Extension:
E(ABC) = {D,E}
ABCD ABCE ABDE ACDE BCDE
ABCDE
Tree Projection
Horizontal
Data Layout Vertical Data Layout
TID Items A B C D E
1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
3 C,E 5 5 4 5 6
4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
6 A,E 8 10
7 A,B 9
8 A,B,C
9 A,C,D
10 B TID-list
ECLAT
Determine support of any k-itemset by intersecting tid-lists of two
of its (k-1) subsets.
A B AB
1 1 1
4 2 5
5
6
5
7
7
8
7 8
8 10
9
3 traversal approaches:
top-down, bottom-up and hybrid
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too large for
memory
Rule Generation
property
c(ABC D) can be larger or smaller than c(AB D)
partition
Method
For each frequent item, construct its conditional pattern-base,
am-proj DB cm-proj DB
fc f …
fc f
fc f
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Further Improvements of Mining Methods