Professional Documents
Culture Documents
Chapitre Association Mining
Chapitre Association Mining
— Chapter 1—
Introduction
Basic concepts
Association rules
1
Introduction
What Is Frequent Pattern Analysis?
Introduction
Motivation: Finding inherent regularities in data
What products were often purchased together?
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, customer shopping behavior analysis, Web log
(click stream) analysis, and DNA sequence analysis
2
Basic Concepts
Basic Concepts
(absolute) support, or, support count of X:
Frequency or occurrence of an itemset X in D
3
Basic Concepts
X I, Y I and X Y =
| {T D | ( X Y ) T } |
|D|
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 7
Basic Concepts
| {T D | X Y T } |
| {T D | X T } |
4
Basic Concepts
The downward closure property of frequent patterns
Basic Concepts
Example
TransactionID Items
2000 A,B,C
I={A,B,C,D,E,F}
1000 A,C
minsup = 50%,
4000 A,D
minconf = 50%
5000 B,E,F
Support
(A): 75%, (B), (C): 50%, (D), (E), (F): 25%,
(A, C): 50%, (A, B), (A, D), (B, C), (B, E), (B, F), (E, F): 25%
Association rules
A C (support = 50%, confidence = 66.6%)
C A (support = 50%, confidence = 100%)
10
5
Mining Association rules
large databases D
11
12
6
Mining Association rules
1. Mining frequent itemsets: Apriori algorithm, FP-
growth method, etc.
2. Generating association rules from frequent
itemsets:
For each frequent itemset l, generate all nonempty
subsets of l,
For every nonempty subset s of l, output the rule R "
s(l -s) " if R fulfills the minimum confidence
requirement. Note that for simplicity, a single-item is
often desired. Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 13
13
14
7
Apriori: A Candidate Generation & Test Approach
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length
k frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
15
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 16
16
8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 17
17
Implementation of Apriori
18
9
Implementation of Apriori
19
Implementation of Apriori
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
20
10
Candidate Generation: An SQL Implementation
SQL Implementation of candidate generation
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database systems: Alternatives and
implications. SIGMOD’98]
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 21
21
22
11
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
If there are 104 frequent 1-itemsets, the Apriori algorithm will need to
generate more than 107 candidate 2-itemsets
To discover a frequent pattern of size 100, it has to generate at least 2100-
11030
Expensive (time, space)
Subset checking (Computationally expensive)
Support counting is expensive
Multiple database scans (I/O)
23
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Major philosophy: Grow long patterns from short ones using local
frequent items only
24
12
Construct FP-tree from a Transaction Database
TID Items bought
100 {f, a, c, d, g, i, m, p} min_support = 3
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
25
26
13
Construct FP-tree from a Transaction Database
To facilitate tree traversal, an item header table is built so that each item
points to its occurrences in the tree via a chain of node-links
the problem of mining frequent patterns in databases is
transformed to that of mining the FP-tree.
{}
Header Table
27
28
14
FP-Tree Construction Algorithm (2/2)
4. for each transaction TD do
Sort the items in T in the F-list order
If (P= Ø) then
create a branch with all items in T and initialize their count to 1
Attach the branch to the root of the FP-tree
Else
increment the count of all nodes in P by one
create a branch with all items in { T - P} and initialize their
counts to 1
attach the created branch to the leaf node in P
End for
return (the FP-tree)
END Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 29
29
according to F-list
F-list = f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, p
Pattern f
The 1-itemsets are considered in reverse F-list order
Completeness and non-redundency
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 30
30
15
Mining Frequent Patterns Using FP-tree
31
32
16
Mining Frequent Patterns Using FP-tree
{}
Header Table Conditional pattern bases
f:4 c:1 item cond. pattern base
Item frequency head
f 4 c f:3
c 4 c:3 b:1 b:1 a fc:3
a 3
b 3 a:3 p:1 b fca:1, f:1, c:1
m 3 m fca:2, fcab:1
p 3 m:2 b:1 p fcam:2, cb:1
p:2 m:1
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 33
33
Mining FP-tree
From Conditional Pattern-bases to Conditional FP-trees
pattern base
34
17
Algorithm FP-growth (T, )
Input
a tree T: a tree transaction database
35
count(xi)=count(ai)
Determine Dxi the conditional pattern base for
xi
TxiFP-tree(Dxi, minsup)
End for
36
18
Advantages of the Pattern Growth Approach
Divide-and-conquer:
Decompose both the mining task and DB according to the
frequent patterns obtained so far
Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test
Compressed database: FP-tree structure
No repeated scan of entire database
Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
A good open-source implementation and refinement of FPGrowth
FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 37
37
References
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket analysis. SIGMOD'97.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98.
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. J. Parallel and Distributed Computing:02.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00.
G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns.
KDD'03.
G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03),
Melbourne, FL, Nov. 2003
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 38
38
19