Chapitre Association Mining

Data Mining:
Concepts and Techniques
— Chapter 1—
Mining Frequent Patterns,

Association and Correlations
Mariem Gzara
Master Recherche en Informatique
Data Mining: Concepts and Techniques-

September 20, 2022 Mariem Gzara 1
Chapter 1: Mining Frequent Patterns,

Association and Correlations
 Introduction
 Basic concepts
 Association rules

1
Introduction
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set
 (set of items milk and bread, a subsequence buying first PC, then a
digital camera, and then a memory card).
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context

of frequent itemsets and association rule mining

Introduction
 Motivation: Finding inherent regularities in data
 What products were often purchased together?
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, customer shopping behavior analysis, Web log
(click stream) analysis, and DNA sequence analysis

2
Basic Concepts
 Items I = {i1, ..., im} a set of items

 Itemset X: A set of one or more items, X  I
 Database D : set of transactions Ti where Ti  I

 Transaction T contains X : X  T
 Length of item set X: number of elements of item
set
 k-itemset: item set of length k, X = {x1, …, xk}

Basic Concepts
 (absolute) support, or, support count of X:
Frequency or occurrence of an itemset X in D
 (relative) support, s, is the fraction of

transactions that contains X (i.e., the probability
that a transaction contains X)
| {T  D | X  T } |
P( X ) 
|D|
 An itemset X is frequent if X’s support is no less
than a minsup threshold
| {T  D | X  T } |
 min sup
|D|
3
Basic Concepts
 Association rule : implication of the form X  Y, where
X  I, Y  I and X  Y = 
 support, s, probability that a transaction contains X  Y

s  support ( X  Y )  support ( X  Y )  P ( X  Y )
number of transactions containing both X and Y


total number of transactions
| {T  D | ( X  Y )  T } |

|D|
Basic Concepts
 confidence, c, conditional probability that a

transaction having X also contains Y
P( X  Y )
c  confidence( X  Y )  P(Y / X ) 
P( X )
number of transactions containing both X and Y


number of transactions containing X
| {T  D | X  Y  T } |

| {T  D | X  T } |

4
Basic Concepts
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {a, b, c} is frequent, so is {a, b}
i.e., every transaction having {a, b, c} also contains

{a, b}

Basic Concepts
 Example
TransactionID Items
2000 A,B,C
I={A,B,C,D,E,F}
1000 A,C
minsup = 50%,
4000 A,D
minconf = 50%
5000 B,E,F
Support
(A): 75%, (B), (C): 50%, (D), (E), (F): 25%,
(A, C): 50%, (A, B), (A, D), (B, C), (B, E), (B, F), (E, F): 25%
Association rules
A  C (support = 50%, confidence = 66.6%)
C  A (support = 50%, confidence = 100%)

10
5
Mining Association rules
Task : discover all association rules that have
support  minsup and confidence  minconf in a
large databases D

11
In general, association rule mining can be viewed

as a two-steps process:
1. Find all frequent itemsets; that is, find itemsets
with supportminsup
2. From the frequent itemsets, generate association

rules satisfying the minimum support and
confidence condition

12
6
1. Mining frequent itemsets: Apriori algorithm, FP-
growth method, etc.
2. Generating association rules from frequent
itemsets:
 For each frequent itemset l, generate all nonempty
subsets of l,
 For every nonempty subset s of l, output the rule R "
s(l -s) " if R fulfills the minimum confidence
requirement. Note that for simplicity, a single-item is
often desired. Data Mining: Concepts and Techniques-
13
Apriori: A Candidate Generation & Test Approach
 Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

14
7
Apriori: A Candidate Generation & Test Approach
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length
k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated

15
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
16
8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
17
Implementation of Apriori
 How to generate candidates?

 Step 1: self-joining Lk
 Step 2: pruning
 The join step:
 All items within a transaction or itemset are sorted in
lexicographic order (or any kind of order)
For the k-itemset li, the items are sorted such that
li[1]<li[2]<…li[k]
 ck 1  Lk  Lk : the joint operation
 l1Lk, l2Lk, l1 and l2 are joined if
(l1[1]=l2[1]) (l1[2]=l2[2])  …  (l1[k-1]=l2[k-1] )  (l1[k-1]<l2[k-1] )
 The resulting itemset formed by joining l1 and l2 is l1[1],l1[2], …, l1[K],l2[k]

18
9
 The prune step:

 Lk+1Ck+1 : Ck+1 is a superset of Lk+1
 If any k-subset of a candidate (k+1)-itemset is

not in Lk, then the candidate can be removed
from Ck+1

19
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}

20
10
Candidate Generation: An SQL Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database systems: Alternatives and
implications. SIGMOD’98]
21
Further Improvement of the Apriori Method
 Major computational challenges

 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

22
11
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 If there are 104 frequent 1-itemsets, the Apriori algorithm will need to
generate more than 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, it has to generate at least 2100-
11030
 Expensive (time, space)
 Subset checking (Computationally expensive)
 Support counting is expensive
 Multiple database scans (I/O)

23
Frequent Pattern-Growth Approach: Mining

Frequent Patterns Without Candidate Generation
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on abc: DB|abc
 “d” is a local frequent item in DB|abc  abcd is a frequent

pattern

24
12
Construct FP-tree from a Transaction Database
TID Items bought
100 {f, a, c, d, g, i, m, p} min_support = 3
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
1. Scan DB once, find Item frequency

frequent 1-itemset (single f 4
item pattern) c 4
a 3
2. Sort frequent items in b 3
frequency descending m 3
order, f-list p 3
3. Scan DB again, construct
FP-tree F-list = {{f:4},{c,4},{a,3},{b,3},{m,3},{p,3}}
25
3. Construct the FP-tree

 Create a NULL node modeling the root
 Map each transaction in the database into a
path of the FP-tree
 Each transaction builds a path of the FP-tree
 A path is a set of linked nodes where each node
models an item of the given transaction
 Items of a transaction are considered in the F-List
order to build the corresponding path

26
13
To facilitate tree traversal, an item header table is built so that each item
points to its occurrences in the tree via a chain of node-links
 the problem of mining frequent patterns in databases is
transformed to that of mining the FP-tree.
{}
Header Table
Item frequency head f:4 c:1

f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1
Data Mining: Concepts and Techniques- p:2 m:1

27
FP-Tree Construction Algorithm (1/2)
FP-tree (D, minSup)

 Input
 D: a transaction database
 minSup: the minimum support count threshold
 Output: the corresponding FP-tree

BEGIN
1. Scan the database once; and compute the set of
frequent items F-list
2. Sort items in F-list by the decreasing order of frequency
3. Create a root for the FP-tree and label it with NULL
28
14
FP-Tree Construction Algorithm (2/2)
4. for each transaction TD do
 Sort the items in T in the F-list order
 Find a path P in the FP-tree having a prefix with T
 If (P= Ø) then
 create a branch with all items in T and initialize their count to 1
 Attach the branch to the root of the FP-tree
 Else
 increment the count of all nodes in P by one
 create a branch with all items in { T - P} and initialize their
counts to 1
 attach the created branch to the leaf node in P
End for
return (the FP-tree)
END Data Mining: Concepts and Techniques-
29
Mining Frequent Patterns Using FP-tree

 General Idea: Partition Patterns and Databases :
 Frequent patterns can be partitioned into subsets
according to F-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 The 1-itemsets are considered in reverse F-list order
 Completeness and non-redundency
30
15
 General idea (divide-and-conquer)

 Recursively grow frequent pattern path using the FP-
tree
 Method
 For each item, construct its conditional pattern-base,
and then its conditional FP-tree
 Repeat the process on each newly created conditional
FP-tree
 This recursive process is stopped when the resulting
FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
31
 Principles of Frequent Pattern Growth:

 Let  be a frequent itemset in DB, B be 's
conditional pattern base, and  be an itemset in B.
Then    is a frequent itemset in DB if  is
frequent in B.
 “abcdef ” is a frequent pattern, if and only if
 “abcde ” is a frequent pattern, and
 “f ” is frequent in the set of transactions containing
“abcde ”

32
16
 Find Patterns Having P From P-conditional Database:

 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table Conditional pattern bases
f:4 c:1 item cond. pattern base
Item frequency head
f 4 c f:3
c 4 c:3 b:1 b:1 a fc:3
a 3
b 3 a:3 p:1 b fca:1, f:1, c:1
m 3 m fca:2, fcab:1
p 3 m:2 b:1 p fcam:2, cb:1
p:2 m:1
33
Mining FP-tree
From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:

{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
m:2 b:1
p 3
p:2 m:1 a:3
m-conditional FP-tree
34
17
Algorithm FP-growth (T, )
 Input
 a tree T: a tree transaction database
 a suffix frequent pattern 
 Output: a set of grown frequent patterns

 BEGIN
1. If(T contains a single path P) then
 Let C = {c1 , c2 , …, cn} the set of all combinations
of the nodes in path P
 ciC; generate pattern xi={ci } with support
count of xi =minimum support count of ci
2. else Data Mining: Concepts and Techniques-
35
Algorithm FP-growth (T, )
 For each pattern ai in the « hash table » of T do

 Generate pattern xi={ai} with
count(xi)=count(ai)
 Determine Dxi the conditional pattern base for
xi
 TxiFP-tree(Dxi, minsup)
 If(Txi) then call FP-growth(Txi,xi)
 End for

36
18
Advantages of the Pattern Growth Approach
 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
 A good open-source implementation and refinement of FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)
37
References
 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94.
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket analysis. SIGMOD'97.
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98.
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. J. Parallel and Distributed Computing:02.
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00.
 G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns.
KDD'03.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03),
Melbourne, FL, Nov. 2003
38
19

Chapitre Association Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapitre Association Mining

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Mining Frequent Patterns,

Data Mining: Concepts and Techniques-

Chapter 1: Mining Frequent Patterns,

Data Mining: Concepts and Techniques-

 Frequent pattern: a pattern (a set of items, subsequences, substructures,

 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context

Data Mining: Concepts and Techniques-

Data Mining: Concepts and Techniques-

 Items I = {i1, ..., im} a set of items

 Database D : set of transactions Ti where Ti  I

 k-itemset: item set of length k, X = {x1, …, xk}

 (relative) support, s, is the fraction of

 Association rule : implication of the form X  Y, where

 support, s, probability that a transaction contains X  Y

number of transactions containing both X and Y

 confidence, c, conditional probability that a

number of transactions containing both X and Y

Data Mining: Concepts and Techniques-

 Any subset of a frequent itemset must be frequent

 If {a, b, c} is frequent, so is {a, b}

i.e., every transaction having {a, b, c} also contains

Data Mining: Concepts and Techniques-

Data Mining: Concepts and Techniques-

Task : discover all association rules that have

support  minsup and confidence  minconf in a

Data Mining: Concepts and Techniques-

Mining Association rules

In general, association rule mining can be viewed

2. From the frequent itemsets, generate association

Data Mining: Concepts and Techniques-

Apriori: A Candidate Generation & Test Approach

 Apriori pruning principle: If there is any itemset which is

Data Mining: Concepts and Techniques-

Data Mining: Concepts and Techniques-

The Apriori Algorithm—An Example

 How to generate candidates?

Data Mining: Concepts and Techniques-

 The prune step:

 If any k-subset of a candidate (k+1)-itemset is

Data Mining: Concepts and Techniques-

Data Mining: Concepts and Techniques-

Further Improvement of the Apriori Method

 Major computational challenges

Data Mining: Concepts and Techniques-

Data Mining: Concepts and Techniques-

Frequent Pattern-Growth Approach: Mining

 Avoid explicit candidate generation

 “abc” is a frequent pattern

 Get all transactions having “abc”, i.e., project DB on abc: DB|abc

 “d” is a local frequent item in DB|abc  abcd is a frequent

Data Mining: Concepts and Techniques-

1. Scan DB once, find Item frequency

Construct FP-tree from a Transaction Database

3. Construct the FP-tree

Data Mining: Concepts and Techniques-

Item frequency head f:4 c:1

Data Mining: Concepts and Techniques- p:2 m:1

FP-Tree Construction Algorithm (1/2)

FP-tree (D, minSup)

 Output: the corresponding FP-tree