You are on page 1of 19

Data Mining:

Concepts and Techniques

— Chapter 1—

Mining Frequent Patterns,


Association and Correlations
Mariem Gzara
Master Recherche en Informatique

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 1

Chapter 1: Mining Frequent Patterns,


Association and Correlations

 Introduction

 Basic concepts

 Association rules

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 2

1
Introduction
What Is Frequent Pattern Analysis?

 Frequent pattern: a pattern (a set of items, subsequences, substructures,


etc.) that occurs frequently in a data set
 (set of items milk and bread, a subsequence buying first PC, then a
digital camera, and then a memory card).

 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context


of frequent itemsets and association rule mining

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 3

Introduction
 Motivation: Finding inherent regularities in data
 What products were often purchased together?
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?

 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, customer shopping behavior analysis, Web log
(click stream) analysis, and DNA sequence analysis

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 4

2
Basic Concepts

 Items I = {i1, ..., im} a set of items


 Itemset X: A set of one or more items, X  I

 Database D : set of transactions Ti where Ti  I


 Transaction T contains X : X  T
 Length of item set X: number of elements of item
set

 k-itemset: item set of length k, X = {x1, …, xk}


Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 5

Basic Concepts
 (absolute) support, or, support count of X:
Frequency or occurrence of an itemset X in D

 (relative) support, s, is the fraction of


transactions that contains X (i.e., the probability
that a transaction contains X)
| {T  D | X  T } |
P( X ) 
|D|
 An itemset X is frequent if X’s support is no less
than a minsup threshold
| {T  D | X  T } |
 min sup
|D|
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 6

3
Basic Concepts

 Association rule : implication of the form X  Y, where

X  I, Y  I and X  Y = 

 support, s, probability that a transaction contains X  Y


s  support ( X  Y )  support ( X  Y )  P ( X  Y )

number of transactions containing both X and Y



total number of transactions

| {T  D | ( X  Y )  T } |

|D|
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 7

Basic Concepts

 confidence, c, conditional probability that a


transaction having X also contains Y
P( X  Y )
c  confidence( X  Y )  P(Y / X ) 
P( X )

number of transactions containing both X and Y



number of transactions containing X

| {T  D | X  Y  T } |

| {T  D | X  T } |

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 8

4
Basic Concepts
 The downward closure property of frequent patterns

 Any subset of a frequent itemset must be frequent

 If {a, b, c} is frequent, so is {a, b}

i.e., every transaction having {a, b, c} also contains


{a, b}

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 9

Basic Concepts

 Example
TransactionID Items
2000 A,B,C
I={A,B,C,D,E,F}
1000 A,C
minsup = 50%,
4000 A,D
minconf = 50%
5000 B,E,F

Support
(A): 75%, (B), (C): 50%, (D), (E), (F): 25%,
(A, C): 50%, (A, B), (A, D), (B, C), (B, E), (B, F), (E, F): 25%
Association rules
A  C (support = 50%, confidence = 66.6%)
C  A (support = 50%, confidence = 100%)

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 10

10

5
Mining Association rules

Task : discover all association rules that have

support  minsup and confidence  minconf in a

large databases D

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 11

11

Mining Association rules

In general, association rule mining can be viewed


as a two-steps process:
1. Find all frequent itemsets; that is, find itemsets
with supportminsup

2. From the frequent itemsets, generate association


rules satisfying the minimum support and
confidence condition

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 12

12

6
Mining Association rules
1. Mining frequent itemsets: Apriori algorithm, FP-
growth method, etc.
2. Generating association rules from frequent
itemsets:
 For each frequent itemset l, generate all nonempty
subsets of l,
 For every nonempty subset s of l, output the rule R "
s(l -s) " if R fulfills the minimum confidence
requirement. Note that for simplicity, a single-item is
often desired. Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 13

13

Apriori: A Candidate Generation & Test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 14

14

7
Apriori: A Candidate Generation & Test Approach

 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length
k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 15

15

The Apriori Algorithm—An Example


Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 16

16

8
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 17

17

Implementation of Apriori

 How to generate candidates?


 Step 1: self-joining Lk
 Step 2: pruning
 The join step:
 All items within a transaction or itemset are sorted in
lexicographic order (or any kind of order)
For the k-itemset li, the items are sorted such that
li[1]<li[2]<…li[k]
 ck 1  Lk  Lk : the joint operation
 l1Lk, l2Lk, l1 and l2 are joined if
(l1[1]=l2[1]) (l1[2]=l2[2])  …  (l1[k-1]=l2[k-1] )  (l1[k-1]<l2[k-1] )
 The resulting itemset formed by joining l1 and l2 is l1[1],l1[2], …, l1[K],l2[k]

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 18

18

9
Implementation of Apriori

 The prune step:


 Lk+1Ck+1 : Ck+1 is a superset of Lk+1

 If any k-subset of a candidate (k+1)-itemset is


not in Lk, then the candidate can be removed
from Ck+1

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 19

19

Implementation of Apriori
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 20

20

10
Candidate Generation: An SQL Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table functions for
efficient implementation [S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
association rule mining with relational database systems: Alternatives and
implications. SIGMOD’98]
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 21

21

Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 22

22

11
Pattern-Growth Approach: Mining Frequent
Patterns Without Candidate Generation
 Bottlenecks of the Apriori approach
 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 If there are 104 frequent 1-itemsets, the Apriori algorithm will need to
generate more than 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, it has to generate at least 2100-
11030
 Expensive (time, space)
 Subset checking (Computationally expensive)
 Support counting is expensive
 Multiple database scans (I/O)

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 23

23

Frequent Pattern-Growth Approach: Mining


Frequent Patterns Without Candidate Generation

 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)

 Depth-first search

 Avoid explicit candidate generation

 Major philosophy: Grow long patterns from short ones using local
frequent items only

 “abc” is a frequent pattern

 Get all transactions having “abc”, i.e., project DB on abc: DB|abc

 “d” is a local frequent item in DB|abc  abcd is a frequent


pattern

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 24

24

12
Construct FP-tree from a Transaction Database
TID Items bought
100 {f, a, c, d, g, i, m, p} min_support = 3
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o, w}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

1. Scan DB once, find Item frequency


frequent 1-itemset (single f 4
item pattern) c 4
a 3
2. Sort frequent items in b 3
frequency descending m 3
order, f-list p 3
3. Scan DB again, construct
FP-tree F-list = {{f:4},{c,4},{a,3},{b,3},{m,3},{p,3}}
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 25

25

Construct FP-tree from a Transaction Database

3. Construct the FP-tree


 Create a NULL node modeling the root
 Map each transaction in the database into a
path of the FP-tree
 Each transaction builds a path of the FP-tree
 A path is a set of linked nodes where each node
models an item of the given transaction
 Items of a transaction are considered in the F-List
order to build the corresponding path

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 26

26

13
Construct FP-tree from a Transaction Database

To facilitate tree traversal, an item header table is built so that each item
points to its occurrences in the tree via a chain of node-links
 the problem of mining frequent patterns in databases is
transformed to that of mining the FP-tree.
{}
Header Table

Item frequency head f:4 c:1


f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3 m:2 b:1

Data Mining: Concepts and Techniques- p:2 m:1


September 20, 2022 Mariem Gzara 27

27

FP-Tree Construction Algorithm (1/2)

FP-tree (D, minSup)


 Input
 D: a transaction database
 minSup: the minimum support count threshold

 Output: the corresponding FP-tree


BEGIN
1. Scan the database once; and compute the set of
frequent items F-list
2. Sort items in F-list by the decreasing order of frequency
3. Create a root for the FP-tree and label it with NULL
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 28

28

14
FP-Tree Construction Algorithm (2/2)
4. for each transaction TD do
 Sort the items in T in the F-list order

 Find a path P in the FP-tree having a prefix with T

 If (P= Ø) then
 create a branch with all items in T and initialize their count to 1
 Attach the branch to the root of the FP-tree
 Else
 increment the count of all nodes in P by one
 create a branch with all items in { T - P} and initialize their
counts to 1
 attach the created branch to the leaf node in P
End for
return (the FP-tree)
END Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 29

29

Mining Frequent Patterns Using FP-tree


 General Idea: Partition Patterns and Databases :
 Frequent patterns can be partitioned into subsets

according to F-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 The 1-itemsets are considered in reverse F-list order
 Completeness and non-redundency
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 30

30

15
Mining Frequent Patterns Using FP-tree

 General idea (divide-and-conquer)


 Recursively grow frequent pattern path using the FP-
tree
 Method
 For each item, construct its conditional pattern-base,
and then its conditional FP-tree
 Repeat the process on each newly created conditional
FP-tree
 This recursive process is stopped when the resulting
FP-tree is empty, or it contains only one path (single
path will generate all the combinations of its sub-paths, each of
which is a frequent pattern
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 31

31

Mining Frequent Patterns Using FP-tree

 Principles of Frequent Pattern Growth:


 Let  be a frequent itemset in DB, B be 's
conditional pattern base, and  be an itemset in B.
Then    is a frequent itemset in DB if  is
frequent in B.
 “abcdef ” is a frequent pattern, if and only if
 “abcde ” is a frequent pattern, and
 “f ” is frequent in the set of transactions containing
“abcde ”

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 32

32

16
Mining Frequent Patterns Using FP-tree

 Find Patterns Having P From P-conditional Database:


 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table Conditional pattern bases
f:4 c:1 item cond. pattern base
Item frequency head
f 4 c f:3
c 4 c:3 b:1 b:1 a fc:3
a 3
b 3 a:3 p:1 b fca:1, f:1, c:1
m 3 m fca:2, fcab:1
p 3 m:2 b:1 p fcam:2, cb:1

p:2 m:1
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 33

33

Mining FP-tree
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
m:2 b:1
p 3
p:2 m:1 a:3
m-conditional FP-tree
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 34

34

17
Algorithm FP-growth (T, )

 Input
 a tree T: a tree transaction database

 a suffix frequent pattern 

 Output: a set of grown frequent patterns


 BEGIN
1. If(T contains a single path P) then
 Let C = {c1 , c2 , …, cn} the set of all combinations
of the nodes in path P
 ciC; generate pattern xi={ci } with support
count of xi =minimum support count of ci
2. else Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 35

35

Algorithm FP-growth (T, )

 For each pattern ai in the « hash table » of T do


 Generate pattern xi={ai} with

count(xi)=count(ai)
 Determine Dxi the conditional pattern base for

xi
 TxiFP-tree(Dxi, minsup)

 If(Txi) then call FP-growth(Txi,xi)

 End for

Data Mining: Concepts and Techniques-


September 20, 2022 Mariem Gzara 36

36

18
Advantages of the Pattern Growth Approach

 Divide-and-conquer:
 Decompose both the mining task and DB according to the
frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building sub FP-tree, no
pattern search and matching
 A good open-source implementation and refinement of FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03)
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 37

37

References
 R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering
association rules. KDD'94.
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication
rules for market basket analysis. SIGMOD'97.
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: Alternatives and implications. SIGMOD'98.
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of
frequent itemsets. J. Parallel and Distributed Computing:02.
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.
SIGMOD’ 00.
 G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying Frequent Patterns.
KDD'03.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent Itemsets, Proc.
ICDM'03 Int. Workshop on Frequent Itemset Mining Implementations (FIMI'03),
Melbourne, FL, Nov. 2003
Data Mining: Concepts and Techniques-
September 20, 2022 Mariem Gzara 38

38

19

You might also like