You are on page 1of 60

Mining Frequent Patterns,

Associations and
Correlations

Dr. Abdul Majid


adlmjd@yahoo.com

1
Mining Frequent Patterns

Frequent Patterns, Associations and Correlations


 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Methods

 Summary

2
Basic Concepts: Definitions

Frequent Patterns, Associations and Correlations


 Frequent patterns:
 patterns that occur frequently in data
 The kinds of frequent patterns:
 A frequent itemset pattern: a set of items that frequently
appear together in a transactional data set, such as milk and
bread.
 A frequent sequential pattern: such as the pattern that
customers tend to purchase first a PC, followed by a digital
camera, and then a memory card, is a frequent sequential
pattern.

3
Basic Concepts: Definitions

Frequent Patterns, Associations and Correlations


 Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
 Applications:
 Basket data analysis,
 Cross-marketing,
 Catalog design,
 Sale campaign analysis
 Web log (click stream) analysis,
 DNA sequence analysis,
 etc.

4
Market Basket Analysis

Frequent Patterns, Associations and Correlations


 Market basket analysis
 A typical example of frequent itemset mining
 Finding associations between the different items that
customers place in their “shopping baskets”

5
Market Basket Analysis

Frequent Patterns, Associations and Correlations


 The discovery of such associations can help retailers to:
 Develop marketing strategies
 If customers tend to purchase computers and printers together,
then having a sale on printers may encourage the sale of printers
as well as computers.
  Design different store layouts
 If customers who purchase computers also tend to buy antivirus
software at the same time, then placing the hardware display
close to the software display may help increase the sales of both
items.

6
Market Basket Analysis

Frequent Patterns, Associations and Correlations


 Market basket data can be represented in a binary
format.

 The Boolean vectors can be analyzed for buying patterns


that reflect items that are frequently associated or
purchased together.
 These patterns can be represented in the form of
association rules.

7
Market Basket Analysis

Frequent Patterns, Associations and Correlations


 The information that customers who purchase computers also tend
to buy antivirus software at the same time is represented in
Association Rule

  Measures of rule interestingness:


 A support of 2% means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who purchased
a computer also bought the antivirus software.
 Typically, association rules are considered interesting if they satisfy
both
 a minimum support threshold
 a minimum confidence threshold
   Such thresholds can be set by users or domain experts.

8
Market Basket Analysis: Applications – (1)

Frequent Patterns, Associations and Correlations


 Items = products; baskets = sets of products someone
bought in one trip to the store.
 Example application: given that many people buy B and
D together:
 Run a sale on D; raise price of B.
 Only useful if many buy D & B.

9
9
Market Basket Analysis: Applications – (2)

Frequent Patterns, Associations and Correlations


 Baskets = sentences; items = documents containing
those sentences.
 Items that appear together too often could represent
plagiarism.
 Notice items do not have to be “in” baskets.

10
10
Market Basket Analysis: Applications – (3)

Frequent Patterns, Associations and Correlations


 Baskets = Web pages; items = words.
 Unusual words appearing together in a large number of
documents, e.g., “Brad” and “Angelina,” may indicate an
interesting relationship.

11
11
Frequent Patterns, Associations and Correlations
Basic Concepts: Frequent Patterns

Tid Items bought  itemset: A set of one or more


10 Pepsi, Nuts, Diaper items
20 Pepsi, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Pepsi, Diaper, Eggs
 (absolute) support, or, support
40 Nuts, Eggs, Milk
count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk
occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys Pepsi

12 12
Basic Concepts: Association Rules

Frequent Patterns, Associations and Correlations


 Find all the rules X  Y with
Tid Items bought minimum support and confidence
10 Pepsi, Nuts, Diaper  support, s, probability that a
20 Pepsi, Coffee, Diaper
transaction contains X  Y
30 Pepsi, Diaper, Eggs
40 Nuts, Eggs, Milk
 confidence, c, conditional
50 Nuts, Coffee, Diaper, Eggs, probability that a transaction
Milk
Customer
Customer
having X also contains Y
buys both Let minsup = 50%, minconf = 50%
buys
diaper Freq. Pat.: Pepsi:3, Nuts:3, Diaper:4, Eggs:3,
{Pepsi, Diaper}:3
 Association rules: (many more!)
Customer  Pepsi  Diaper (60%, 100%)
buys Pepsi  Diaper  Pepsi (60%, 75%)

13 13
Frequent Patterns, Associations and Correlations
14
Basic Concepts: Support and Confidence
Mining Association Rules

Frequent Patterns, Associations and Correlations


 Two-step approach for mining association rules:
 1.Frequent Itemset Generation
 Generate all itemsets whose support ≥ min_sup

 2.Rule Generation
 Generate rules that satisfy minimum support and minimum
confidence.

  Frequent itemset generation is computationally


expensive
 The second step is much less costly

15
Frequent Pattern Generation

Frequent Patterns, Associations and Correlations


 Mining frequent itemsets often generates a huge number
of itemsets satisfying the minimum support (min_sup)
threshold, especially when min_sup is set low.
 This is because if an itemset is frequent, each of its
subsets is frequent as well.
 Long pattern
 A long pattern will contain a combinatorial number of shorter,
frequent sub-pattern.
 For example, a frequent itemset of length 100, such as {a1, a2, … ,
a100}, contains:
 frequent 1-itemsets: a1, a2, …, a100
 frequent 2-itemsets: (a1, a2), (a1, a3), : : : , (a99, a100), and so on
 The total number of frequent itemset:

16
Frequent Patterns, Associations and Correlations
17
Frequent Patterns Generations
Frequent Patterns Generations

Frequent Patterns, Associations and Correlations


 This is too huge a number of itemsets for any computer to
compute or store. Solution: Mine closed patterns and max-
patterns instead
 Closed itemset
 An itemset X is closed in a data set S if there exists no proper super-
itemset Y, Y ‫ כ‬X, such that Y has the same support count as X in S.
 Closed frequent itemset
 An itemset X is a closed frequent itemset in set S if X is both
closed and frequent in S.
 Maximal frequent itemset
 An itemset X is a maximal frequent itemset (or max- itemset) in set
S if X is frequent, and there exists no super-itemset Y such that Y ‫ כ‬X
and Y is frequent in S.
  

18
Frequent Patterns Generations

Frequent Patterns, Associations and Correlations


 Suppose that a transaction database has only two
transactions:
 {<a1, a2, …, a100>; <a1, a2, …, a50>}.
 Let the minimum support count threshold be min_sup =
1.
  We find two closed frequent itemsets and their support
counts, that is,
 C = { {a1, a2, …, a100} : 1; {a1, a2, …, a50} :2 }.
 There is one maximal frequent itemset:
 M = {{a1, a2, …, a100} : 1}.
 Notice that we cannot include {a1, a2, …. , a50} as a maximal
frequent itemset because it has a frequent superset, {a1, a2, ….. ,
a100}.

19
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 There are many kinds of frequent patterns, association
rules, and correlation relationships.
 Frequent pattern mining can be classified in various
ways, based on:
 the completeness of patterns to be mined
 the levels of abstraction involved in the rule set
 the number of data dimensions involved in the rule
 the types of values handled in the rule
 the kinds of rules to be mined
 the kinds of patterns to be mined

20
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the completeness of patterns to be mined:
 Complete set of frequent itemsets
 Closed frequent itemsets
 Maximal frequent itemsets
 Constrained frequent itemsets
 those that satisfy a set of user-defined constraints
 Top-k frequent itemsets
 the k most frequent itemsets for a user-specified value

21
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the levels of abstraction involved in the rule
set:
 Single-level association rules
 the rules within a given set do reference items or attributes at single
level of abstraction, e.g.

 Multilevel association rules


 methods for association rule mining can find rules at differing levels
of abstraction, e.g.

 “computer” is a higher-level abstraction of “laptop computer”

22
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the number of data dimensions involved in the
rule:
 Single-dimensional association rule
 If an association rule references only one dimension, e.g.

 refer to only one dimension, buys

 Multidimensional association rule


 If a rule references two or more dimensions, e.g.

 the dimensions are age, income, and buys

23
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the types of values handled in the rule
 Boolean association rule
 If a rule involves associations between the presence or absence
of items. e.g.

 Quantitative association rule


 If a rule describes associations between quantitative items or
attributes, e.g.

24
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the kinds of rules to be mined:
 Association rules
 the most popular kind of rules generated from frequent patterns.
 such mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation relationship
among itemsets.
 Correlation rules
 the discovered associations can be further analyzed to uncover
statistical correlations

25
Kinds of Frequent Pattern Mining

Frequent Patterns, Associations and Correlations


 Based on the kinds of patterns to be mined
  Frequent itemset mining
 the mining of frequent itemsets (sets of items) from transactional
or relational data sets.

  Sequential pattern mining


 searches for frequent subsequences in a sequence data set,
where a sequence records an ordering of events.
 For example, we can study the order in which items are
frequently purchased.
 For instance, customers may tend to first buy a PC, followed by a
digital camera, and then a memory card.

26
Mining Frequent Patterns

Frequent Patterns, Associations and Correlations


 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Methods

 Summary

27
Frequent Patterns: Complete Search Approach

Frequent Patterns, Associations and Correlations


 Complete search approach:
 List all possible itemsets M = 2d - 1
 Count the support of each itemset by scanning the database
 Eliminate itemsets that fail the min_sup
 ⇒ Computationally prohibitive!

 Reduce the number of candidates (M)


 Use pruning techniques to reduce M
 e.g. Apriori algorithm

28
The Downward Closure Property and

Frequent Patterns, Associations and Correlations


Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent

 If {pepsi, diaper, nuts} is frequent, so is {pepsi, diaper}

 i.e., every transaction having {pepsi, diaper, nuts} also

contains {pepsi, diaper}


 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02)

29 29
Apriori:

Frequent Patterns, Associations and Correlations


 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Apriori Property: All nonempty subsets of a frequent itemset
must also be frequent.
 The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties.
 Apriori employs an iterative approach, where k- itemsets
are used to explore (k+1)-itemsets.
  
  
30 30
Frequent Patterns, Associations and Correlations
31
Apriori Property
Apriori

Frequent Patterns, Associations and Correlations
32
Apriori pruning Principle:
Apriori

Apriori

Frequent Patterns, Associations and Correlations


 Let k=1
 Scan DB once to get frequent k-itemset
 Repeat until no new frequent or candidate itemsets are
identified
 Generate length (k+1) candidate itemsets from length k frequent
itemsets
 Prune candidate itemsets containing subsets of length k that are
infrequent
 Scan DB to count the support of each candidate
 Eliminate candidates that are infrequent, leaving only those that
are frequent

33
The Apriori Algorithm—An Example

Frequent Patterns, Associations and Correlations


Supmin = 2 Itemset sup
Itemset sup
{A} 2
L1 {A} 2
Database
Tid TDB
Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2

34 34
Frequent Patterns, Associations and Correlations
35
The Apriori Algorithm—An Example
Frequent Patterns, Associations and Correlations
36
The Apriori Algorithm—An Example
Pruning

The Apriori Algorithm (Pseudo-Code)

Frequent Patterns, Associations and Correlations


Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
37 37
Implementation of Apriori

Frequent Patterns, Associations and Correlations


 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
38 38
Improvement of the Apriori Method

Frequent Patterns, Associations and Correlations


 Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

39 39
Pattern-Growth Approach: Mining Frequent Patterns

Frequent Patterns, Associations and Correlations


Without Candidate Generation

 Bottlenecks of the Apriori approach


 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin,
SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation

40 40
Construct FP-tree from a Transaction

Frequent Patterns, Associations and Correlations


Database
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset (single f:4 c:1
Item frequency head
item pattern) f 4 c:3 b:1 b:1
2. Sort frequent items in c 4
frequency descending a 3 a:3 p:1
order, f-list b 3
3. Scan DB again, construct m 3 m:2 b:1
FP-tree p 3
F-list = f-c-a-b-m-p p:2 m:1

41 41
Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
42
Partition Patterns and Databases

Frequent Patterns, Associations and Correlations


 Frequent patterns can be partitioned into subsets
according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 Completeness and non-redundency

43 43
Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
44
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

m-conditional pattern base:


{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 
f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
45
Recursion: Mining Each Conditional FP-tree

{}

{} Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree

46
Frequent Patterns, Associations and Correlations
Construct FP-tree from a Transaction Database : Another Example

47
Frequent Patterns, Associations and Correlations
Construct FP-tree from a Transaction Database: Another Example

48
Frequent Patterns, Associations and Correlations
Extension of Pattern Growth Mining Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)

 Mining sequential patterns


 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)

 Mining graph patterns


 gSpan (ICDM’02), CloseGraph (KDD’03)

 Constraint-based mining of frequent patterns


 Convertible constraints (ICDE’01), gPrune (PAKDD’03)

 Computing iceberg data cubes with complex measures


 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)

 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)

 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)

49 49
Mining Frequent Itemsets Using the Vertical

Frequent Patterns, Associations and Correlations


Data Format
 Data can be presented in item-TID set format (i.e., {item :
TID set})
 This is known as the vertical data format.

50
Mining Frequent Itemsets Using the Vertical

Frequent Patterns, Associations and Correlations


Data Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)

51
Mining Frequent Itemsets Using the Vertical

Frequent Patterns, Associations and Correlations


Data Format
 Mining can be performed on this
data set by intersecting the TID
sets of every pair of frequent
single items.
 The minimum support count is 2.
 The itemsets {I1, I4}and fI3, I5g
each contain only one transaction,
they do not belong to the set of
frequent 2-item
 Based on the Apriori property, a
given 3-itemset is a candidate 3-
itemset only if every one of its 2-
itemset subsets is frequent sets.

52
Mining Frequent Closed Patterns :CLOSET

Frequent Patterns, Associations and Correlations


 Flist: list of all frequent items in support ascending order
 Flist: d-a-f-e-c Min_sup=2
 Divide search space TID Items
10 a, c, d, e, f
 Patterns having d 20 a, b, e
30 c, e, f
 Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
 Find frequent closed pattern recursively
 Every transaction having d also has cfa  cfad is a
frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.

53
MaxMiner: Mining Max-Patterns

Tid Items
 1st scan: find frequent items
10 A, B, C, D, E
 A, B, C, D, E 20 B, C, D, E,
 2nd scan: find support for 30 A, C, D, F

 AB, AC, AD, AE, ABCDE


 BC, BD, BE, BCDE Potential
 CD, CE, CDE, DE max-patterns
 Since BCDE is a max-pattern, no need to check BCD, BDE,
CDE in later scan
 R. Bayardo. Efficiently mining long patterns from
databases. SIGMOD’98
Mining Frequent Patterns

Frequent Patterns, Associations and Correlations


 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Methods

 Summary

55
Interestingness Measure: Correlations (Lift)

 play basketball  eat cereal [40%, 66.7%] is misleading


 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

P( A B ) Basketball Not basketball Sum (row)


lift  Cereal 2000 1750 3750
P ( A) P( B)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000

56
Interestingness Measure

Frequent Patterns, Associations and Correlations


 Over 20
interestingness
measures have
been proposed
(see Tan,
Kumar,
Sritastava
@KDD’02)
 Which are
good ones?

57
Home Work

Frequent Patterns, Associations and Correlations


 (Implementation project) Using a programming
language that you are familiar with, such as C++ or Java,
implement three frequent itemset mining algorithms (1)
Apriori [AS94b], (2) FP-growth [HPY00], and (3) Eclat
[Zak00] (mining using the vertical data format). Compare
the performance of each algorithm with various kinds of
large data sets. Write a report to analyze the situations
(e.g., data size, data distribution, minimal support
threshold setting, and pattern density) where one
algorithm may perform better than the others, and state
why.

58
Home Work

Frequent Patterns, Associations and Correlations


 (Implementation project) The DBLP data set (www.informatik.uni-
trier.de/ley/db/) consists of over one million entries of research
papers published in computer science conferences and journals.
Among these entries, there are a good number of authors that have
coauthor relationships.
 (a) Propose a method to efficiently mine a set of coauthor
relationships that are closely correlated (e.g., often coauthoring
papers together).
 (b) Based on the mining results and the pattern evaluation
measures, discuss which measure may convincingly uncover close
collaboration patterns better than others.
 (c) Based on the study in (a), develop a method that can roughly
predict advisor
 and advisee relationships and the approximate period for such
advisory supervision.

59
Frequent Patterns, Associations and Correlations
60
The End
Thanks

You might also like