Lect 4. Frequent Pattern Mining

Mining Frequent Patterns,
Associations and
Correlations
Dr. Abdul Majid

adlmjd@yahoo.com
1
Mining Frequent Patterns
Frequent Patterns, Associations and Correlations

 Basic Concepts
 Frequent Itemset Mining Methods
 Which Patterns Are Interesting?—Pattern Evaluation
Methods
 Summary
2
Basic Concepts: Definitions

 Frequent patterns:
 patterns that occur frequently in data
 The kinds of frequent patterns:
 A frequent itemset pattern: a set of items that frequently
appear together in a transactional data set, such as milk and
bread.
 A frequent sequential pattern: such as the pattern that
customers tend to purchase first a PC, followed by a digital
camera, and then a memory card, is a frequent sequential
pattern.
3
Basic Concepts: Definitions

 Mining frequent patterns leads to the discovery of
interesting associations and correlations within data.
 Applications:
 Basket data analysis,
 Cross-marketing,
 Catalog design,
 Sale campaign analysis
 Web log (click stream) analysis,
 DNA sequence analysis,
 etc.
4
Market Basket Analysis

 Market basket analysis
 A typical example of frequent itemset mining
 Finding associations between the different items that
customers place in their “shopping baskets”
5

 The discovery of such associations can help retailers to:
 Develop marketing strategies
 If customers tend to purchase computers and printers together,
then having a sale on printers may encourage the sale of printers
as well as computers.
 Design different store layouts
 If customers who purchase computers also tend to buy antivirus
software at the same time, then placing the hardware display
close to the software display may help increase the sales of both
items.
6

 Market basket data can be represented in a binary
format.
 The Boolean vectors can be analyzed for buying patterns

that reflect items that are frequently associated or
purchased together.
 These patterns can be represented in the form of
association rules.
7

 The information that customers who purchase computers also tend
to buy antivirus software at the same time is represented in
Association Rule
 Measures of rule interestingness:

 A support of 2% means that 2% of all the transactions under analysis
show that computer and antivirus software are purchased together.
 A confidence of 60% means that 60% of the customers who purchased
a computer also bought the antivirus software.
 Typically, association rules are considered interesting if they satisfy
both
 a minimum support threshold
 a minimum confidence threshold
 Such thresholds can be set by users or domain experts.
8
Market Basket Analysis: Applications – (1)

 Items = products; baskets = sets of products someone
bought in one trip to the store.
 Example application: given that many people buy B and
D together:
 Run a sale on D; raise price of B.
 Only useful if many buy D & B.
9
9

 Baskets = sentences; items = documents containing
those sentences.
 Items that appear together too often could represent
plagiarism.
 Notice items do not have to be “in” baskets.
10
10

 Baskets = Web pages; items = words.
 Unusual words appearing together in a large number of
documents, e.g., “Brad” and “Angelina,” may indicate an
interesting relationship.
11
11
Basic Concepts: Frequent Patterns
Tid Items bought  itemset: A set of one or more

10 Pepsi, Nuts, Diaper items
20 Pepsi, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Pepsi, Diaper, Eggs
 (absolute) support, or, support
40 Nuts, Eggs, Milk
count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk
occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys Pepsi
12 12
Basic Concepts: Association Rules

 Find all the rules X  Y with
Tid Items bought minimum support and confidence
10 Pepsi, Nuts, Diaper  support, s, probability that a
20 Pepsi, Coffee, Diaper
transaction contains X  Y
30 Pepsi, Diaper, Eggs
40 Nuts, Eggs, Milk
 confidence, c, conditional
50 Nuts, Coffee, Diaper, Eggs, probability that a transaction
Milk
Customer
Customer
having X also contains Y
buys both Let minsup = 50%, minconf = 50%
buys
diaper Freq. Pat.: Pepsi:3, Nuts:3, Diaper:4, Eggs:3,
{Pepsi, Diaper}:3
 Association rules: (many more!)
Customer  Pepsi  Diaper (60%, 100%)
buys Pepsi  Diaper  Pepsi (60%, 75%)
13 13
14
Basic Concepts: Support and Confidence
Mining Association Rules

 Two-step approach for mining association rules:
 1.Frequent Itemset Generation
 Generate all itemsets whose support ≥ min_sup
 2.Rule Generation
 Generate rules that satisfy minimum support and minimum
confidence.
 Frequent itemset generation is computationally

expensive
 The second step is much less costly
15
Frequent Pattern Generation

 Mining frequent itemsets often generates a huge number
of itemsets satisfying the minimum support (min_sup)
threshold, especially when min_sup is set low.
 This is because if an itemset is frequent, each of its
subsets is frequent as well.
 Long pattern
 A long pattern will contain a combinatorial number of shorter,
frequent sub-pattern.
 For example, a frequent itemset of length 100, such as {a1, a2, … ,
a100}, contains:
 frequent 1-itemsets: a1, a2, …, a100
 frequent 2-itemsets: (a1, a2), (a1, a3), : : : , (a99, a100), and so on
 The total number of frequent itemset:
16
17
Frequent Patterns Generations

 This is too huge a number of itemsets for any computer to
compute or store. Solution: Mine closed patterns and max-
patterns instead
 Closed itemset
 An itemset X is closed in a data set S if there exists no proper super-
itemset Y, Y ‫ כ‬X, such that Y has the same support count as X in S.
 Closed frequent itemset
 An itemset X is a closed frequent itemset in set S if X is both
closed and frequent in S.
 Maximal frequent itemset
 An itemset X is a maximal frequent itemset (or max- itemset) in set
S if X is frequent, and there exists no super-itemset Y such that Y ‫ כ‬X
and Y is frequent in S.

18

 Suppose that a transaction database has only two
transactions:
 {<a1, a2, …, a100>; <a1, a2, …, a50>}.
 Let the minimum support count threshold be min_sup =
1.
 We find two closed frequent itemsets and their support
counts, that is,
 C = { {a1, a2, …, a100} : 1; {a1, a2, …, a50} :2 }.
 There is one maximal frequent itemset:
 M = {{a1, a2, …, a100} : 1}.
 Notice that we cannot include {a1, a2, …. , a50} as a maximal
frequent itemset because it has a frequent superset, {a1, a2, ….. ,
a100}.
19
Kinds of Frequent Pattern Mining

 There are many kinds of frequent patterns, association
rules, and correlation relationships.
 Frequent pattern mining can be classified in various
ways, based on:
 the completeness of patterns to be mined
 the levels of abstraction involved in the rule set
 the number of data dimensions involved in the rule
 the types of values handled in the rule
 the kinds of rules to be mined
 the kinds of patterns to be mined
20

 Based on the completeness of patterns to be mined:
 Complete set of frequent itemsets
 Closed frequent itemsets
 Maximal frequent itemsets
 Constrained frequent itemsets
 those that satisfy a set of user-defined constraints
 Top-k frequent itemsets
 the k most frequent itemsets for a user-specified value
21

 Based on the levels of abstraction involved in the rule
set:
 Single-level association rules
 the rules within a given set do reference items or attributes at single
level of abstraction, e.g.
 Multilevel association rules

 methods for association rule mining can find rules at differing levels
of abstraction, e.g.
 “computer” is a higher-level abstraction of “laptop computer”
22

 Based on the number of data dimensions involved in the
rule:
 Single-dimensional association rule
 If an association rule references only one dimension, e.g.
 refer to only one dimension, buys
 Multidimensional association rule

 If a rule references two or more dimensions, e.g.
 the dimensions are age, income, and buys
23

 Based on the types of values handled in the rule
 Boolean association rule
 If a rule involves associations between the presence or absence
of items. e.g.
 Quantitative association rule

 If a rule describes associations between quantitative items or
attributes, e.g.
24

 Based on the kinds of rules to be mined:
 Association rules
 the most popular kind of rules generated from frequent patterns.
 such mining can generate a large number of rules, many of
which are redundant or do not indicate a correlation relationship
among itemsets.
 Correlation rules
 the discovered associations can be further analyzed to uncover
statistical correlations
25

 Based on the kinds of patterns to be mined
 Frequent itemset mining
 the mining of frequent itemsets (sets of items) from transactional
or relational data sets.
 Sequential pattern mining

 searches for frequent subsequences in a sequence data set,
where a sequence records an ordering of events.
 For example, we can study the order in which items are
frequently purchased.
 For instance, customers may tend to first buy a PC, followed by a
digital camera, and then a memory card.
26

 Basic Concepts
Methods
 Summary
27
Frequent Patterns: Complete Search Approach

 Complete search approach:
 List all possible itemsets M = 2d - 1
 Count the support of each itemset by scanning the database
 Eliminate itemsets that fail the min_sup
 ⇒ Computationally prohibitive!
 Reduce the number of candidates (M)

 Use pruning techniques to reduce M
 e.g. Apriori algorithm
28
The Downward Closure Property and

Scalable Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent
 If {pepsi, diaper, nuts} is frequent, so is {pepsi, diaper}
 i.e., every transaction having {pepsi, diaper, nuts} also
contains {pepsi, diaper}

 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
29 29
Apriori:

 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Apriori Property: All nonempty subsets of a frequent itemset
must also be frequent.
 The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent itemset
properties.
 Apriori employs an iterative approach, where k- itemsets
are used to explore (k+1)-itemsets.


30 30
31
Apriori Property
Apriori

32
Apriori pruning Principle:
Apriori

Apriori

 Let k=1
 Scan DB once to get frequent k-itemset
 Repeat until no new frequent or candidate itemsets are
identified
 Generate length (k+1) candidate itemsets from length k frequent
itemsets
 Prune candidate itemsets containing subsets of length k that are
infrequent
 Scan DB to count the support of each candidate
 Eliminate candidates that are infrequent, leaving only those that
are frequent
33
The Apriori Algorithm—An Example

Supmin = 2 Itemset sup
Itemset sup
{A} 2
L1 {A} 2
Database
Tid TDB
Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
34 34
35
36
Pruning

The Apriori Algorithm (Pseudo-Code)

Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
37 37
Implementation of Apriori

 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
38 38
Improvement of the Apriori Method

 Major computational challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
39 39
Pattern-Growth Approach: Mining Frequent Patterns

Without Candidate Generation
 Bottlenecks of the Apriori approach

 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin,
SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
40 40
Construct FP-tree from a Transaction

Database
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset (single f:4 c:1
Item frequency head
item pattern) f 4 c:3 b:1 b:1
2. Sort frequent items in c 4
frequency descending a 3 a:3 p:1
order, f-list b 3
3. Scan DB again, construct m 3 m:2 b:1
FP-tree p 3
F-list = f-c-a-b-m-p p:2 m:1
41 41
Construct FP-tree from a Transaction Database
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p} min_support = 3
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
frequent 1-itemset (single Item frequency head f:4 c:1
item pattern) f 4
c 4 c:3 b:1 b:1
2. Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3
3. Scan DB again, construct m:2 b:1
FP-tree
F-list = f-c-a-b-m-p p:2 m:1
42
Partition Patterns and Databases

 Frequent patterns can be partitioned into subsets
according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
 Completeness and non-redundency
43 43
Find Patterns Having P From P-conditional Database
 Starting at the frequent item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern bases
Item frequency head
f 4 item cond. pattern base
c 4 c:3 b:1 b:1
c f:3
a 3
b 3 a:3 p:1 a fc:3
m 3 b fca:1, f:1, c:1
p 3 m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
44
From Conditional Pattern-bases to Conditional FP-trees
 For each pattern-base

 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
m-conditional pattern base:

{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,
a 3 
f:3  fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree
45
Recursion: Mining Each Conditional FP-tree
{}
{} Cond. pattern base of “am”: (fc:3) f:3
c:3
f:3
am-conditional FP-tree
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP-tree
{}
Cond. pattern base of “cam”: (f:3) f:3
cam-conditional FP-tree
46
Construct FP-tree from a Transaction Database : Another Example
47
Construct FP-tree from a Transaction Database: Another Example
48
Extension of Pattern Growth Mining Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne & Zhu, Fimi’03)
 Mining sequential patterns

 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)
 Mining graph patterns

 gSpan (ICDM’02), CloseGraph (KDD’03)
 Constraint-based mining of frequent patterns

 Convertible constraints (ICDE’01), gPrune (PAKDD’03)
 Computing iceberg data cubes with complex measures

 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)
 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)
 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, et al, ICDE’07)
49 49
Mining Frequent Itemsets Using the Vertical

Data Format
 Data can be presented in item-TID set format (i.e., {item :
TID set})
 This is known as the vertical data format.
50

Data Format
 Vertical format: t(AB) = {T11, T25, …}
 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical intersections
 t(X) = t(Y): X and Y always happen together
 t(X)  t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
51

Data Format
 Mining can be performed on this
data set by intersecting the TID
sets of every pair of frequent
single items.
 The minimum support count is 2.
 The itemsets {I1, I4}and fI3, I5g
each contain only one transaction,
they do not belong to the set of
frequent 2-item
 Based on the Apriori property, a
given 3-itemset is a candidate 3-
itemset only if every one of its 2-
itemset subsets is frequent sets.
52
Mining Frequent Closed Patterns :CLOSET

 Flist: list of all frequent items in support ascending order
 Flist: d-a-f-e-c Min_sup=2
 Divide search space TID Items
10 a, c, d, e, f
 Patterns having d 20 a, b, e
30 c, e, f
 Patterns having d but no a, etc. 40 a, c, d, f
50 c, e, f
 Find frequent closed pattern recursively
 Every transaction having d also has cfa  cfad is a
frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient Algorithm for
Mining Frequent Closed Itemsets", DMKD'00.
53
MaxMiner: Mining Max-Patterns
Tid Items
 1st scan: find frequent items
10 A, B, C, D, E
 A, B, C, D, E 20 B, C, D, E,
 2nd scan: find support for 30 A, C, D, F
 AB, AC, AD, AE, ABCDE

 BC, BD, BE, BCDE Potential
 CD, CE, CDE, DE max-patterns
 Since BCDE is a max-pattern, no need to check BCD, BDE,
CDE in later scan
 R. Bayardo. Efficiently mining long patterns from
databases. SIGMOD’98

 Basic Concepts
Methods
 Summary
55
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading

 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift
P( A B ) Basketball Not basketball Sum (row)

lift  Cereal 2000 1750 3750
P ( A) P( B)
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
56
Interestingness Measure

 Over 20
interestingness
measures have
been proposed
(see Tan,
Kumar,
Sritastava
@KDD’02)
 Which are
good ones?
57
Home Work

 (Implementation project) Using a programming
language that you are familiar with, such as C++ or Java,
implement three frequent itemset mining algorithms (1)
Apriori [AS94b], (2) FP-growth [HPY00], and (3) Eclat
[Zak00] (mining using the vertical data format). Compare
the performance of each algorithm with various kinds of
large data sets. Write a report to analyze the situations
(e.g., data size, data distribution, minimal support
threshold setting, and pattern density) where one
algorithm may perform better than the others, and state
why.
58
Home Work

 (Implementation project) The DBLP data set (www.informatik.uni-
trier.de/ley/db/) consists of over one million entries of research
papers published in computer science conferences and journals.
Among these entries, there are a good number of authors that have
coauthor relationships.
 (a) Propose a method to efficiently mine a set of coauthor
relationships that are closely correlated (e.g., often coauthoring
papers together).
 (b) Based on the mining results and the pattern evaluation
measures, discuss which measure may convincingly uncover close
collaboration patterns better than others.
 (c) Based on the study in (a), develop a method that can roughly
predict advisor
 and advisee relationships and the approximate period for such
advisory supervision.
59
60
The End
Thanks


Lect 4. Frequent Pattern Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect 4. Frequent Pattern Mining

Uploaded by

Copyright:

Available Formats

Mining Frequent Patterns,

Dr. Abdul Majid

Frequent Patterns, Associations and Correlations

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 The Boolean vectors can be analyzed for buying patterns

Frequent Patterns, Associations and Correlations

 Measures of rule interestingness:

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Tid Items bought  itemset: A set of one or more

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 Frequent itemset generation is computationally

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 Multilevel association rules

 “computer” is a higher-level abstraction of “laptop computer”

Frequent Patterns, Associations and Correlations

 refer to only one dimension, buys

 Multidimensional association rule

 the dimensions are age, income, and buys

Frequent Patterns, Associations and Correlations

 Quantitative association rule

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 Sequential pattern mining

Frequent Patterns, Associations and Correlations

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern Evaluation

Frequent Patterns, Associations and Correlations

 Reduce the number of candidates (M)

Frequent Patterns, Associations and Correlations

 If {pepsi, diaper, nuts} is frequent, so is {pepsi, diaper}

 i.e., every transaction having {pepsi, diaper, nuts} also

contains {pepsi, diaper}

 Freq. pattern growth (FPgrowth—Han, Pei & Yin

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 Bottlenecks of the Apriori approach

Frequent Patterns, Associations and Correlations

Frequent Patterns, Associations and Correlations

 For each pattern-base

 Construct the FP-tree for the frequent items of the

m-conditional pattern base:

{} Cond. pattern base of “am”: (fc:3) f:3

 Mining sequential patterns