CH 4

Unit 4
Unit 5- Concept Description &

Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Concept Description
 Descriptive vs. predictive data mining
 Descriptive mining: describes concepts or task-relevant data

sets in concise, summarative, informative, discriminative
forms
 Predictive mining: Based on data and analysis, constructs
models for the database, and predicts the trend and
properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct
summarization of the given collection of data
 Comparison: provides descriptions comparing two or more
collections of data
Concept Description vs. OLAP
 Concept description:
 can handle complex data types of the attributes and
their aggregations
 a more automated process
 OLAP:
 restricted to a small number of dimension and measure
types
 user-controlled process
characterization
 Incremental ARM
Data Generalization and Summarization-based
Characterization
 Data generalization
 A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to higher
ones. 1
2
3
4
Conceptual Levels
5
 Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)
 Perform computations and store results in data cubes
 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures
 e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a data cube
by roll-up and drill-down
 Limitations
 handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
 Not confined to categorical data nor particular measures.
 How it is done?
 Collect the task-relevant data( initial relation) using a
relational database query
 Perform generalization by attribute removal or attribute
generalization.
 Apply aggregation by merging identical, generalized tuples
and accumulating their respective counts.
 Interactive presentation with users.
Basic Principles of Attribute-Oriented
Induction
 Data focusing: task-relevant data, including dimensions, and the result
is the initial relation.
 Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final relation/rule
size.
Basic Algorithm for Attribute-Oriented
Induction
 InitialRel: Query processing of task-relevant data, deriving the initial
relation.
 PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or
how high to generalize?
 PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
 Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Example
 DMQL: Describe general characteristics of graduate students in
the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Initial Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime Generalized M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
Relation
… … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Presentation of Generalized Results
 Generalized relation:
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
 Cross tabulation:
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
characterization
 Association Mining
 Incremental ARM
Association Mining
 Association rule mining:

 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E
 support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A  D (60%, 100%)
buys beer
D  A (60%, 75%)
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo
@ SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Scalable Methods for Mining Frequent Patterns
 The downward closure property of frequent patterns

 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Two major approaches
 Apriori
 Freq. pattern growth
Apriori: A Candidate Generation-and-Test Approach
 Apriori Principle : All nonempty subsets of a frequent itemset

must also be frequent.
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
Example
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset L3 Itemset sup

3rd scan {B, C, E} 2
{B, C, E}
Example
Example
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Challenges of Frequent Pattern Mining
 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas

 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent.
Outline of the Presentation
Outline
 Frequent Pattern Mining: Problem statement and an
example
 Review of Apriori-like Approaches
 FP-Growth:
 Overview
 FP-tree:
structure, construction and advantages
 FP-growth:
 FP-tree conditional pattern bases  conditional FP-tree
frequent patterns
 Experiments
 Discussion:
 Improvement of FP-growth
 Conclusion Remarks
32
Frequent Pattern Mining: An
Example
Given a transaction database DB and a minimum support threshold ξ,
find all frequent patterns (item sets) with support no less than ξ.
Input: DB: TID Items bought

100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}
Minimum support: ξ =3
Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…
Problem Statement: How to efficiently find all frequent patterns?
33
Apriori
Candidate
 Main Steps of Apriori Algorithm: Generation
 Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
 Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
 E.g. , Candidate
Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p
300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp

400 {b, c, k, s, p} L2 fa, fc, fm, …
500 {a, f, c, e, l, p, m, n} …
34
Performance Bottlenecks of Apriori
 Bottlenecks of Apriori: candidate generation

 Generate huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
one needs to generate 2100  1030 candidates.
 Candidate Test incur multiple scans of database:

each candidate
35
Overview of FP-Growth: Ideas
 Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
 highly compacted, but complete for frequent pattern mining
 avoid costly repeated database scans
 Develop an efficient, FP-tree-based frequent pattern

mining method (FP-growth)
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only.
36
FP-tree:
Construction and Design
FP-tree
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L
in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according
to the order in L; Scan DB the second time, construct FP-
tree by putting each frequency ordered transaction onto
it.
38
FP-tree Example: step 1
Step 1: Scan DB once, find frequent 1-itemset
TID Items bought

100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o} Item frequency
300 {b, f, h, j, o} f 4
400 {b, c, k, s, p} c 4
500 {a, f, c, e, l, p, m, n} a 3
b 3
m 3
p 3
39
Step 2: scan the DB for the second time, order frequent items
in each transaction
TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
40
Step 2: construct FP-tree
{} {}
f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2
a:1 a:2
m:1 m:1 b:1

NOTE: Each
transaction
corresponds to one p:1 p:1 m:1
path in the FP-tree
41
Step 2: construct FP-tree
{} {} {}
f:3 f:3 c:1 f:4 c:1

{f, b} {c, b, p} {f, c, a, m, p}
c:2 b:1 c:2 b:1 b:1 c:3 b:1 b:1
a:2 a:2 p:1 a:3 p:1
m:1 b:1 m:1 b:1 m:2 b:1
p:1 m:1 p:1 m:1 p:2 m:1

Node-Link
42
Construction Example
Final FP-tree
{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1
p:2 m:1
43
FP-Tree Definition
 FP-tree is a frequent pattern tree . Formally, FP-tree is a tree
structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the
path reaching this node,
 node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying
the item-name.
44
Advantages of the FP-tree Structure
 The most significant advantage of the FP-tree
 Scan the DB only twice and twice only.
 Completeness:
 the FP-tree contains all the information related to mining frequent
patterns (given the min-support threshold). Why?
 Compactness:
 The size of the tree is bounded by the occurrences of frequent items
 The height of the tree is bounded by the maximum number of items in

a transaction
45
Questions?
 Why descending order?
 Example 1: {}
f:1 a:1
TID (unordered) frequent items

100 {f, a, c, m, p} a:1 f:1
500 {a, f, c, p, m}
c:1 c:1
m:1 p:1
p:1 m:1
46
FP-tree
Questions?
 Example 2: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f} m:2 b:1 b:1 b:1
400 {p, b, c}
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1
This tree is larger than FP-tree,

c:2 c:1
because in FP-tree, more frequent
items have a higher position, which
makes branches less f:2 f:2
47
FP-growth:
Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree:
looking for shorter ones recursively and then concatenating
the suffix:
 For each frequent item, construct its conditional pattern
base, and then its conditional FP-tree;
 Repeat the process on each newly created conditional FP-
tree until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the
combinations of its sub-paths, each of which is a frequent
pattern)
49
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header
table
Step 2
Construct conditional FP-tree from each conditional pattern base
Step 3
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path,
simply enumerate all the patterns
50
Step 1: Construct Conditional Pattern Base
 Starting at the bottom of frequent-item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
{} Conditional pattern bases
Header Table item cond. pattern base
f:4 c:1
Item head p fcam:2, cb:1
f m fca:2, fcab:1
c c:3 b:1 b:1
a b fca:1, f:1, c:1
b a:3 p:1 a fc:3
m
p m:2 b:1 c f:3
f {}
p:2 m:1
51
Properties of FP-Tree
 Node-link property
 For any frequent item ai, all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
52
Step 2: Construct Conditional FP-tree
 For each pattern base
 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the
pattern base
{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3  fca:2, fcab:1 
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree
53
Step 3: Recursively mine the conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) add “cam”: (f:3)
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern
Frequent Pattern
add conditional FP-tree of
f:3 “fcm”: 3
“f”
Frequent Pattern Frequent Pattern

fcam
conditional FP-tree of “fm”: 3
54
Frequent Pattern
Principles of FP-Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's conditional pattern
base, and  be an itemset in B. Then    is a frequent itemset
in DB iff  is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.
55
Conditional Pattern Bases and
Conditional FP-Tree
Item Conditional pattern base Conditional FP-tree

p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
order of L
56
Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P. The complete set of frequent
pattern of T can be generated by enumeration of all the combinations of
the sub-paths of P
{}
All frequent patterns concerning m:
combination of {f, c, a} and m
f:3
m,
c:3  fm, cm, am,
fcm, fam, cam,
a:3
fcam
m-conditional FP-tree
57
Summary of FP-Growth Algorithm
 Mining frequent patterns can be viewed as first
mining 1-itemset and progressively growing each 1-
itemset by mining on its conditional pattern base
recursively
 Transform a frequent k-itemset mining problem into a

sequence of k frequent 1-itemset mining problems via
a set of conditional pattern bases

CH 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CH 4

Uploaded by

Copyright:

Available Formats

Unit 4

Unit 5- Concept Description &

 Descriptive mining: describes concepts or task-relevant data

 Association rule mining:

 The downward closure property of frequent patterns

 Apriori Principle : All nonempty subsets of a frequent itemset

C3 Itemset L3 Itemset sup

 Huge number of candidates

 Tedious workload of support counting for candidates

 Improving Apriori: general ideas

 Shrink number of candidates

 Facilitate support counting of candidates

Input: DB: TID Items bought

Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…

Problem Statement: How to efficiently find all frequent patterns?

300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp

 Bottlenecks of Apriori: candidate generation

 Candidate Test incur multiple scans of database:

 Develop an efficient, FP-tree-based frequent pattern

TID Items bought

TID Items bought (ordered) frequent items

m:1 m:1 b:1

f:3 f:3 c:1 f:4 c:1

a:2 a:2 p:1 a:3 p:1

m:1 b:1 m:1 b:1 m:2 b:1

p:1 m:1 p:1 m:1 p:2 m:1

 The height of the tree is bounded by the maximum number of items in

TID (unordered) frequent items

This tree is larger than FP-tree,

Frequent Pattern Frequent Pattern

Item Conditional pattern base Conditional FP-tree

 Transform a frequent k-itemset mining problem into a

You might also like