Professional Documents
Culture Documents
Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)
Perform computations and store results in data cubes
Strength
An efficient implementation of data generalization
Computation of various kinds of measures
e.g., count( ), sum( ), average( ), max( )
Generalization and specialization can be performed on a data cube
by roll-up and drill-down
Limitations
handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
Not confined to categorical data nor particular measures.
How it is done?
Collect the task-relevant data( initial relation) using a
relational database query
Perform generalization by attribute removal or attribute
generalization.
Apply aggregation by merging identical, generalized tuples
and accumulating their respective counts.
Interactive presentation with users.
Basic Principles of Attribute-Oriented
Induction
Data focusing: task-relevant data, including dimensions, and the result
is the initial relation.
Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A.
Attribute-threshold control: typical 2-8, specified/default.
Generalized relation threshold control: control the final relation/rule
size.
Basic Algorithm for Attribute-Oriented
Induction
InitialRel: Query processing of task-relevant data, deriving the initial
relation.
PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or
how high to generalize?
PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Example
DMQL: Describe general characteristics of graduate students in
the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student
where status in “graduate”
Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Initial Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …
…
Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime Generalized M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
Relation
… … … … … … …
Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Presentation of Generalized Results
Generalized relation:
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
Cross tabulation:
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x) male( x)
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Unit 5- Concept Description &
Association Rule Mining
What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute relevance
Association Mining
Market basket analysis
Finding frequent item sets
Apriori algorithm
Improved Apriori algorithm
Incremental ARM
Associative Classification
Association Mining
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
Challenges of Frequent Pattern Mining
Challenges
Multiple scans of transaction database
Outline
Frequent Pattern Mining: Problem statement and an
example
Review of Apriori-like Approaches
FP-Growth:
Overview
FP-tree:
structure, construction and advantages
FP-growth:
FP-tree conditional pattern bases conditional FP-tree
frequent patterns
Experiments
Discussion:
Improvement of FP-growth
Conclusion Remarks
32
Frequent Pattern Mining: An
Example
Given a transaction database DB and a minimum support threshold ξ,
find all frequent patterns (item sets) with support no less than ξ.
Minimum support: ξ =3
33
Apriori
Candidate
Main Steps of Apriori Algorithm: Generation
Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
E.g. , Candidate
Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p
34
Performance Bottlenecks of Apriori
35
Overview of FP-Growth: Ideas
Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
highly compacted, but complete for frequent pattern mining
avoid costly repeated database scans
36
FP-tree:
Construction and Design
FP-tree
Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L
in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according
to the order in L; Scan DB the second time, construct FP-
tree by putting each frequency ordered transaction onto
it.
38
FP-tree Example: step 1
Step 1: Scan DB once, find frequent 1-itemset
39
FP-tree Example: step 2
Step 2: scan the DB for the second time, order frequent items
in each transaction
40
FP-tree Example: step 2
Step 2: construct FP-tree
{} {}
f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2
a:1 a:2
41
FP-tree Example: step 2
Step 2: construct FP-tree
{} {} {}
42
Construction Example
Final FP-tree
{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1
p:2 m:1
43
FP-Tree Definition
FP-tree is a frequent pattern tree . Formally, FP-tree is a tree
structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
item-name : register which item this node represents,
count, the number of transactions represented by the portion of the
path reaching this node,
node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
item-name, and
head of node-link that points to the first node in the FP-tree carrying
the item-name.
44
Advantages of the FP-tree Structure
The most significant advantage of the FP-tree
Scan the DB only twice and twice only.
Completeness:
the FP-tree contains all the information related to mining frequent
patterns (given the min-support threshold). Why?
Compactness:
The size of the tree is bounded by the occurrences of frequent items
45
Questions?
Why descending order?
Example 1: {}
f:1 a:1
m:1 p:1
p:1 m:1
46
FP-tree
Questions?
Example 2: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f} m:2 b:1 b:1 b:1
400 {p, b, c}
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1
47
FP-growth:
Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree:
looking for shorter ones recursively and then concatenating
the suffix:
For each frequent item, construct its conditional pattern
base, and then its conditional FP-tree;
Repeat the process on each newly created conditional FP-
tree until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the
combinations of its sub-paths, each of which is a frequent
pattern)
49
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header
table
Step 2
Construct conditional FP-tree from each conditional pattern base
Step 3
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path,
simply enumerate all the patterns
50
Step 1: Construct Conditional Pattern Base
Starting at the bottom of frequent-item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item
Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
{} Conditional pattern bases
Header Table item cond. pattern base
f:4 c:1
Item head p fcam:2, cb:1
f m fca:2, fcab:1
c c:3 b:1 b:1
a b fca:1, f:1, c:1
b a:3 p:1 a fc:3
m
p m:2 b:1 c f:3
f {}
p:2 m:1
51
Properties of FP-Tree
Node-link property
For any frequent item ai, all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
Prefix path property
To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.
52
Step 2: Construct Conditional FP-tree
For each pattern base
Accumulate the count for each item in the base
Construct the conditional FP-tree for the frequent items of the
pattern base
{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3 fca:2, fcab:1
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree
53
Step 3: Recursively mine the conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) add “cam”: (f:3)
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern
Frequent Pattern
add conditional FP-tree of
f:3 “fcm”: 3
“f”
54
Frequent Pattern
Principles of FP-Growth
Pattern growth property
Let be a frequent itemset in DB, B be 's conditional pattern
base, and be an itemset in B. Then is a frequent itemset
in DB iff is frequent in B.
Is “fcabm ” a frequent pattern?
“fcab” is a branch of m's conditional pattern base
“b” is NOT frequent in transactions containing “fcab ”
“bm” is NOT a frequent itemset.
55
Conditional Pattern Bases and
Conditional FP-Tree
57
Summary of FP-Growth Algorithm
Mining frequent patterns can be viewed as first
mining 1-itemset and progressively growing each 1-
itemset by mining on its conditional pattern base
recursively