You are on page 1of 58

Unit 4

Unit 5- Concept Description &


Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Concept Description
 Descriptive vs. predictive data mining

 Descriptive mining: describes concepts or task-relevant data


sets in concise, summarative, informative, discriminative
forms
 Predictive mining: Based on data and analysis, constructs
models for the database, and predicts the trend and
properties of unknown data
 Concept description:
 Characterization: provides a concise and succinct
summarization of the given collection of data
 Comparison: provides descriptions comparing two or more
collections of data
Concept Description vs. OLAP
 Concept description:
 can handle complex data types of the attributes and
their aggregations
 a more automated process
 OLAP:
 restricted to a small number of dimension and measure
types
 user-controlled process
Unit 5- Concept Description &
Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Data Generalization and Summarization-based
Characterization
 Data generalization
 A process which abstracts a large set of task-relevant
data in a database from a low conceptual levels to higher
ones. 1
2
3
4
Conceptual Levels
5

 Approaches:
 Data cube approach(OLAP approach)
 Attribute-oriented induction approach
Characterization: Data Cube Approach
(without using AO-Induction)
 Perform computations and store results in data cubes
 Strength
 An efficient implementation of data generalization
 Computation of various kinds of measures
 e.g., count( ), sum( ), average( ), max( )
 Generalization and specialization can be performed on a data cube
by roll-up and drill-down
 Limitations
 handle only dimensions of simple nonnumeric data and measures of
simple aggregated numeric values.
 Lack of intelligent analysis, can’t tell which dimensions should be
used and what levels should the generalization reach
Attribute-Oriented Induction
 Not confined to categorical data nor particular measures.
 How it is done?
 Collect the task-relevant data( initial relation) using a
relational database query
 Perform generalization by attribute removal or attribute
generalization.
 Apply aggregation by merging identical, generalized tuples
and accumulating their respective counts.
 Interactive presentation with users.
Basic Principles of Attribute-Oriented
Induction
 Data focusing: task-relevant data, including dimensions, and the result
is the initial relation.
 Attribute-removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
 Attribute-generalization: If there is a large set of distinct values for A,
and there exists a set of generalization operators on A, then select an
operator and generalize A.
 Attribute-threshold control: typical 2-8, specified/default.
 Generalized relation threshold control: control the final relation/rule
size.
Basic Algorithm for Attribute-Oriented
Induction
 InitialRel: Query processing of task-relevant data, deriving the initial
relation.
 PreGen: Based on the analysis of the number of distinct values in each
attribute, determine generalization plan for each attribute: removal? or
how high to generalize?
 PrimeGen: Based on the PreGen plan, perform generalization to the
right level to derive a “prime generalized relation”, accumulating the
counts.
 Presentation: User interaction: (1) adjust levels by drilling, (2) pivoting,
(3) mapping into rules, cross tabs, visualization presentations.
Example
 DMQL: Describe general characteristics of graduate students in
the Big-University database
use Big_University_DB
mine characteristics as “Science_Students”
in relevance to name, gender, major, birth_place,
birth_date, residence, phone#, gpa
from student
where status in “graduate”
 Corresponding SQL statement:
Select name, gender, major, birth_place, birth_date,
residence, phone#, gpa
from student
where status in {“Msc”, “MBA”, “PhD” }
Class Characterization: An Example
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim M CS Vancouver,BC, 8-12-76 3511 Main St., 687-4598 3.67
Initial Woodman Canada Richmond
Relation Scott M CS Montreal, Que, 28-7-75 345 1st Ave., 253-9106 3.70
Lachance Canada Richmond
Laura Lee F Physics Seattle, WA, USA 25-8-70 125 Austin Ave., 420-5232 3.83
… … … … … Burnaby … …

Removed Retained Sci,Eng, Country Age range City Removed Excl,
Bus VG,..
Gender Major Birth_region Age_range Residence GPA Count
Prime Generalized M Science Canada 20-25 Richmond Very-good 16
F Science Foreign 25-30 Burnaby Excellent 22
Relation
… … … … … … …

Birth_Region
Canada Foreign Total
Gender
M 16 14 30
F 10 22 32
Total 26 36 62
Presentation of Generalized Results
 Generalized relation:
Relations where some or all attributes are generalized, with counts
or other aggregation values accumulated.
 Cross tabulation:
Mapping results into cross tabulation form (similar to contingency
tables).
Visualization techniques:
Pie charts, bar charts, curves, cubes, and other visual forms.
 Quantitative characteristic rules:
Mapping generalized result into characteristic rules with quantitative
information associated with it, e.g.,
grad( x)  male( x) 
birth_ region( x) "Canada"[t :53%] birth_ region( x) " foreign"[t : 47%].
Unit 5- Concept Description &
Association Rule Mining
 What is concept description?
 Data generalization and summarization-based
characterization
 Analytical characterization: Analysis of attribute relevance
 Association Mining
 Market basket analysis
 Finding frequent item sets
 Apriori algorithm
 Improved Apriori algorithm
 Incremental ARM
 Associative Classification
Association Mining

 Association rule mining:


 Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
 Applications:
 Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?
 Discloses an intrinsic and important property of data sets
 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
Basic Concepts: Frequent Patterns and Association
Rules
Transaction-id Items bought  Itemset X = {x1, …, xk}
10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E
 support, s, probability that a
40 B, E, F
transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper
having X also contains Y
Let supmin = 50%, confmin = 50%
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Association rules:
Customer A  D (60%, 100%)
buys beer
D  A (60%, 75%)
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000) =
2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo
@ SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !!
Chapter 5: Mining Frequent Patterns,
Association and Correlations
 Basic concepts and a road map
 Efficient and scalable frequent itemset mining
methods
 Mining various kinds of association rules
 From association mining to correlation analysis
 Constraint-based association mining
 Summary
Scalable Methods for Mining Frequent Patterns

 The downward closure property of frequent patterns


 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Two major approaches
 Apriori
 Freq. pattern growth
Apriori: A Candidate Generation-and-Test Approach

 Apriori Principle : All nonempty subsets of a frequent itemset


must also be frequent.
 Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
Example
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
Tid Items
L1 {A} 2
C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan {B, C, E} 2
{B, C, E}
Example
Example
The Apriori Algorithm
 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database

 Huge number of candidates

 Tedious workload of support counting for candidates

 Improving Apriori: general ideas


 Reduce passes of transaction database scans

 Shrink number of candidates

 Facilitate support counting of candidates


Methods to Improve Apriori’s Efficiency
 Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the
threshold cannot be frequent.
 Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans.
 Partitioning: Any itemset that is potentially frequent in DB
must be frequent in at least one of the partitions of DB.
 Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness.
 Dynamic itemset counting: add new candidate itemsets
only when all of their subsets are estimated to be frequent.
Outline of the Presentation

Outline
 Frequent Pattern Mining: Problem statement and an
example
 Review of Apriori-like Approaches
 FP-Growth:
 Overview
 FP-tree:
structure, construction and advantages
 FP-growth:
 FP-tree conditional pattern bases  conditional FP-tree
frequent patterns
 Experiments
 Discussion:
 Improvement of FP-growth
 Conclusion Remarks

32
Frequent Pattern Mining: An
Example
Given a transaction database DB and a minimum support threshold ξ,
find all frequent patterns (item sets) with support no less than ξ.

Input: DB: TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o}
300 {b, f, h, j, o}
400 {b, c, k, s, p}
500 {a, f, c, e, l, p, m, n}

Minimum support: ξ =3

Output: all frequent patterns, i.e., f, a, …, fa, fac, fam, fm,am…

Problem Statement: How to efficiently find all frequent patterns?

33
Apriori
Candidate
 Main Steps of Apriori Algorithm: Generation
 Use frequent (k – 1)-itemsets (Lk-1) to generate candidates of
frequent k-itemsets Ck
 Scan database and count each pattern in Ck , get frequent k-
itemsets ( Lk ) .
 E.g. , Candidate
Test
TID Items bought Apriori iteration
100 {f, a, c, d, g, i, m, p} C1 f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n
200 {a, b, c, f, l, m, o} L1 f, a, c, m, b, p

300 {b, f, h, j, o} C2 fa, fc, fm, fp, ac, am, …bp


400 {b, c, k, s, p} L2 fa, fc, fm, …
500 {a, f, c, e, l, p, m, n} …

34
Performance Bottlenecks of Apriori

 Bottlenecks of Apriori: candidate generation


 Generate huge candidate sets:
 104 frequent 1-itemset will generate 107 candidate 2-itemsets
 To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
one needs to generate 2100  1030 candidates.

 Candidate Test incur multiple scans of database:


each candidate

35
Overview of FP-Growth: Ideas
 Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
 highly compacted, but complete for frequent pattern mining
 avoid costly repeated database scans

 Develop an efficient, FP-tree-based frequent pattern


mining method (FP-growth)
 A divide-and-conquer methodology: decompose mining tasks into
smaller ones
 Avoid candidate generation: sub-database test only.

36
FP-tree:
Construction and Design
FP-tree

Construct FP-tree
Two Steps:
1. Scan the transaction DB for the first time, find frequent
items (single item patterns) and order them into a list L
in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according
to the order in L; Scan DB the second time, construct FP-
tree by putting each frequency ordered transaction onto
it.
38
FP-tree Example: step 1
Step 1: Scan DB once, find frequent 1-itemset

TID Items bought


100 {f, a, c, d, g, i, m, p}
200 {a, b, c, f, l, m, o} Item frequency
300 {b, f, h, j, o} f 4
400 {b, c, k, s, p} c 4
500 {a, f, c, e, l, p, m, n} a 3
b 3
m 3
p 3

39
FP-tree Example: step 2

Step 2: scan the DB for the second time, order frequent items
in each transaction

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

40
FP-tree Example: step 2
Step 2: construct FP-tree

{} {}

f:1 f:2
{f, c, a, m, p} {f, c, a, b, m}
{} c:1 c:2

a:1 a:2

m:1 m:1 b:1


NOTE: Each
transaction
corresponds to one p:1 p:1 m:1
path in the FP-tree

41
FP-tree Example: step 2
Step 2: construct FP-tree

{} {} {}

f:3 f:3 c:1 f:4 c:1


{f, b} {c, b, p} {f, c, a, m, p}
c:2 b:1 c:2 b:1 b:1 c:3 b:1 b:1

a:2 a:2 p:1 a:3 p:1

m:1 b:1 m:1 b:1 m:2 b:1

p:1 m:1 p:1 m:1 p:2 m:1


Node-Link

42
Construction Example
Final FP-tree

{}
Header Table
f:4 c:1
Item head
f
c c:3 b:1 b:1
a
b a:3 p:1
m
p m:2 b:1

p:2 m:1

43
FP-Tree Definition
 FP-tree is a frequent pattern tree . Formally, FP-tree is a tree
structure defined below:
1. One root labeled as “null", a set of item prefix sub-trees as the
children of the root, and a frequent-item header table.
2. Each node in the item prefix sub-trees has three fields:
 item-name : register which item this node represents,
 count, the number of transactions represented by the portion of the
path reaching this node,
 node-link that links to the next node in the FP-tree carrying the same
item-name, or null if there is none.
3. Each entry in the frequent-item header table has two fields,
 item-name, and
 head of node-link that points to the first node in the FP-tree carrying
the item-name.

44
Advantages of the FP-tree Structure
 The most significant advantage of the FP-tree
 Scan the DB only twice and twice only.

 Completeness:
 the FP-tree contains all the information related to mining frequent
patterns (given the min-support threshold). Why?

 Compactness:
 The size of the tree is bounded by the occurrences of frequent items

 The height of the tree is bounded by the maximum number of items in


a transaction

45
Questions?
 Why descending order?
 Example 1: {}

f:1 a:1

TID (unordered) frequent items


100 {f, a, c, m, p} a:1 f:1
500 {a, f, c, p, m}
c:1 c:1

m:1 p:1

p:1 m:1

46
FP-tree

Questions?
 Example 2: {}
TID (ascended) frequent items
100 {p, m, a, c, f} p:3 m:2 c:1
200 {m, b, a, c, f}
300 {b, f} m:2 b:1 b:1 b:1
400 {p, b, c}
500 {p, m, a, c, f}
a:2 c:1 a:2 p:1

This tree is larger than FP-tree,


c:2 c:1
because in FP-tree, more frequent
items have a higher position, which
makes branches less f:2 f:2

47
FP-growth:
Mining Frequent Patterns
Using FP-tree
Mining Frequent Patterns Using FP-tree
 General idea (divide-and-conquer)
Recursively grow frequent patterns using the FP-tree:
looking for shorter ones recursively and then concatenating
the suffix:
 For each frequent item, construct its conditional pattern
base, and then its conditional FP-tree;
 Repeat the process on each newly created conditional FP-
tree until the resulting FP-tree is empty, or it contains
only one path (single path will generate all the
combinations of its sub-paths, each of which is a frequent
pattern)

49
3 Major Steps
Starting the processing from the end of list L:
Step 1:
Construct conditional pattern base for each item in the header
table
Step 2
Construct conditional FP-tree from each conditional pattern base
Step 3
Recursively mine conditional FP-trees and grow frequent patterns
obtained so far. If the conditional FP-tree contains a single path,
simply enumerate all the patterns

50
Step 1: Construct Conditional Pattern Base
 Starting at the bottom of frequent-item header table in the FP-tree
 Traverse the FP-tree by following the link of each frequent item
 Accumulate all of transformed prefix paths of that item to form a
conditional pattern base
{} Conditional pattern bases
Header Table item cond. pattern base
f:4 c:1
Item head p fcam:2, cb:1
f m fca:2, fcab:1
c c:3 b:1 b:1
a b fca:1, f:1, c:1
b a:3 p:1 a fc:3
m
p m:2 b:1 c f:3
f {}
p:2 m:1
51
Properties of FP-Tree
 Node-link property
 For any frequent item ai, all the possible frequent patterns that
contain ai can be obtained by following ai's node-links, starting from
ai's head in the FP-tree header.
 Prefix path property
 To calculate the frequent patterns for a node ai in a path P, only the
prefix sub-path of ai in P need to be accumulated, and its frequency
count should carry the same count as node ai.

52
Step 2: Construct Conditional FP-tree
 For each pattern base
 Accumulate the count for each item in the base
 Construct the conditional FP-tree for the frequent items of the
pattern base

{}
Header Table
Item head f:4 {}
f 4
c 4 c:3 f:3
m- cond. pattern base:
a 3
b 3
a:3  fca:2, fcab:1 
c:3
m 3 m:2 b:1
p 3 a:3
m:1 m-conditional FP-tree

53
Step 3: Recursively mine the conditional FP-tree
conditional FP-tree of conditional FP-tree of conditional FP-tree of
“m”: (fca:3) “am”: (fc:3) add “cam”: (f:3)
{} “c” {}
{} add Frequent Pattern Frequent Pattern
Frequent Pattern
“a” f:3 f:3
f:3 add c:3 add add
“c” “f” “f”
c:3
conditional FP-tree of conditional FP-tree of
a:3 “cm”: (f:3) of “fam”: 3
add
{} “f”
Frequent Pattern
Frequent Pattern
add conditional FP-tree of
f:3 “fcm”: 3
“f”

Frequent Pattern Frequent Pattern


fcam
conditional FP-tree of “fm”: 3

54
Frequent Pattern
Principles of FP-Growth
 Pattern growth property
 Let  be a frequent itemset in DB, B be 's conditional pattern
base, and  be an itemset in B. Then    is a frequent itemset
in DB iff  is frequent in B.
 Is “fcabm ” a frequent pattern?
 “fcab” is a branch of m's conditional pattern base
 “b” is NOT frequent in transactions containing “fcab ”
 “bm” is NOT a frequent itemset.

55
Conditional Pattern Bases and
Conditional FP-Tree

Item Conditional pattern base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} Empty
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f Empty Empty
order of L
56
Single FP-tree Path Generation
 Suppose an FP-tree T has a single path P. The complete set of frequent
pattern of T can be generated by enumeration of all the combinations of
the sub-paths of P
{}
All frequent patterns concerning m:
combination of {f, c, a} and m
f:3
m,
c:3  fm, cm, am,
fcm, fam, cam,
a:3
fcam
m-conditional FP-tree

57
Summary of FP-Growth Algorithm
 Mining frequent patterns can be viewed as first
mining 1-itemset and progressively growing each 1-
itemset by mining on its conditional pattern base
recursively

 Transform a frequent k-itemset mining problem into a


sequence of k frequent 1-itemset mining problems via
a set of conditional pattern bases

You might also like