You are on page 1of 59

CP610 Data Analysis

- Mining Frequent Patterns and


Associations: Advanced Methods
Midterm
• Feb 14th 16:00-17:20
• Room BA102
• 20 multiple choice questions
• 5 short-answer questions
• You can bring a one-side hand-written paper for
reference
• Cover lectures in week1 - week5
Advanced Frequent Pattern Mining
• Pattern Mining in Multi-Level, Multi-Dimensional Space
• Mining Multi-Level Association
• Mining Multi-Dimensional Association
• Mining Quantitative Association Rules
• Mining Rare Patterns and Negative Patterns
• Constraint-Based Frequent Pattern Mining
• Mining High-Dimensional Data and Colossal Patterns
• Mining Compressed or Approximate Patterns
• Pattern Exploration and Application
• Summary

3
Multiple-Level Association Rules
• Patterns or association rules may have items or
concepts residing at high, low, or multiple
abstraction levels.
Mining Multiple-Level Association
Rules: Concept hierarchy
• Concept hierarchy for the items: a sequence of
mappings from a set of low-level concepts to a
higher-level, more general concept set.
Mining Multiple-Level Association
Rules: A support-confidence
framework
• Top-down strategy
• Counts are accumulated for the calculation of frequent
itemsets at each concept level
• Starting at concept level 1 and working downward in the
hierarchy
• For each level, any algorithm for discovering frequent
itemsets may be used, such as Apriori or its variations.
Mining Multiple-Level Association Rules: Flexible
support settings

• Flexible support settings


• Items at the lower level are expected to have lower support
• Exploration of shared multi-level mining (Agrawal &
Srikant@VLB’95, Han & Fu@VLDB’95)

uniform support reduced support


Level 1
Milk Level 1
min_sup = 5%
[support = 10%] min_sup = 5%

Level 2 2% Milk Skim Milk Level 2


min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

7
Multi-level Association: Flexible Support and
Redundancy filtering
• Flexible min-support thresholds: Some items are more valuable but less
frequent
• Use non-uniform, group-based min-support
• E.g., {diamond, watch, camera}: 0.05%; {bread, milk}: 5%; …

• Redundancy Filtering: Some rules may be redundant due to “ancestor”


relationships between items
• milk Þ wheat bread [support = 8%, confidence = 70%]
• 2% milk Þ wheat bread [support = 2%, confidence = 72%]
The first rule is an ancestor of the second rule

• A rule is redundant if its support is close to the “expected” value, based on


the rule’s ancestor
• Suppose 1/4 of milk items are 2% milk. Does the second rule provide novel 8information?
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space


• Mining Multi-Level Association
• Mining Multi-Dimensional Association
• Mining Quantitative Association Rules
• Mining Rare Patterns and Negative Patterns
• Constraint-Based Frequent Pattern Mining
• Mining High-Dimensional Data and Colossal Patterns
• Mining Compressed or Approximate Patterns
• Pattern Exploration and Application
• Summary

9
Mining Multi-Dimensional Association

• Single-dimensional rules:
buys(X, “milk”) Þ buys(X, “bread”)

• Multi-dimensional rules: ³ 2 dimensions or predicates


• Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) Ù occupation(X,“student”) Þ buys(X, “coke”)
• hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) Ù buys(X, “popcorn”) Þ buys(X, “coke”)

• Search for frequent predicate sets, rather than searching


for frequent itemsets

10
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space


• Mining Multi-Level Association
• Mining Multi-Dimensional Association
• Mining Quantitative Association Rules
• Mining Rare Patterns and Negative Patterns
• Constraint-Based Frequent Pattern Mining
• Mining High-Dimensional Data and Colossal Patterns
• Mining Compressed or Approximate Patterns
• Pattern Exploration and Application
• Summary

11
Mining Quantitative Associations

Methods to discover associations with quantitative attributes


1. Static discretization based on predefined concept hierarchies
2. Dynamic discretization based on data distribution
(quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
3. Clustering: Distance-based association (e.g., Yang &
Miller@SIGMOD97)
• One dimensional clustering then association
4. Statistical Theory: (such as Aumann and Lindell@KDD99)
Sex = female => Wage: mean=$7/hr (overall mean = $9)

12
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space


• Mining Multi-Level Association
• Mining Multi-Dimensional Association
• Mining Quantitative Association Rules
• Mining Rare Patterns and Negative Patterns
• Constraint-Based Frequent Pattern Mining
• Mining High-Dimensional Data and Colossal Patterns
• Mining Compressed or Approximate Patterns
• Pattern Exploration and Application
• Summary

14
Negative and Rare Patterns
• Rare patterns: Very low support but interesting
• E.g., buying Rolex watches
• Mining: Setting individual-based or special group-based
support threshold for valuable items

• Negative patterns: itemsets X and Y are both frequent but rarely


occur together
• it is unlikely that one buys bread and bagel together

15
Defining Negative Correlated Patterns (I)
• Definition 1 (support-based)
• If itemsets X and Y are both frequent but rarely occur together, i.e.,
sup(X U Y) < sup (X) * sup(Y)
Then X and Y are negatively correlated, X U Y is a negatively correlated pattern
• If sup(X U Y) < <sup (X) * sup(Y)
Then X and Y are stongly negatively correlated, X U Y is a strongly negatively correlated
pattern

• Problem: A store sold 100 packages A and 100 packages B, only one transaction containing
both A and B.
• When there are in total 200 transactions, we have
s(A U B) = 0.005, s(A) * s(B) = 0.25, s(A U B) < s(A) * s(B)
• When there are 105 transactions, we have
s(A U B) = 1/105, s(A) * s(B) = 1/103 * 1/103, s(A U B) > s(A) * s(B)
• Where is the problem? —Null transactions, i.e., the support-based definition is not null-
invariant!

16
Defining Negative Correlated Patterns (II)
• Definition 2 (Kulzynski measure-based) If itemsets X
and Y are frequent, but (P(X|Y) + P(Y|X))/2 < є, where є
is a negative pattern threshold, then X and Y are
negatively correlated.
• Ex. For the same needle package problem, when no
matter there are 200 or 105 transactions, if є = 0.01, we
have
(P(A|B) + P(B|A))/2 = (0.01 + 0.01)/2 < є

17
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space

• Constraint-Based Frequent Pattern Mining

• Mining High-Dimensional Data and Colossal Patterns

• Mining Compressed or Approximate Patterns

• Pattern Exploration and Application

• Summary

18
Constraints in Data Mining

• Knowledge type constraint:


• association, classification, etc.
• Data constraint — using SQL-like queries
• find product pairs sold together in stores in Chicago this year
• Dimension/level constraint
• in relevance to region, price, brand, customer category
• Interestingness constraint
• strong rules: min_support ³ 3%, min_confidence ³ 60%
• Rule (or pattern) constraint
• small sales (price < $10) triggers big sales (sum > $200)

19
Meta-Rule Guided Mining
• Meta-rule can be in the rule form with partially instantiated predicates
and constants
P1(X, Y) ^ P2(X, W) => buys(X, “iPad”)
• The resulting rule derived can be
age(X, “15-25”) ^ profession(X, “student”) => buys(X, “iPad”)
• In general, it can be in the form of
P1 ^ P2 ^ … ^ Pl => Q1 ^ Q2 ^ … ^ Qr
• Method to find meta-rules
• Find frequent (l+r) predicates (based on min-support threshold)
• Push constraint deeply when possible into the mining process (see
the remaining discussions on constraint-push techniques)
• Use confidence, correlation, and other measures when possible

20
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup
{2 3 5} {2 3 5} 2
21
Naïve Algorithm: Apriori + Constraint
Database D itemset sup.
L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2
{3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5
22
How can we use rule constraints
to prune the search space?
• pruning pattern search space
• checks candidate patterns and decides whether a
pattern can be pruned

• pruning data search space


• checks the data set to determine whether the particular
data piece will be able to contribute to the subsequent
generation of satisfiable patterns (for a particular
pattern) in the remaining mining process
Constraint-Based Frequent Pattern Mining
• Pattern space pruning constraints
• Anti-monotonic: If constraint c is violated, its further mining can be
terminated
• Monotonic: If c is satisfied, no need to check c again
• Succinct: c must be satisfied, so one can start with the data sets satisfying
c
• Convertible: c is not monotonic nor anti-monotonic, but it can be
converted into it if items in the transaction can be properly ordered
• Inconvertible

• Data space pruning constraint

24
Pattern Space Pruning with Anti-Monotonicity Constraints

• A constraint C is anti-monotone if the super pattern satisfies C, all of its sub-


patterns do so too
• In other words, anti-monotonicity: If an itemset S violates the constraint, so
does any of its superset
• Ex. 1. sum(S.price) £ v anti-monotone
• Ex. 2. sum(S.Price) ³ v not anti-monotone
• Ex. 3. support count (s) > v anti-monotone: core property used in Apriori

• Apply: Anti-monotone rules can be pushed into the mining process. Ex. can
be applied at each iteration of Apriori-style algorithms

25
Pattern Space Pruning with Monotonicity Constraints

• A constraint C is monotone if the pattern satisfies C, we do not need to


check C in subsequent mining
• Alternatively, monotonicity: If an itemset S satisfies the constraint, so
does any of its superset
• Ex. 1. sum(S.Price) ³ v is monotone
• Ex. 2. min(S.Price) £ v is monotone
• Ex. 3. range(s.profit) < v is anti-monotone

• Apply: No need to further testing of this constraint on its super itemsets.

26
Pattern Space Pruning with Succinctness
• Succinctness:
• Given A1, the set of items satisfying a succinctness constraint C, then
any set S satisfying C is based on A1 , i.e., S contains a subset belonging
to A1
• Idea: Without looking at the transaction database, whether an itemset
S satisfies constraint C can be determined based on the selection of
items
• min(S.Price) > v is succinct
• Sum(S.Price) >v is not succinct

• Apply: If C is succinct, C is pre-counting pushable. there is no need to


iteratively check the rule constraint during the mining process.
27
Constrained Apriori : Push a Succinct Constraint Deep

Database D itemset sup.


L1 itemset sup.
TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3
300 1235 {4} 1 {5} 3
400 25 {5} 3
C2 itemset sup C2 itemset
L2 itemset sup {1 2}
{1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3}
{1 5} 1 {1 5}
{2 3} 2
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5}
{3 5} 2 {3 5}
{3 5} 2
C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 min{S.price } <= 1
28
Constraint-Based Mining — A General Picture

Constraint Anti-monotone Monotone Succinct


vÎS no yes yes
SÊV no yes yes

SÍV yes no yes


min(S) £ v no yes yes

min(S) ³ v yes no yes


max(S) £ v yes no yes

max(S) ³ v no yes yes


count(S) £ v yes no weakly

count(S) ³ v no yes weakly

sum(S) £ v ( a Î S, a ³ 0 ) yes no no
sum(S) ³ v ( a Î S, a ³ 0 ) no yes no

range(S) £ v yes no no
range(S) ³ v no yes no

avg(S) q v, q Î { =, £, ³ } convertible convertible no


support(S) ³ x yes no no

support(S) £ x no yes no

29
Convertible Constraints: Ordering Data in Transactions

TDB (min_sup=2)
TID Transaction
• Convert tough constraints into anti-
10 a, b, c, d, f
monotone or monotone by properly ordering
20 b, c, d, f, g, h
items
30 a, c, d, e, f
• Examine C: avg(S.profit) ³ 25 40 c, e, f, g
• Order items in value-descending order Item Profit
• <a, f, g, d, b, h, c, e> a 40
b 0
• If an itemset afb violates C
c -20
• So does afbh, afb*
d 10
• It becomes anti-monotone! e -30
f 30
g 20
h -10
31
Strongly Convertible Constraints

• A constraint that is both convertible monotone and


antimonotone is called strongly convertible
• avg(X) ³ 25 is convertible anti-monotone w.r.t. item Item Profit
value descending order R: <a, f, g, d, b, h, c, e> a 40
• If an itemset af violates a constraint C, so does b 0
every itemset with af as prefix, such as afd c -20
• avg(X) ³ 25 is convertible monotone w.r.t. item value d 10
ascending order R-1: <e, c, h, b, d, g, f, a> e -30
• If an itemset d satisfies a constraint C, so does f 30
itemsets df and dfa, which having d as a prefix g 20
• Thus, avg(X) ³ 25 is strongly convertible h -10

32
What Constraints Are Convertible?

Convertible anti- Convertible Strongly


Constraint monotone monotone convertible

avg(S) £ , ³ v Yes Yes Yes


median(S) £ , ³ v Yes Yes Yes
sum(S) £ v (items could be of any value,
Yes No No
v ³ 0)
sum(S) £ v (items could be of any value,
No Yes No
v £ 0)
sum(S) ³ v (items could be of any value,
No Yes No
v ³ 0)
sum(S) ³ v (items could be of any value,
Yes No No
v £ 0)
……

33
Constraint-Based Frequent Pattern Mining
• Pattern space pruning constraints

• Data space pruning constraint


• Data succinct: Data space can be pruned at the initial pattern mining
process
• Ex. must contain digital camera
• Data anti-monotonic: If a transaction t does not satisfy c, t can be pruned
from its further mining
• Ex. sum(I.price) ≤ $100

34
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space

• Constraint-Based Frequent Pattern Mining

• Mining High-Dimensional Data and Colossal Patterns

• Mining Compressed or Approximate Patterns

• Pattern Exploration and Application

• Summary

35
Mining Long Patterns

• Colossal pattern: long pattern

• Mining long patterns is needed in bioinformatics,


social network analysis, software engineering, …
• For example, analyze microarray data that contain a large
number of genes (e.g., 10,000 to 100,000) but only a small
number of samples (e.g., 100 to 1000).

36
Challenges of mining long
patterns
• The curse of “downward closure” property of frequent
patterns
• Any sub-pattern of a frequent pattern is frequent
• If {a1, a2, …, a100} is frequent, then {a1}, {a2}, …, {a100}, {a1, a2}, {a1,
a3}, …, {a1, a100}, {a1, a2, a3}, … are all frequent! There are about 2100
such frequent itemsets!
• No matter searching in breadth-first (e.g., Apriori) or depth-
first (e.g., FPgrowth), if we still adopt the “small to large”
step-by-step growing paradigm, we have to examine so
many patterns, which leads to combinatorial explosion!
T1 = 2 3 4 ….. 39 40
Example Dataset T2 = 1 3 4 ….. 39 40
: .
: .
min-support σ= 20 T40=1 2 3 4 …… 39
T41= 41 42 43 ….. 79
T42= 41 42 43 ….. 79
# of Mid sized (length=20) closed Patters: æç 40 ö÷
ç 20 ÷ : .
è ø
: .
# of long pattern (length =39): 1 T60= 41 42 43 … 79

α= {41,42,…,79} of size 39

Question: How to find it without generating an exponential number of


mid-sized patterns?
Colossal Pattern Set: Small but Interesting

• It is often the case that only


a small number of patterns
are colossal, i.e., of large
size

• Colossal patterns are


usually attached with
greater importance than
those of small pattern sizes

40
Mining Colossal Patterns: Philosophy

• No hope for completeness

• Jumping out of the swamp of the mid-sized results

• Striving for mining almost complete colossal patterns

41
Pattern-Fusion Strategy: Core Patterns

• Core Patterns

Intuitively, for a frequent pattern α, a subpattern β is a τ-core pattern


of α if β shares a similar support set with α, i.e.,
| Da |
³t 0 <t £1
| Db |

where τ is called the core ratio

42
Example: Core Patterns
A data set with 400 transactions, if τ = 0.5
A core pattern: (ab) occurs in (abe) 100 times, and (abcef) 100 times, so
Sup(abe)/Sup(ab)=0.5

Transaction (# of Ts) Core Patterns (τ = 0.5)

(abe) (100) (abe), (ab), (be), (ae), (e)


(bcf) (100) (bcf), (bc), (bf)
(acf) (100) (acf), (ac), (af)

(abcef) (100) (ab), (ac), (af), (ae), (bc), (bf), (be) (ce), (fe), (e),
(abc), (abf), (abe), (ace), (acf), (afe), (bcf), (bce),
(bfe), (cfe), (abcf), (abce), (bcfe), (acfe), (abfe), (abcef)

43
Robustness of Core Patterns
• A colossal pattern has far more core
patterns than a small-sized pattern

• A colossal pattern can be generated


by merging a set of core patterns

• A random draw from a complete set


of pattern of size c would more
likely to pick a core descendant of a
colossal pattern
Idea of Pattern-Fusion Algorithm
• Generate a complete set of frequent patterns up to a small
size
• Randomly pick a pattern β (β has a high probability to be a
core-descendant of some colossal pattern α)
• Identify all α’s descendants in this complete set, and merge
all of them ― This would generate a much larger core-
descendant of α
• The set of all its core patterns of α, is bounded in a “ball” of
diameter r(τ ).

Distance between two patterns


45
Pattern-Fusion: The Algorithm

• Initialization (Initial pool): Use an existing algorithm to mine all


frequent patterns up to a small size, e.g., 3
• Iteration (Iterative Pattern Fusion) to find k Colossal patterns:
• At each iteration, k seed patterns are randomly picked from
the current pattern pool
• For each seed pattern, we find all the patterns within a
bounding ball centered at the seed pattern
• All these patterns found are fused together to generate a set
of super-patterns. All the super-patterns thus generated
form a new pool for the next iteration
• Termination: support threshold

46
Why Is Pattern-Fusion Efficient?

• A bounded-breadth pattern
tree traversal

• Ability to identify “short-cuts”


and take “leaps”

47
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space

• Constraint-Based Frequent Pattern Mining

• Mining High-Dimensional Data and Colossal Patterns

• Mining Compressed or Approximate Patterns

• Pattern Exploration and Application

• Summary

48
Mining Compressed Patterns by
Pattern Clustering
• Clustering is the automatic process of grouping
similar objects together
• The frequent closed patterns are clustered using a
tightness measure called 𝛿 -cluster.
• A representative pattern is selected for each
cluster, thereby offering a compressed version of
the set of frequent patterns.
Mining Compressed Patterns:
Distance Measure
• Let P1 and P2 be two closed patterns. Their
supporting transaction sets are T(P1) and T(P2),
respectively. The pattern distance of P1 and P2, is
defined as

Example. T(P1) ={t1, t2, t3, t4, t5} and T(P2) = {t1, t2, t3, t4, t6}.
The distance between P1 and P2 is
Mining Compressed Patterns:
Expression of patterns
• Given two patterns A and B, we say if ,
where O(A) is the corresponding itemset of A.
• Assume patterns P1,P2, … ,Pk are in the same
cluster, the representative pattern Pr of the cluster
should be able to express all the other patterns in
the cluster.
Mining Compressed Patterns :
𝛿 –clustering
• A pattern P is 𝛿-covered by another pattern P’ if

and

• A set of patterns form a 𝜹 –cluster, if there exists a


representative pattern Pr such that for each pattern
P in the set, P is 𝛿 -covered by Pr
Pattern compression problem
• Given a transaction database, a minimum support
min_sup, and the cluster quality measure 𝛿
• The pattern compression problem is to find a set of
representative patterns R such that
• for each frequent pattern P (with respect to min_sup),
there is a representative pattern Pr ∈ R (with respect to
min_sup! ), which covers P
• the value of |R| is minimized.

A detailed approach: Liu, Guimei, Haojun Zhang, and Limsoon Wong. "A flexible approach to finding representative pattern sets." IEEE
Transactions on Knowledge and Data Engineering (TKDE) 26.7 (2013): 1562-1574.
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space

• Constraint-Based Frequent Pattern Mining

• Mining High-Dimensional Data and Colossal Patterns

• Mining Compressed or Approximate Patterns

• Pattern Exploration and Application

• Summary

57
How to Understand and Interpret Patterns?

• Do they all make sense?


diaper beer • What do they mean?
• How are they useful?

morphological info. and simple statistics

Semantic Information

Not all frequent patterns are useful, only meaningful ones …

Annotate patterns with semantic information


A Dictionary Analogy
Word: “pattern” – from Merriam-Webster

Non-semantic info.

Definitions indicating
semantics

Examples of Usage
Synonyms

Related Words
Semantic annotation of the pattern “frequent,
pattern”
Automatic annotation of frequent
patterns: Context Modeling
• A context unit is a basic object in a database, D,
that carries semantic information and co-occurs
with at least one frequent pattern, p, in at least one
transaction in D.
• A context unit can be an item, a pattern, or even a
transaction, depending on the specific task and data.
• The context of a pattern, p, is a selected set of
weighted context units (referred to as context
indicators) in the database
Semantic Analysis with Context Models

• Task1: Extract strongest context indicators


• E.g. cosine similarity

• Task2: Extract representative transactions


• Represent each transaction as a context vector,
then measure similarity

• Task3: Extract semantically similar patterns


Example: Annotations Generated for Frequent
Patterns in the DBLP Data Set
Advanced Frequent Pattern Mining

• Pattern Mining in Multi-Level, Multi-Dimensional Space

• Constraint-Based Frequent Pattern Mining

• Mining High-Dimensional Data and Colossal Patterns

• Mining Compressed or Approximate Patterns

• Pattern Exploration and Application

• Summary

64
Summary

• Mining patterns in multi-level, multi dimensional space


• Mining rare and negative patterns
• Constraint-based pattern mining
• Specialized methods for mining high-dimensional data and
colossal patterns
• Mining compressed or approximate patterns
• Pattern exploration and understanding: Semantic annotation
of frequent patterns

65

You might also like