You are on page 1of 21

University

Database Laboratory
of
Alberta
PAKDD 2004 Tutorial What Is Frequent Pattern Mining?
Advances and Issues in Frequent
• What is a frequent pattern?
Pattern Mining – Pattern (set of items, sequence, etc.) that occurs together
frequently in a database [AIS92]

• Frequent pattern: an important form of regularity


Osmar R. Zaïane – What products were often purchased together? — beers and
Mohammad El-Hajj diapers!

University of Alberta – What are the consequences of a hurricane?


– What is the next target after buying a PC?
Canada
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [1] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [2]

Why Studying Frequent Pattern General Outline


• Association Rules
Mining? • Different Frequent Patterns
• Frequent pattern mining — Foundation for several • Different Lattice Traversal Approaches
essential data mining tasks: • Different Transactional Layouts
– association, correlation, causality
• State-Of-The-Art Algorithms
– sequential patterns • For All Frequent Patterns
– partial periodicity, cyclic/temporal associations • For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Applications:
– basket data analysis, cross-marketing, catalog design,
• Adding Constraints
loss-leader analysis, • Parallel and Distributed Mining
– clustering, classification, Web log sequence, DNA • Visualization of Association Rules
analysis, etc. • Frequent Sequential Pattern Mining
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [3] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

What Is Association Mining?


Basic Concepts
• Association rule mining searches for A transaction is a set of items: T={ia, ib,…it}
relationships between items in a dataset: T ⊂ I, where I is the set of all possible items {i1, i2,…in}
– Finding association, correlation, or causal structures
among sets of items or objects in transaction D, the task relevant data, is a set of transactions.
databases, relational databases, and other information An association rule is of the form:
repositories. P ÎQ, where P ⊂ I, Q ⊂ I, and P∩Q =∅
– Rule form: “Body Æ Ηead [support, confidence]”. PÎQ holds in D with support s
• Examples: and
PÎQ has a confidence c in the transaction set D.
– buys(x, “bread”) Æ buys(x, “milk”) [0.6%, 65%]
– major(x, “CS”) ^ takes(x, “DB”) Æ grade(x, “A”) [1%, 75%] Support(PÎQ) = Probability(P∪Q)
Confidence(PÎQ)=Probability(Q/P)
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [5] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [6]
Frequent Itemset Generation
Association Rule Mining null

A B C D E

AB AC AD AE BC BD BE CD CE DE
Frequent Itemset Mining Association Rules Generation

FIM 1 abc 2 abÆc


bÆac ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Bound by a support threshold Bound by a confidence threshold

ABCD ABCE ABDE ACDE BCDE


zFrequent itemset generation is still computationally expensive
Given d items, there
are 2d possible
ABCDE candidate itemsets
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [7] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [8]

An Influential Mining Methodology


Frequent Itemset Generation — The Apriori Algorithm
• Brute-force approach (Basic approach):
– Each itemset in the lattice is a candidate frequent itemset
• The Apriori method:
– Count the support of each candidate by scanning the database
List of
– Proposed by Agrawal & Srikant 1994
Transactions
Candidates – A similar level-wise algorithm by Mannila et al. 1994
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs • Major idea:
N 3 Milk, Diaper, Beer, Coke M – A subset of a frequent itemset must be frequent
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke • E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be.
w Any itemset that is infrequent, its superset cannot be frequent!
– Match each transaction against every candidate – A powerful, scalable candidate set pruning technique:
– Complexity ~ O(NMw) => Expensive since M = 2d !!! • It reduces candidate k-itemsets dramatically (for k > 2)

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [9] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [10]

Illustrating Apriori Principle The Apriori Algorithm


Frequent
null
subsets Ck: Candidate itemset of size k
Lk : frequent itemset of size k
A B C D E

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
AB AC AD AE BC BD BE CD CE DE
Ck+1 = candidates generated from Lk;
Infrequent for each transaction t in database do
Items increment the count of all candidates in
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
ABCD ABCE ABDE ACDE BCDE
end
return ∪k Lk;
Pruned Frequent
supersets ABCDE Items

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [11] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [12]
The Apriori Algorithm -- Example Generating Association Rules
Database D Support>1
itemset sup.
L1 itemset sup.
from Frequent Itemsets
TID Items C1
{1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3 • Only strong association rules are generated.
300 1235 {4} 1 {5} 3
400 25 {5} 3 • Frequent itemsets satisfy minimum support threshold.
C2 itemset sup C2 itemset • Strong AR satisfy minimum confidence threshold.
L2 itemset sup {1 2}
{1 2} 1 Scan D Support(P ∪ Q)
{1 3} 2 {1 3} 2 {1 3} • Confidence(P Î Q) = Prob(Q/P) =
{1 5} Support(P)
{2 3} 2 {1 5} 1
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5} For each frequent itemset, f, generate all non-empty subsets of f.
{3 5} 2
{3 5} 2 {3 5} For every non-empty subset s of f do
C3 itemset L3 itemset sup output rule sÎ(f-s) if support(f)/support(s) ≥ min_confidence
Scan D Note: {1,2,3}{1,2,5} end
{2 3 5} {2 3 5} 2 and {1,3,5} not in C3

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [13] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [14]

Interestingness Measures Adding Correlations or Lifts to Support


• play basketball ⇒ eat cereal [40%, 66.7%] is misleading
and Confidence
– The overall percentage of students eating cereal is 75% which is higher • Example
than 66.7%. – X and Y: positively correlated, X 1 1 1 1 0 0 0 0
• play basketball ⇒ not eat cereal [20%, 33.3%] is less – X and Z, negatively related Y 1 1 0 0 0 0 0 0
deceptive, although with lower support and confidence – support and confidence of Z 0 1 1 1 1 1 1 1
• Measure of dependent/correlated events: lift X=>Z dominates
• We need a measure of dependent or
Basketball Not basketball Sum (row) correlated events Rule Support Confidence
P( A∪ B) Cereal 2000 1750 3750
corr A , B =
P ( A∪ B ) X=>Y 25% 50%
corrA, B = Not cereal 1000 250 1250 P ( A)P ( B ) X=>Z 37.50% 75%
P( A) P( B) Sum(col.) 3000 2000 5000 • P(B|A)/P(B) is also called the lift of
rule A => B
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [15] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [16]

General Outline
• Association Rules
• Different Frequent Patterns
• Different Lattice Traversal Approaches


Different Transactional Layouts
State-Of-The-Art Algorithms
{abcd}
Other Frequent Patterns • For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Adding Constraints
• Parallel and Distributed Mining
• Visualization of Association Rules
• Frequent Sequential Pattern Mining
Frequent Closed Patterns {abc}
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

• For frequent itemset X, if there exists no {bd}


• Frequent pattern {a1, …, a100} Æ (1001) + item y such that every transaction
Transactions
Support = 2

(1002) + … + (100100) = 2100-1 = 1.27*1030 containing X also contains y, then X is a a 2


frequent closed pattern b 3
frequent sub-patterns! c 2
• In other words, frequent itemset X is d 2
closed if there is no item y, not already in ab 2
X, that always accompanies X in all ac 2
• Frequent Closed Patterns
All
bc 2
frequent
itemsets
transactions where X occurs. bd 2
• Concise representation of frequent abc 2
• Frequent Maximal Patterns Closed
frequent
itemsets
Maximal
frequent
patterns. Can generate all frequent Frequent itemsets
• All Frequent Patterns itemsets patterns with their support from frequent
closed ones. b 3
bd 2
• Reduce number of patterns and rules abc 2
Maximal frequent itemsets ⊆ Closed frequent itemsets ⊆ All frequent itemset
• N. Pasquier et al. In ICDT’99 Frequent Closed itemsets
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [17] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [18]
Frequent Maximal Patterns
{abcd}
{abc}
Maximal vs. Closed Itemsets
TID Items Set of
null
• Frequent itemset X is maximal if there is {bd} 1 ABC transaction Ids
Transactions 2 ABCD
no other frequent itemset Y that is Support = 2
3 ABC
1.2.3.4
A
1.2.3
B
1.2.3.4
C
2.4.5
D E
4.5

superset of X. a 2 4 ACDE

• In other words, there is no other frequent b 3 5 DE


c 2
pattern that would include a maximal d 2
1.2.3
AB
1.2.3.4
AC
2.4
AD
4
AE
1.2.3
BC
2
BD BE
2.4
CD
4
CE
4.5 DE
pattern. ab 2
ac 2
• More concise representation of frequent bc 2
patterns but the information about bd 2 1.2.3
ABC
2
ABD ABE
2.4
ACD
4
ACE
4
ADE
2
BCD BCE BDE
4
CDE
supports is lost. abc 2
• Can generate all frequent patterns from Frequent itemsets
frequent maximal ones but without their 2 4
respective support. bd 2
ABCD ABCE ABDE ACDE BCDE

• R. Bayardo. In SIGMOD’98 abc 2 Not supported by


any transaction ABCDE
Frequent Maximal itemsets
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [19] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [20]

General Outline
• Association Rules
• Different Frequent Patterns
• Different Lattice Traversal Approaches
• Different Transactional Layouts

TID Items
Maximal vs. Closed Itemsets Mining the Pattern Lattice •



State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
Adding Constraints
Parallel and Distributed Mining

null
Closed but •

Visualization of Association Rules
Frequent Sequential Pattern Mining
1 ABC not maximal
• Breadth-First
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

2 ABCD
1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5
3 ABC A B C D E – It uses current items at level k to generate items of level k+1 (many database scans)
4 ACDE Closed and
5 DE maximal • Depth-First
1.2.3 1.2.3.4 2.4 4 1.2.3 2 – It uses a current item at level k to generate all its supersets (favored when mining long
2.4 4 4.5 DE
AB AC AD AE BC BD BE CD CE frequent patterns)

• Hybrid approach Bottom Depth


Frequent Hybrid
1.2.3 2 2.4 4 4 2 Pattern – It mines using both direction at the same time null Breadth
4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Border
• Leap traversal approach
A B C D E

– Jumps to selected nodes AB AC AD AE BC BD BE CD CE DE

4
2
ABCD ABCE ABDE ACDE BCDE Maximal There is also the notion of : ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

Closed Top-down (level k then level k+1)


ABCD ABCE ABDE ACDE BCDE

Minimum support = 2 ABCDE


Frequent
Bottom-up (level k+1 then level k) Leap Traversal
Infrequent ABCDE
Top
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [21] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [22]

Breadth- First (Bottom-Up Example) TID


1
Items
ABC
Depth First (Top-Down Example) TID
1
Items
ABC
Steps null Bottom
2 ABCD Steps null Bottom
2 ABCD
1
1 3 ABC 3 ABC
A B C D E A B C D E
4 ACDE 4 ACDE
5 DE 5 DE
2 4 4
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE x

Superset is Subset is
candidate if 3 3 3 candidate if it
3 3 ALL its subsets is marked or if
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
are frequent x one of its
supersets is
candidate
2 2
4
ABCD ABCE ABDE ACDE BCDE ABCD ABCE ABDE ACDE BCDE
18 candidates x 23 candidates
x
to check to check
Minimum support = 2 Minimum support = 2
ABCDE Top ABCDE Top
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [23] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [24]
One Hybrid Example TID
1
Items
ABC
Leap Traversal Example TID
1
Items
ABC
Steps null Bottom
2 ABCD Steps null Bottom
2 ABCD
1
3 ABC 1 3 ABC
A B C D E A B C D E
4 ACDE 4 ACDE
5 DE 5 DE
2 5
3 4 2 Itemset is
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE x candidate if it
is marked or if
Superset is it is a subset of
2 2 candidate if more than one
ALL its subsets 2 3
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE infrequent
are frequent x marked
superset

2 2
ABCD ABCE ABDE ACDE BCDE How to find the
2 ABCD ABCE ABDE ACDE BCDE
10 candidates
19 candidates x to check
Support of an itemset x
to check
1. Full scan of the database OR 5 frequent patterns
Minimum support = 2
ABCDE 2. Intelligent techniques: Support of itemset =
ABCDE without checking
Top Top
Summation of the supports of its supersets of marked patterns
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [25] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [26]

Finding Maximal using leap traversal approach Finding Maximal using leap traversal approach
Minimum support = 2 null
TID Items null TID Items
1,3 ABC 1.2.3.4 1.2.3
1,3 ABC 1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5
1.2.3.4 2.4.5 4.5
A B C D E
2 ABCD A B C D E 2 ABCD
3 ABC 3 ABC
4 ACDE 1.2.3 1.2.3.4 2.4 4 1.2.3 2
4 ACDE 4.5 DE
2.4 4 4.5 DE AB AC AD AE BC BD BE CD CE
AB AC AD AE BC BD BE CD CE
5 DE 5 DE

1.2.3
1.2.3 2 2.4 4 4 2 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
-Selectively jumps in the lattice 2 4 Step1: Define actual ABCD ABCE ABDE ACDE BCDE
ABCD ABCE ABDE ACDE BCDE
paths (Mark paths)
-Find local Maximal
Minimum support = 2 ABCDE
-Generate patterns from these local maximal ABCDE
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [27] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [28]

Finding Maximal using leap traversal approach Finding Maximal using leap traversal approach
null null
TID Items TID Items
1,3 ABC 1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5 1,3 ABC
A B C D E A B C D E
2 ABCD 2 ABCD
3 ABC 3 ABC
4 ACDE 4.5 DE 4 ACDE 4.5 DE
AB AC AD AE BC BD BE CD CE AB AC AD AE BC BD BE CD CE
5 DE 5 DE

1.2.3 2.4 1.2.3 2.4


Intersect ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
with Step3: Remove non
ACDE frequent paths, or
2 4
Step2: Intersect non ABCD ABCE ABDE ACDE BCDE
frequent paths that ABCD ABCE ABDE ACDE BCDE
frequent marked have superset of other
paths with each frequent paths
others
Minimum support = 2 Minimum support = 2
ABCDE ABCDE

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [29] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [30]
Finding Maximal using leap traversal approach
TID Items
null
When to use a Given Strategy
1,3 ABC
A B C D E
2 ABCD
• Breadth First
3 ABC
– Suitable for short frequent patterns
4 ACDE
AB AC AD AE BC BD BE CD CE DE – Unsuitable for long frequent patterns
5 DE
• Depth First
– Suitable for long frequent patterns
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
– In general not scalable when long candidate patterns are
not frequent

ABCD ABCE ABDE ACDE BCDE


• Leap Traversal
– Suitable for cases having short and long frequent
patterns simultaneously
Minimum support = 2 ABCDE

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [31] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [32]

General Outline
• Association Rules
• Different Frequent Patterns

Empirical Tests
• Different Lattice Traversal Approaches

Transactional Layouts
• Different Transactional Layouts
• State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Adding Constraints
Connect Connect database (long Frequent Patterns)
• Parallel and Distributed Mining
• Visualization of Association Rules
Total Candidates Created • Frequent Sequential Pattern Mining
16.00 © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

Support Size of Largest F1-Size Total Frequent Breadth Depth Leap


Order of genertated Candidate
patterns compared to the

14.00
95 9 17 2205 15626 6654 3044
frequent patterns

• Horizontal Layout
12.00
Breadth
90 12 21 27127 394648 144426 29263 10.00
8.00 Depth
80 15 28 119418 120177 6.00 Leap

75 17 30 176365 177244 4.00

70 18 31 627952 628963 2.00

65 19 33 1368337 1369585
0.00
95 90 85 75 70 65 60 55
Each transaction is recorded as a list of items
60 20 36 2908632 2910175 Support %

55 21 37 5996892 5998715
Transaction ID Items
T10I4D100K (Short Frequent Patterns)
T10I4D100K 1 A G D C B
100.00 2 B C H E D
Candidacy generation can be removed (FP-Growth)
Order of genertated Candidate

Total Candidates Created 90.00


patterns compared to the

Support Size of Largest F1-Size Total Frequent Breadth Depth Leap 80.00 3 B D E A M
frequent patterns

70.00

1000 3 375 385 20 10 29 60.00


Breadth 4 C E F A N
50.00 Depth
5 A B N O P
750 4 463 561 262 124 291 40.00
30.00
Leap

6 A C Q R G
500 5 569 1073 1492 731 1546 20.00

250 8 717 7703 36994 36309 26666


10.00
0.00
0.01 0.0075 0.005 0.0025 0.001 0.0005
7
8
A
L
C
E
H
F
I
K
G
B
Superfluous Processing
100 10 797 27532 264645 458275 200113 Support %
9 A F M N O
50 10 839 53385 1129779 4923723 1067050
10 C F P J R
Chess (Mixed Length Frequent Patterns) 11 A D B H I
Chess
12 D E B K L
Total Candidates Created 16.00
Order of genertated Candidate

13 M D C G O
patterns compared to the

14.00
Support Size of Largest F1-Size Total Frequent Breadth Depth Leap
frequent patterns

12.00
Breadth 14 C F P Q J
95 5 9 78 165 90 136 10.00
8.00 Depth 15 B D E F I
90 7 13 628 2768 1842 1191 6.00
Leap
16 J E B A D
85 8 16 2690 20871 9667 4318 4.00
2.00 17 A K E F C
80 10 20 8282 91577 49196 12021 0.00
18 C D L B A
75 11 23 20846 292363 160362 28986 95 90 85 80 75 70
Support %
70 13 24 48939 731740 560103 65093

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [33] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [34]

Transactional Layouts Transactional Layouts


• Vertical Layout • Bitmap Layout Matrix : Rows represent transactions
Columns represent item
Tid-list is kept for each item If item exists in transaction
then cell value = 1 else cell value = 0
Transaction ID Items Items Transaction ID
1 A G D C B A 1 3 4 5 6 7 9 11 16 17 18 Transaction ID items
T# Items
2 B C H E D B 1 2 3 5 8 11 12 15 16 18 T# A B C D E F G H I J K L M N O P Q R
T1 A G D C B
3
4
B
C
D
E
E
F
A
A
M
N
C 1 2 4 6 7 10 13 14 17 18 T2 B C H E D
T1
T2
1
0
1
1
1
1
1
1
0
1
0
0
1
0
0
1
0 0 0
0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Similar to
D 1 2 3 11 12 13 15 16 18 T3 B D E A M
5 A B N O P E 2 3 4 8 12 15 16 17 T4 C E F A N
T3
T4
1
1
1
0
0
1
1
0
1
1
0
1
0
0
0
0
0 0 0
0 0 0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
horizontal
6 A C Q R G F 4 8 9 10 14 15 17 T5 A B N O P
7 A C H I G G 1 6 7 13 T6 A C Q R G
T5 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 layout.
8 L E F K B H 2 7 11 T6 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
9 A F M N O I 7 11 15
T7
T8
A
L
C
E
H
F
I
K
G
B
T7 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 Suitable for
10 C F P J R J 10 14 16 T8 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
11 A D B H I K 8 12 17
Minimize Superfluous Processing T9 A F M N O
T9 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 datasets with
T10 C F P J R
12 D E B K L L 8 12 18 T10 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1
13 M D C G O M 3 9 13
T11
T12
A
D
D
E
B
B
H
K
I
L
T11 1 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 small
14 C F P Q J T12 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
15 B D E F I
N
O
4
5
5
9
9
13 Candidacy generation is required
T13 M D C G O
T13 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 dimensionality
T14 C F P Q J
16 J E B A D P 5 9 13 T14 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0
T15 B D E F I
17 A K E F C Q 6 14 T15 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0
T16 J E B A D
18 C D L B A R 6 10 T16 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
T17 A K E F C
T17 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
T18 C D L B A
T18 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [35] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [36]
Transactional Layouts Why The Matrix Layout?
• Inverted Matrix Layout Interactive mining
El-Hajj and Zaiane, ACM SIGKDD’03
Minimize Superfluous Processing
Candidacy generation can be reduced
Appropriate for Interactive Mining
Changing the support level means
T# Items
Loc Index
Transactional Array expensive steps (whole process is redone)
1 2 3 4 5 6 7 8 9 10 11
T1 A G D C B 1 R 2 (2,1) (3,2)
T2 B C H E D 2 Q 2 (12,2) (3,3) Evaluation and Knowledge
T3 B D E A M 3 P 3 (4,1) (9,1) (9,2)
T4 C E F A N 4 O 3 (5,2) (5,3) (6,3) Presentation
T5 A B N O P 5 N 3 (13,1) (17,4) (6,2)
T6 A C Q R G 6 M 3 (14,2) (13,3) (12,4)
T7 A C H I G 7 L 3 (8,1) (8,2) (15,9) Data Mining
T8 L E F K B 8 K 3 (13,2) (14,5) (13,7)
T9 A F M N O 9 J 3 (13,4) (13,5) (14,7) Selection and Patterns
T10 C F P J R 10 I 3 (11,2) (11,3) (13,6) Transformation
T11 A D B H I 11 H 3 (14,1) (12,3) 15,4)
T12 D E B K L 12 G 4 (15,1) (16,4) (16,5) (15,6) Data warehouse
T13 M D C G O 13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
T14 C F P Q J 14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9)
T15 B D E F I 15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10)
T16 J E B A D 16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (¤, ¤) (¤, ¤) (¤, ¤) (18,10) (17,10)
T17 A K E F C 17 B 10 (18,1) (¤, ¤) (18,2) (18,4) (¤, ¤) (18,8) (¤, ¤) (¤, ¤) (18,9) (18,11) Databases
T18 C D L B A 18 A 11 (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤)

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [37] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [38]

Why The Matrix Layout? Why The Matrix Layout?


Repetitive tasks, (I/O) readings (Superfluous Processing) Repetitive tasks, (I/O) readings (Superfluous Processing)

T# Items T# Items T# Items T# Items


T1 A G D C B T1 A G D C B T1 A G D C B T1 A G G C B
T2 B C H E D T2 B C G E D T2 B C H E D T2 B C G G G
T3 B D E A M T3 B D E A G T3 B D E A M Support > 9 T3 B G G A G
T4 C E F A N T4 C E F A G T4 C E F A N T4 C G G A G
T5 A B N O P
Support > 4 T5 A B G G G T5 A B N O P T5 A B G G G
T6 A C Q R G T6 A C G G G T6 A C Q R G T6 A C G G G
T7 A C H I G T7 A C G G G T7 A C H I G T7 A C G G G
T8 L E F K B Frequent 1-itemsets {A, B, C, D, E, F} T8 G E F G B T8 L E F K B T8 G G G G B
T9 A F M N O Non frequent items {G, H, I, J, K, L, M, N, O, P, Q, R} T9 A F G G G T9 A F M N O T9 A G G G G
T10 C F P J R T10 C F G G G T10 C F P J R
Frequent 1-itemsets {A, B, C} T10 C G G G G
T11 A D B H I T11 A D B G G T11 A D B H I
Non frequent items T11 A G B G G
T12 D E B K L T12 D E B G G T12 D E B K L
{D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R} T12 G G B G G
T13 M D C G O T13 G D C G G T13 M D C G O T13 G G C G G
T14 C F P Q J T14 C F G G G T14 C F P Q J T14 C G G G G
T15 B D E F I T15 B D E F G T15 B D E F I T15 B G G G G
T16 J E B A D T16 G E B A D T16 J E B A D T16 G G B A G
T17 A K E F C T17 A G E F C T17 A K E F C T17 A G G G C
T18 C D L B A T18 C D G B A T18 C D L B A T18 C G G B A

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [39] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [40]

Transactional Layouts Transactional Layouts


• Inverted Matrix Layout • Inverted Matrix Layout
Support > 4
Transactional Array Transactional Array
Loc Index Loc Index
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
1 R 2 (2,1) (3,2)
2 Q 2 (12,2) (3,3)
3 P 3 (4,1) (9,1) (9,2)
4 O 3 (5,2) (5,3) (6,3)
5 N 3 (13,1) (17,4) (6,2)
6 M 3 (14,2) (13,3) (12,4)
7
8
9
L
K
J
3
3
3
(8,1)
(13,2)
(13,4)
(8,2)
(14,5)
(13,5)
(15,9)
(13,7)
(14,7)
Border
10
11
12
H
G
I 3
3
4
(11,2)
(14,1)
(15,1)
(11,3)
(12,3)
(16,4)
(13,6)
15,4)
(16,5) (15,6)
Support
13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8) 13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9) 14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9)
15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10) 15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10)
16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10) 16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10)
17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11) 17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11)
18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) 18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼)

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [41] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [42]
General Outline
• Association Rules
• Different Frequent Patterns

Transactional Layouts The Algorithms •




Different Lattice Traversal Approaches
Different Transactional Layouts
State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Adding Constraints
• Parallel and Distributed Mining

• Inverted Matrix Layout T# Items


(State of the Art) • Visualization of Association Rules
• Frequent Sequential Pattern Mining
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

T1 A D C B
T2
T3
B C E D
B D E A All
T4 C E F A
T5 A B

Loc Index
Transactional Array
T6
T7
A C
A C
Apriori, FP-Growth, COFI*, ECLAT
1 2 3 4 5 6 7 8 9 10 11
T8 E F B
13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
14
15
E 8 (15,2)
D 9 (16,1)
(15,3)
(16,2)
(16,3)
(17,2)
(17,5)
(17,6)
(15,5) (15,7) (15,8)
(17,7) (16,7) (17,8)
(16,9)
(17,9) (16,10)
T9 A F
T10 C F Closed
16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10) T11 A D B
17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11) T12 D E B
18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) T13 D C
T14 C F
CHARM, CLOSET+,COFI-CLOSED
T15 B D E F
T16 E B A D
T17 A E F C Maximal
T18 C D B A

MaxMiner, MAFIA, GENMAX, COFI-MAX

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [43] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [44]

All-Apriori All-Apriori
Problems with Apriori
Apriori • Generation of candidate itemsets are
expensive (Huge candidate sets)
Repetitive I/O scans • 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
Huge Computation to generate candidate items one needs to generate 2100 ≈ 1030 candidates.
• High number of data scans

Frequent Pattern Growth


• First algorithm that allows frequent pattern
mining without generating candidate sets
R. Agrawal, R. Srikant, VLDB’94
• Requires Frequent Pattern Tree
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [45] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [46]

All-FP-Growth All-FP-Growth
FP-Growth
Frequent Pattern Tree
2 I/O scans
Reduced candidacy generation
High memory requirements F, A, C, D, G, I, M, P
Claims to be 1 order of magnitude faster than
Required Support: 3
Apriori A, B, C, F, L, M, O

B, F, H, J, O

A, F, C, E, L, P, M, N
B, C, K, S, P

Patterns
F, M, C, B, A
Recursive
FP-Tree conditional trees and FP-Trees F:5, C:5, A:4, B:4, M:4, P:3 D:1 E:1 G:1 H:1 I:1 J:1 K:1 L:1 O:1
J. Han, J. Pei, Y. Yin, SIGMOD’00
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [47] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [48]
All-FP-Growth All-FP-Growth

Frequent Pattern Tree


F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree
Original Transaction Ordered frequent items F, B root
C, B, P
F, A, C, D, G, I, M, P F, C, A, M, P
F, C, A, M, P
C, A, M
F:1 F:1
A, B, C, F, L, M, O F, C, A, B, M
F, B
B, F, H, J, O F, B C:1 C:1
A, F, C, E, L, P, M, N C, B, P
A:1 A:1
B, C, K, S, P F, C, A, M, P
M:1 B:1
F, M, C, B, A F, C, A, M
F, B, D F, B P:1 M:1
F:5, C:5, A:4, B:4, M:4, P:3 Required Support: 3
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [49] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [50]

All-FP-Growth All-FP-Growth

F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:2 C, A, M
F:2 F:1
F, B F, B
C:2 C:2 B:1

A:2 A:2

M:1 B:1 M:1 B:1

P:1 M:1 P:1 M:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [51] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [52]

All-FP-Growth All-FP-Growth

F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:3 C, A, M
F:3 C:1 F:1
F, B F, B
C:2 B:1 C:2 B:1 B:1 C:1

A:2 A:2 P:1 A:1

M:1 B:1 M:1 B:1 M:1

P:1 M:1 P:1 M:1 P:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [53] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [54]
All-FP-Growth All-FP-Growth

F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:4 C:1 C, A, M
F:4 C:1 C:1
F, B F, B
C:3 B:1 B:1 C:3 B:1 B:1 A:1

A:3 P:1 A:3 P:1 M:1

M:2 B:1 M:2 B:1

P:2 M:1 P:2 M:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [55] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [56]

All-FP-Growth All-FP-Growth

F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:4 C:2 C, A, M
F:4 C:2 F:1
F, B F, B
C:3 B:1 B:1 A:1 C:3 B:1 B:1 A:1 B:1

A:3 P:1 M:1 A:3 P:1 M:1

M:3 B:1 M:3 B:1

P:3 M:1 P:3 M:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [57] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [58]

All-FP-Growth All-FP-Growth

F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree Frequent Pattern Growth
root
F, B root P:3
C, B, P
F, C, A, M, P F:5 F:5 C:1
C, A, M
F:5 C:1 C:3
C:5 <C:3, P:3>
F, B C:3 B:2 B:1 A:1
C:3 B:2 B:1 A:1 A:4
B:4 A:3
A:3 P:1 M:1
P:1 M:1 M:4
M:2 B:1
M:2 B:1 P:3

P:2 M:1 P:2 M:1

C:3, F:2, A:2, M:2, B:1


© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [59] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [60]
All-FP-Growth All-COFI
COFI algorithm big picture
Frequent Pattern Growth M:4
root
F3 F:3 C:1 2 I/O scans
COFI reduced candidacy generation
F:5 F:5 C:1 C4
A:1 Small memory footprint
C:5 C:3
C:3 B:2 B:1 A:1 A4
A:4 A:3
B1
B:4 A:3 P:1 M:1
M:4 <F:3, C:3, A:3, M:3>
M:2 B:1 <C:1, A:1>
P:3
P:2 M:1
F:2, C:2, A:2 Recursively build Patterns
F:1, C:1, A:1, B:1 the A, C and F COFI- trees
conditional trees. FP-Tree
C:1, A:1 El-Hajj and Zaïane, DaWak’03
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [61] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [62]

All-COFI All-COFI
Co-Occurrences Frequent
Co-Occurrences Frequent Item tree
Item tree Support
Start with item P: Then with item M: Find Locally
root P:3:0 root
frequent items with respect to M:
Find Locally frequent A:4,C:4:F:3
Participation M:4:0
items with respect to
F:5 C:1 C:3:0 F:5 C:1
P: C3
frequent-path-bases
C:3 B:2 B:1 A:1 C:3 B:2 B:1 A:1 A4 A:4:0 FCA: 3
PC:3 frequent-path-base C4
A:3 P:1 M:1 All subsets of PC:3 are A:3 P:1 M:1 F3 C:4 :0
frequent and have the
M:2 B:1 same support M:2 B:1 F:3 :0

P:2 M:1 P:2 M:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [63] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [64]

All-COFI All-COFI

Co-Occurrences Frequent Item tree Co-Occurrences Frequent Item tree


Then with item M: Find Locally Then with item M: Find Locally
root root
frequent items with respect to M: frequent items with respect to M:
A:4,C:4:F:3 M:4:3 A:4,C:4:F:3 M:4:3
F:5 C:1 F:5 C:1
frequent-path-bases frequent-path-bases
C:3 B:2 B:1 A:1 A4 A:4:3 FCA: 3 C:3 B:2 B:1 A:1 A4 A:4:3 FCA: 3
C4 C4
A:3 A:3 CA:1
P:1 M:1 F3 C:4 :3 P:1 M:1 F3 C:4 :3

M:2 B:1 F:3 :3 M:2 B:1 F:3 :3


P:2 M:1 P:2 M:1

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [65] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [66]
All-COFI All-COFI
Co-Occurrences Frequent Item tree Co-Occurrences Frequent Item tree
How to mine frequent-path-bases How to mine frequent-path-bases

Three approaches: Three approaches:


Support of any pattern is the Support of any pattern is the
1: Bottom-Up summation of the supports of its 2: Top-down summation of the supports of its
supersets of frequent-path-bases supersets of frequent-path-bases
FCA: 3 FCA: 3
FCA: 3 FCA: 3
CA: 1 CA: 1
CA:4 CF:3 AF:3 CA:4 CF:3 AF:3

C:4 A:4 F:3 C:4 A:4 F:3

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [67] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [68]

All-COFI All-ECLAT
Co-Occurrences Frequent Item tree ECLAT
How to mine frequent-path-bases • For each item, store a list of transaction ids
(tids) Horizontal
Three approaches: Data Layout
Support of any pattern is the Vertical Data Layout
3: Leap-Traversal summation of the supports of its TID Items
A B C D E
supersets of frequent-path-bases 1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
1) Intersect non frequent path bases
FCA: 3 3 C,E 5 5 4 5 6
FCA: 3 ∩ CA:1 = CA 4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
2) Find subsets of the only CA:4 CF:3 AF:3 6 A,E 8 10
7 A,B 9
frequent paths (sure to be frequent) 8 A,B,C
C:4 A:4 F:3
3) Find the support of each pattern 9 A,C,D
10 B TID-list
M.J.Zaki IEEE transactions on Knowledge and data Engineering 00
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [69] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [70]

All-ECLAT All-ECLAT
ECLAT ECLAT
• Determine support of any k-itemset by intersecting tid-lists of two of its (k- Find all frequent patters with respect to item A
1) subsets. AB, AC, …. ABC, ABD, ACD, ABCD ……..
A B AB
1 1 1
5 Then it finds all frequent patters with respect to item B

∧ →
4 2
5 5 7 BC, BD, …. BCD, BDE, BCDE ……..
6 7 8
7 8
8 10 • 3 traversal approaches:
9 – top-down, bottom-up and hybrid
• Advantage: very fast support counting, Few scans of database (best
case 2)
• Disadvantage: intermediate tid-lists may become too large for
memory

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [71] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [72]
CLOSED-CHARM CLOSED-CHARM
CHARM Diffset Intersections (example)
TIDSET database DIFFSET database

Use vertical Data Format A C D T W A C D T W


1 1 2 1 2 6
1 1
2
2 2 6 3 4
3 3
Deliver Closed patterns based on vertical intersections 4
3
4

5 5 3
4
AB 5
A B 5
6 6 4
1 1 1
6 5
4 2 5
5
6
∧ 5
7 → 7
8
7 8
8 10
9
AC AD AT AW CD CT CW DT DW TW
1 4 1 2 6 2 6 6

Diffsets: don’t store the entire tidsets, only the differences 3 3 4 4

Diffset: d(AD) = d(D) – d(A)


between tidsets d(AD) = 1,3 – 2,6 = 1,3
ACT ACW ATW CDW CTW
Or Diffset: d(AD) = t(A) – t(D)
4 6 6
Use Diffset to accelerate mining d(AD) = 1,3,4,5 – 2,4,5,6 = 1,3
Support σ(AD) = σ(D) - |d(AD)|
ACTW
σ(AD) = 4 – 2 = 2
M.J.Zaki and C-J. Hsiao. SIAM 02
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [73] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [74]

CLOSED-CLOSET+ CLOSED-CLOSET+
CLOSET+ CLOSET+
Use horizontal format of transactions • Two-Level hash-indexed results tree
Based on the FP-Growth Model • Compressed result tree structure
• Search space shrinking for subset checking
Divide the search space – If itemset Sc can be absorbed by another already mined itemset Sa, they
Find closed itemset recursively have the following relationships:
1) sup(Sc)=sup(Sa)
2) length(Sc)<length(Sa)
2 level hash indexed result tree structure for dense
3) ∀ i, j∈ Sc ⇒ i ∈ Sa
datasets – Measures to enhance the checking
Pseudo projection based upward-checking for sparse » Two-level hash indices – support and itemID
» Record length information in each result tree node
datasets

J. Wang, J. Han, J. Pei. SIGKDD 03


© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [75] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [76]

CLOSED-CLOSET+ CLOSED-CLOSET+
CLOSET+ CLOSET+
Pseudo-projection based upward checking Pseudo-projection based upward checking
• Result-tree may consume much more memory for • Result-tree may consume much more memory for
sparse datasets sparse datasets
• Subset checking without maintenance of history • Subset checking without maintenance of history
itemsets itemsets
– For a certain prefix X, as long as we can find any item which
– (1) appears in each prefix path w.r.t. prefix X, and
– (2) does not belong to X, any itemset with prefix X will be non-closed,
otherwise, if there’s no such item, the union of X and the complete set
of its locally frequent items with support sup(X) will form a closed
itemset.

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [77] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [78]
CLOSED-COFI-CLOSED CLOSED-COFI-CLOSED
COFI-CLOSED COFI-CLOSED
1 2 3 4 5
G 5 0
A 5 0
B 4 0 A D E 1 1 A B C D 2 2 A B C D E 1 1 D 2 0 E 3 0
A 5
C 4 0 NULL A B C E 1 1
B 4 C 2 0 C 1 0 D 2 0
D 4 0
Use Horizontal format E 3 0
Step 1: Assign frequent-path-bases to their locations in OPB
C
D
E
4
4
3
B 2 0 B 1 0 A 1 0 C 1 0

A 2 0 A 1 0 B 1 0
1 2 3 4 5
COFI- Mining approach A 5 0 A D E 1 1
A 1 0

B 4 0 A B C 0 0 A B C D 2 2 A B C D E 1 1
C 4 0 A B C E 1 1

Leap-traversal approach D
E
4
3
0
0
A D 0 0
A E 0 0
Example (Finding closed
Step 2: Apply leap-traversal approach on frequent-path-bases

Divide the search space 1 2 3 4 5 frequent patterns for a COFI-


A 5 0 A D E 2 1
B 4 0 A B C 4 0 A B C D 3 2 A B C D E 1 1 tree)
One Global tree to hold the discovered closed patterns C
D
E
4
4
3
0
0
0
A D 4 0
A E 3 0
A B C E 2 1

Step 3: Find global support for each pattern

1 2 3 4 5 Frequent Path Bases


A 5 0
A B C 4 0 A B C D 3 2 NULL
ABCD:2, ABCE:1, ADE:1, ABCDE:1
A D 4 0
A E 3 0
Step 4: Remove non-frequent patterns or frequent patterns that have
superset of frequent patterns with same support

M. El-Hajj and O. Zaïane


© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [79] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [80]

MAXIMAL-MAXMINER MAXIMAL-MAXMINER
MaxMiner: Mining Max-patterns MaxMiner: Mining Max-patterns
• 1st scan: find frequent items Tid Items –Abandons a bottom-up traversal
– A, B, C, D, E 10 A,B,C,D,E –Attempts to “look-ahead”
20 B,C,D,E,
• 2nd scan: find support for 30 A,C,D,F
–Identify a long frequent itemset, prune all its subsets.
•Set-enumeration tree
– AB, AC, AD, AE, ABCDE {}

– BC, BD, BE, BCDE Potential max- 1 2 3 4


– CD, CE, CDE patterns
1,2 1,3 1,4 2,3 2,4 3,4
• Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan 1,2,3 1,2,4 1,3,4 2,3,4

R. J. Bayardo, ACM SIGMOD’98 1,2,3,4


© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [81] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [82]

MAXIMAL-MAFIA MAXIMAL-MAFIA {}

MAFIA MAFIA 1 2 3 4

A Maximal Frequent Itemset Algorithm for Transactional Databases 1,2 1,3 1,4 2,3 2,4 3,4

1,2,3 1,2,4 1,3,4 2,3,4


Vertical bitmap presentation
1,2,3,4

Look Ahead pruning technique • PEP (Parent Equivalence Pruning)


Integrates a depth-first traversal of the itmset lattice with
Level newNode = C ∪ i (where i is child of C)
effective pruning mechanisms
{} Check newNode.support == C.support
0 Move i from C.tail to C.head
P
1 2 3 4 1 • FHUT (Frequent Head Union Tail)
Cut newNode = C ∪ I (where I is the leftmost child in the tail)
1,2 1,3 1,4 2,3 2,4 3,4 2
• HUTMFI
3 Check Head Union Tail is in MFI
1,2,3 1,2,4 1,3,4 2,3,4
Stop searching and return

1,2,3,4 D. Burdick, M. Calimlim, and J. Gehrke ICDE’01 4


© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [83] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [84]
MAXIMAL-GENMAX MAXIMAL-COFIMAX
GenMax:
COFI-MAX
ƒ Progressive focusing to improve superset checking
ƒ Superset checking techniques COFI-Approach
ƒ Do superset check only for Il+1 ∪ Pl+1
Leap traversal searching
ƒ Using check_status flag
ƒ Local maximal frequent itemsets Pruning techniques
ƒ Reordering the combine set (A) skip building some COFI-trees
ƒ Vertical approach. (B) Remove all frequent-path-bases that are subset of already
found maximal patterns
ƒ Database needs to be loaded to Main Memory (C) Count the support of frequent-path-bases early, and
remove the frequent ones from the leap-traversal approach

.
K. Jouda and M. Zaki, ICDM’01 M. El-Hajj and O. Zaïane
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [85] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [86]

MAXIMAL-COFIMAX

COFI-MAX Mining relatively small datasets


mining for ALL patterns in small synthetic dataset mining for CLOSED patterns in small synthetic dataset

700 350
1 2 3 4 5
G 5 0 600 300
A 5 0
Time in seconds

Time in seconds
500 250
B 4 0 A D E 1 1 A B C D 2 2 AB C D E 1 1 D 2 0 E 3 0
COFI-ALL COFI-CLOSED
C 4 0 NULL A B C E 1 1 A 5 400 200
D 4 0 B 4 C 2 0 C 1 0 D 2 0 FP-Growth FPCLOSED
300 150
E 3 0 C 4 MAFIA MAFIA
Step 1: Assign frequent-path-bases to their locations in OPB D 4 B 2 0 B 1 0 A 1 0 C 1 0 200 100
E 3
A 2 0 A 1 0 B 1 0 100 50
1 2 3 4 5
0 0
A 5 0 A 1 0
10K 50K 100K 250K 10K 50K 100K 250K
B 4 0 A D E 2 1 A B C D 3 2 AB C D E 1 1
C 4 0 NULL A B C E 2 1 transaction size transaction size
D 4 0
E 3 0
Step 2: Find global support for each frequent-path-base
Example (Finding Maximal
1 2 3 4 5

A 5 0 frequent patterns for a COFI- mining for MAXIMAL patterns in synthetic dataset
B 4 0 A D E 2 1 A B C D 3 2 AB C D E 1 1
C 4
D 4
0A E 3 0
0
A B C E 2 1
tree) 350

300
E 3 0
Time in seconds

Step 3: Apply leap-traversal approach on non-frequent frequent-path-bases 250


COFI-MAX
200
1 2 3 4 5 FPMAX
150

NULL
A E 3 0
NULL A B C D 3 2
NULL
Frequent Path Bases 100
MAFIA

50
Step 4: Remove non-frequent patterns or frequent patterns that have
0
superset of frequent patterns
ABCD:2, ABCE:1, ADE:1, ABCDE:1 10K 50K 100K 250K
transaction size

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [87] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [88]

Mining Extremely large datasets Mining real datasets (Low, and High) support
Mining ALL frequent patterns Mining CLOSED frequent patterns
pumsb (ALL with high support) pumsb (CLOSED with high support)
5000 4000
120 18
4500 3500
4000 16
3000 100
Time in seconds

Time in seconds

3500 14
Time in seconds

Time in seconds

3000 2500 80 COFI-ALL 12 COFI-CLOSE


COFI-ALL COFI-CLOSED
2500 2000 FP_Growth 10 FP-CLOSE
FP-Growth FP-CLOSED 60
2000 1500 MAFIA 8 MAFIA
1500 40 Eclat 6 CHARM
1000
1000
4
500 500 20
2
0 0
0 0
5M 25M 50M 75M 100M 5M 25M 50M 75M 100M
90 85 80 75 70 65 90 85 80 75 70 65
transaction size transaction size
Support % Support %

pumsb (MAXIMAL with low support)


Mining MAXIMAL frequent patterns
3500.00
3500
3000.00
3000
Time in seconds

2500.00
Time in seconds

2500 COFI-MAX
2000.00 FPMAX
2000 COFI-MAX
1500.00 MAFIA
1500 FPMAX
GENMAX
1000.00
1000
500.00
500
0.00
0
40 35 30 25 20 15
5M 25M 50M 75M 100M
Support %
transaction size

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [89] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [90]
Which algorithm is the winner? Which algorithm is the winner?
What about Extremely large datasets (hundreds of
millions of transactions)?
Not clear yet
Most of the existing algorithms do not run on such sizes
With relatively small datasets we can find
different winners Vertical approaches and Bitmaps approaches cannot
load the transactions in Main Memory
1. By using different datasets
2. By changing the support level Reparative approaches cannot keep scanning these huge
databases many times
3. By changing the implementations
Requirements: We need algorithms that
1) do not require multiple scans of the database
2) Leave small foot print in Main Memory at any given time
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [91] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [92]

General Outline
• Association Rules
• Different Frequent Patterns
• Different Lattice Traversal Approaches

Constraint-based Data Mining Restricting Association Rules


• Different Transactional Layouts
• State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Adding Constraints
• Parallel and Distributed Mining
• Visualization of Association Rules
• Frequent Sequential Pattern Mining
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

• Finding all the patterns in a database autonomously? •Useful for interactive and ad-hoc mining
— unrealistic! •Reduces the set of association rules discovered and confines
them to more relevant rules.
– The patterns could be too many but not focused!
• Before mining
• Data mining should be an interactive process 9 Knowledge type constraints: classification, etc.
– User directs what to be mined using a data mining query 9 Data constraints: SQL-like queries (DMQL)
language (or a graphical user interface) 9 Dimension/level constraints: relevance to some dimensions
and some concept levels.
• Constraint-based mining • While mining
– User flexibility: provides constraints on what to be mined 9 Rule constraints: form, size, and content.
– System optimization: explores such constraints for efficient 9 Interestingness constraints: support, confidence, correlation.
mining—constraint-based mining • After mining
9 Querying association rules
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [93] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [94]

Constrained Frequent Pattern Mining: Rule Constraints in Association Mining


A Mining Query Optimization Problem • Two kind of rule constraints:
• Given a frequent pattern mining query with a set of constraints C, – Rule form constraints: meta-rule guided mining.
the algorithm should be • P(x, y) ^ Q(x, w) −> takes(x, “database systems”).
– sound: it only finds frequent sets that satisfy the given
constraints C – Rule content constraint: constraint-based query
– complete: all frequent sets satisfying the given constraints C
optimization (where and having clauses) (Ng, et al., SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
are found
• A naïve solution • 1-variable vs. 2-variable constraints
– First find all frequent sets, and then test them for constraint (Lakshmanan, et al. SIGMOD’99):
satisfaction – 1-var: A constraint confining only one side (L/R) of the rule, e.g.,
• More efficient approaches: as shown above.
– Analyze the properties of constraints comprehensively – 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
– Push them as deeply as possible inside the frequent pattern
computation.
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [95] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [96]
Anti-Monotonicity in Constraint-Based Mining Monotonicity in Constraint-Based Mining
TDB (min_sup=2)
TDB (min_sup=2)
TID Transaction TID Transaction
• Anti-monotonicity 10 a, b, c, d, f
10 a, b, c, d, f
– When an intemset S violates the • Monotonicity 20 b, c, d, f, g, h
20 b, c, d, f, g, h
constraint, so does any of its superset 30 a, c, d, e, f – When an intemset S satisfies the 30 a, c, d, e, f

– sum(S.Price) ≤ v is anti-monotone 40 c, e, f, g constraint, so does any of its superset 40 c, e, f, g


Item Profit Item Profit
– sum(S.Price) ≥ v is not anti-monotone a 40
– sum(S.Price) ≥ v is monotone a 40
• Example. C: range(S.profit) ≤ 15 is b 0 – min(S.Price) ≤ v is monotone b 0
c -20 c -20
anti-monotone
d 10
• Example. C: range(S.profit) ≥ 15 d 10
– Itemset ab violates C e -30 – Itemset ab satisfies C e -30
– So does every superset of ab f 30
– So does every superset of ab
f 30
g 20 g 20
h -10 h -10
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [97] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [98]

Which Constraints Are Monotone or State Of The Art


SQL-based Constraints Anti-Monotone? Constraint pushing techniques have been proven to be
Constraint Monotone Anti-Monotone effective in reducing the explored portion of the search
v∈S yes no
S⊇V
space in constrained frequent pattern mining tasks.
yes no
S⊆V no yes
min(S) ≤ v yes no
Anti-monotone constraints:
min(S) ≥ v no yes • Easy to push …
max(S) ≤ v no yes
max(S) ≥ v yes no
• Always profitable to do …
count(S) ≤ v no yes
count(S) ≥ v yes no
Monotone constraints:
sum(S) ≤ v ( a ∈ S, a ≤ 0 ) no yes • Hard to push …
sum(S) ≥ v ( a ∈ S, a ≤ 0 ) yes no
range(S) ≤ v no
• Should we push them, or not?
yes
range(S) ≥ v yes no
support(S) ≥ ξ no yes
support(S) ≤ ξ yes no
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [99] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [100]

Dual Miner C. Bucil, J. Gherke, D. Kiefer and W. White, SIGKDD’02


• A dual pruning algorithm for itemsets with constraints
• Based on MAFIA FP-Growth with constraints
Ø
1. BITMAP approach Checks for only monotone constraints (plus
2. Does not compute a b c d the support, which is an anti-monotone constraint).
frequent items with
their support
3. Performance issues ab ac ad bc bd cd Once a frequent itemset satisfies the monotone
(Many tests are
constraint, all frequent itemsets having it as a prefix also
required) abc abd acd bcd are guaranteed to satisfy the constraint
abcd
Support(a)<min_sup
Support(cd)>max_sup
All its supersets can be pruned All its subsets can be pruned J. Pei, J. Han, L. Lakshmanan, ICDE’01
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [101] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [102]
COFI-With Constraints COFI-With Constraints
Anti-Monotone constraint checking Monotone constraint checking
Reduce
Reduce 1. Removing any frequent-path-bases that do not lattice
1. Removing individual items that do not satisfy the it anti- traversal
monotone constraint. (i.e they do not even participate in the
FP-tree satisfy the monotone constraint.
size
FP-tree Building process)
2. Not creating an A-COFI-tree, if all its local items Reduce
Reduce number
2. For each A-COFI-tree, remove any locally frequent item B, COFI- X with the A itemset violate the monotone of COFI-
where B does not satisfy the anti-monotone constraint tree size constraint. trees

3. No need for constraint checking for any A-COFI-tree if the Reduce


itemset X satisfies the anti-monotone constraint, where X is testing 3. No need for constraint checking for any A-COFI- Reduce
the set of all locally frequent items with respect to A tree if the item A satisfies the monotone constraint, testing
since any item with A will also satisfy this
M. El-Hajj and O. Zaïane constraint.
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [103] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [104]

General Outline
• Association Rules
• Different Frequent Patterns

Parallel KDD systems •




Different Lattice Traversal Approaches
Different Transactional Layouts
State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
Challenges and issues
• For Frequent Maximal Patterns
• Adding Constraints

(Requirements & Design issues) •




Parallel and Distributed Mining
Visualization of Association Rules
Frequent Sequential Pattern Mining
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]
Large size
Algorithm Evaluation High dimensionality
Data location
Process Support
Data skew
Location Transparency Dynamic load balance
Data Transparency Parallel database management system vs. parallel file systems
Parallel I/O minimization
System Transparency
Parallel incremental mining algorithms
Security, Quality of Service Parallel Dividing of the datasets
Availability, Fault tolerance and Mobility Parallel merging of discoveries

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [105] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [106]

Parallel Association Rule Mining Algorithms Parallel Association Rule Mining Algorithms
REPLICATED candidate sets Candidate sets
Partitioned
Candidate Candidate Candidate
Replication Algorithms Set Set Set Replication Algorithms P. P. P.
-Partition the Candidate Set Candidate Candidate Candidate
Set Set Set
Partitioning Algorithms Partitioning Algorithms -Need to Scan the whole data
Replicate Candidate Sets
Hybrid Algorithms Partition the databases Hybrid Algorithms
Partitioned
database

database ¾ Data Distribution algorithm (Agrawal R. et al, 1996)


¾ Count Distribution algorithm (Agrawal R. et al, 1996) ¾ Intelligent Data Distribution algorithm (E-H Han et all, 1997)
¾ Parallel Partition algorithm (Ashok Savasere et al, 1995)
¾ Non Partitioned apriori, Simple partitioned, Hash Partitioned
¾ Fast Distributed algorithm (David Cheung et al, 1996) apriori and HPA_ELD algorithms (Takahiko et al, 1996)
¾ Parallel Data Mining algorithm (J.S. Park et al, 1996)

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [107] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [108]
Parallel Association Rule Mining Algorithms Presentation of Association Rules (Table Form)
Candidate sets
Partitioned

Replication Algorithms P. P. P.
Candidate Candidate Candidate

Partitioning Algorithms -Combine both ideas of replication and Set Set Set
partition
Hybrid Algorithms

Partitioned
¾ Hybrid Distribution algorithm (E-H Han et all, 1997) database

¾ Multiple Local Frequent Pattern Tree algorithm (Zaïane et al, 2001)

DBMiner

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [109] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [110]

Visualization of Association Rule in Plane Visualization of Association Rule Using Rule


Form Graph

DBMiner

DBMiner

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [111] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [112]

General Outline

Visualization of Association Rule Using •




Association Rules
Different Frequent Patterns
Different Lattice Traversal Approaches

Table Graph (DBMiner Web version) What is Sequence Mining? •



Different Transactional Layouts
State-Of-The-Art Algorithms
• For All Frequent Patterns
• For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Adding Constraints
• Parallel and Distributed Mining
• Visualization of Association Rules
• Frequent Sequential Pattern Mining
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]

• Input
– A database D of sequences called data-sequences, in
which:
• I={i1, i2,…,in} is the set of items
• each sequence is a list of transactions ordered by transaction-time
• each transaction consists of fields: (sequence-id, transaction-id),
transaction-time and a set of items.
• Problem
– To discover all the sequential patterns with a user-
specified minimum support

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [113] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [114]
Input Database: example Example
Customer ID Transaction Time Items Bought
Sequence-Id Transaction Items 1 June 1 30 Customer ID Customer Sequence
Time 45% of customers who 1 June 30 90
bought Foundation will
C1 1 Ringworld 2 June 10 10,20
C1 2 Foundation 1 <(30) (90)>
Database D C1 15 Ringworld Engineers, Second Foundation buy Foundation and 2 June 15 30 2 <(10 20) (30) (40 60 70)>
2 June 20 40, 60, 70
C2 1 Foundation, Ringworld Empire within the next 3 <(30 50 70)>
3 June 25 30, 50, 70
C2 20 Foundation and Empire
month. 4 <(30) (40 70) (90)>
C2 50 Ringworld Engineers 4 June 20 30 5 <(90)>
4 June 25 40, 70
4 June 30 90 Customer sequence version of the database
•Subsequence: subsequence sequence 5 June 12 90
A sequence <a1, a2, …, an> is <(3)(4 5) (8)> <(7) (3 8) (9)(4 5 6) (8)> √ Data base Sorted by customer ID and transaction time
contained in another sequence <(3) (5)> < (3 5) > X
Sequential Patterns with support = 2 customer,
<b1, b2, …, bm> <(3) (5)> < (3 5) (3 5) > √ 2 sequences (25% support)
if there exists integers i1 < i2 < i3
<(30) (90)>
< …. <in such that a1 ⊆ bi1 ,,a2 ⊆ k-sequence: a sequence that consists of k items
<(30) (40 70)>
bi2 , …,… an ⊆ bin Examples: both <x,y> and <(x, y)> are
2-sequence Answer Set
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [115] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [116]

Algorithms References: Frequent-pattern Mining


• B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
• R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133,
• AprioriAll (1995) •
Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
• GSP:Generalized Sequential Patterns (1996) • A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of
customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
– Both proposed by Rakesh Agrawal, Ramakrishnan Srikant • D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of
– Both need multi-scans over the database association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
• J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and Challenges. SIGMOD
– GSP outperforms AprioriAll and is able to discover generalized sequential DMKD’01, Santa Barbara, CA.
patterns • Mohammad El-Hajj and Osmar R. Zaïane, Non Recursive Generation of Frequent K-itemsets from Frequent
Pattern Tree Representations, in Proc. of 5th International Conference on Data Warehousing and Knowledge
Discovery (DaWak'2003), pp 371-380, Prague, Czech Republic, September 3-5, 2003
Mohammad El-Hajj and Osmar R. Zaïane, Inverted Matrix: Efficient Discovery of Frequent Items in
• SPADE: (1998) •
Large Datasets in the Context of Interactive Mining, in Proc. 2003 Int'l Conf. on Data Mining and
– Proposed by Mohammed J. Zaki Knowledge Discovery (ACM SIGKDD), pp 109-118, Washington, DC, USA, August 24-27, 2003
– Sequential Pattern Discovery using Equivalence classes
Constraint-base Frequent-pattern Mining
• G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521,
• FreeSpan (2000) San Diego, CA, Feb. 2000.
• Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46,
• PrefixSpan (2001) Singapore, Dec. 1995.
– Both proposed by Jian Pei et al. • J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining",
– Based on FP-Tree and FP-Growth model COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
• L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained Frequent Set Queries with 2-
Variable Constraints", SIGMOD’99
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [117] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [118]

References:
Constraint-base Frequent Pattern Mining References: Sequential Pattern Mining
• R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning optimizations of constrained
association rules.” SIGMOD’98
• R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
• J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc. 2001 Int. • R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
Conf. on Data Engineering (ICDE'01), April 2001. improvements. EDBT’96.
• J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf. on • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-
Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000. Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data
• R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Mining (KDD'00), Boston, MA, August 2000.
Beach, California.
• H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences.
• C. Bucila, J. E. Gehrke, D. Kifer, and W. White. Dualminer: A dual-pruning algorithm for itemsets with Data Mining and Knowledge Discovery, 1:259-289, 1997.
constraints. In Proc. SIGKDD 2002.
• M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan: Mining Sequential
cubes. KDD'97, 207-210, Newport Beach, California. Patterns Efficiently by Prefix-Projected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
Mining Maximal and Closed Patterns • B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421,
Orlando, FL.
• R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington.
• S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in
• J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", Proc. 2000
ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000. association rules. VLDB'98, 368-379, New York, NY.
• N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. • M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98. Novermber 1998.
ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
• M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular
• M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
Expression Constraints. VLDB 1999: 223-234, Edinburgh, Scotland.
• M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, TR99-10, Department of Computer
Science, Rensselaer Polytechnic Institute.
• M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of Computer Science, Rensselaer Polytechnic
Institute.
• D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In
ICDE 2001. IEEE Computer Society, 2001
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [119] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [120]
References: Frequent-pattern Mining in Spatial,
Multimedia, Text & Web Databases References: Parallel Frequent itemset mining
• K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through Progressive Refinement Methods",
Revue internationale de gomatique (European Journal of GIS and Spatial Analysis), 9(4):425-440, 1999.
• Agrawal, R., Shafer, J.: Parallel Mining of Association Rules. IEEE Transactions in
• A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions: Mining Inter-Transaction Association
Knowledge and Data Eng. 8 (1996) pp. 962-969
Rules", Int. Conf. on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.
• J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic Patterns in Time Series Database", Proc. 1999 Int. • Barry Wilkinson, Michael Allen: Parallel Programming Techniques and applications using
Conf. on Data Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115. networked workstations and parallel computers. Alan Apt, New Jersy, USA. 1999
• H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining Multi-Dimensional Inter- • D. Cheung, K. Hu, S. Xia: Asynchronous Parallel Algorithm for Mining Association Rules on
Transaction Association Rules", ACM Transactions on Information Systems (TOIS’00), 18(4): 423-454, 2000. a Shared-memory Multi-processors. Proc. 10th ACM Symp. Parallel Algorithms and
• O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining architectures,ACM Press, NY, 1998, pp 279 - 288
Technology on Web Logs," Proc. Advances in Digital Libraries Conf., Santa Barbara, CA, April 1998, pp. 19-29
• David Wai-Lok Cheung, Jiawei Han, Vincent Ng, Ada Wai-Chee Fu, Yongjian Fu: A Fast
• O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution Refinement",
Distributed Algorithm for Mining Association Rules. PDIS 1996: pp. 31-42
Proc. 2000 Int. Conf. on Data Engineering (ICDE'00), San Diego, CA, Feb. 2000, pp. 461-470.
• David Wai-Lok Cheung, Yongqiao Xiao: Effect of Data Distribution in Parallel Mining of
Associations. Data Mining and Knowledge Discovery 3(3): pp.291-314 (1999)
FIM for Classification & Data Cube Computation • E-H. Han, G. Karypis, and V. Kumar: Scalable parallel data mining for association rules. In
ACM SIGMOD Conf. Management of Data, May 1997
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD 1999, pp 359-370
• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. • J. S. Park, M. Chen, and P. S. Yu: Efficient parallel data mining for association rules. In ACM
VLDB, 299-310, New York, NY, Aug. 1998. Intl. Conf. Infomration and Knowledge Management, November 1995
• J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data Cubes with Complex Measures, SIGMOD 2001. • Mohammed J. Zaki and Ching-Tien Ho (editors): Large-scale Parallel Data Mining, Lecture
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD 1999 Notes in Artificial Intelligence, State-of-the-Art-Survey, Volume 1759, Springer-Verlag, 2000
• B. Liu, W. Hsu and Y. Ma, Integrating classification and association rule mining, SIGKDD 1998, pp 80-86
• M-L. Antonie, O. R. Zaïane, Text Document Categorization by Term Association , in Proc. of the IEEE International
Conference on Data Mining (ICDM'2002), pp 19-26, Maebashi City, Japan.
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [121] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [122]

References: Parallel Frequent itemset mining


• Mohammed J. Zaki, Parallel and Distributed Association Mining: A Survey, in IEEE
Concurrency, special issue on Parallel Mechanisms for Data Mining, Vol. 7, No. 4, pp14-25,
December, 1999
• Osmar R. Zaïane, Mohammad El-Hajj, Paul Lu: Fast Parallel Association Rule Mining without
Candidacy Generation, in Proc. of the IEEE 2001 International Conference on Data Mining
(ICDM'2001), San Jose, CA, USA
• Mohammad El-Hajj and Osmar R. Zaïane, Parallel Association Rule Mining with Minimum
Inter-Processor Communication, Fifth International Workshop on Parallel and Distributed
Databases (PaDD'2003) in conjunction with the 14th Int' Conf. on Database and Expert
Systems Applications DEXA2003, pp 519-523, Prague, Czech Republic, September 1-
September 5, 2003
• R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo: Fast discovery of
association rules. In Fayyad et al, AAAI Press, Menio Park, Calif pp. 307--328, 1996
• Skillicorn, D: Strategies for Parallel Data Mining. IEEE concurrency 7 (1999) pp. 26-35
• Srinivasan Parthasarathy, Mohammed Zaki, Mitsunori Ogihara, Wei Li: Parallel Data Mining
for Association Rules on Shared-memory Systems, in Knowledge and Information Systems,
Volume 3, Number 1, pp 1-29, Feb 2001
• Takahiko Shintani,Masaru Kitsuregawa: Hash Based Parallel Algorithms for Mining
Association Rules. Proceedings of IEEE Fourth International Conference on Parallel and
Distributed Information Systems, pp.19-30, 1996.12
• W. A. Maniatty, Mohammed J. Zaki, A Requirement Analysis for Parallel KDD Systems , 3rd
IPDPS Workshop on High Performance Data Mining , Cancun, Mexico, May 2000

© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [123]

You might also like