Advances and Issues in Frequent Pattern Mining

University
Database Laboratory
of
Alberta
PAKDD 2004 Tutorial What Is Frequent Pattern Mining?
Advances and Issues in Frequent
• What is a frequent pattern?
Pattern Mining – Pattern (set of items, sequence, etc.) that occurs together
frequently in a database [AIS92]
• Frequent pattern: an important form of regularity

Osmar R. Zaïane – What products were often purchased together? — beers and
Mohammad El-Hajj diapers!
University of Alberta – What are the consequences of a hurricane?

– What is the next target after buying a PC?
Canada
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [1] © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [2]
Why Studying Frequent Pattern General Outline

• Association Rules
Mining? • Different Frequent Patterns
• Frequent pattern mining — Foundation for several • Different Lattice Traversal Approaches
essential data mining tasks: • Different Transactional Layouts
– association, correlation, causality
• State-Of-The-Art Algorithms
– sequential patterns • For All Frequent Patterns
– partial periodicity, cyclic/temporal associations • For Frequent Closed Patterns
• For Frequent Maximal Patterns
• Applications:
– basket data analysis, cross-marketing, catalog design,
• Adding Constraints
loss-leader analysis, • Parallel and Distributed Mining
– clustering, classification, Web log sequence, DNA • Visualization of Association Rules
analysis, etc. • Frequent Sequential Pattern Mining
What Is Association Mining?

Basic Concepts
• Association rule mining searches for A transaction is a set of items: T={ia, ib,…it}
relationships between items in a dataset: T ⊂ I, where I is the set of all possible items {i1, i2,…in}
– Finding association, correlation, or causal structures
among sets of items or objects in transaction D, the task relevant data, is a set of transactions.
databases, relational databases, and other information An association rule is of the form:
repositories. P ÎQ, where P ⊂ I, Q ⊂ I, and P∩Q =∅
– Rule form: “Body Æ Ηead [support, confidence]”. PÎQ holds in D with support s
• Examples: and
PÎQ has a confidence c in the transaction set D.
– buys(x, “bread”) Æ buys(x, “milk”) [0.6%, 65%]
– major(x, “CS”) ^ takes(x, “DB”) Æ grade(x, “A”) [1%, 75%] Support(PÎQ) = Probability(P∪Q)
Confidence(PÎQ)=Probability(Q/P)
Frequent Itemset Generation
Association Rule Mining null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Frequent Itemset Mining Association Rules Generation
FIM 1 abc 2 abÆc

bÆac ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Bound by a support threshold Bound by a confidence threshold
ABCD ABCE ABDE ACDE BCDE

zFrequent itemset generation is still computationally expensive
Given d items, there
are 2d possible
ABCDE candidate itemsets
An Influential Mining Methodology

Frequent Itemset Generation — The Apriori Algorithm
• Brute-force approach (Basic approach):
– Each itemset in the lattice is a candidate frequent itemset
• The Apriori method:
– Count the support of each candidate by scanning the database
List of
– Proposed by Agrawal & Srikant 1994
Transactions
Candidates – A similar level-wise algorithm by Mannila et al. 1994
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs • Major idea:
N 3 Milk, Diaper, Beer, Coke M – A subset of a frequent itemset must be frequent
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke • E.g., if {beer, diaper, nuts} is frequent, {beer, diaper} must be.
w Any itemset that is infrequent, its superset cannot be frequent!
– Match each transaction against every candidate – A powerful, scalable candidate set pruning technique:
– Complexity ~ O(NMw) => Expensive since M = 2d !!! • It reduces candidate k-itemsets dramatically (for k > 2)
Illustrating Apriori Principle The Apriori Algorithm

Frequent
null
subsets Ck: Candidate itemset of size k
Lk : frequent itemset of size k
A B C D E
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
AB AC AD AE BC BD BE CD CE DE
Ck+1 = candidates generated from Lk;
Infrequent for each transaction t in database do
Items increment the count of all candidates in
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
Pruned Frequent
supersets ABCDE Items
The Apriori Algorithm -- Example Generating Association Rules
Database D Support>1
itemset sup.
L1 itemset sup.
from Frequent Itemsets
TID Items C1
{1} 2 {1} 2
100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 {3} 3 • Only strong association rules are generated.
300 1235 {4} 1 {5} 3
400 25 {5} 3 • Frequent itemsets satisfy minimum support threshold.
C2 itemset sup C2 itemset • Strong AR satisfy minimum confidence threshold.
L2 itemset sup {1 2}
{1 2} 1 Scan D Support(P ∪ Q)
{1 3} 2 {1 3} 2 {1 3} • Confidence(P Î Q) = Prob(Q/P) =
{1 5} Support(P)
{2 3} 2 {1 5} 1
{2 3} 2 {2 3}
{2 5} 3
{2 5} 3 {2 5} For each frequent itemset, f, generate all non-empty subsets of f.
{3 5} 2
{3 5} 2 {3 5} For every non-empty subset s of f do
C3 itemset L3 itemset sup output rule sÎ(f-s) if support(f)/support(s) ≥ min_confidence
Scan D Note: {1,2,3}{1,2,5} end
{2 3 5} {2 3 5} 2 and {1,3,5} not in C3
Interestingness Measures Adding Correlations or Lifts to Support

• play basketball ⇒ eat cereal [40%, 66.7%] is misleading
and Confidence
– The overall percentage of students eating cereal is 75% which is higher • Example
than 66.7%. – X and Y: positively correlated, X 1 1 1 1 0 0 0 0
• play basketball ⇒ not eat cereal [20%, 33.3%] is less – X and Z, negatively related Y 1 1 0 0 0 0 0 0
deceptive, although with lower support and confidence – support and confidence of Z 0 1 1 1 1 1 1 1
• Measure of dependent/correlated events: lift X=>Z dominates
• We need a measure of dependent or
Basketball Not basketball Sum (row) correlated events Rule Support Confidence
P( A∪ B) Cereal 2000 1750 3750
corr A , B =
P ( A∪ B ) X=>Y 25% 50%
corrA, B = Not cereal 1000 250 1250 P ( A)P ( B ) X=>Z 37.50% 75%
P( A) P( B) Sum(col.) 3000 2000 5000 • P(B|A)/P(B) is also called the lift of
rule A => B
General Outline
• Different Frequent Patterns
• Different Lattice Traversal Approaches
•
•
Different Transactional Layouts
State-Of-The-Art Algorithms
{abcd}
Other Frequent Patterns • For All Frequent Patterns
• For Frequent Closed Patterns
• Parallel and Distributed Mining
• Visualization of Association Rules
• Frequent Sequential Pattern Mining
Frequent Closed Patterns {abc}
© Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]
• For frequent itemset X, if there exists no {bd}

• Frequent pattern {a1, …, a100} Æ (1001) + item y such that every transaction
Transactions
Support = 2
(1002) + … + (100100) = 2100-1 = 1.27*1030 containing X also contains y, then X is a a 2

frequent closed pattern b 3
frequent sub-patterns! c 2
• In other words, frequent itemset X is d 2
closed if there is no item y, not already in ab 2
X, that always accompanies X in all ac 2
• Frequent Closed Patterns
All
bc 2
frequent
itemsets
transactions where X occurs. bd 2
• Concise representation of frequent abc 2
• Frequent Maximal Patterns Closed
frequent
itemsets
Maximal
frequent
patterns. Can generate all frequent Frequent itemsets
• All Frequent Patterns itemsets patterns with their support from frequent
closed ones. b 3
bd 2
• Reduce number of patterns and rules abc 2
Maximal frequent itemsets ⊆ Closed frequent itemsets ⊆ All frequent itemset
• N. Pasquier et al. In ICDT’99 Frequent Closed itemsets
Frequent Maximal Patterns
{abcd}
{abc}
Maximal vs. Closed Itemsets
TID Items Set of
null
• Frequent itemset X is maximal if there is {bd} 1 ABC transaction Ids
Transactions 2 ABCD
no other frequent itemset Y that is Support = 2
3 ABC
1.2.3.4
A
1.2.3
B
1.2.3.4
C
2.4.5
D E
4.5
superset of X. a 2 4 ACDE
• In other words, there is no other frequent b 3 5 DE

c 2
pattern that would include a maximal d 2
1.2.3
AB
1.2.3.4
AC
2.4
AD
4
AE
1.2.3
BC
2
BD BE
2.4
CD
4
CE
4.5 DE
pattern. ab 2
ac 2
• More concise representation of frequent bc 2
patterns but the information about bd 2 1.2.3
ABC
2
ABD ABE
2.4
ACD
4
ACE
4
ADE
2
BCD BCE BDE
4
CDE
supports is lost. abc 2
• Can generate all frequent patterns from Frequent itemsets
frequent maximal ones but without their 2 4
respective support. bd 2
• R. Bayardo. In SIGMOD’98 abc 2 Not supported by

any transaction ABCDE
Frequent Maximal itemsets
General Outline
• Different Transactional Layouts
TID Items
Maximal vs. Closed Itemsets Mining the Pattern Lattice •
•
•
• For All Frequent Patterns
Adding Constraints
Parallel and Distributed Mining
null
Closed but •
•
Visualization of Association Rules
Frequent Sequential Pattern Mining
1 ABC not maximal
• Breadth-First
2 ABCD
1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5
3 ABC A B C D E – It uses current items at level k to generate items of level k+1 (many database scans)
4 ACDE Closed and
5 DE maximal • Depth-First
1.2.3 1.2.3.4 2.4 4 1.2.3 2 – It uses a current item at level k to generate all its supersets (favored when mining long
2.4 4 4.5 DE
AB AC AD AE BC BD BE CD CE frequent patterns)
• Hybrid approach Bottom Depth

Frequent Hybrid
1.2.3 2 2.4 4 4 2 Pattern – It mines using both direction at the same time null Breadth
4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE Border
• Leap traversal approach
A B C D E
– Jumps to selected nodes AB AC AD AE BC BD BE CD CE DE
4
2
ABCD ABCE ABDE ACDE BCDE Maximal There is also the notion of : ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Closed Top-down (level k then level k+1)

Minimum support = 2 ABCDE

Frequent
Bottom-up (level k+1 then level k) Leap Traversal
Infrequent ABCDE
Top
Breadth- First (Bottom-Up Example) TID

1
Items
ABC
Depth First (Top-Down Example) TID
1
Items
ABC
Steps null Bottom
2 ABCD Steps null Bottom
2 ABCD
1
1 3 ABC 3 ABC
A B C D E A B C D E
4 ACDE 4 ACDE
5 DE 5 DE
2 4 4
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE x
Superset is Subset is
candidate if 3 3 3 candidate if it
3 3 ALL its subsets is marked or if
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
are frequent x one of its
supersets is
candidate
2 2
4
ABCD ABCE ABDE ACDE BCDE ABCD ABCE ABDE ACDE BCDE
18 candidates x 23 candidates
x
to check to check
Minimum support = 2 Minimum support = 2
ABCDE Top ABCDE Top
One Hybrid Example TID
1
Items
ABC
Leap Traversal Example TID
1
Items
ABC
Steps null Bottom
2 ABCD Steps null Bottom
2 ABCD
1
3 ABC 1 3 ABC
A B C D E A B C D E
4 ACDE 4 ACDE
5 DE 5 DE
2 5
3 4 2 Itemset is
AB AC AD AE BC BD BE CD CE DE AB AC AD AE BC BD BE CD CE DE x candidate if it
is marked or if
Superset is it is a subset of
2 2 candidate if more than one
ALL its subsets 2 3
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE infrequent
are frequent x marked
superset
2 2
ABCD ABCE ABDE ACDE BCDE How to find the
2 ABCD ABCE ABDE ACDE BCDE
10 candidates
19 candidates x to check
Support of an itemset x
to check
1. Full scan of the database OR 5 frequent patterns
Minimum support = 2
ABCDE 2. Intelligent techniques: Support of itemset =
ABCDE without checking
Top Top
Summation of the supports of its supersets of marked patterns
Finding Maximal using leap traversal approach Finding Maximal using leap traversal approach
Minimum support = 2 null
TID Items null TID Items
1,3 ABC 1.2.3.4 1.2.3
1,3 ABC 1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5
1.2.3.4 2.4.5 4.5
A B C D E
2 ABCD A B C D E 2 ABCD
3 ABC 3 ABC
4 ACDE 1.2.3 1.2.3.4 2.4 4 1.2.3 2
4 ACDE 4.5 DE
2.4 4 4.5 DE AB AC AD AE BC BD BE CD CE
AB AC AD AE BC BD BE CD CE
5 DE 5 DE
1.2.3
1.2.3 2 2.4 4 4 2 4 ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
-Selectively jumps in the lattice 2 4 Step1: Define actual ABCD ABCE ABDE ACDE BCDE
paths (Mark paths)
-Find local Maximal
-Generate patterns from these local maximal ABCDE
Finding Maximal using leap traversal approach Finding Maximal using leap traversal approach
null null
TID Items TID Items
1,3 ABC 1.2.3.4 1.2.3 1.2.3.4 2.4.5 4.5 1,3 ABC
A B C D E A B C D E
2 ABCD 2 ABCD
3 ABC 3 ABC
4 ACDE 4.5 DE 4 ACDE 4.5 DE
AB AC AD AE BC BD BE CD CE AB AC AD AE BC BD BE CD CE
5 DE 5 DE
1.2.3 2.4 1.2.3 2.4

Intersect ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD
with Step3: Remove non
ACDE frequent paths, or
2 4
Step2: Intersect non ABCD ABCE ABDE ACDE BCDE
frequent paths that ABCD ABCE ABDE ACDE BCDE
frequent marked have superset of other
paths with each frequent paths
others
Minimum support = 2 Minimum support = 2
ABCDE ABCDE
Finding Maximal using leap traversal approach
TID Items
null
When to use a Given Strategy
1,3 ABC
A B C D E
2 ABCD
• Breadth First
3 ABC
– Suitable for short frequent patterns
4 ACDE
AB AC AD AE BC BD BE CD CE DE – Unsuitable for long frequent patterns
5 DE
• Depth First
– Suitable for long frequent patterns
– In general not scalable when long candidate patterns are
not frequent

• Leap Traversal
– Suitable for cases having short and long frequent
patterns simultaneously
General Outline
Empirical Tests
Transactional Layouts
Connect Connect database (long Frequent Patterns)
Total Candidates Created • Frequent Sequential Pattern Mining
16.00 © Osmar R. Zaïane & Mohammad El-Hajj, 2004 PAKDD 2004, Sydney Tutorial: Advances in Frequent Pattern Mining [4]
Support Size of Largest F1-Size Total Frequent Breadth Depth Leap

Order of genertated Candidate
patterns compared to the
14.00
95 9 17 2205 15626 6654 3044
frequent patterns
• Horizontal Layout
12.00
Breadth
90 12 21 27127 394648 144426 29263 10.00
8.00 Depth
80 15 28 119418 120177 6.00 Leap
75 17 30 176365 177244 4.00
70 18 31 627952 628963 2.00
65 19 33 1368337 1369585
0.00
95 90 85 75 70 65 60 55
Each transaction is recorded as a list of items
60 20 36 2908632 2910175 Support %
55 21 37 5996892 5998715
Transaction ID Items
T10I4D100K (Short Frequent Patterns)
T10I4D100K 1 A G D C B
100.00 2 B C H E D
Candidacy generation can be removed (FP-Growth)
Total Candidates Created 90.00

Support Size of Largest F1-Size Total Frequent Breadth Depth Leap 80.00 3 B D E A M
frequent patterns
70.00
1000 3 375 385 20 10 29 60.00

Breadth 4 C E F A N
50.00 Depth
5 A B N O P
750 4 463 561 262 124 291 40.00
30.00
Leap
6 A C Q R G
500 5 569 1073 1492 731 1546 20.00
250 8 717 7703 36994 36309 26666

10.00
0.00
0.01 0.0075 0.005 0.0025 0.001 0.0005
7
8
A
L
C
E
H
F
I
K
G
B
Superfluous Processing
100 10 797 27532 264645 458275 200113 Support %
9 A F M N O
50 10 839 53385 1129779 4923723 1067050
10 C F P J R
Chess (Mixed Length Frequent Patterns) 11 A D B H I
Chess
12 D E B K L
Total Candidates Created 16.00
13 M D C G O
14.00
Support Size of Largest F1-Size Total Frequent Breadth Depth Leap
frequent patterns
12.00
Breadth 14 C F P Q J
95 5 9 78 165 90 136 10.00
8.00 Depth 15 B D E F I
90 7 13 628 2768 1842 1191 6.00
Leap
16 J E B A D
85 8 16 2690 20871 9667 4318 4.00
2.00 17 A K E F C
80 10 20 8282 91577 49196 12021 0.00
18 C D L B A
75 11 23 20846 292363 160362 28986 95 90 85 80 75 70
Support %
70 13 24 48939 731740 560103 65093
Transactional Layouts Transactional Layouts

• Vertical Layout • Bitmap Layout Matrix : Rows represent transactions
Columns represent item
Tid-list is kept for each item If item exists in transaction
then cell value = 1 else cell value = 0
Transaction ID Items Items Transaction ID
1 A G D C B A 1 3 4 5 6 7 9 11 16 17 18 Transaction ID items
T# Items
2 B C H E D B 1 2 3 5 8 11 12 15 16 18 T# A B C D E F G H I J K L M N O P Q R
T1 A G D C B
3
4
B
C
D
E
E
F
A
A
M
N
C 1 2 4 6 7 10 13 14 17 18 T2 B C H E D
T1
T2
1
0
1
1
1
1
1
1
0
1
0
0
1
0
0
1
0 0 0
0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Similar to
D 1 2 3 11 12 13 15 16 18 T3 B D E A M
5 A B N O P E 2 3 4 8 12 15 16 17 T4 C E F A N
T3
T4
1
1
1
0
0
1
1
0
1
1
0
1
0
0
0
0
0 0 0
0 0 0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
horizontal
6 A C Q R G F 4 8 9 10 14 15 17 T5 A B N O P
7 A C H I G G 1 6 7 13 T6 A C Q R G
T5 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 layout.
8 L E F K B H 2 7 11 T6 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1
9 A F M N O I 7 11 15
T7
T8
A
L
C
E
H
F
I
K
G
B
T7 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 Suitable for
10 C F P J R J 10 14 16 T8 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
11 A D B H I K 8 12 17
Minimize Superfluous Processing T9 A F M N O
T9 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 datasets with
T10 C F P J R
12 D E B K L L 8 12 18 T10 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1
13 M D C G O M 3 9 13
T11
T12
A
D
D
E
B
B
H
K
I
L
T11 1 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 small
14 C F P Q J T12 0 1 0 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0
15 B D E F I
N
O
4
5
5
9
9
13 Candidacy generation is required
T13 M D C G O
T13 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 dimensionality
T14 C F P Q J
16 J E B A D P 5 9 13 T14 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0
T15 B D E F I
17 A K E F C Q 6 14 T15 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0
T16 J E B A D
18 C D L B A R 6 10 T16 1 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0
T17 A K E F C
T17 1 0 1 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0
T18 C D L B A
T18 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
Transactional Layouts Why The Matrix Layout?
• Inverted Matrix Layout Interactive mining
El-Hajj and Zaiane, ACM SIGKDD’03
Minimize Superfluous Processing
Candidacy generation can be reduced
Appropriate for Interactive Mining
Changing the support level means
T# Items
Loc Index
Transactional Array expensive steps (whole process is redone)
1 2 3 4 5 6 7 8 9 10 11
T1 A G D C B 1 R 2 (2,1) (3,2)
T2 B C H E D 2 Q 2 (12,2) (3,3) Evaluation and Knowledge
T3 B D E A M 3 P 3 (4,1) (9,1) (9,2)
T4 C E F A N 4 O 3 (5,2) (5,3) (6,3) Presentation
T5 A B N O P 5 N 3 (13,1) (17,4) (6,2)
T6 A C Q R G 6 M 3 (14,2) (13,3) (12,4)
T7 A C H I G 7 L 3 (8,1) (8,2) (15,9) Data Mining
T8 L E F K B 8 K 3 (13,2) (14,5) (13,7)
T9 A F M N O 9 J 3 (13,4) (13,5) (14,7) Selection and Patterns
T10 C F P J R 10 I 3 (11,2) (11,3) (13,6) Transformation
T11 A D B H I 11 H 3 (14,1) (12,3) 15,4)
T12 D E B K L 12 G 4 (15,1) (16,4) (16,5) (15,6) Data warehouse
T13 M D C G O 13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
T14 C F P Q J 14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9)
T15 B D E F I 15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10)
T16 J E B A D 16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (¤, ¤) (¤, ¤) (¤, ¤) (18,10) (17,10)
T17 A K E F C 17 B 10 (18,1) (¤, ¤) (18,2) (18,4) (¤, ¤) (18,8) (¤, ¤) (¤, ¤) (18,9) (18,11) Databases
T18 C D L B A 18 A 11 (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤) (¤, ¤)
Why The Matrix Layout? Why The Matrix Layout?

Repetitive tasks, (I/O) readings (Superfluous Processing) Repetitive tasks, (I/O) readings (Superfluous Processing)
T# Items T# Items T# Items T# Items

T1 A G D C B T1 A G D C B T1 A G D C B T1 A G G C B
T2 B C H E D T2 B C G E D T2 B C H E D T2 B C G G G
T3 B D E A M T3 B D E A G T3 B D E A M Support > 9 T3 B G G A G
T4 C E F A N T4 C E F A G T4 C E F A N T4 C G G A G
T5 A B N O P
Support > 4 T5 A B G G G T5 A B N O P T5 A B G G G
T6 A C Q R G T6 A C G G G T6 A C Q R G T6 A C G G G
T7 A C H I G T7 A C G G G T7 A C H I G T7 A C G G G
T8 L E F K B Frequent 1-itemsets {A, B, C, D, E, F} T8 G E F G B T8 L E F K B T8 G G G G B
T9 A F M N O Non frequent items {G, H, I, J, K, L, M, N, O, P, Q, R} T9 A F G G G T9 A F M N O T9 A G G G G
T10 C F P J R T10 C F G G G T10 C F P J R
Frequent 1-itemsets {A, B, C} T10 C G G G G
T11 A D B H I T11 A D B G G T11 A D B H I
Non frequent items T11 A G B G G
T12 D E B K L T12 D E B G G T12 D E B K L
{D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R} T12 G G B G G
T13 M D C G O T13 G D C G G T13 M D C G O T13 G G C G G
T14 C F P Q J T14 C F G G G T14 C F P Q J T14 C G G G G
T15 B D E F I T15 B D E F G T15 B D E F I T15 B G G G G
T16 J E B A D T16 G E B A D T16 J E B A D T16 G G B A G
T17 A K E F C T17 A G E F C T17 A K E F C T17 A G G G C
T18 C D L B A T18 C D G B A T18 C D L B A T18 C G G B A
Transactional Layouts Transactional Layouts

• Inverted Matrix Layout • Inverted Matrix Layout
Support > 4
Transactional Array Transactional Array
Loc Index Loc Index
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
1 R 2 (2,1) (3,2)
2 Q 2 (12,2) (3,3)
3 P 3 (4,1) (9,1) (9,2)
4 O 3 (5,2) (5,3) (6,3)
5 N 3 (13,1) (17,4) (6,2)
6 M 3 (14,2) (13,3) (12,4)
7
8
9
L
K
J
3
3
3
(8,1)
(13,2)
(13,4)
(8,2)
(14,5)
(13,5)
(15,9)
(13,7)
(14,7)
Border
10
11
12
H
G
I 3
3
4
(11,2)
(14,1)
(15,1)
(11,3)
(12,3)
(16,4)
(13,6)
15,4)
(16,5) (15,6)
Support
13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8) 13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9) 14 E 8 (15,2) (15,3) (16,3) (17,5) (15,5) (15,7) (15,8) (16,9)
15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10) 15 D 9 (16,1) (16,2) (17,2) (17,6) (17,7) (16,7) (17,8) (17,9) (16,10)
16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10) 16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10)
17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11) 17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11)
18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) 18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼)
General Outline
Transactional Layouts The Algorithms •

•
•
Different Lattice Traversal Approaches
• Inverted Matrix Layout T# Items

(State of the Art) • Visualization of Association Rules
T1 A D C B
T2
T3
B C E D
B D E A All
T4 C E F A
T5 A B
Loc Index
Transactional Array
T6
T7
A C
A C
Apriori, FP-Growth, COFI*, ECLAT
1 2 3 4 5 6 7 8 9 10 11
T8 E F B
13 F 7 (14,3) (14,4) (18,7) (16,6) (16,8) (14,6) (14,8)
14
15
E 8 (15,2)
D 9 (16,1)
(15,3)
(16,2)
(16,3)
(17,2)
(17,5)
(17,6)
(15,5) (15,7) (15,8)
(17,7) (16,7) (17,8)
(16,9)
(17,9) (16,10)
T9 A F
T10 C F Closed
16 C 10 (17,1) (17,2) (18,3) (18,5) (18,6) (☼,☼) (☼,☼) (☼,☼) (18,10) (17,10) T11 A D B
17 B 10 (18,1) (☼,☼) (18,2) (18,4) (☼,☼) (18,8) (☼,☼) (☼,☼) (18,9) (18,11) T12 D E B
18 A 11 (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) (☼,☼) T13 D C
T14 C F
CHARM, CLOSET+,COFI-CLOSED
T15 B D E F
T16 E B A D
T17 A E F C Maximal
T18 C D B A
MaxMiner, MAFIA, GENMAX, COFI-MAX
All-Apriori All-Apriori
Problems with Apriori
Apriori • Generation of candidate itemsets are
expensive (Huge candidate sets)
Repetitive I/O scans • 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100},
Huge Computation to generate candidate items one needs to generate 2100 ≈ 1030 candidates.
• High number of data scans
Frequent Pattern Growth

• First algorithm that allows frequent pattern
mining without generating candidate sets
R. Agrawal, R. Srikant, VLDB’94
• Requires Frequent Pattern Tree
All-FP-Growth All-FP-Growth
FP-Growth
Frequent Pattern Tree
2 I/O scans
Reduced candidacy generation
High memory requirements F, A, C, D, G, I, M, P
Claims to be 1 order of magnitude faster than
Required Support: 3
Apriori A, B, C, F, L, M, O
B, F, H, J, O
A, F, C, E, L, P, M, N
B, C, K, S, P
Patterns
F, M, C, B, A
Recursive
FP-Tree conditional trees and FP-Trees F:5, C:5, A:4, B:4, M:4, P:3 D:1 E:1 G:1 H:1 I:1 J:1 K:1 L:1 O:1
J. Han, J. Pei, Y. Yin, SIGMOD’00

F, C, A, M, P
F, C, A, B, M
Original Transaction Ordered frequent items F, B root
C, B, P
F, A, C, D, G, I, M, P F, C, A, M, P
F, C, A, M, P
C, A, M
F:1 F:1
A, B, C, F, L, M, O F, C, A, B, M
F, B
B, F, H, J, O F, B C:1 C:1
A, F, C, E, L, P, M, N C, B, P
A:1 A:1
B, C, K, S, P F, C, A, M, P
M:1 B:1
F, M, C, B, A F, C, A, M
F, B, D F, B P:1 M:1
F:5, C:5, A:4, B:4, M:4, P:3 Required Support: 3
F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree F, C, A, M, P
F, C, A, B, M
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:2 C, A, M
F:2 F:1
F, B F, B
C:2 C:2 B:1
A:2 A:2
M:1 B:1 M:1 B:1
P:1 M:1 P:1 M:1
F, C, A, M, P
F, C, A, B, M
F, C, A, B, M
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:3 C, A, M
F:3 C:1 F:1
F, B F, B
C:2 B:1 C:2 B:1 B:1 C:1
A:2 A:2 P:1 A:1
M:1 B:1 M:1 B:1 M:1
P:1 M:1 P:1 M:1 P:1
F, C, A, M, P
F, C, A, B, M
F, C, A, B, M
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:4 C:1 C, A, M
F:4 C:1 C:1
F, B F, B
C:3 B:1 B:1 C:3 B:1 B:1 A:1
A:3 P:1 A:3 P:1 M:1
M:2 B:1 M:2 B:1
P:2 M:1 P:2 M:1
F, C, A, M, P
F, C, A, B, M
F, C, A, B, M
F, B root F, B root
C, B, P C, B, P
F, C, A, M, P F, C, A, M, P
C, A, M
F:4 C:2 C, A, M
F:4 C:2 F:1
F, B F, B
C:3 B:1 B:1 A:1 C:3 B:1 B:1 A:1 B:1
A:3 P:1 M:1 A:3 P:1 M:1
M:3 B:1 M:3 B:1
P:3 M:1 P:3 M:1
F, C, A, M, P
F, C, A, B, M
Frequent Pattern Tree Frequent Pattern Growth
root
F, B root P:3
C, B, P
F, C, A, M, P F:5 F:5 C:1
C, A, M
F:5 C:1 C:3
C:5 <C:3, P:3>
F, B C:3 B:2 B:1 A:1
C:3 B:2 B:1 A:1 A:4
B:4 A:3
A:3 P:1 M:1
P:1 M:1 M:4
M:2 B:1
M:2 B:1 P:3
P:2 M:1 P:2 M:1
C:3, F:2, A:2, M:2, B:1

All-FP-Growth All-COFI
COFI algorithm big picture
Frequent Pattern Growth M:4
root
F3 F:3 C:1 2 I/O scans
COFI reduced candidacy generation
F:5 F:5 C:1 C4
A:1 Small memory footprint
C:5 C:3
C:3 B:2 B:1 A:1 A4
A:4 A:3
B1
B:4 A:3 P:1 M:1
M:4 <F:3, C:3, A:3, M:3>
M:2 B:1 <C:1, A:1>
P:3
P:2 M:1
F:2, C:2, A:2 Recursively build Patterns
F:1, C:1, A:1, B:1 the A, C and F COFI- trees
conditional trees. FP-Tree
C:1, A:1 El-Hajj and Zaïane, DaWak’03
All-COFI All-COFI
Co-Occurrences Frequent
Co-Occurrences Frequent Item tree
Item tree Support
Start with item P: Then with item M: Find Locally
root P:3:0 root
frequent items with respect to M:
Find Locally frequent A:4,C:4:F:3
Participation M:4:0
items with respect to
F:5 C:1 C:3:0 F:5 C:1
P: C3
frequent-path-bases
C:3 B:2 B:1 A:1 C:3 B:2 B:1 A:1 A4 A:4:0 FCA: 3
PC:3 frequent-path-base C4
A:3 P:1 M:1 All subsets of PC:3 are A:3 P:1 M:1 F3 C:4 :0
frequent and have the
M:2 B:1 same support M:2 B:1 F:3 :0
P:2 M:1 P:2 M:1
All-COFI All-COFI
Co-Occurrences Frequent Item tree Co-Occurrences Frequent Item tree

Then with item M: Find Locally Then with item M: Find Locally
root root
frequent items with respect to M: frequent items with respect to M:
A:4,C:4:F:3 M:4:3 A:4,C:4:F:3 M:4:3
F:5 C:1 F:5 C:1
frequent-path-bases frequent-path-bases
C:3 B:2 B:1 A:1 A4 A:4:3 FCA: 3 C:3 B:2 B:1 A:1 A4 A:4:3 FCA: 3
C4 C4
A:3 A:3 CA:1
P:1 M:1 F3 C:4 :3 P:1 M:1 F3 C:4 :3
M:2 B:1 F:3 :3 M:2 B:1 F:3 :3

P:2 M:1 P:2 M:1
All-COFI All-COFI
Co-Occurrences Frequent Item tree Co-Occurrences Frequent Item tree
How to mine frequent-path-bases How to mine frequent-path-bases
Three approaches: Three approaches:

Support of any pattern is the Support of any pattern is the
1: Bottom-Up summation of the supports of its 2: Top-down summation of the supports of its
supersets of frequent-path-bases supersets of frequent-path-bases
FCA: 3 FCA: 3
FCA: 3 FCA: 3
CA: 1 CA: 1
CA:4 CF:3 AF:3 CA:4 CF:3 AF:3
C:4 A:4 F:3 C:4 A:4 F:3
All-COFI All-ECLAT
Co-Occurrences Frequent Item tree ECLAT
How to mine frequent-path-bases • For each item, store a list of transaction ids
(tids) Horizontal
Three approaches: Data Layout
Support of any pattern is the Vertical Data Layout
3: Leap-Traversal summation of the supports of its TID Items
A B C D E
supersets of frequent-path-bases 1 A,B,E 1 1 2 2 1
2 B,C,D 4 2 3 4 3
1) Intersect non frequent path bases
FCA: 3 3 C,E 5 5 4 5 6
FCA: 3 ∩ CA:1 = CA 4 A,C,D 6 7 8 9
5 A,B,C,D 7 8 9
2) Find subsets of the only CA:4 CF:3 AF:3 6 A,E 8 10
7 A,B 9
frequent paths (sure to be frequent) 8 A,B,C
C:4 A:4 F:3
3) Find the support of each pattern 9 A,C,D
10 B TID-list
M.J.Zaki IEEE transactions on Knowledge and data Engineering 00
All-ECLAT All-ECLAT
ECLAT ECLAT
• Determine support of any k-itemset by intersecting tid-lists of two of its (k- Find all frequent patters with respect to item A
1) subsets. AB, AC, …. ABC, ABD, ACD, ABCD ……..
A B AB
1 1 1
5 Then it finds all frequent patters with respect to item B
∧ →
4 2
5 5 7 BC, BD, …. BCD, BDE, BCDE ……..
6 7 8
7 8
8 10 • 3 traversal approaches:
9 – top-down, bottom-up and hybrid
• Advantage: very fast support counting, Few scans of database (best
case 2)
• Disadvantage: intermediate tid-lists may become too large for
memory
CLOSED-CHARM CLOSED-CHARM
CHARM Diffset Intersections (example)
TIDSET database DIFFSET database
Use vertical Data Format A C D T W A C D T W

1 1 2 1 2 6
1 1
2
2 2 6 3 4
3 3
Deliver Closed patterns based on vertical intersections 4
3
4
5 5 3
4
AB 5
A B 5
6 6 4
1 1 1
6 5
4 2 5
5
6
∧ 5
7 → 7
8
7 8
8 10
9
AC AD AT AW CD CT CW DT DW TW
1 4 1 2 6 2 6 6
Diffsets: don’t store the entire tidsets, only the differences 3 3 4 4
Diffset: d(AD) = d(D) – d(A)

between tidsets d(AD) = 1,3 – 2,6 = 1,3
ACT ACW ATW CDW CTW
Or Diffset: d(AD) = t(A) – t(D)
4 6 6
Use Diffset to accelerate mining d(AD) = 1,3,4,5 – 2,4,5,6 = 1,3
Support σ(AD) = σ(D) - |d(AD)|
ACTW
σ(AD) = 4 – 2 = 2
M.J.Zaki and C-J. Hsiao. SIAM 02
CLOSED-CLOSET+ CLOSED-CLOSET+
CLOSET+ CLOSET+
Use horizontal format of transactions • Two-Level hash-indexed results tree
Based on the FP-Growth Model • Compressed result tree structure
• Search space shrinking for subset checking
Divide the search space – If itemset Sc can be absorbed by another already mined itemset Sa, they
Find closed itemset recursively have the following relationships:
1) sup(Sc)=sup(Sa)
2) length(Sc)<length(Sa)
2 level hash indexed result tree structure for dense
3) ∀ i, j∈ Sc ⇒ i ∈ Sa
datasets – Measures to enhance the checking
Pseudo projection based upward-checking for sparse » Two-level hash indices – support and itemID
» Record length information in each result tree node
datasets
J. Wang, J. Han, J. Pei. SIGKDD 03

CLOSED-CLOSET+ CLOSED-CLOSET+
CLOSET+ CLOSET+
Pseudo-projection based upward checking Pseudo-projection based upward checking
• Result-tree may consume much more memory for • Result-tree may consume much more memory for
sparse datasets sparse datasets
• Subset checking without maintenance of history • Subset checking without maintenance of history
itemsets itemsets
– For a certain prefix X, as long as we can find any item which
– (1) appears in each prefix path w.r.t. prefix X, and
– (2) does not belong to X, any itemset with prefix X will be non-closed,
otherwise, if there’s no such item, the union of X and the complete set
of its locally frequent items with support sup(X) will form a closed
itemset.
CLOSED-COFI-CLOSED CLOSED-COFI-CLOSED
COFI-CLOSED COFI-CLOSED
1 2 3 4 5
G 5 0
A 5 0
B 4 0 A D E 1 1 A B C D 2 2 A B C D E 1 1 D 2 0 E 3 0
A 5
C 4 0 NULL A B C E 1 1
B 4 C 2 0 C 1 0 D 2 0
D 4 0
Use Horizontal format E 3 0
Step 1: Assign frequent-path-bases to their locations in OPB
C
D
E
4
4
3
B 2 0 B 1 0 A 1 0 C 1 0
A 2 0 A 1 0 B 1 0
1 2 3 4 5
COFI- Mining approach A 5 0 A D E 1 1
A 1 0
B 4 0 A B C 0 0 A B C D 2 2 A B C D E 1 1
C 4 0 A B C E 1 1
Leap-traversal approach D
E
4
3
0
0
A D 0 0
A E 0 0
Example (Finding closed
Step 2: Apply leap-traversal approach on frequent-path-bases
Divide the search space 1 2 3 4 5 frequent patterns for a COFI-

A 5 0 A D E 2 1
B 4 0 A B C 4 0 A B C D 3 2 A B C D E 1 1 tree)
One Global tree to hold the discovered closed patterns C
D
E
4
4
3
0
0
0
A D 4 0
A E 3 0
A B C E 2 1
Step 3: Find global support for each pattern
1 2 3 4 5 Frequent Path Bases

A 5 0
A B C 4 0 A B C D 3 2 NULL
ABCD:2, ABCE:1, ADE:1, ABCDE:1
A D 4 0
A E 3 0
Step 4: Remove non-frequent patterns or frequent patterns that have
superset of frequent patterns with same support
M. El-Hajj and O. Zaïane

MAXIMAL-MAXMINER MAXIMAL-MAXMINER
MaxMiner: Mining Max-patterns MaxMiner: Mining Max-patterns
• 1st scan: find frequent items Tid Items –Abandons a bottom-up traversal
– A, B, C, D, E 10 A,B,C,D,E –Attempts to “look-ahead”
20 B,C,D,E,
• 2nd scan: find support for 30 A,C,D,F
–Identify a long frequent itemset, prune all its subsets.
•Set-enumeration tree
– AB, AC, AD, AE, ABCDE {}
– BC, BD, BE, BCDE Potential max- 1 2 3 4

– CD, CE, CDE patterns
1,2 1,3 1,4 2,3 2,4 3,4
• Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan 1,2,3 1,2,4 1,3,4 2,3,4
R. J. Bayardo, ACM SIGMOD’98 1,2,3,4

MAXIMAL-MAFIA MAXIMAL-MAFIA {}
MAFIA MAFIA 1 2 3 4
A Maximal Frequent Itemset Algorithm for Transactional Databases 1,2 1,3 1,4 2,3 2,4 3,4
1,2,3 1,2,4 1,3,4 2,3,4

Vertical bitmap presentation
1,2,3,4
Look Ahead pruning technique • PEP (Parent Equivalence Pruning)

Integrates a depth-first traversal of the itmset lattice with
Level newNode = C ∪ i (where i is child of C)
effective pruning mechanisms
{} Check newNode.support == C.support
0 Move i from C.tail to C.head
P
1 2 3 4 1 • FHUT (Frequent Head Union Tail)
Cut newNode = C ∪ I (where I is the leftmost child in the tail)
1,2 1,3 1,4 2,3 2,4 3,4 2
• HUTMFI
3 Check Head Union Tail is in MFI
1,2,3 1,2,4 1,3,4 2,3,4
Stop searching and return
1,2,3,4 D. Burdick, M. Calimlim, and J. Gehrke ICDE’01 4

MAXIMAL-GENMAX MAXIMAL-COFIMAX
GenMax:
COFI-MAX
Progressive focusing to improve superset checking
Superset checking techniques COFI-Approach
Do superset check only for Il+1 ∪ Pl+1
Leap traversal searching
Using check_status flag
Local maximal frequent itemsets Pruning techniques
Reordering the combine set (A) skip building some COFI-trees
Vertical approach. (B) Remove all frequent-path-bases that are subset of already
found maximal patterns
Database needs to be loaded to Main Memory (C) Count the support of frequent-path-bases early, and
remove the frequent ones from the leap-traversal approach
.
K. Jouda and M. Zaki, ICDM’01 M. El-Hajj and O. Zaïane
MAXIMAL-COFIMAX
COFI-MAX Mining relatively small datasets

mining for ALL patterns in small synthetic dataset mining for CLOSED patterns in small synthetic dataset
700 350
1 2 3 4 5
G 5 0 600 300
A 5 0
Time in seconds
Time in seconds
500 250
B 4 0 A D E 1 1 A B C D 2 2 AB C D E 1 1 D 2 0 E 3 0
COFI-ALL COFI-CLOSED
C 4 0 NULL A B C E 1 1 A 5 400 200
D 4 0 B 4 C 2 0 C 1 0 D 2 0 FP-Growth FPCLOSED
300 150
E 3 0 C 4 MAFIA MAFIA
Step 1: Assign frequent-path-bases to their locations in OPB D 4 B 2 0 B 1 0 A 1 0 C 1 0 200 100
E 3
A 2 0 A 1 0 B 1 0 100 50
1 2 3 4 5
0 0
A 5 0 A 1 0
10K 50K 100K 250K 10K 50K 100K 250K
B 4 0 A D E 2 1 A B C D 3 2 AB C D E 1 1
C 4 0 NULL A B C E 2 1 transaction size transaction size
D 4 0
E 3 0
Step 2: Find global support for each frequent-path-base
Example (Finding Maximal
1 2 3 4 5
A 5 0 frequent patterns for a COFI- mining for MAXIMAL patterns in synthetic dataset
B 4 0 A D E 2 1 A B C D 3 2 AB C D E 1 1
C 4
D 4
0A E 3 0
0
A B C E 2 1
tree) 350
300
E 3 0
Time in seconds
Step 3: Apply leap-traversal approach on non-frequent frequent-path-bases 250

COFI-MAX
200
1 2 3 4 5 FPMAX
150
NULL
A E 3 0
NULL A B C D 3 2
NULL
Frequent Path Bases 100
MAFIA
50
Step 4: Remove non-frequent patterns or frequent patterns that have
0
superset of frequent patterns
ABCD:2, ABCE:1, ADE:1, ABCDE:1 10K 50K 100K 250K
transaction size
Mining Extremely large datasets Mining real datasets (Low, and High) support
Mining ALL frequent patterns Mining CLOSED frequent patterns
pumsb (ALL with high support) pumsb (CLOSED with high support)
5000 4000
120 18
4500 3500
4000 16
3000 100
Time in seconds
Time in seconds
3500 14
Time in seconds
Time in seconds
3000 2500 80 COFI-ALL 12 COFI-CLOSE

COFI-ALL COFI-CLOSED
2500 2000 FP_Growth 10 FP-CLOSE
FP-Growth FP-CLOSED 60
2000 1500 MAFIA 8 MAFIA
1500 40 Eclat 6 CHARM
1000
1000
4
500 500 20
2
0 0
0 0
5M 25M 50M 75M 100M 5M 25M 50M 75M 100M
90 85 80 75 70 65 90 85 80 75 70 65
transaction size transaction size
Support % Support %
pumsb (MAXIMAL with low support)

Mining MAXIMAL frequent patterns
3500.00
3500
3000.00
3000
Time in seconds
2500.00
Time in seconds
2500 COFI-MAX
2000.00 FPMAX
2000 COFI-MAX
1500.00 MAFIA
1500 FPMAX
GENMAX
1000.00
1000
500.00
500
0.00
0
40 35 30 25 20 15
5M 25M 50M 75M 100M
Support %
transaction size
Which algorithm is the winner? Which algorithm is the winner?
What about Extremely large datasets (hundreds of
millions of transactions)?
Not clear yet
Most of the existing algorithms do not run on such sizes
With relatively small datasets we can find
different winners Vertical approaches and Bitmaps approaches cannot
load the transactions in Main Memory
1. By using different datasets
2. By changing the support level Reparative approaches cannot keep scanning these huge
databases many times
3. By changing the implementations
Requirements: We need algorithms that
1) do not require multiple scans of the database
2) Leave small foot print in Main Memory at any given time
General Outline
Constraint-based Data Mining Restricting Association Rules

• Finding all the patterns in a database autonomously? •Useful for interactive and ad-hoc mining
— unrealistic! •Reduces the set of association rules discovered and confines
them to more relevant rules.
– The patterns could be too many but not focused!
• Before mining
• Data mining should be an interactive process 9 Knowledge type constraints: classification, etc.
– User directs what to be mined using a data mining query 9 Data constraints: SQL-like queries (DMQL)
language (or a graphical user interface) 9 Dimension/level constraints: relevance to some dimensions
and some concept levels.
• Constraint-based mining • While mining
– User flexibility: provides constraints on what to be mined 9 Rule constraints: form, size, and content.
– System optimization: explores such constraints for efficient 9 Interestingness constraints: support, confidence, correlation.
mining—constraint-based mining • After mining
9 Querying association rules
Constrained Frequent Pattern Mining: Rule Constraints in Association Mining

A Mining Query Optimization Problem • Two kind of rule constraints:
• Given a frequent pattern mining query with a set of constraints C, – Rule form constraints: meta-rule guided mining.
the algorithm should be • P(x, y) ^ Q(x, w) −> takes(x, “database systems”).
– sound: it only finds frequent sets that satisfy the given
constraints C – Rule content constraint: constraint-based query
– complete: all frequent sets satisfying the given constraints C
optimization (where and having clauses) (Ng, et al., SIGMOD’98).
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
are found
• A naïve solution • 1-variable vs. 2-variable constraints
– First find all frequent sets, and then test them for constraint (Lakshmanan, et al. SIGMOD’99):
satisfaction – 1-var: A constraint confining only one side (L/R) of the rule, e.g.,
• More efficient approaches: as shown above.
– Analyze the properties of constraints comprehensively – 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
– Push them as deeply as possible inside the frequent pattern
computation.
Anti-Monotonicity in Constraint-Based Mining Monotonicity in Constraint-Based Mining
TDB (min_sup=2)
TDB (min_sup=2)
TID Transaction TID Transaction
• Anti-monotonicity 10 a, b, c, d, f
10 a, b, c, d, f
– When an intemset S violates the • Monotonicity 20 b, c, d, f, g, h
20 b, c, d, f, g, h
constraint, so does any of its superset 30 a, c, d, e, f – When an intemset S satisfies the 30 a, c, d, e, f
– sum(S.Price) ≤ v is anti-monotone 40 c, e, f, g constraint, so does any of its superset 40 c, e, f, g

Item Profit Item Profit
– sum(S.Price) ≥ v is not anti-monotone a 40
– sum(S.Price) ≥ v is monotone a 40
• Example. C: range(S.profit) ≤ 15 is b 0 – min(S.Price) ≤ v is monotone b 0
c -20 c -20
anti-monotone
d 10
• Example. C: range(S.profit) ≥ 15 d 10
– Itemset ab violates C e -30 – Itemset ab satisfies C e -30
– So does every superset of ab f 30
– So does every superset of ab
f 30
g 20 g 20
h -10 h -10
Which Constraints Are Monotone or State Of The Art

SQL-based Constraints Anti-Monotone? Constraint pushing techniques have been proven to be
Constraint Monotone Anti-Monotone effective in reducing the explored portion of the search
v∈S yes no
S⊇V
space in constrained frequent pattern mining tasks.
yes no
S⊆V no yes
min(S) ≤ v yes no
Anti-monotone constraints:
min(S) ≥ v no yes • Easy to push …
max(S) ≤ v no yes
max(S) ≥ v yes no
• Always profitable to do …
count(S) ≤ v no yes
count(S) ≥ v yes no
Monotone constraints:
sum(S) ≤ v ( a ∈ S, a ≤ 0 ) no yes • Hard to push …
sum(S) ≥ v ( a ∈ S, a ≤ 0 ) yes no
range(S) ≤ v no
• Should we push them, or not?
yes
range(S) ≥ v yes no
support(S) ≥ ξ no yes
support(S) ≤ ξ yes no
Dual Miner C. Bucil, J. Gherke, D. Kiefer and W. White, SIGKDD’02

• A dual pruning algorithm for itemsets with constraints
• Based on MAFIA FP-Growth with constraints
Ø
1. BITMAP approach Checks for only monotone constraints (plus
2. Does not compute a b c d the support, which is an anti-monotone constraint).
frequent items with
their support
3. Performance issues ab ac ad bc bd cd Once a frequent itemset satisfies the monotone
(Many tests are
constraint, all frequent itemsets having it as a prefix also
required) abc abd acd bcd are guaranteed to satisfy the constraint
abcd
Support(a)<min_sup
Support(cd)>max_sup
All its supersets can be pruned All its subsets can be pruned J. Pei, J. Han, L. Lakshmanan, ICDE’01
COFI-With Constraints COFI-With Constraints
Anti-Monotone constraint checking Monotone constraint checking
Reduce
Reduce 1. Removing any frequent-path-bases that do not lattice
1. Removing individual items that do not satisfy the it anti- traversal
monotone constraint. (i.e they do not even participate in the
FP-tree satisfy the monotone constraint.
size
FP-tree Building process)
2. Not creating an A-COFI-tree, if all its local items Reduce
Reduce number
2. For each A-COFI-tree, remove any locally frequent item B, COFI- X with the A itemset violate the monotone of COFI-
where B does not satisfy the anti-monotone constraint tree size constraint. trees
3. No need for constraint checking for any A-COFI-tree if the Reduce

itemset X satisfies the anti-monotone constraint, where X is testing 3. No need for constraint checking for any A-COFI- Reduce
the set of all locally frequent items with respect to A tree if the item A satisfies the monotone constraint, testing
since any item with A will also satisfy this
M. El-Hajj and O. Zaïane constraint.
General Outline
Parallel KDD systems •

•
•
Challenges and issues
(Requirements & Design issues) •

•
•
Parallel and Distributed Mining
Visualization of Association Rules
Frequent Sequential Pattern Mining
Large size
Algorithm Evaluation High dimensionality
Data location
Process Support
Data skew
Location Transparency Dynamic load balance
Data Transparency Parallel database management system vs. parallel file systems
Parallel I/O minimization
System Transparency
Parallel incremental mining algorithms
Security, Quality of Service Parallel Dividing of the datasets
Availability, Fault tolerance and Mobility Parallel merging of discoveries
Parallel Association Rule Mining Algorithms Parallel Association Rule Mining Algorithms
REPLICATED candidate sets Candidate sets
Partitioned
Candidate Candidate Candidate
Replication Algorithms Set Set Set Replication Algorithms P. P. P.
-Partition the Candidate Set Candidate Candidate Candidate
Set Set Set
Partitioning Algorithms Partitioning Algorithms -Need to Scan the whole data
Replicate Candidate Sets
Hybrid Algorithms Partition the databases Hybrid Algorithms
Partitioned
database
database ¾ Data Distribution algorithm (Agrawal R. et al, 1996)

¾ Count Distribution algorithm (Agrawal R. et al, 1996) ¾ Intelligent Data Distribution algorithm (E-H Han et all, 1997)
¾ Parallel Partition algorithm (Ashok Savasere et al, 1995)
¾ Non Partitioned apriori, Simple partitioned, Hash Partitioned
¾ Fast Distributed algorithm (David Cheung et al, 1996) apriori and HPA_ELD algorithms (Takahiko et al, 1996)
¾ Parallel Data Mining algorithm (J.S. Park et al, 1996)
Parallel Association Rule Mining Algorithms Presentation of Association Rules (Table Form)
Candidate sets
Partitioned
Replication Algorithms P. P. P.
Candidate Candidate Candidate
Partitioning Algorithms -Combine both ideas of replication and Set Set Set
partition
Hybrid Algorithms
Partitioned
¾ Hybrid Distribution algorithm (E-H Han et all, 1997) database
¾ Multiple Local Frequent Pattern Tree algorithm (Zaïane et al, 2001)
DBMiner
Visualization of Association Rule in Plane Visualization of Association Rule Using Rule

Form Graph
DBMiner
DBMiner
General Outline
Visualization of Association Rule Using •

•
•
Association Rules
Different Frequent Patterns
Table Graph (DBMiner Web version) What is Sequence Mining? •

•
• Input
– A database D of sequences called data-sequences, in
which:
• I={i1, i2,…,in} is the set of items
• each sequence is a list of transactions ordered by transaction-time
• each transaction consists of fields: (sequence-id, transaction-id),
transaction-time and a set of items.
• Problem
– To discover all the sequential patterns with a user-
specified minimum support
Input Database: example Example
Customer ID Transaction Time Items Bought
Sequence-Id Transaction Items 1 June 1 30 Customer ID Customer Sequence
Time 45% of customers who 1 June 30 90
bought Foundation will
C1 1 Ringworld 2 June 10 10,20
C1 2 Foundation 1 <(30) (90)>
Database D C1 15 Ringworld Engineers, Second Foundation buy Foundation and 2 June 15 30 2 <(10 20) (30) (40 60 70)>
2 June 20 40, 60, 70
C2 1 Foundation, Ringworld Empire within the next 3 <(30 50 70)>
3 June 25 30, 50, 70
C2 20 Foundation and Empire
month. 4 <(30) (40 70) (90)>
C2 50 Ringworld Engineers 4 June 20 30 5 <(90)>
4 June 25 40, 70
4 June 30 90 Customer sequence version of the database
•Subsequence: subsequence sequence 5 June 12 90
A sequence <a1, a2, …, an> is <(3)(4 5) (8)> <(7) (3 8) (9)(4 5 6) (8)> √ Data base Sorted by customer ID and transaction time
contained in another sequence <(3) (5)> < (3 5) > X
Sequential Patterns with support = 2 customer,
<b1, b2, …, bm> <(3) (5)> < (3 5) (3 5) > √ 2 sequences (25% support)
if there exists integers i1 < i2 < i3
<(30) (90)>
< …. <in such that a1 ⊆ bi1 ,,a2 ⊆ k-sequence: a sequence that consists of k items
<(30) (40 70)>
bi2 , …,… an ⊆ bin Examples: both <x,y> and <(x, y)> are
2-sequence Answer Set
Algorithms References: Frequent-pattern Mining

• B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
• R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133,
• AprioriAll (1995) •
Bombay, India.
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
• GSP:Generalized Sequential Patterns (1996) • A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of
customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
– Both proposed by Rakesh Agrawal, Ramakrishnan Srikant • D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of
– Both need multi-scans over the database association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
• J. Pei, A.K.H. Tung, J. Han. Fault-Tolerant Frequent Pattern Mining: Problems and Challenges. SIGMOD
– GSP outperforms AprioriAll and is able to discover generalized sequential DMKD’01, Santa Barbara, CA.
patterns • Mohammad El-Hajj and Osmar R. Zaïane, Non Recursive Generation of Frequent K-itemsets from Frequent
Pattern Tree Representations, in Proc. of 5th International Conference on Data Warehousing and Knowledge
Discovery (DaWak'2003), pp 371-380, Prague, Czech Republic, September 3-5, 2003
Mohammad El-Hajj and Osmar R. Zaïane, Inverted Matrix: Efficient Discovery of Frequent Items in
• SPADE: (1998) •
Large Datasets in the Context of Interactive Mining, in Proc. 2003 Int'l Conf. on Data Mining and
– Proposed by Mohammed J. Zaki Knowledge Discovery (ACM SIGKDD), pp 109-118, Washington, DC, USA, August 24-27, 2003
– Sequential Pattern Discovery using Equivalence classes
Constraint-base Frequent-pattern Mining
• G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521,
• FreeSpan (2000) San Diego, CA, Feb. 2000.
• Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46,
• PrefixSpan (2001) Singapore, Dec. 1995.
– Both proposed by Jian Pei et al. • J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining",
– Based on FP-Tree and FP-Growth model COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.
• L. V. S. Lakshmanan, R. Ng, J. Han and A. Pang, "Optimization of Constrained Frequent Set Queries with 2-
Variable Constraints", SIGMOD’99
References:
Constraint-base Frequent Pattern Mining References: Sequential Pattern Mining
• R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. “Exploratory mining and pruning optimizations of constrained
association rules.” SIGMOD’98
• R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
• J. Pei, J. Han, and L. V. S. Lakshmanan, "Mining Frequent Itemsets with Convertible Constraints", Proc. 2001 Int. • R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
Conf. on Data Engineering (ICDE'01), April 2001. improvements. EDBT’96.
• J. Pei and J. Han "Can We Push More Constraints into Frequent Pattern Mining?", Proc. 2000 Int. Conf. on • J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, "FreeSpan: Frequent Pattern-
Knowledge Discovery and Data Mining (KDD'00), Boston, MA, August 2000. Projected Sequential Pattern Mining", Proc. 2000 Int. Conf. on Knowledge Discovery and Data
• R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Mining (KDD'00), Boston, MA, August 2000.
Beach, California.
• H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences.
• C. Bucila, J. E. Gehrke, D. Kifer, and W. White. Dualminer: A dual-pruning algorithm for itemsets with Data Mining and Knowledge Discovery, 1:259-289, 1997.
constraints. In Proc. SIGKDD 2002.
• M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data • J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan: Mining Sequential
cubes. KDD'97, 207-210, Newport Beach, California. Patterns Efficiently by Prefix-Projected Pattern Growth", Proc. 2001 Int. Conf. on Data Engineering
(ICDE'01), Heidelberg, Germany, April 2001.
Mining Maximal and Closed Patterns • B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421,
Orlando, FL.
• R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington.
• S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in
• J. Pei, J. Han, and R. Mao, "CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets", Proc. 2000
ACM-SIGMOD Int. Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, May 2000. association rules. VLDB'98, 368-379, New York, NY.
• N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. • M.J. Zaki. Efficient enumeration of frequent sequences. CIKM’98. Novermber 1998.
ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
• M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular
• M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
Expression Constraints. VLDB 1999: 223-234, Edinburgh, Scotland.
• M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, TR99-10, Department of Computer
Science, Rensselaer Polytechnic Institute.
• M. Zaki, Fast Vertical Mining Using Diffsets, TR01-1, Department of Computer Science, Rensselaer Polytechnic
Institute.
• D. Burdick, M. Calimlim, and J. Gehrke. Mafia: A maximal frequent itemset algorithm for transactional databases. In
ICDE 2001. IEEE Computer Society, 2001
References: Frequent-pattern Mining in Spatial,
Multimedia, Text & Web Databases References: Parallel Frequent itemset mining
• K. Koperski, J. Han, and G. B. Marchisio, "Mining Spatial and Image Data through Progressive Refinement Methods",
Revue internationale de gomatique (European Journal of GIS and Spatial Analysis), 9(4):425-440, 1999.
• Agrawal, R., Shafer, J.: Parallel Mining of Association Rules. IEEE Transactions in
• A. K. H. Tung, H. Lu, J. Han, and L. Feng, "Breaking the Barrier of Transactions: Mining Inter-Transaction Association
Knowledge and Data Eng. 8 (1996) pp. 962-969
Rules", Int. Conf. on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, Aug. 1999, pp. 297-301.
• J. Han, G. Dong and Y. Yin, "Efficient Mining of Partial Periodic Patterns in Time Series Database", Proc. 1999 Int. • Barry Wilkinson, Michael Allen: Parallel Programming Techniques and applications using
Conf. on Data Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115. networked workstations and parallel computers. Alan Apt, New Jersy, USA. 1999
• H. Lu, L. Feng, and J. Han, "Beyond Intra-Transaction Association Analysis:Mining Multi-Dimensional Inter- • D. Cheung, K. Hu, S. Xia: Asynchronous Parallel Algorithm for Mining Association Rules on
Transaction Association Rules", ACM Transactions on Information Systems (TOIS’00), 18(4): 423-454, 2000. a Shared-memory Multi-processors. Proc. 10th ACM Symp. Parallel Algorithms and
• O. R. Zaiane, M. Xin, J. Han, "Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining architectures,ACM Press, NY, 1998, pp 279 - 288
Technology on Web Logs," Proc. Advances in Digital Libraries Conf., Santa Barbara, CA, April 1998, pp. 19-29
• David Wai-Lok Cheung, Jiawei Han, Vincent Ng, Ada Wai-Chee Fu, Yongjian Fu: A Fast
• O. R. Zaiane, J. Han, and H. Zhu, "Mining Recurrent Items in Multimedia with Progressive Resolution Refinement",
Distributed Algorithm for Mining Association Rules. PDIS 1996: pp. 31-42
Proc. 2000 Int. Conf. on Data Engineering (ICDE'00), San Diego, CA, Feb. 2000, pp. 461-470.
• David Wai-Lok Cheung, Yongqiao Xiao: Effect of Data Distribution in Parallel Mining of
Associations. Data Mining and Knowledge Discovery 3(3): pp.291-314 (1999)
FIM for Classification & Data Cube Computation • E-H. Han, G. Karypis, and V. Kumar: Scalable parallel data mining for association rules. In
ACM SIGMOD Conf. Management of Data, May 1997
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD 1999, pp 359-370
• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. • J. S. Park, M. Chen, and P. S. Yu: Efficient parallel data mining for association rules. In ACM
VLDB, 299-310, New York, NY, Aug. 1998. Intl. Conf. Infomration and Knowledge Management, November 1995
• J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data Cubes with Complex Measures, SIGMOD 2001. • Mohammed J. Zaki and Ching-Tien Ho (editors): Large-scale Parallel Data Mining, Lecture
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD 1999 Notes in Artificial Intelligence, State-of-the-Art-Survey, Volume 1759, Springer-Verlag, 2000
• B. Liu, W. Hsu and Y. Ma, Integrating classification and association rule mining, SIGKDD 1998, pp 80-86
• M-L. Antonie, O. R. Zaïane, Text Document Categorization by Term Association , in Proc. of the IEEE International
Conference on Data Mining (ICDM'2002), pp 19-26, Maebashi City, Japan.
References: Parallel Frequent itemset mining

• Mohammed J. Zaki, Parallel and Distributed Association Mining: A Survey, in IEEE
Concurrency, special issue on Parallel Mechanisms for Data Mining, Vol. 7, No. 4, pp14-25,
December, 1999
• Osmar R. Zaïane, Mohammad El-Hajj, Paul Lu: Fast Parallel Association Rule Mining without
Candidacy Generation, in Proc. of the IEEE 2001 International Conference on Data Mining
(ICDM'2001), San Jose, CA, USA
• Mohammad El-Hajj and Osmar R. Zaïane, Parallel Association Rule Mining with Minimum
Inter-Processor Communication, Fifth International Workshop on Parallel and Distributed
Databases (PaDD'2003) in conjunction with the 14th Int' Conf. on Database and Expert
Systems Applications DEXA2003, pp 519-523, Prague, Czech Republic, September 1-
September 5, 2003
• R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo: Fast discovery of
association rules. In Fayyad et al, AAAI Press, Menio Park, Calif pp. 307--328, 1996
• Skillicorn, D: Strategies for Parallel Data Mining. IEEE concurrency 7 (1999) pp. 26-35
• Srinivasan Parthasarathy, Mohammed Zaki, Mitsunori Ogihara, Wei Li: Parallel Data Mining
for Association Rules on Shared-memory Systems, in Knowledge and Information Systems,
Volume 3, Number 1, pp 1-29, Feb 2001
• Takahiko Shintani,Masaru Kitsuregawa: Hash Based Parallel Algorithms for Mining
Association Rules. Proceedings of IEEE Fourth International Conference on Parallel and
Distributed Information Systems, pp.19-30, 1996.12
• W. A. Maniatty, Mohammed J. Zaki, A Requirement Analysis for Parallel KDD Systems , 3rd
IPDPS Workshop on High Performance Data Mining , Cancun, Mexico, May 2000

Advances and Issues in Frequent Pattern Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advances and Issues in Frequent Pattern Mining

Uploaded by

Copyright:

Available Formats

University

• Frequent pattern: an important form of regularity

University of Alberta – What are the consequences of a hurricane?

Why Studying Frequent Pattern General Outline

What Is Association Mining?

FIM 1 abc 2 abÆc

ABCD ABCE ABDE ACDE BCDE

An Influential Mining Methodology

Illustrating Apriori Principle The Apriori Algorithm

Interestingness Measures Adding Correlations or Lifts to Support

• For frequent itemset X, if there exists no {bd}

(1002) + … + (100100) = 2100-1 = 1.27*1030 containing X also contains y, then X is a a 2

• In other words, there is no other frequent b 3 5 DE

• R. Bayardo. In SIGMOD’98 abc 2 Not supported by

• Hybrid approach Bottom Depth

– Jumps to selected nodes AB AC AD AE BC BD BE CD CE DE

Closed Top-down (level k then level k+1)

Minimum support = 2 ABCDE

Breadth- First (Bottom-Up Example) TID

1.2.3 2.4 1.2.3 2.4

ABCD ABCE ABDE ACDE BCDE

Support Size of Largest F1-Size Total Frequent Breadth Depth Leap

75 17 30 176365 177244 4.00

70 18 31 627952 628963 2.00

Total Candidates Created 90.00

1000 3 375 385 20 10 29 60.00

250 8 717 7703 36994 36309 26666

Transactional Layouts Transactional Layouts

Why The Matrix Layout? Why The Matrix Layout?

T# Items T# Items T# Items T# Items

Transactional Layouts Transactional Layouts

Transactional Layouts The Algorithms •

• Inverted Matrix Layout T# Items

MaxMiner, MAFIA, GENMAX, COFI-MAX

Frequent Pattern Growth

Frequent Pattern Tree

M:1 B:1 M:1 B:1

P:1 M:1 P:1 M:1

A:2 A:2 P:1 A:1

M:1 B:1 M:1 B:1 M:1

P:1 M:1 P:1 M:1 P:1

A:3 P:1 A:3 P:1 M:1

M:2 B:1 M:2 B:1

P:2 M:1 P:2 M:1

A:3 P:1 M:1 A:3 P:1 M:1

M:3 B:1 M:3 B:1

P:3 M:1 P:3 M:1

P:2 M:1 P:2 M:1

C:3, F:2, A:2, M:2, B:1

P:2 M:1 P:2 M:1

Co-Occurrences Frequent Item tree Co-Occurrences Frequent Item tree

M:2 B:1 F:3 :3 M:2 B:1 F:3 :3

Three approaches: Three approaches:

C:4 A:4 F:3 C:4 A:4 F:3

Use vertical Data Format A C D T W A C D T W

Diffsets: don’t store the entire tidsets, only the differences 3 3 4 4

Diffset: d(AD) = d(D) – d(A)

J. Wang, J. Han, J. Pei. SIGKDD 03

Divide the search space 1 2 3 4 5 frequent patterns for a COFI-

Step 3: Find global support for each pattern

1 2 3 4 5 Frequent Path Bases

M. El-Hajj and O. Zaïane

– BC, BD, BE, BCDE Potential max- 1 2 3 4