This action might not be possible to undo. Are you sure you want to continue?
Christian Borgelt
lntc¦¦i¸cnt Lata Ana¦ysis and Grajhica¦ `odc¦s lcscarch ¹nit
Lurojcan Ccntcr tor Sott Comjutin¸
c, Gonza¦o Guticrrcz Quiros s,n, 33o00 `icrcs, Sjain
christian.borgelt@softcomputing.es
http://www.softcomputing.es/
http://www.borgelt.net/
http://www.borgelt.net/teach/fpm/
Christian Borgelt Frequent Pattern Mining 1
Overview
Frequent Pattern Mining comjriscs
• Ircqucnt ltcm Sct `inin¸ and Association lu¦c lnduction
• Ircqucnt Scqucncc `inin¸
• Ircqucnt Trcc `inin¸
• Ircqucnt Grajh `inin¸
Application Areas ot Ircqucnt lattcrn `inin¸ inc¦udc
• `arkct Laskct Ana¦ysis
• C¦ick Strcam Ana¦ysis
• \c¦ Link Ana¦ysis
• Gcnomc Ana¦ysis
• Lru¸ Lcsi¸n (`o¦ccu¦ar Ira¸mcnt `inin¸)
Christian Borgelt Frequent Pattern Mining 2
Frequent Item Set Mining
Christian Borgelt Frequent Pattern Mining 3
Frequent Item Set Mining: Motivation
• Ircqucnt ltcm Sct `inin¸ is a mcthod tor market basket analysis
• lt aims at ﬁndin¸ rc¸u¦aritics in thc shojjin¸ ¦chavior ot customcrs
ot sujcrmarkcts, mai¦ordcr comjanics, on¦inc shojs ctc
• `orc sjcciﬁca¦¦y
Find sets of products that are frequently bought together.
• lossi¦¦c ajj¦ications ot tound trcqucnt itcm scts
◦ lmjrovc arran¸cmcnt ot jroducts in shc¦vcs, on a cata¦o¸’s ja¸cs ctc
◦ Sujjort crosssc¦¦in¸ (su¸¸cstion ot othcr jroducts), jroduct ¦und¦in¸
◦ Iraud dctcction, tcchnica¦ dcjcndcncc ana¦ysis ctc
• Ottcn tound jattcrns arc cxjrcsscd as association rules, tor cxamj¦c
If a customcr ¦uys bread and wine,
then shc,hc wi¦¦ jro¦a¦¦y a¦so ¦uy cheese
Christian Borgelt Frequent Pattern Mining 4
Frequent Item Set Mining: Basic Notions
• Lct B ¦i
1
, . . . , i
m
¦ ¦c a sct ot items This sct is ca¦¦cd thc item base
ltcms may ¦c jroducts, sjccia¦ cquijmcnt itcms, scrvicc ojtions ctc
• Any su¦sct I ⊆ B is ca¦¦cd an item set
An itcm sct may ¦c any sct ot jroducts that can ¦c ¦ou¸ht (to¸cthcr)
• Lct T (t
1
, . . . , t
n
) with ∀k, 1 ≤ k ≤ n t
k
⊆ B ¦c a vcctor ot
transactions ovcr B This vcctor is ca¦¦cd thc transaction database
A transaction data¦asc can ¦ist, tor cxamj¦c, thc scts ot jroducts
¦ou¸ht ¦y thc customcrs ot a sujcrmarkct in a ¸ivcn jcriod ot timc
Lvcry transaction is an itcm sct, ¦ut somc itcm scts may not ajjcar in T
Transactions nccd not ¦c jairwisc diﬀcrcnt it may ¦c t
j
t
k
tor j , k
T may a¦so ¦c dcﬁncd as a bag or multiset ot transactions
Thc sct B may not ¦c cxj¦icitc¦y ¸ivcn, ¦ut on¦y imj¦icit¦y as B
n
k1
t
k
Christian Borgelt Frequent Pattern Mining 5
Frequent Item Set Mining: Basic Notions
Lct I ⊆ B ¦c an itcm sct and T a transaction data¦asc ovcr B
• A transaction t ∈ T covers thc itcm sct I or
thc itcm sct I is contained in a transaction t ∈ T iﬀ I ⊆ t
• Thc sct K
T
(I) ¦k ∈ ¦1, . . . , n¦ [ I ⊆ t
k
¦ is ca¦¦cd thc cover ot I wrt T
Thc covcr ot an itcm sct is thc indcx sct ot thc transactions that covcr it
lt may a¦so ¦c dcﬁncd as a vcctor ot a¦¦ transactions that covcr it
(which, howcvcr, is comj¦icatcd to writc in a torma¦¦y corrcct way)
• Thc va¦uc s
T
(I) [K
T
(I)[ is ca¦¦cd thc (absolute) support ot I wrt T
Thc va¦uc σ
T
(I)
1
n
[K
T
(I)[ is ca¦¦cd thc relative support ot I wrt T
Thc sujjort ot I is thc num¦cr or traction ot transactions that contain it
Somctimcs σ
T
(I) is a¦so ca¦¦cd thc (relative) frequency ot I wrt T
Christian Borgelt Frequent Pattern Mining 6
Frequent Item Set Mining: Basic Notions
A¦tcrnativc Lcﬁnition ot Transactions
• A transaction ovcr an itcm ¦asc B is a tuj¦c t (tid, J), whcrc
◦ tid is a uniquc transaction identiﬁer and
◦ J ⊆ B is an itcm sct
• A transaction database T ¦t
1
, . . . , t
n
¦ is a set ot transactions
A simj¦c sct can ¦c uscd, sincc transactions diﬀcr at ¦cast in thcir idcntiﬁcr
• A transaction t (tid, J) covers an itcm sct I iﬀ I ⊆ J
• Thc sct K
T
(I) ¦tid [ ∃J ⊆ B ∃t ∈ T t (tid, J) ∧ I ⊆ J¦
is thc cover ot I wrt T
lcmark lt thc transaction data¦asc is dcﬁncd as a vcctor, thcrc is an imj¦icit
transaction idcntiﬁcr, namc¦y thc josition ot thc transaction in thc vcctor
Christian Borgelt Frequent Pattern Mining 7
Frequent Item Set Mining: Formal Deﬁnition
Given:
• a sct B ¦i
1
, . . . , i
m
¦ ot itcms, thc item base,
• a vcctor T (t
1
, . . . , t
n
) ot transactions ovcr B, thc transaction database,
• a num¦cr s
min
∈ l`, 0 < s
min
≤ n, or (cquiva¦cnt¦y)
a num¦cr σ
min
∈ ll, 0 < σ
min
≤ 1, thc minimum support
Desired:
• thc sct ot frequent item sets, that is,
thc sct F
T
(s
min
) ¦I ⊆ B [ s
T
(I) ≥ s
min
¦ or (cquiva¦cnt¦y)
thc sct Φ
T
(σ
min
) ¦I ⊆ B [ σ
T
(I) ≥ σ
min
¦
`otc that with thc rc¦ations s
min
⌈nσ
min
⌉ and σ
min
1
n
s
min
thc two vcrsions can casi¦y ¦c transtormcd into cach othcr
Christian Borgelt Frequent Pattern Mining 8
Frequent Item Sets: Example
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• Thc minimum sujjort is s
min
3 or σ
min
0.3 30/ in this cxamj¦c
• Thcrc arc 2
`
32 jossi¦¦c itcm scts ovcr B ¦a, b, c, d, e¦
• Thcrc arc 1o trcqucnt itcm scts (¦ut on¦y 10 transactions)
Christian Borgelt Frequent Pattern Mining 9
Searching for Frequent Item Sets
Christian Borgelt Frequent Pattern Mining 10
Properties of the Support of Item Sets
• A brute force approach that travcrscs a¦¦ jossi¦¦c itcm scts, dctcrmincs thcir
sujjort, and discards intrcqucnt itcm scts is usua¦¦y infeasible
Thc num¦cr ot jossi¦¦c itcm scts ¸rows cxjoncntia¦¦y with thc num¦cr ot itcms
A tyjica¦ sujcrmarkct oﬀcrs thousands ot diﬀcrcnt jroducts
• Idea: Considcr thc jrojcrtics ot thc sujjort, in jarticu¦ar
∀I ∀J ⊇ I K
T
(J) ⊆ K
T
(I).
This jrojcrty ho¦ds, sincc ∀t ∀I ∀J ⊇ I J ⊆ t → I ⊆ t
Lach additiona¦ itcm is anothcr condition a transaction has to satisty
Transactions that do not satisty this condition arc rcmovcd trom thc covcr
• lt to¦¦ows ∀I ∀J ⊇ I s
T
(J) ≤ s
T
(I).
That is If an item set is extended, its support cannot increase.
Onc a¦so says that sujjort is antimonotone or downward closed
Christian Borgelt Frequent Pattern Mining 11
Properties of the Support of Item Sets
• Irom ∀I ∀J ⊇ I s
T
(J) ≤ s
T
(I) it to¦¦ows immcdiatc¦y
∀s
min
∀I ∀J ⊇ I s
T
(I) < s
min
→ s
T
(J) < s
min
.
That is No superset of an infrequent item set can be frequent.
• This jrojcrty is ottcn rctcrrcd to as thc Apriori Property
lationa¦c Somctimcs wc can know a priori, that is, ¦ctorc chcckin¸ its sujjort
¦y acccssin¸ thc ¸ivcn transaction data¦asc, that an itcm sct cannot ¦c trcqucnt
• Ot coursc, thc contrajosition ot this imj¦ication a¦so ho¦ds
∀s
min
∀I ∀J ⊆ I s
T
(I) ≥ s
min
→ s
T
(J) ≥ s
min
.
That is All subsets of a frequent item set are frequent.
• This su¸¸csts a comjrcsscd rcjrcscntation ot thc sct ot trcqucnt itcm scts
(which wi¦¦ ¦c cxj¦orcd ¦atcr maxima¦ and c¦oscd trcqucnt itcm scts)
Christian Borgelt Frequent Pattern Mining 12
Reminder: Partially Ordered Sets
• A partial order is a ¦inary rc¦ation ≤ ovcr a sct S which satisﬁcs ∀a, b, c ∈ S
◦ a ≤ a (rcﬂcxivity)
◦ a ≤ b ∧ b ≤ a ⇒ a b (antisymmctry)
◦ a ≤ b ∧ b ≤ c ⇒ a ≤ c (transitivity)
• A sct with a jartia¦ ordcr is ca¦¦cd a partially ordered set (or poset tor short)
• Lct a and b ¦c two distinct c¦cmcnts ot a jartia¦¦y ordcrcd sct (S, ≤)
◦ it a ≤ b or b ≤ a, thcn a and b arc ca¦¦cd comparable
◦ it ncithcr a ≤ b nor b ≤ a, thcn a and b arc ca¦¦cd incomparable
• lt a¦¦ jairs ot c¦cmcnts ot thc undcr¦yin¸ sct S arc comjara¦¦c,
thc ordcr ≤ is ca¦¦cd a total order or a linear order
• ln a tota¦ ordcr thc rcﬂcxivity axiom is rcj¦accd ¦y thc stron¸cr axiom
◦ a ≤ b ∨ b ≤ a (tota¦ity)
Christian Borgelt Frequent Pattern Mining 13
Properties of the Support of Item Sets
Monotonicity in Calculus and Analysis
• A tunction f ll → ll is ca¦¦cd monotonically nondecreasing
it ∀x, y x ≤ y ⇒ f(x) ≤ f(y)
• A tunction f ll → ll is ca¦¦cd monotonically nonincreasing
it ∀x, y x ≤ y ⇒ f(x) ≥ f(y)
Monotonicity in Order Theory
• Ordcr thcory is conccrncd with ar¦itrary jartia¦¦y ordcrcd scts
Thc tcrms increasing and decreasing arc avoidcd, ¦ccausc thcy ¦osc thcir jictoria¦
motivation as soon as scts arc considcrcd that arc not tota¦¦y ordcrcd
• A tunction f S → R, whcrc S and R arc two jartia¦¦y ordcrcd scts, is ca¦¦cd
monotone or orderpreserving it ∀x, y ∈ S x ≤
S
y ⇒ f(x) ≤
R
f(y)
• A tunction f S → R, is ca¦¦cd
antimonotone or orderreversing it ∀x, y ∈ S x ≤
S
y ⇒ f(x) ≥
R
f(y)
• ln this scnsc thc sujjort ot an itcm sct is antimonotonc
Christian Borgelt Frequent Pattern Mining 14
Properties of Frequent Item Sets
• A su¦sct R ot a jartia¦¦y ordcrcd sct (S, ≤) is ca¦¦cd downward closed
it tor any c¦cmcnt ot thc sct a¦¦ sma¦¦cr c¦cmcnts arc a¦so in it
∀x ∈ R ∀y ∈ S y ≤ x ⇒ y ∈ R
ln this casc thc su¦sct R is a¦so ca¦¦cd a lower set
• Thc notions ot upward closed and upper set arc dcﬁncd ana¦o¸ous¦y
• Ior cvcry s
min
thc sct ot trcqucnt itcm scts F
T
(s
min
) is downward c¦oscd
wrt thc jartia¦¦y ordcrcd sct (2
B
, ⊆), whcrc 2
B
dcnotcs thc jowcrsct ot B
∀X ∈ F
T
(s
min
) ∀Y ⊆ B Y ⊆ X ⇒ Y ∈ F
T
(s
min
)
• Sincc thc sct ot trcqucnt itcm scts is induccd ¦y thc sujjort tunction,
thc notions ot up or downward closed arc transtcrrcd to thc sujjort tunction
Any sct ot itcm scts induccd ¦y a sujjort thrcsho¦d θ is uj or downward c¦oscd
F
T
(θ) ¦S ⊆ B [ s
T
(S) ≥ θ¦ is downward c¦oscd,
G
T
(θ) ¦S ⊆ B [ s
T
(S) < θ¦ is ujward c¦oscd
Christian Borgelt Frequent Pattern Mining 15
Reminder: Partially Ordered Sets and Hasse Diagrams
• A ﬁnitc jartia¦¦y ordcrcd sct (S, ≤) can ¦c dcjictcd as a (dircctcd) acyc¦ic ¸rajh G,
which is ca¦¦cd Hasse diagram
• G has thc c¦cmcnts ot S as nodcs
Thc cd¸cs arc sc¦cctcd accordin¸ to
lt a and b arc c¦cmcnts ot S with a < b
(that is, a ≤ b and not a b) and
thcrc is no c¦cmcnt ¦ctwccn a and b
(that is, no c ∈ S with a < c < b),
thcn thcrc is an cd¸c trom a to b
• Sincc thc ¸rajh is acyc¦ic
(thcrc is no dircctcd cyc¦c),
thc ¸rajh can a¦ways ¦c dcjictcd
such that a¦¦ cd¸cs ¦cad downwards
• Thc Lassc dia¸ram ot a tota¦ or
¦incar ordcr is a chain
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Lassc dia¸ram ot (¦a, b, c, d, e¦, ⊆)
(Ld¸c dircctions arc omittcd.
a¦¦ cd¸cs ¦cad downwards)
Christian Borgelt Frequent Pattern Mining 16
Searching for Frequent Item Sets
• Thc standard scarch jroccdurc is an enumeration approach,
that cnumcratcs candidatc itcm scts and chccks thcir sujjort
• lt imjrovcs ovcr thc ¦rutc torcc ajjroach ¦y cxj¦oitin¸ thc apriori property
to skij itcm scts that cannot ¦c trcqucnt ¦ccausc thcy havc an intrcqucnt su¦sct
• Thc search space is thc partially ordered set (2
B
, ⊆)
• Thc structurc ot thc jartia¦¦y ordcrcd sct (2
B
, ⊆) hc¦js to idcntity
thosc itcm scts that can ¦c skijjcd duc to thc ajriori jrojcrty
⇒ topdown search (trom cmjty sct,oncc¦cmcnt scts to ¦ar¸cr scts)
• Sincc a jartia¦¦y ordcrcd sct can convcnicnt¦y ¦c dcjictcd ¦y a Hasse diagram,
wc wi¦¦ usc such dia¸rams to i¦¦ustratc thc scarch
• `otc that thc scarch may havc to visit an cxjoncntia¦ num¦cr ot itcm scts
ln jracticc, howcvcr, thc scarch timcs arc ottcn ¦cara¦¦c,
at ¦cast it thc minimum sujjort is not choscn too ¦ow
Christian Borgelt Frequent Pattern Mining 17
Searching for Frequent Item Sets
Idea: ¹sc thc jrojcrtics
ot thc sujjort to or¸anizc
thc scarch tor a¦¦ trcqucnt
itcm scts, csjccia¦¦y thc
apriori property
∀I ∀J ⊃ I
s
T
(I) < s
min
→ s
T
(J) < s
min
.
Sincc thcsc jrojcrtics rc
¦atc thc sujjort ot an itcm
sct to thc sujjort ot its
subsets and supersets,
it is rcasona¦¦c to or¸anizc
thc scarch ¦ascd on thc
structurc ot thc partially
ordered set (2
B
, ⊆)
Hasse diagram tor ﬁvc itcms ¦a, b, c, d, e¦ B
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde (2
B
, ⊆)
Christian Borgelt Frequent Pattern Mining 18
Hasse Diagrams and Frequent Item Sets
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
L¦uc ¦oxcs arc trcqucnt
itcm scts, whitc ¦oxcs
intrcqucnt itcm scts
Lassc dia¸ram with trcqucnt itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 19
The Apriori Algorithm
A¸rawa¦ and Srikant 199!
Christian Borgelt Frequent Pattern Mining 20
Searching for Frequent Item Sets
One possible scheme for the search:
• Lctcrminc thc sujjort ot thc onc c¦cmcnt itcm scts
and discard thc intrcqucnt itcms
• Iorm candidatc itcm scts with two itcms (¦oth itcms must ¦c trcqucnt),
dctcrminc thcir sujjort, and discard thc intrcqucnt itcm scts
• Iorm candidatc itcm scts with thrcc itcms (a¦¦ jairs must ¦c trcqucnt),
dctcrminc thcir sujjort, and discard thc intrcqucnt itcm scts
• Continuc ¦y tormin¸ candidatc itcm scts with tour, ﬁvc ctc itcms
unti¦ no candidatc itcm sct is trcqucnt
This is thc ¸cncra¦ schcmc ot thc Apriori Algorithm
lt is ¦ascd on two main stcjs candidate generation and pruning
A¦¦ cnumcration a¦¸orithms arc ¦ascd on thcsc stcjs in somc torm
Christian Borgelt Frequent Pattern Mining 21
The Apriori Algorithm 1
function ajriori (B, T, s
min
)
begin (∗ — Ajriori a¦¸orithm ∗)
k 1. (∗ initia¦izc thc itcm sct sizc ∗)
E
k
i∈B
¦¦i¦¦. (∗ start with sin¸¦c c¦cmcnt scts ∗)
F
k
jrunc(E
k
, T, s
min
). (∗ and dctcrminc thc trcqucnt oncs ∗)
while F
k
, ∅ do begin (∗ whi¦c thcrc arc trcqucnt itcm scts ∗)
E
k+1
candidatcs(F
k
). (∗ crcatc candidatcs with onc itcm morc ∗)
F
k+1
jrunc(E
k+1
, T, s
min
). (∗ and dctcrminc thc trcqucnt itcm scts ∗)
k k + 1. (∗ incrcmcnt thc itcm countcr ∗)
end.
return
k
j1
F
j
. (∗ rcturn thc trcqucnt itcm scts ∗)
end (∗ ajriori ∗)
E
j
candidatc itcm scts ot sizc j, F
j
trcqucnt itcm scts ot sizc j
Christian Borgelt Frequent Pattern Mining 22
The Apriori Algorithm 2
function candidatcs (F
k
)
begin (∗ — ¸cncratc candidatcs with k + 1 itcms ∗)
E ∅. (∗ initia¦izc thc sct ot candidatcs ∗)
forall f
1
, f
2
∈ F
k
(∗ travcrsc a¦¦ jairs ot trcqucnt itcm scts ∗)
with f
1
¦i
1
, . . . , i
k−1
, i
k
¦ (∗ that diﬀcr on¦y in onc itcm and ∗)
and f
2
¦i
1
, . . . , i
k−1
, i
′
k
¦ (∗ arc in a ¦cxico¸rajhic ordcr ∗)
and i
k
< i
′
k
do begin (∗ (thc ordcr is ar¦itrary, ¦ut ﬁxcd) ∗)
f f
1
∪ f
2
¦i
1
, . . . , i
k−1
, i
k
, i
′
k
¦. (∗ union has k + 1 itcms ∗)
if ∀i ∈ f f −¦i¦ ∈ F
k
(∗ it a¦¦ su¦scts with k itcms arc trcqucnt, ∗)
then E E ∪ ¦f¦. (∗ add thc ncw itcm sct to thc candidatcs ∗)
end. (∗ (othcrwisc it cannot ¦c trcqucnt) ∗)
return E. (∗ rcturn thc ¸cncratcd candidatcs ∗)
end (∗ candidatcs ∗)
Christian Borgelt Frequent Pattern Mining 23
The Apriori Algorithm 3
function jrunc (E, T, s
min
)
begin (∗ — jrunc intrcqucnt candidatcs ∗)
forall e ∈ E do (∗ initia¦izc thc sujjort countcrs ∗)
s
T
(e) 0. (∗ ot a¦¦ candidatcs to ¦c chcckcd ∗)
forall t ∈ T do (∗ travcrsc thc transactions ∗)
forall e ∈ E do (∗ travcrsc thc candidatcs ∗)
if e ⊆ t (∗ it transaction contains thc candidatc, ∗)
then s
T
(e) s
T
(e) + 1. (∗ incrcmcnt thc sujjort countcr ∗)
F ∅. (∗ initia¦izc thc sct ot trcqucnt candidatcs ∗)
forall e ∈ E do (∗ travcrsc thc candidatcs ∗)
if s
T
(e) ≥ s
min
(∗ it a candidatc is trcqucnt, ∗)
then F F ∪ ¦e¦. (∗ add it to thc sct ot trcqucnt itcm scts ∗)
return F. (∗ rcturn thc jruncd sct ot candidatcs ∗)
end (∗ jrunc ∗)
Christian Borgelt Frequent Pattern Mining 24
Improving the Candidate Generation
Christian Borgelt Frequent Pattern Mining 25
Searching for Frequent Item Sets
• Thc Ajriori a¦¸orithm scarchcs thc jartia¦ ordcr tojdown ¦cvc¦ ¦y ¦cvc¦
• Co¦¦cctin¸ thc trcqucnt itcm scts ot sizc k in a set F
k
has draw¦acks
A trcqucnt itcm sct ot sizc k + 1 can ¦c tormcd in
j
k(k + 1)
2
jossi¦¦c ways (Ior intrcqucnt itcm scts thc num¦cr may ¦c sma¦¦cr)
As a conscqucncc, thc candidatc ¸cncration stcj may carry out a ¦ot ot
rcdundant work, sincc it suﬃccs to ¸cncratc cach candidatc itcm sct oncc
• Question: Can wc rcducc or cvcn c¦iminatc this rcdundant work´
More generally:
Low can wc makc surc that any candidatc itcm sct is ¸cncratcd at most oncc´
• Idea: Assi¸n to cach itcm sct a uniquc jarcnt itcm sct,
trom which this itcm sct is to ¦c ¸cncratcd
Christian Borgelt Frequent Pattern Mining 26
Searching for Frequent Item Sets
• A corc jro¦¦cm is that an itcm sct ot sizc k (that is, with k itcms)
can ¦c ¸cncratcd in k' diﬀcrcnt ways (on k' jaths in thc Lassc dia¸ram),
¦ccausc in jrincij¦c thc itcms may ¦c addcd in any ordcr
• lt wc considcr an itcm ¦y itcm jroccss ot ¦ui¦din¸ an itcm sct
(which can ¦c ima¸incd as a ¦cvc¦wisc travcrsa¦ ot thc jartia¦ ordcr),
thcrc arc k jossi¦¦c ways ot tormin¸ an itcm sct ot sizc k
trom itcm scts ot sizc k −1 ¦y addin¸ thc rcmainin¸ itcm
• lt is o¦vious that it suﬃccs to considcr cach itcm sct at most oncc in ordcr
to ﬁnd thc trcqucnt oncs (intrcqucnt itcm scts nccd not ¦c ¸cncratcd at a¦¦)
• Question: Can wc rcducc or cvcn c¦iminatc this varicty´
More generally:
Low can wc makc surc that any candidatc itcm sct is ¸cncratcd at most oncc´
• Idea: Assi¸n to cach itcm sct a uniquc jarcnt itcm sct,
trom which this itcm sct is to ¦c ¸cncratcd
Christian Borgelt Frequent Pattern Mining 27
Searching for Frequent Item Sets
• \c havc to scarch thc jartia¦¦y ordcrcd sct (2
B
, ⊆) , its Lassc dia¸ram
• Assi¸nin¸ uniquc jarcnts turns thc Lassc dia¸ram into a trcc
• Travcrsin¸ thc rcsu¦tin¸ trcc cxj¦orcs cach itcm sct cxact¦y oncc
Lassc dia¸ram and a jossi¦¦c trcc tor ﬁvc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 28
Searching with Unique Parents
Principle of a Search Algorithm based on Unique Parents:
• Base Loop:
◦ Travcrsc a¦¦ oncc¦cmcnt itcm scts (thcir uniquc jarcnt is thc cmjty sct)
◦ lccursivc¦y jroccss a¦¦ oncc¦cmcnt itcm scts that arc trcqucnt
• Recursive Processing:
Ior a ¸ivcn trcqucnt itcm sct I
◦ Gcncratc a¦¦ cxtcnsions J ot I ¦y onc itcm (that is, J ⊃ I, [J[ [I[ + 1)
tor which thc itcm sct I is thc choscn uniquc jarcnt
◦ Ior a¦¦ J it J is trcqucnt, jroccss J rccursivc¦y, othcrwisc discard J
• Questions:
◦ Low can wc torma¦¦y assi¸n uniquc jarcnts´
◦ Low can wc makc surc that wc ¸cncratc on¦y thosc cxtcnsions
tor which thc itcm sct that is cxtcndcd is thc choscn uniquc jarcnt´
Christian Borgelt Frequent Pattern Mining 29
Assigning Unique Parents
• Iorma¦¦y, thc sct ot a¦¦ possible parents ot an itcm sct I is
P(I) ¦J ⊂ I [ ,∃K J ⊂ K ⊂ I¦.
ln othcr words, thc jossi¦¦c jarcnts ot I arc its maximal proper subsets
• ln ordcr to sin¸¦c out onc c¦cmcnt ot P(I), thc canonical parent p
c
(I),
wc can simj¦y dcﬁnc an (ar¦itrary, ¦ut ﬁxcd) ¸¦o¦a¦ ordcr ot thc itcms
i
1
< i
2
< i
3
< < i
n
.
Thcn thc canonica¦ jarcnt ot an itcm sct I can ¦c dcﬁncd as thc itcm sct
p
c
(I) I −¦max
i∈I
i¦ (or p
c
(I) I −¦min
i∈I
i¦),
whcrc thc maximum (or minimum) is takcn wrt thc choscn ordcr ot thc itcms
• Lvcn thou¸h this ajjroach is strai¸httorward and simj¦c,
wc rctormu¦atc it now in tcrms ot a canonical form ot an itcm sct,
in ordcr to ¦ay thc toundations tor thc study ot trcqucnt (su¦)¸rajh minin¸
Christian Borgelt Frequent Pattern Mining 30
Canonical Forms of Item Sets
Christian Borgelt Frequent Pattern Mining 31
Canonical Forms
Thc mcanin¸ ot thc word “canonica¦”
(sourcc Oxtord Advanccd Lcarncr’s Lictionary — Lncyc¦ojcdic Ldition)
canon , kan
c
n, n 1 ¸cncra¦ ru¦c, standard or jrincij¦c, ¦y which sth is ,ud¸cd
This ﬁlm oﬀends against all the canons of good taste.
canonical ,k
c
n
a
nIk¦, adj 3 standard. acccjtcd
• A canonical form ot somcthin¸ is a standard rcjrcscntation ot it
• Thc canonica¦ torm must ¦c uniquc (othcrwisc it cou¦d not ¦c standard)
`cvcrthc¦css thcrc arc ottcn scvcra¦ jossi¦¦c choiccs tor a canonica¦ torm
Lowcvcr, onc must ﬁx onc ot thcm tor a ¸ivcn ajj¦ication
• ln thc to¦¦owin¸ wc wi¦¦ dcﬁnc a standard rcjrcscntation ot an itcm sct,
and ¦atcr standard rcjrcscntations ot a ¸rajh, a scqucncc, a trcc ctc
• This canonica¦ torm wi¦¦ ¦c uscd to assi¸n uniquc jarcnts to a¦¦ itcm scts
Christian Borgelt Frequent Pattern Mining 32
A Canonical Form for Item Sets
• An itcm sct is rcjrcscntcd ¦y a code word. cach ¦cttcr rcjrcscnts an itcm
Thc codc word is a word ovcr thc a¦jha¦ct B, thc sct ot a¦¦ itcms
• Thcrc arc k' jossi¦¦c codc words tor an itcm sct ot sizc k,
¦ccausc thc itcms may ¦c ¦istcd in any ordcr
• Ly introducin¸ an (ar¦itrary, ¦ut ﬁxcd) order of the items,
and ¦y comjarin¸ codc words ¦cxico¸rajhica¦¦y wrt this ordcr,
wc can dcﬁnc an ordcr on thcsc codc words
Lxamj¦c abc < bac < bca < cab ctc tor thc itcm sct ¦a, b, c¦ and a < b < c
• Thc ¦cxico¸rajhica¦¦y sma¦¦cst (or, a¦tcrnativc¦y, ¸rcatcst) codc word
tor an itcm sct is dcﬁncd to ¦c its canonical code word
O¦vious¦y thc canonica¦ codc word ¦ists thc itcms in thc choscn, ﬁxcd ordcr
lcmark Thcsc cxj¦anations may ajjcar o¦tuscatcd, sincc thc corc idca and thc rcsu¦t arc vcry simj¦c
Lowcvcr, thc vicw dcvc¦ojcd hcrc wi¦¦ hc¦j us a ¦ot whcn wc turn to trcqucnt (su¦)¸rajh minin¸
Christian Borgelt Frequent Pattern Mining 33
Canonical Forms and Canonical Parents
• Lct I ¦c an itcm sct and w
c
(I) its canonica¦ codc word
Thc canonical parent p
c
(I) ot thc itcm sct I is thc itcm sct
dcscri¦cd ¦y thc longest proper preﬁx ot thc codc word w
c
(I)
• Sincc thc canonica¦ codc word ot an itcm sct ¦ists its itcms in thc choscn ordcr,
this dcﬁnition is cquiva¦cnt to
p
c
(I) I −¦max
a∈I
a¦.
• General Recursive Processing with Canonical Forms:
Ior a ¸ivcn trcqucnt itcm sct I
◦ Gcncratc a¦¦ jossi¦¦c cxtcnsions J ot I ¦y onc itcm (J ⊃ I, [J[ [I[ + 1)
◦ Iorm thc canonica¦ codc word w
c
(J) ot cach cxtcndcd itcm sct J
◦ Ior cach J it thc ¦ast ¦cttcr ot w
c
(J) is thc itcm addcd to I to torm J
and J is trcqucnt, jroccss J rccursivc¦y, othcrwisc discard J
Christian Borgelt Frequent Pattern Mining 34
The Preﬁx Property
• `otc that thc considcrcd itcm sct codin¸ schcmc has thc preﬁx property
The longest proper preﬁx of the canonical code word of any item set
is a canonical code word itself.
⇒ \ith thc ¦on¸cst jrojcr jrcﬁx ot thc canonica¦ codc word ot an itcm sct I
wc not on¦y know thc canonica¦ jarcnt ot I, ¦ut a¦so its canonica¦ codc word
• Lxamj¦c Considcr thc itcm sct I ¦a, b, d, e¦
◦ Thc canonica¦ codc word ot I is abde
◦ Thc ¦on¸cst jrojcr jrcﬁx ot abde is abd
◦ abd is thc canonica¦ codc word ot p
c
(I) ¦a, b, d¦
• `otc that thc jrcﬁx jrojcrty immcdiatc¦y imj¦ics
Every preﬁx of a canonical code word is a canonical code word itself.
(ln thc to¦¦owin¸ ¦oth statcmcnts arc ca¦¦cd thc preﬁx property, sincc thcy arc o¦vious¦y cquiva¦cnt)
Christian Borgelt Frequent Pattern Mining 35
Searching with the Preﬁx Property
Thc jrcﬁx jrojcrty a¦¦ows us to simplify the search scheme
• Thc ¸cncra¦ rccursivc jroccssin¸ schcmc with canonica¦ torms rcquircs
to construct thc canonical code word ot cach crcatcd itcm sct
in ordcr to dccidc whcthcr it has to ¦c jroccsscd rccursivc¦y or not
⇒ \c know thc canonica¦ codc word ot cvcry itcm sct that is jroccsscd rccursivc¦y
• \ith this codc word wc know, duc to thc preﬁx property, thc canonica¦
codc words ot a¦¦ chi¦d itcm scts that havc to ¦c cxj¦orcd in thc rccursion
with the exception of the last letter (that is, thc addcd itcm)
⇒ \c on¦y havc to chcck whcthcr thc codc word that rcsu¦ts trom ajjcndin¸
thc addcd itcm to thc ¸ivcn canonica¦ codc word is canonica¦ or not
• Advantage:
Chcckin¸ whcthcr a ¸ivcn codc word is canonica¦ can ¦c simj¦cr,tastcr
than constructin¸ a canonica¦ codc word trom scratch
Christian Borgelt Frequent Pattern Mining 36
Searching with the Preﬁx Property
Principle of a Search Algorithm based on the Preﬁx Property:
• Base Loop:
◦ Travcrsc a¦¦ jossi¦¦c itcms, that is,
thc canonica¦ codc words ot a¦¦ oncc¦cmcnt itcm scts
◦ lccursivc¦y jroccss cach codc word that dcscri¦cs a trcqucnt itcm sct
• Recursive Processing:
Ior a ¸ivcn (canonica¦) codc word ot a trcqucnt itcm sct
◦ Gcncratc a¦¦ jossi¦¦c cxtcnsions ¦y onc itcm
This is donc ¦y simj¦y appending the item to thc codc word
◦ Chcck whcthcr thc cxtcndcd codc word is thc canonical code word
ot thc itcm sct that is dcscri¦cd ¦y thc cxtcndcd codc word
(and, ot coursc, whcthcr thc dcscri¦cd itcm sct is trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivc¦y, othcrwisc discard it
Christian Borgelt Frequent Pattern Mining 37
Searching with the Preﬁx Property: Examples
• Sujjosc thc itcm ¦asc is B ¦a, b, c, d, e¦ and ¦ct us assumc that
wc simj¦y usc thc a¦jha¦ctica¦ ordcr to dcﬁnc a canonica¦ torm (as ¦ctorc)
• Considcr thc rccursivc jroccssin¸ ot thc codc word acd
(this codc word is canonica¦, ¦ccausc its ¦cttcrs arc in a¦jha¦ctica¦ ordcr)
◦ Sincc acd contains ncithcr b nor e, its cxtcnsions arc acdb and acde
◦ Thc codc word acdb is not canonica¦ and thus it is discardcd
(¦ccausc d > b — notc that it suﬃccs to comjarc thc ¦ast two ¦cttcrs)
◦ Thc codc word acde is canonica¦ and thcrctorc it is jroccsscd rccursivc¦y
• Considcr thc rccursivc jroccssin¸ ot thc codc word bc
◦ Thc cxtcndcd codc words arc bca, bcd and bce
◦ bca is not canonica¦ and thus discardcd
bcd and bce arc canonica¦ and thcrctorc jroccsscd rccursivc¦y
Christian Borgelt Frequent Pattern Mining 38
Searching with the Preﬁx Property
Exhaustive Search
• Thc preﬁx property is a ncccssary condition tor cnsurin¸
that a¦¦ canonica¦ codc words can ¦c constructcd in thc scarch
¦y ajjcndin¸ cxtcnsions (itcms) to visitcd canonica¦ codc words
• Sujjosc thc jrcﬁx jrojcrty wou¦d not ho¦d Thcn
◦ Thcrc cxist a canonica¦ codc word w and a jrcﬁx v ot w,
such that v is not a canonica¦ codc word
◦ Iormin¸ w ¦y rcjcatcd¦y ajjcndin¸ itcms must torm v ﬁrst
(othcrwisc thc jrcﬁx wou¦d diﬀcr)
◦ \hcn v is constructcd in thc scarch, it is discardcd,
¦ccausc it is not canonica¦
◦ As a conscqucncc, thc canonica¦ codc word w can ncvcr ¦c rcachcd
⇒ Thc simj¦iﬁcd scarch schcmc can ¦c cxhaustivc on¦y it thc jrcﬁx jrojcrty ho¦ds
Christian Borgelt Frequent Pattern Mining 39
Searching with Canonical Forms
Straightforward Improvement of the Extension Step:
• Thc considcrcd canonica¦ torm ¦ists thc itcms in thc choscn itcm ordcr
⇒ lt thc addcd itcm succccds a¦¦ a¦rcady jrcscnt itcms in thc choscn ordcr,
thc rcsu¦t is in canonica¦ torm
∧ lt thc addcd itcm jrcccdcs any ot thc a¦rcady jrcscnt itcms in thc choscn ordcr,
thc rcsu¦t is not in canonica¦ torm
• As a conscqucncc, wc havc a vcry simj¦c canonical extension rule
(that is, a ru¦c that ¸cncratcs a¦¦ chi¦drcn and on¦y canonica¦ codc words)
• Ajj¦icd to thc Ajriori a¦¸orithm, this mcans that wc ¸cncratc candidatcs
ot sizc k + 1 ¦y com¦inin¸ two trcqucnt itcm scts f
1
¦i
1
, . . . , i
k−1
, i
k
¦
and f
2
¦i
1
, . . . , i
k−1
, i
′
k
¦ on¦y it i
k
< i
′
k
and ∀j, 1 ≤ j < k i
j
< i
j+1
`otc that it suﬃccs to comjarc thc ¦ast ¦cttcrs,itcms i
k
and i
′
k
it a¦¦ trcqucnt itcm scts arc rcjrcscntcd ¦y canonica¦ codc words
Christian Borgelt Frequent Pattern Mining 40
Searching with Canonical Forms
Final Search Algorithm based on Canonical Forms:
• Base Loop:
◦ Travcrsc a¦¦ jossi¦¦c itcms, that is,
thc canonica¦ codc words ot a¦¦ oncc¦cmcnt itcm scts
◦ lccursivc¦y jroccss cach codc word that dcscri¦cs a trcqucnt itcm sct
• Recursive Processing:
Ior a ¸ivcn (canonica¦) codc word ot a trcqucnt itcm sct
◦ Gcncratc a¦¦ jossi¦¦c cxtcnsions ¦y a sin¸¦c itcm,
whcrc this itcm succccds thc ¦ast ¦cttcr (itcm) ot thc ¸ivcn codc word
This is donc ¦y simj¦y appending the item to thc codc word
◦ lt thc itcm sct dcscri¦cd ¦y thc rcsu¦tin¸ cxtcndcd codc word is trcqucnt,
jroccss thc codc word rccursivc¦y, othcrwisc discard it
• This scarch schcmc ¸cncratcs cach candidatc itcm sct at most once
Christian Borgelt Frequent Pattern Mining 41
Canonical Parents and Preﬁx Trees
• ltcm scts, whosc canonica¦ codc words sharc thc samc ¦on¸cst jrojcr jrcﬁx,
arc si¦¦in¸s, ¦ccausc thcy havc (¦y dcﬁnition) thc samc canonica¦ jarcnt
• This a¦¦ows us to rcjrcscnt thc canonica¦ jarcnt trcc as a preﬁx tree or trie
Canonica¦ jarcnt trcc,jrcﬁx trcc and jrcﬁx trcc with mcr¸cd si¦¦in¸s tor ﬁvc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
Christian Borgelt Frequent Pattern Mining 42
Canonical Parents and Preﬁx Trees
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu¦¦) jrcﬁx trcc tor thc ﬁvc itcms a, b, c, d, e
• Lascd on a ¸¦o¦a¦ ordcr ot thc itcms (which can ¦c ar¦itrary)
• Thc itcm scts countcd in a nodc consist ot
◦ a¦¦ itcms ¦a¦c¦in¸ thc cd¸cs to thc nodc (common jrcﬁx) and
◦ onc itcm to¦¦owin¸ thc ¦ast cd¸c ¦a¦c¦ in thc itcm ordcr
Christian Borgelt Frequent Pattern Mining 43
Search Tree Pruning
ln ajj¦ications thc scarch trcc tcnds to ¸ct vcry ¦ar¸c, so jrunin¸ is nccdcd
• Structural Pruning:
◦ Lxtcnsions ¦ascd on canonica¦ codc words rcmovc sujcrﬂuous jaths
◦ Lxj¦ains thc un¦a¦anccd structurc ot thc tu¦¦ jrcﬁx trcc
• Support Based Pruning:
◦ No superset of an infrequent item set can be frequent.
(apriori property)
◦ `o countcrs tor itcm scts havin¸ an intrcqucnt su¦sct arc nccdcd
• Size Based Pruning:
◦ lrunc thc trcc it a ccrtain dcjth (a ccrtain sizc ot thc itcm scts) is rcachcd
◦ ldca Scts with too many itcms can ¦c diﬃcu¦t to intcrjrct
Christian Borgelt Frequent Pattern Mining 44
The Order of the Items
• Thc structurc ot thc (structura¦¦y jruncd) jrcﬁx trcc
o¦vious¦y dcjcnds on thc choscn ordcr ot thc itcms
• ln jrincij¦c, thc ordcr is ar¦itrary (that is, any ordcr can ¦c uscd)
Lowcvcr, thc num¦cr and thc sizc ot thc nodcs that arc visitcd in thc scarch
diﬀcrs considcra¦¦y dcjcndin¸ on thc ordcr
As a conscqucncc, thc cxccution timcs ot trcqucnt itcm sct minin¸ a¦¸orithms
can diﬀcr considcra¦¦y dcjcndin¸ on thc itcm ordcr
• \hich ordcr ot thc itcms is ¦cst (¦cads to thc tastcst scarch)
can dcjcnd on thc trcqucnt itcm sct minin¸ a¦¸orithm uscd
Advanccd mcthods cvcn adajt thc ordcr ot thc itcms durin¸ thc scarch
(that is, usc diﬀcrcnt, ¦ut “comjati¦¦c” ordcrs in diﬀcrcnt ¦ranchcs)
• Lcuristics tor choosin¸ an itcm ordcr arc usua¦¦y ¦ascd
on (conditiona¦) indcjcndcncc assumjtions
Christian Borgelt Frequent Pattern Mining 45
The Order of the Items
Heuristics for Choosing the Item Order
• Basic Idea: independence assumption
lt is j¦ausi¦¦c that trcqucnt itcm scts consist ot trcqucnt itcms
◦ Sort thc itcms wrt thcir sujjort (trcqucncy ot occurrcncc)
◦ Sort dcsccndin¸¦y lrcﬁx trcc has tcwcr, ¦ut ¦ar¸cr nodcs
◦ Sort asccndin¸¦y lrcﬁx trcc has morc, ¦ut sma¦¦cr nodcs
• Extension of this Idea:
Sort itcms wrt thc sum ot thc sizcs ot thc transactions that covcr thcm
◦ ldca thc sum ot transaction sizcs a¦so cajturcs imj¦icit¦y thc trcqucncy
ot jairs, trij¦cts ctc (thou¸h, ot coursc, on¦y to somc dc¸rcc)
◦ Lmjirica¦ cvidcncc ¦cttcr jcrtormancc than simj¦c trcqucncy sortin¸
Christian Borgelt Frequent Pattern Mining 46
Searching the Preﬁx Tree
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
a b c d e
b c d e c d e d e e
c d e d e e d e e e
d e e e e
e
a b c d
b c d c d d
c d d d
d
• Apriori ◦ Lrcadthﬁrst,¦cvc¦wisc scarch (itcm scts ot samc sizc)
◦ Su¦scts tcsts on transactions to ﬁnd thc sujjort ot itcm scts
• Eclat ◦ Lcjthﬁrst scarch (itcm scts with samc jrcﬁx)
◦ lntcrscction ot transaction ¦ists to ﬁnd thc sujjort ot itcm scts
Christian Borgelt Frequent Pattern Mining 47
Searching the Preﬁx Tree Levelwise
(Ajriori A¦¸orithm lcvisitcd)
Christian Borgelt Frequent Pattern Mining 48
Apriori: Basic Ideas
• Thc itcm scts arc chcckcd in thc order of increasing size
(breadthﬁrst/levelwise traversal ot thc jrcﬁx trcc)
• Thc canonica¦ torm ot itcm scts and thc induccd jrcﬁx trcc arc uscd
to cnsurc that cach candidatc itcm sct is ¸cncratcd at most oncc
• Thc a¦rcady ¸cncratcd ¦cvc¦s arc uscd to cxccutc a priori jrunin¸
ot thc candidatc itcm scts (usin¸ thc apriori property)
(a priori: ¦ctorc acccssin¸ thc transaction data¦asc to dctcrminc thc sujjort)
• Transactions arc rcjrcscntcd as simj¦c arrays ot itcms
(soca¦¦cd horizontal transaction representation, scc a¦so ¦c¦ow)
• Thc sujjort ot a candidatc itcm sct is comjutcd
¦y chcckin¸ whcthcr thcy arc su¦scts ot a transaction
or ¦y ¸cncratin¸ and ﬁndin¸ su¦scts ot a transaction
Christian Borgelt Frequent Pattern Mining 49
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
• Lxamj¦c transaction data¦asc with ` itcms and 10 transactions
• `inimum sujjort 30/, that is, at ¦cast 3 transactions must contain thc itcm sct
• A¦¦ onc itcm scts arc trcqucnt → tu¦¦ sccond ¦cvc¦ is nccdcd
Christian Borgelt Frequent Pattern Mining 50
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
• Lctcrminin¸ thc sujjort ot itcm scts Ior cach itcm sct travcrsc thc data¦asc
and count thc transactions that contain it (hi¸h¦y incﬃcicnt)
• Lcttcr Travcrsc thc trcc tor cach transaction and ﬁnd thc itcm scts it contains
(cﬃcicnt can ¦c imj¦cmcntcd as a simj¦c dou¦¦y rccursivc jroccdurc)
Christian Borgelt Frequent Pattern Mining 51
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
• `inimum sujjort 30/, that is, at ¦cast 3 transactions must contain thc itcm sct
• lntrcqucnt itcm scts ¦a, b¦, ¦b, d¦, ¦b, e¦
• Thc su¦trccs startin¸ at thcsc itcm scts can ¦c jruncd
Christian Borgelt Frequent Pattern Mining 52
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : ? e : ? e : ? d : ? e : ? e : ?
• Gcncratc candidatc itcm scts with 3 itcms (jarcnts must ¦c trcqucnt)
• Lctorc countin¸, chcck whcthcr thc candidatcs contain an intrcqucnt itcm sct
◦ An itcm sct with k itcms has k su¦scts ot sizc k −1
◦ Thc jarcnt itcm sct is on¦y onc ot thcsc su¦scts
Christian Borgelt Frequent Pattern Mining 53
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : ? e : ? e : ? d : ? e : ? e : ?
• Thc itcm scts ¦b, c, d¦ and ¦b, c, e¦ can ¦c jruncd, ¦ccausc
◦ ¦b, c, d¦ contains thc intrcqucnt itcm sct ¦b, d¦ and
◦ ¦b, c, e¦ contains thc intrcqucnt itcm sct ¦b, e¦
Christian Borgelt Frequent Pattern Mining 54
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
• On¦y thc rcmainin¸ tour itcm scts ot sizc 3 arc cva¦uatcd
• `o othcr itcm scts ot sizc 3 can ¦c trcqucnt
Christian Borgelt Frequent Pattern Mining 55
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
• `inimum sujjort 30/, that is, at ¦cast 3 transactions must contain thc itcm sct
• lntrcqucnt itcm sct ¦c, d, e¦
Christian Borgelt Frequent Pattern Mining 56
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
d
e : ?
• Gcncratc candidatc itcm scts with ! itcms (jarcnts must ¦c trcqucnt)
• Lctorc countin¸, chcck whcthcr thc candidatcs contain an intrcqucnt itcm sct
Christian Borgelt Frequent Pattern Mining 57
Apriori: Levelwise Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b c
d
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
c d c d
d : 3 e : 3 e : 4 d : ? e : ? e : 2
d
e : ?
• Thc itcm sct ¦a, c, d, e¦ can ¦c jruncd,
¦ccausc it contains thc intrcqucnt itcm sct ¦c, d, e¦
• Conscqucncc `o candidatc itcm scts with tour itcms
• Iourth acccss to thc transaction data¦asc is not ncccssary
Christian Borgelt Frequent Pattern Mining 58
Apriori: Node Organization 1
ldca Ojtimizc thc or¸anization ot thc countcrs and thc chi¦d jointcrs
Direct Indexing:
• Lach nodc is a simj¦c vcctor (array) ot countcrs
• An itcm is uscd as a dircct indcx to ﬁnd thc countcr
• Advanta¸c Countcr acccss is cxtrcmc¦y tast
• Lisadvanta¸c `cmory usa¸c can ¦c hi¸h duc to “¸ajs” in thc indcx sjacc
Sorted Vectors:
• Lach nodc is a vcctor (array) ot itcm,countcr jairs
• A ¦inary scarch is ncccssary to ﬁnd thc countcr tor an itcm
• Advanta¸c `cmory usa¸c may ¦c sma¦¦cr, no unncccssary countcrs
• Lisadvanta¸c Countcr acccss is s¦owcr duc to thc ¦inary scarch
Christian Borgelt Frequent Pattern Mining 59
Apriori: Node Organization 2
Hash Tables:
• Lach nodc is a vcctor (array) ot itcm,countcr jairs (c¦oscd hashin¸)
• Thc indcx ot a countcr is comjutcd trom thc itcm codc
• Advanta¸c Iastcr countcr acccss than with ¦inary scarch
• Lisadvanta¸c Li¸hcr mcmory usa¸c than sortcd vcctors (jairs, ﬁ¦¦ ratc)
Thc ordcr ot thc itcms cannot ¦c cxj¦oitcd
Child Pointers:
• Thc dccjcst ¦cvc¦ ot thc itcm sct trcc docs not nccd chi¦d jointcrs
• Icwcr chi¦d jointcrs than countcrs arc nccdcd
→ lt jays to rcjrcscnt thc chi¦d jointcrs in a scjaratc array
• Thc sortcd array ot itcm,countcr jairs can ¦c rcuscd tor a ¦inary scarch
Christian Borgelt Frequent Pattern Mining 60
Apriori: Item Coding
• ltcms arc codcd as consccutivc intc¸crs startin¸ with 0
(nccdcd tor thc dircct indcxin¸ ajjroach)
• Thc sizc and thc num¦cr ot thc “¸ajs” in thc indcx sjacc
dcjcnds on how thc itcms arc codcd
• ldca lt is j¦ausi¦¦c that trcqucnt itcm scts consist ot trcqucnt itcms
◦ Sort thc itcms wrt thcir trcqucncy (¸rouj trcqucnt itcms)
◦ Sort dcsccndin¸¦y jrcﬁx trcc has tcwcr nodcs
◦ Sort asccndin¸¦y thcrc arc tcwcr and sma¦¦cr indcx “¸ajs”
◦ Lmjirica¦ cvidcncc sortin¸ asccndin¸¦y is ¦cttcr
• Lxtcnsion Sort itcms wrt thc sum ot thc sizcs
ot thc transactions that covcr thcm
◦ Lmjirica¦ cvidcncc ¦cttcr than simj¦c itcm trcqucncics
Christian Borgelt Frequent Pattern Mining 61
Apriori: Recursive Counting
• Thc itcms in a transaction arc sortcd (asccndin¸ itcm codcs)
• lroccssin¸ a transaction is a doubly recursive procedure
To jroccss a transaction tor a nodc ot thc itcm sct trcc
◦ Go to thc chi¦d corrcsjondin¸ to thc ﬁrst itcm in thc transaction and
count thc rcst ot thc transaction rccursivc¦y tor that chi¦d
(ln thc currcnt¦y dccjcst ¦cvc¦ ot thc trcc wc incrcmcnt thc countcr
corrcsjondin¸ to thc itcm instcad ot ¸oin¸ to thc chi¦d nodc)
◦ Liscard thc ﬁrst itcm ot thc transaction and
jroccss it rccursivc¦y tor thc nodc itsc¦t
• Ojtimizations
◦ Lircct¦y skij a¦¦ itcms jrcccdin¸ thc ﬁrst itcm in thc nodc
◦ A¦ort thc rccursion it thc ﬁrst itcm is ¦cyond thc ¦ast onc in thc nodc
◦ A¦ort thc rccursion it a transaction is too short to rcach thc dccjcst ¦cvc¦
Christian Borgelt Frequent Pattern Mining 62
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 0 0 0 0
c d e a
a
transaction
to count
¦a, c, d, e¦
currcnt
itcm sct sizc 3
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 0 0 0 0
d e c
c
c d e
jroccssin¸ a
jroccssin¸ c
Christian Borgelt Frequent Pattern Mining 63
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 0 0
d e
d e
c d e
jroccssin¸ a
jroccssin¸ c
jroccssin¸ d e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 0 0
e d
d
c d e
jroccssin¸ a
jroccssin¸ d
Christian Borgelt Frequent Pattern Mining 64
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 0
e
e
c d e
jroccssin¸ a
jroccssin¸ d
jroccssin¸ e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 0
e
c d e
jroccssin¸ a
jroccssin¸ e
(skijjcd
too tcw itcms)
Christian Borgelt Frequent Pattern Mining 65
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 0
d e c
c
jroccssin¸ c
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 0
e d
d
d e
jroccssin¸ c
jroccssin¸ d
Christian Borgelt Frequent Pattern Mining 66
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 1
e
e
d e
jroccssin¸ c
jroccssin¸ d
jroccssin¸ e
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 1
e
d e
jroccssin¸ c
jroccssin¸ e
(skijjcd
too tcw itcms)
Christian Borgelt Frequent Pattern Mining 67
Apriori: Recursive Counting
a : 7 b : 3 c : 7 d : 6 e : 7
b : 0 c : 4 d : 5 e : 6 c : 3 d : 1 e : 1 d : 4 e : 4 e : 4
a
b c
d
c d c d
d : e : e : d : ? e : ? e : 1 1 1 1
e d
jroccssin¸ d
(skijjcd
too tcw itcms)
• lroccssin¸ an itcm sct in a nodc is casi¦y imj¦cmcntcd as a simj¦c ¦ooj
• Ior cach itcm thc rcmainin¸ suﬃx is jroccsscd in thc corrcsjondin¸ chi¦d
• lt thc (currcnt¦y) dccjcst trcc ¦cvc¦ is rcachcd,
countcrs arc incrcmcntcd tor cach itcm in thc transaction
• lt thc rcmainin¸ transaction (suﬃx) is too short to rcach
thc (currcnt¦y) dccjcst ¦cvc¦, jroccssin¸ is tcrminatcd
Christian Borgelt Frequent Pattern Mining 68
Apriori: Transaction Representation
Direct Representation:
• Lach transaction is rcjrcscntcd as an array ot itcms
• Thc transactions arc storcd in a simj¦c ¦ist or array
Organization as a Preﬁx Tree:
• Thc itcms in cach transaction arc sortcd (ar¦itrary, ¦ut ﬁxcd ordcr)
• Transactions with thc samc jrcﬁx arc ¸roujcd to¸cthcr
• Advanta¸c a common jrcﬁx is jroccsscd on¦y oncc
• Gains trom this or¸anization dcjcnd on how thc itcms arc codcd
◦ Common transaction jrcﬁxcs arc morc ¦ikc¦y
it thc itcms arc sortcd with dcsccndin¸ trcqucncy
◦ Lowcvcr an asccndin¸ ordcr is ¦cttcr tor thc scarch
and this dominatcs thc cxccution timc
Christian Borgelt Frequent Pattern Mining 69
Apriori: Transactions as a Preﬁx Tree
transaction
data¦asc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
¦cxico¸rajhica¦¦y
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
preﬁx tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
• ltcms in transactions arc sortcd wrt somc ar¦itrary ordcr,
transactions arc sortcd ¦cxico¸rajhica¦¦y, thcn a jrcﬁx trcc is constructcd
• Advantage: idcntica¦ transaction jrcﬁxcs arc jroccsscd on¦y oncc
Christian Borgelt Frequent Pattern Mining 70
Summary Apriori
Basic Processing Scheme
• Lrcadthﬁrst,¦cvc¦wisc travcrsa¦ ot thc jartia¦¦y ordcrcd sct (2
B
, ⊆)
• Candidatcs arc tormcd ¦y mcr¸in¸ itcm scts that diﬀcr in on¦y onc itcm
• Sujjort countin¸ can ¦c donc with a dou¦¦y rccursivc jroccdurc
Advantages
• “lcrtcct” jrunin¸ ot intrcqucnt candidatc itcm scts (with intrcqucnt su¦scts)
Disadvantages
• Can rcquirc a ¦ot ot mcmory (sincc a¦¦ trcqucnt itcm scts arc rcjrcscntcd)
• Sujjort countin¸ takcs vcry ¦on¸ tor ¦ar¸c transactions
Software
• http://www.borgelt.net/apriori.html
Christian Borgelt Frequent Pattern Mining 71
Searching the Preﬁx Tree DepthFirst
(Lc¦at, Il¸rowth and othcr a¦¸orithms)
Christian Borgelt Frequent Pattern Mining 72
DepthFirst Search and Conditional Databases
• A dcjthﬁrst scarch can a¦so ¦c sccn as a divideandconquer scheme
Iirst ﬁnd a¦¦ trcqucnt itcm scts that contain a choscn itcm,
thcn a¦¦ trcqucnt itcm scts that do not contain it
• Gcncra¦ scarch jroccdurc
◦ Lct thc itcm ordcr ¦c a < b < c <
◦ lcstrict thc transaction data¦asc to thosc transactions that contain a
This is thc conditional database for the preﬁx a
lccursivc¦y scarch this conditiona¦ data¦asc tor trcqucnt itcm scts
and add thc jrcﬁx a to a¦¦ trcqucnt itcm scts tound in thc rccursion
◦ lcmovc thc itcm a trom thc transactions in thc full transaction data¦asc
This is thc conditional database for item sets without a
lccursivc¦y scarch this conditiona¦ data¦asc tor trcqucnt itcm scts
• \ith this schcmc on¦y trcqucnt oncc¦cmcnt itcm scts havc to ¦c dctcrmincd
Lar¸cr itcm scts rcsu¦t trom addin¸ jossi¦¦c jrcﬁxcs
Christian Borgelt Frequent Pattern Mining 73
DepthFirst Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sj¦it into su¦jro¦¦cms wrt itcm a
• ¦¦uc itcm sct containin¸ only itcm a
¸rccn itcm scts containin¸ itcm a (and at ¦cast onc othcr itcm)
rcd itcm scts not containin¸ itcm a (¦ut at ¦cast onc othcr itcm)
• ¸rccn conditiona¦ data¦asc with transactions containin¸ itcm a
rcd conditiona¦ data¦asc with a¦¦ transactions, ¦ut with itcm a rcmovcd
Christian Borgelt Frequent Pattern Mining 74
DepthFirst Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sj¦it into su¦jro¦¦cms wrt itcm b
• ¦¦uc itcm scts ¦a¦ and ¦a, b¦
¸rccn itcm scts containin¸ itcms a and b (and at ¦cast onc othcr itcm)
rcd itcm scts containin¸ itcm a (and at ¦cast onc othcr itcm), ¦ut not itcm b
• ¸rccn data¦asc with transactions containin¸ ¦oth itcms a and b
rcd data¦asc with transactions containin¸ itcm a, ¦ut with itcm b rcmovcd
Christian Borgelt Frequent Pattern Mining 75
DepthFirst Search and Conditional Databases
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
sj¦it into su¦jro¦¦cms wrt itcm b
• ¦¦uc itcm sct containin¸ only itcm b
¸rccn itcm scts containin¸ itcm b (and at ¦cast onc othcr itcm), ¦ut not itcm a
rcd itcm scts containin¸ ncithcr itcm a nor b (¦ut at ¦cast onc othcr itcm)
• ¸rccn data¦asc with transactions containin¸ itcm b, ¦ut not itcm a
rcd data¦asc with a¦¦ transactions, ¦ut with itcms a and b rcmovcd
Christian Borgelt Frequent Pattern Mining 76
Formal Description of the DivideandConquer Scheme
• Gcncra¦¦y, a dividcandconqucr schcmc can ¦c dcscri¦cd as a sct ot (su¦)jro¦¦cms
◦ Thc initia¦ (su¦)jro¦¦cm is thc actua¦ jro¦¦cm to so¦vc
◦ A su¦jro¦¦cm is jroccsscd ¦y sj¦ittin¸ it into sma¦¦cr su¦jro¦¦cms,
which arc thcn jroccsscd rccursivc¦y
• A¦¦ su¦jro¦¦cms that occur in trcqucnt itcm sct minin¸ can ¦c dcﬁncd ¦y
◦ a conditional transaction database and
◦ a preﬁx (ot itcms)
Thc jrcﬁx is a sct ot itcms that has to ¦c addcd to a¦¦ trcqucnt itcm scts
that arc discovcrcd in thc conditiona¦ transaction data¦asc
• Iorma¦¦y, a¦¦ su¦jro¦¦cms arc tuj¦cs S (D, P),
whcrc D is a conditiona¦ transaction data¦asc and P ⊆ B is a jrcﬁx
• Thc initia¦ jro¦¦cm, with which thc rccursion is startcd, is S (T, ∅),
whcrc T is thc transaction data¦asc to minc and thc jrcﬁx is cmjty
Christian Borgelt Frequent Pattern Mining 77
Formal Description of the DivideandConquer Scheme
A su¦jro¦¦cm S
0
(T
0
, P
0
) is jroccsscd as to¦¦ows
• Choosc an itcm i ∈ B
0
, whcrc B
0
is thc sct ot itcms occurrin¸ in T
0
• lt s
T
0
(i) ≥ s
min
(whcrc s
T
0
(i) is thc sujjort ot thc itcm i in T
0
)
◦ lcjort thc itcm sct P
0
∪ ¦i¦ as trcqucnt with thc sujjort s
T
0
(i)
◦ Iorm thc su¦jro¦¦cm S
1
(T
1
, P
1
) with P
1
P
0
∪ ¦i¦
T
1
comjriscs a¦¦ transactions in T
0
that contain thc itcm i,
¦ut with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
◦ lt T
1
is not cmjty, jroccss S
1
rccursivc¦y
• ln any casc (that is, rc¸ard¦css ot whcthcr s
T
0
(i) ≥ s
min
or not)
◦ Iorm thc su¦jro¦¦cm S
2
(T
2
, P
2
), whcrc P
2
P
0
T
2
comjriscs a¦¦ transactions in T
0
(whcthcr thcy contain i or not),
¦ut a¸ain with thc itcm i rcmovcd (and cmjty transactions rcmovcd)
◦ lt T
2
is not cmjty, jroccss S
2
rccursivc¦y
Christian Borgelt Frequent Pattern Mining 78
DivideandConquer Recursion
Subproblem Tree
(T, ∅)
.
.
.
.
.
.
.
.
.
.
.
.
. .,
a












¸
a
(T
a
, ¦a¦)
Z
Z
Z
Z
Z »
b
`
`
`
`
`·
b
(T
a
, ∅)
Z
Z
Z
Z
Z»
b
`
`
`
`
`·
b
(T
ab
, ¦a, b¦)
/
/
/
/
/
/
/
/º
c
`
`
`
`
`*
c
(T
a
b
, ¦a¦)
/
/
/
/
/
/
/
/º
c
`
`
`
`
` *
c
(T
ab
, ¦b¦)
/
/
/
/
/
/
/
/ º
c
`
`
`
`
` *
c
(T
a
b
, ∅)
/
/
/
/
/
/
/
/ º
c
`
`
`
`
` *
c
(T
abc
, ¦a, b, c¦)
(T
ab c
, ¦a, b¦)
(T
a
bc
, ¦a, c¦)
(T
a
b c
, ¦a¦)
(T
abc
, ¦b, c¦)
(T
ab c
, ¦b¦)
(T
a
bc
, ¦c¦)
(T
a
b c
, ∅)
• Lranch to thc ¦ctt inc¦udc an itcm (ﬁrst su¦jro¦¦cm)
• Lranch to thc ri¸ht cxc¦udc an itcm (sccond su¦jro¦¦cm)
(ltcms in thc indiccs ot thc conditiona¦ transaction data¦ascs T havc ¦ccn rcmovcd trom thcm)
Christian Borgelt Frequent Pattern Mining 79
Perfect Extensions
Thc scarch can casi¦y ¦c imjrovcd with soca¦¦cd perfect extension pruning
• Lct T ¦c a transaction data¦asc ovcr an itcm ¦asc B
Givcn an itcm sct I, an itcm a / ∈ I is ca¦¦cd a perfect extension ot I wrt T,
iﬀ thc itcm scts I and I ∪ ¦a¦ havc thc samc sujjort s
T
(I) s
T
(I ∪ ¦a¦)
(that is, it a¦¦ transactions containin¸ thc itcm sct I a¦so contain thc itcm a)
• lcrtcct cxtcnsions havc thc to¦¦owin¸ jrojcrtics
◦ lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is a¦so a jcrtcct cxtcnsion ot any itcm sct J ⊇ I (as ¦on¸ as a / ∈ J)
This can most casi¦y ¦c sccn ¦y considcrin¸ that K
T
(I) ⊆ K
T
(¦a¦)
and hcncc K
T
(J) ⊆ K
T
(¦a¦), sincc K
T
(J) ⊆ K
T
(I)
◦ lt X
T
(I) is thc sct ot a¦¦ jcrtcct cxtcnsions ot an itcm sct I wrt T
(that is, it X
T
(I) ¦i ∈ B −I [ s
T
(I ∪ ¦i¦) s
T
(I)¦),
thcn a¦¦ scts I ∪ J with J ∈ 2
X
T
(I)
havc thc samc sujjort as I
(whcrc 2
M
dcnotcs thc jowcr sct ot a sct M)
Christian Borgelt Frequent Pattern Mining 80
Perfect Extensions: Examples
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• c is a jcrtcct cxtcnsion ot ¦b¦ as ¦b¦ and ¦b, c¦ ¦oth havc sujjort 3
• a is a jcrtcct cxtcnsion ot ¦d, e¦ as ¦d, e¦ and ¦a, d, e¦ ¦oth havc sujjort !
• Thcrc arc no othcr jcrtcct cxtcnsions in this cxamj¦c
tor a minimum sujjort ot s
min
3
Christian Borgelt Frequent Pattern Mining 81
Perfect Extension Pruning
• Considcr a¸ain thc ori¸ina¦ divideandconquer scheme
A su¦jro¦¦cm S
0
(T
0
, P
0
) is sj¦it into
◦ a su¦jro¦¦cm S
1
(T
1
, P
1
) to ﬁnd a¦¦ trcqucnt itcm scts
that contain an itcm i ∈ B
0
,
◦ a su¦jro¦¦cm S
2
(T
2
, P
2
) to ﬁnd a¦¦ trcqucnt itcm scts
that do not contain thc itcm i
• Sujjosc thc itcm i is a perfect extension ot thc jrcﬁx P
0
◦ Lct F
1
and F
2
¦c thc scts ot trcqucnt itcm scts
that arc rcjortcd whcn jroccssin¸ S
1
and S
2
, rcsjcctivc¦y
◦ lt is I ∪ ¦i¦ ∈ F
1
⇔ I ∈ F
2
◦ Thc rcason is that ¸cncra¦¦y P
1
P
2
∪ ¦i¦ and in this casc T
1
T
2
,
¦ccausc a¦¦ transaction in T
0
contain itcm i (i is a jcrtcct cxtcnsion)
• Thcrctorc it suﬃccs to so¦vc onc su¦jro¦¦cm (namc¦y S
2
)
and to construct thc so¦ution ot thc othcr (S
1
) ¦y addin¸ itcm i
Christian Borgelt Frequent Pattern Mining 82
Perfect Extension Pruning
• lcrtcct cxtcnsions can ¦c cxj¦oitcd ¦y co¦¦cctin¸ thcsc itcms in thc rccursion,
in a third c¦cmcnt ot a su¦jro¦¦cm dcscrijtion
• Iorma¦¦y, a su¦jro¦¦cm is a trij¦ct S (T, P, X), whcrc
◦ T is a conditional transaction database,
◦ P is thc sct ot preﬁx items tor T,
◦ X is thc sct ot perfect extension items
• Oncc idcntiﬁcd, jcrtcct cxtcnsion itcms arc no ¦on¸cr jroccsscd in thc rccursion,
¦ut arc on¦y uscd to ¸cncratc a¦¦ sujcrscts ot thc jrcﬁx havin¸ thc samc sujjort
Conscqucnt¦y, thcy arc rcmovcd trom thc conditiona¦ data¦ascs
This tcchniquc is a¦so known as hypercube decomposition
• Thc dividcandconqucr schcmc has ¦asica¦¦y thc samc structurc
as without jcrtcct cxtcnsion jrunin¸
Lowcvcr, thc cxact way in which jcrtcct cxtcnsions arc co¦¦cctcd
can dcjcnd on thc sjcciﬁc a¦¸orithm uscd
Christian Borgelt Frequent Pattern Mining 83
Reporting Frequent Item Sets
• \ith thc dcscri¦cd dividcandconqucr schcmc,
itcm scts arc rcjortcd in lexicographic order
• This can ¦c cxj¦oitcd tor eﬃcient item set reporting
◦ Thc jrcﬁx P is a strin¸, which is cxtcndcd whcn an itcm is addcd to P
◦ Thus on¦y onc itcm nccds to ¦c tormattcd jcr rcjortcd trcqucnt itcm sct,
thc rcst is a¦rcady tormattcd in thc strin¸
◦ Lacktrackin¸ thc scarch (rcturn trom rccursion)
rcmovcs an itcm trom thc jrcﬁx strin¸
◦ This schcmc can sjccd uj thc outjut considcra¦¦y
Lxamj¦c a (¨)
a c (!)
a c d (3)
a c e (3)
a d (`)
a d e (!)
a e (o)
b (3)
b c (3)
c (¨)
c d (!)
c e (!)
d (o)
d e (!)
e (¨)
Christian Borgelt Frequent Pattern Mining 84
Global and Local Item Order
• ¹j to now wc assumcd that thc itcm ordcr is (¸¦o¦a¦¦y) ﬁxcd,
and dctcrmincd at thc vcry ¦c¸innin¸ ¦ascd on hcuristics
• Lowcvcr, thc dcscri¦cd dividcandconqucr schcmc shows
that a ¸¦o¦a¦¦y ﬁxcd itcm ordcr is morc rcstrictivc than ncccssary
◦ Thc itcm uscd to sj¦it thc currcnt su¦jro¦¦cm can ¦c any itcm
that occurs in thc conditiona¦ transaction data¦asc ot thc su¦jro¦¦cm
◦ Thcrc is no nccd to choosc thc samc itcm tor sj¦ittin¸ si¦¦in¸ su¦jro¦¦cms
(as a ¸¦o¦a¦ itcm ordcr wou¦d rcquirc us to do)
◦ Thc samc hcuristics uscd tor dctcrminin¸ a ¸¦o¦a¦ itcm ordcr su¸¸cst
that thc sj¦it itcm tor a ¸ivcn su¦jro¦¦cm shou¦d ¦c sc¦cctcd trom
thc (conditiona¦¦y) most trcqucnt itcm(s)
• As a conscqucncc, thc itcm ordcrs may diﬀcr tor cvcry ¦ranch ot thc scarch trcc
◦ Lowcvcr, two su¦jro¦¦cms must sharc thc itcm ordcr that is ﬁxcd
¦y thc common jart ot thcir jaths trom thc root (initia¦ su¦jro¦¦cm)
Christian Borgelt Frequent Pattern Mining 85
Item Order: DivideandConquer Recursion
Subproblem Tree
(T, ∅)
.
.
.
.
.
.
.
.
.
.
.
.
. .,
a












¸
a
(T
a
, ¦a¦)
Z
Z
Z
Z
Z»
b
`
`
`
`
`·
b
(T
a
, ∅)
Z
Z
Z
Z
Z»
c
`
`
`
`
`·
c
(T
ab
, ¦a, b¦)
/
/
/
/
/
/
/
/ º
d
`
`
`
`
` *
d
(T
a
b
, ¦a¦)
/
/
/
/
/
/
/
/ º
e
`
`
`
`
` *
e
(T
ac
, ¦c¦)
/
/
/
/
/
/
/
/ º
f
`
`
`
`
` *
f
(T
a c
, ∅)
/
/
/
/
/
/
/
/ º
g
`
`
`
`
` *
g
(T
abd
, ¦a, b, d¦)
(T
ab
d
, ¦a, b¦)
(T
a
be
, ¦a, e¦)
(T
a
b e
, ¦a¦)
(T
acf
, ¦c, f¦)
(T
ac
f
, ¦c¦)
(T
a cg
, ¦g¦)
(T
a c g
, ∅)
• A¦¦ ¦oca¦ itcm ordcrs start with a < . . .
• A¦¦ su¦jro¦¦cms on thc ¦ctt sharc a < b < . . .,
A¦¦ su¦jro¦¦cms on thc ri¸ht sharc a < c < . . .
Christian Borgelt Frequent Pattern Mining 86
Global and Local Item Order
Loca¦ itcm ordcrs havc advanta¸cs and disadvanta¸cs
• Advantage
◦ ln somc data scts thc ordcr ot thc conditiona¦ itcm trcqucncics
diﬀcrs considcra¦¦y trom thc ¸¦o¦a¦ ordcr
◦ Such data scts can somctimcs ¦c jroccsscd si¸niﬁcant¦y tastcr
with ¦oca¦ itcm ordcrs (dcjcndin¸ on thc a¦¸orithm)
• Disadvantage
◦ Thc data structurc ot thc conditiona¦ data¦ascs must a¦¦ow us
to dctcrminc conditiona¦ itcm trcqucncics quick¦y
◦ `ot havin¸ a ¸¦o¦a¦¦y ﬁxcd itcm ordcr can makc it morc diﬃcu¦t
to dctcrminc conditiona¦ transaction data¦ascs wrt sj¦it itcms
(dcjcndin¸ on thc cmj¦oycd data structurc)
◦ Thc ¸ains trom thc ¦cttcr itcm ordcr may ¦c ¦ost a¸ain
duc to thc morc comj¦cx jroccssin¸ , conditionin¸ schcmc
Christian Borgelt Frequent Pattern Mining 87
Transaction Database Representation
Christian Borgelt Frequent Pattern Mining 88
Transaction Database Representation
• Lc¦at, Il¸rowth and scvcra¦ othcr trcqucnt itcm sct minin¸ a¦¸orithms
rc¦y on thc dcscri¦cd ¦asic dividcandconqucr schcmc
Thcy diﬀcr main¦y in how thcy rcjrcscnt thc conditiona¦ transaction data¦ascs
• Thc main ajjroachcs arc horizonta¦ and vcrtica¦ rcjrcscntations
◦ ln a horizontal representation, thc data¦asc is storcd as a ¦ist (or array)
ot transactions, cach ot which is a ¦ist (or array) ot thc itcms containcd in it
◦ ln a vertical representation, a data¦asc is rcjrcscntcd ¦y ﬁrst rctcrrin¸
with a ¦ist (or array) to thc diﬀcrcnt itcms Ior cach itcm a ¦ist (or array) ot
idcntiﬁcrs is storcd, which indicatc thc transactions that contain thc itcm
• Lowcvcr, this distinction is not jurc, sincc thcrc arc many a¦¸orithms
that usc a com¦ination ot thc two torms ot rcjrcscntin¸ a data¦asc
• Ircqucnt itcm sct minin¸ a¦¸orithms a¦so diﬀcr in
how thcy construct ncw conditiona¦ data¦ascs trom a ¸ivcn onc
Christian Borgelt Frequent Pattern Mining 89
Transaction Database Representation
• Thc Ajriori a¦¸orithm uscs a horizontal transaction representation
cach transaction is an array ot thc containcd itcms
◦ `otc that thc a¦tcrnativc jrcﬁx trcc or¸anization
is sti¦¦ an csscntia¦¦y horizontal rcjrcscntation
• Thc a¦tcrnativc is a vertical transaction representation
◦ Ior cach itcm a transaction list is crcatcd
◦ Thc transaction ¦ist ot itcm a indicatcs thc transactions that contain it,
that is, it rcjrcscnts its cover K
T
(¦a¦)
◦ Advanta¸c thc transaction ¦ist tor a jair ot itcms can ¦c comjutcd ¦y
intcrscctin¸ thc transaction ¦ists ot thc individua¦ itcms
◦ Gcncra¦¦y, a vcrtica¦ transaction rcjrcscntation can cxj¦oit
∀I, J ⊆ B K
T
(I ∪ J) K
T
(I) ∩ K
T
(J).
• A com¦incd rcjrcscntation is thc frequent pattern tree (to ¦c discusscd ¦atcr)
Christian Borgelt Frequent Pattern Mining 90
Transaction Database Representation
• Horizontal Representation: List itcms tor cach transaction
• Vertical Representation: List transactions tor cach itcm
horizonta¦ rcjrcscntation
1 a, d, e
2 b, c, d
3 a, c, e
! a, c, d, e
` a, e
o a, c, d
¨ b, c
S a, c, d, e
9 b, c, e
10 a, d, e
vcrtica¦ rcjrcscntation
a b c d e
1 2 2 1 1
3 ¨ 3 2 3
! 9 ! ! !
` o o `
o ¨ S S
S S 10 9
10 9 10
matrix rcjrcscntation
a b c d e
1 1 0 0 1 1
2 0 1 1 1 0
3 1 0 1 0 1
! 1 0 1 1 1
` 1 0 0 0 1
o 1 0 1 1 0
¨ 0 1 1 0 0
S 1 0 1 1 1
9 0 1 1 0 1
10 1 0 0 1 1
Christian Borgelt Frequent Pattern Mining 91
Transaction Database Representation
transaction
data¦asc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
¦cxico¸rajhica¦¦y
sortcd
a, c, d
a, c, d, e
a, c, d, e
a, c, e
a, d, e
a, d, e
a, e
b, c
b, c, d
b, c, e
preﬁx tree
representation
a
b
c
d
e
c
d
e
e
d
e
e
: 7
: 3
: 4
: 2
: 1
: 3
: 3
: 1
: 2
: 1
: 1
: 2
• `otc that a jrcﬁx trcc rcjrcscntation is a comjrcsscd horizonta¦ rcjrcscntation
• Principle: cqua¦ jrcﬁxcs ot transactions arc mcr¸cd
• This is most cﬀcctivc it thc itcms arc sortcd dcsccndin¸¦y wrt thcir sujjort
Christian Borgelt Frequent Pattern Mining 92
The Eclat Algorithm
Zaki, larthasarathy, O¸ihara, and Li 199¨
Christian Borgelt Frequent Pattern Mining 93
Eclat: Basic Ideas
• Thc itcm scts arc chcckcd in lexicographic order
(depthﬁrst traversal ot thc jrcﬁx trcc)
• Thc scarch schcmc is thc samc as thc ¸cncra¦ schcmc tor scarchin¸
with canonica¦ torms havin¸ thc jrcﬁx jrojcrty and josscssin¸
a jcrtcct cxtcnsion ru¦c (¸cncratc on¦y canonica¦ cxtcnsions)
• Lc¦at ¸cncratcs morc candidatc itcm scts than Ajriori,
¦ccausc it (usua¦¦y) docs not storc thc sujjort ot a¦¦ visitcd itcm scts
∗
As a conscqucncc it cannot tu¦¦y cxj¦oit thc Ajriori jrojcrty tor jrunin¸
• Lc¦at uscs a jurc¦y vertical transaction representation
• `o su¦sct tcsts and no su¦sct ¸cncration arc nccdcd to comjutc thc sujjort
Thc sujjort ot itcm scts is rathcr dctcrmincd ¦y intcrscctin¸ transaction ¦ists
∗
`otc that Lc¦at cannot tu¦¦y cxj¦oit thc Ajriori jrojcrty, ¦ccausc it docs not store thc sujjort ot a¦¦
cxj¦orcd itcm scts, not ¦ccausc it cannot know it lt a¦¦ comjutcd sujjort va¦ucs wcrc storcd, it cou¦d
¦c imj¦cmcntcd in such a way that a¦¦ sujjort va¦ucs nccdcd tor tu¦¦ a priori jrunin¸ wcrc avai¦a¦¦c
Christian Borgelt Frequent Pattern Mining 94
Eclat: Subproblem Split
1
3
4
5
6
8
10
a
7
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
b
0
3
4
6
8
c
4
1
4
6
8
10
d
5
1
3
4
5
8
10
e
6
2
7
9
b
3
2
3
4
6
7
8
9
c
7
1
2
4
6
8
10
d
6
1
3
4
5
8
9
10
e
7
↑
Conditiona¦
data¦asc
tor jrcﬁx a
(1st su¦jro¦¦cm)
← Conditiona¦
data¦asc
with itcm a
rcmovcd
(2nd su¦jro¦¦cm)
a
7
b
3
c
7
d
6
e
7
b
0
c
4
d
5
e
6
b
3
c
7
d
6
e
7
↑
Conditiona¦
data¦asc
tor jrcﬁx a
(1st su¦jro¦¦cm)
← Conditiona¦
data¦asc
with itcm a
rcmovcd
(2nd su¦jro¦¦cm)
Christian Borgelt Frequent Pattern Mining 95
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
• Iorm a transaction ¦ist tor cach itcm Lcrc ¦it vcctor rcjrcscntation
◦ ¸rcy itcm is containcd in transaction
◦ whitc itcm is not containcd in transaction
• Transaction data¦asc is nccdcd on¦y oncc (tor thc sin¸¦c itcm transaction ¦ists)
Christian Borgelt Frequent Pattern Mining 96
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
• lntcrscct thc transaction ¦ist tor itcm a
with thc transaction ¦ists ot a¦¦ othcr itcms (conditional database tor itcm a)
• Count thc num¦cr ot ¦its that arc sct (num¦cr ot containin¸ transactions)
This yic¦ds thc sujjort ot a¦¦ itcm scts with thc jrcﬁx a
Christian Borgelt Frequent Pattern Mining 97
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
• Thc itcm sct ¦a, b¦ is intrcqucnt and can ¦c jruncd
• A¦¦ othcr itcm scts with thc jrcﬁx a arc trcqucnt
and arc thcrctorc kcjt and jroccsscd rccursivc¦y
Christian Borgelt Frequent Pattern Mining 98
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
• lntcrscct thc transaction ¦ist tor thc itcm sct ¦a, c¦
with thc transaction ¦ists ot thc itcm scts ¦a, x¦, x ∈ ¦d, e¦
• lcsu¦t Transaction ¦ists tor thc itcm scts ¦a, c, d¦ and ¦a, c, e¦
• Count thc num¦cr ot ¦its that arc sct (num¦cr ot containin¸ transactions)
This yic¦ds thc sujjort ot a¦¦ itcm scts with thc jrcﬁx ac
Christian Borgelt Frequent Pattern Mining 99
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
• lntcrscct thc transaction ¦ists tor thc itcm scts ¦a, c, d¦ and ¦a, c, e¦
• lcsu¦t Transaction ¦ist tor thc itcm sct ¦a, c, d, e¦
• \ith Ajriori this itcm sct cou¦d ¦c jruncd ¦ctorc countin¸,
¦ccausc it was known that ¦c, d, e¦ is intrcqucnt
Christian Borgelt Frequent Pattern Mining 100
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
• Thc itcm sct ¦a, c, d, e¦ is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
• Sincc thcrc is no transaction ¦ist ¦ctt (and thus no intcrscction jossi¦¦c),
thc rccursion is tcrminatcd and thc scarch ¦acktracks
Christian Borgelt Frequent Pattern Mining 101
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
• Thc scarch ¦acktracks to thc sccond ¦cvc¦ ot thc scarch trcc and
intcrscct thc transaction ¦ist tor thc itcm scts ¦a, d¦ and ¦a, e¦
• lcsu¦t Transaction ¦ist tor thc itcm sct ¦a, d, e¦
• Sincc thcrc is on¦y onc transaction ¦ist ¦ctt (and thus no intcrscction jossi¦¦c),
thc rccursion is tcrminatcd and thc scarch ¦acktracks a¸ain
Christian Borgelt Frequent Pattern Mining 102
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
• Thc scarch ¦acktracks to thc ﬁrst ¦cvc¦ ot thc scarch trcc and
intcrscct thc transaction ¦ist tor b with thc transaction ¦ists tor c, d, and e
• lcsu¦t Transaction ¦ists tor thc itcm scts ¦b, c¦, ¦b, d¦, and ¦b, e¦
Christian Borgelt Frequent Pattern Mining 103
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
• On¦y onc itcm sct has suﬃcicnt sujjort → jrunc a¦¦ su¦trccs
• Sincc thcrc is on¦y onc transaction ¦ist ¦ctt (and thus no intcrscction jossi¦¦c),
thc rccursion is tcrminatcd and thc scarch ¦acktracks a¸ain
Christian Borgelt Frequent Pattern Mining 104
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
• Lacktrack to thc ﬁrst ¦cvc¦ ot thc scarch trcc and
intcrscct thc transaction ¦ist tor c with thc transaction ¦ists tor d and e
• lcsu¦t Transaction ¦ists tor thc itcm scts ¦c, d¦ and ¦c, e¦
Christian Borgelt Frequent Pattern Mining 105
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
• lntcrscct thc transaction ¦ist tor thc itcm scts ¦c, d¦ and ¦c, e¦
• lcsu¦t Transaction ¦ist tor thc itcm sct ¦c, d, e¦
Christian Borgelt Frequent Pattern Mining 106
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
• Thc itcm sct ¦c, d, e¦ is not trcqucnt (sujjort 2,20/) and thcrctorc jruncd
• Sincc thcrc is no transaction ¦ist ¦ctt (and thus no intcrscction jossi¦¦c),
thc rccursion is tcrminatcd and thc scarch ¦acktracks
Christian Borgelt Frequent Pattern Mining 107
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
• Thc scarch ¦acktracks to thc ﬁrst ¦cvc¦ ot thc scarch trcc and
intcrscct thc transaction ¦ist tor d with thc transaction ¦ist tor e
• lcsu¦t Transaction ¦ist tor thc itcm sct ¦d, e¦
• \ith this stcj thc scarch is ﬁnishcd
Christian Borgelt Frequent Pattern Mining 108
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
• Thc tound trcqucnt itcm scts coincidc, ot coursc,
with thosc tound ¦y thc Ajriori a¦¸orithm
• Lowcvcr, a tundamcnta¦ diﬀcrcncc is that
Lc¦at usua¦¦y on¦y writcs tound trcqucnt itcm scts to an outjut ﬁ¦c,
whi¦c Ajriori kccjs thc who¦c scarch trcc in main mcmory
Christian Borgelt Frequent Pattern Mining 109
Eclat: DepthFirst Search
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
a : 7 b : 3 c : 7 d : 6 e : 7
a
b : 0 c : 4 d : 5 e : 6
c
d : 3 e : 3
d
e : 2
d
e : 4
b
c : 3 d : 1 e : 1
c
d : 4 e : 4
d
e : 2
d
e : 4
• `otc that thc itcm sct ¦a, c, d, e¦ cou¦d ¦c jruncd ¦y Ajriori without comjutin¸
its sujjort, ¦ccausc thc itcm sct ¦c, d, e¦ is intrcqucnt
• Thc samc can ¦c achicvcd with Lc¦at it thc dcjthﬁrst travcrsa¦ ot thc jrcﬁx trcc
is carricd out trom ri¸ht to ¦ctt and comjutcd sujjort va¦ucs arc storcd
lt is dc¦ata¦¦c whcthcr thc jotcntia¦ ¸ains ,ustity thc mcmory rcquircmcnt
Christian Borgelt Frequent Pattern Mining 110
Eclat: Bit Matrices and Item Coding
Bit Matrices
• lcjrcscnt transactions as a ¦it matrix
◦ Lach co¦umn corrcsjonds to an itcm
◦ Lach row corrcsjonds to a transaction
• `orma¦ and sjarsc rcjrcscntation ot ¦it matriccs
◦ `orma¦ onc mcmory ¦it jcr matrix ¦it, zcros rcjrcscntcd
◦ Sjarsc ¦ists ot row indiccs ot sct ¦its (transaction ¦ists)
• \hich rcjrcscntation is jrctcra¦¦c dcjcnds on
thc ratio ot sct ¦its to c¦carcd ¦its
Item Coding
• Sortin¸ thc itcms asccndin¸¦y wrt thcir trcqucncy (individua¦ or
transaction sizc sum) ¦cads to a ¦cttcr structurc ot thc scarch trcc
Christian Borgelt Frequent Pattern Mining 111
Eclat: Intersecting Transaction Lists
function iscct (src1, src2 tid¦ist)
begin (∗ — intcrscct two transaction id ¦ists ∗)
var dst tid¦ist. (∗ crcatcd intcrscction ∗)
while ¦oth src1 and src2 arc not cmjty do begin
if hcad(src1) < hcad(src2) (∗ skij transaction idcntiﬁcrs that arc ∗)
then src1 tai¦(src1). (∗ uniquc to thc ﬁrst sourcc ¦ist ∗)
elseif hcad(src1) > hcad(src2) (∗ skij transaction idcntiﬁcrs that arc ∗)
then src2 tai¦(src2). (∗ uniquc to thc sccond sourcc ¦ist ∗)
else begin (∗ it transaction id is in ¦oth sourccs ∗)
dstajjcnd(hcad(src1)). (∗ ajjcnd it to thc outjut ¦ist ∗)
src1 tai¦(src1). src2 tai¦(src2).
end. (∗ rcmovc thc transtcrrcd transaction id ∗)
end. (∗ trom ¦oth sourcc ¦ists ∗)
return dst. (∗ rcturn thc crcatcd intcrscction ∗)
end. (∗ tunction iscct() ∗)
Christian Borgelt Frequent Pattern Mining 112
Eclat: Transaction Ranges
transaction
data¦asc
a, d, e
b, c, d
a, c, e
a, c, d, e
a, e
a, c, d
b, c
a, c, d, e
b, c, e
a, d, e
itcm
trcqucncics
a ¨
b 3
c ¨
d o
e ¨
sortcd ¦y
trcqucncy
a, e, d
c, d, b
a, c, e
a, c, e, d
a, e
a, c, d
c, b
a, c, e, d
c, e, b
a, e, d
¦cxico¸rajhica¦¦y
sortcd
1 a, c, e
2 a, c, e, d
3 a, c, e, d
! a, c, d
` a, e
o a, e, d
¨ a, e, d
S c, e, b
9 c, d, b
10 c, b
a
1
¨
c
1
!
S
10
e
1
3
`
¨
S
S
d
2
3
!
!
o
¨
9
9
b
S
S
9
9
10
10
• Thc transaction ¦ists can ¦c comjrcsscd ¦y com¦inin¸
consccutivc transaction idcntiﬁcrs into ran¸cs
• Lxj¦oit itcm trcqucncics and cnsurc su¦sct rc¦ations ¦ctwccn ran¸cs
trom ¦owcr to hi¸hcr trcquccics, so that intcrscctin¸ thc ¦ists is casy
Christian Borgelt Frequent Pattern Mining 113
Eclat: Diﬀerence sets (Diﬀsets)
• ln a conditiona¦ data¦asc, a¦¦ transaction ¦ists arc “ﬁ¦tcrcd” ¦y thc jrcﬁx
On¦y transactions containcd in thc transaction ¦ist tor thc jrcﬁx
can ¦c in thc transaction ¦ists ot thc conditiona¦ data¦asc
• This su¸¸csts thc idca to usc diﬀsets to rcjrcscnt conditiona¦ data¦ascs
∀I ∀a / ∈ I D
T
(a [ I) K
T
(I) −K
T
(I ∪ ¦a¦)
D
T
(a [ I) contains thc indiccs ot thc transactions that contain I ¦ut not a
• Thc sujjort ot dircct sujcrscts ot I can now ¦c comjutcd as
∀I ∀a / ∈ I s
T
(I ∪ ¦a¦) s
T
(I) −[D
T
(a [ I)[.
Thc diﬀscts tor thc ncxt ¦cvc¦ can ¦c comjutcd ¦y
∀I ∀a, b / ∈ I, a , b D
T
(b [ I ∪ ¦a¦) D
T
(b [ I) −D
T
(a [ I)
• Ior somc transaction data¦ascs, usin¸ diﬀscts sjccds uj thc scarch considcra¦¦y
Christian Borgelt Frequent Pattern Mining 114
Eclat: Diﬀsets
Proof of the Formula for the Next Level:
D
T
(b [ I ∪ ¦a¦) K
T
(I ∪ ¦a¦) −K
T
(I ∪ ¦a, b¦)
¦k [ I ∪ ¦a¦ ⊆ t
k
¦ −¦k [ I ∪ ¦a, b¦ ⊆ t
k
¦
¦k [ I ⊆ t
k
∧ a ∈ t
k
¦
−¦k [ I ⊆ t
k
∧ a ∈ t
k
∧ b ∈ t
k
¦
¦k [ I ⊆ t
k
∧ a ∈ t
k
∧ b / ∈ t
k
¦
¦k [ I ⊆ t
k
∧ b / ∈ t
k
¦
−¦k [ I ⊆ t
k
∧ b / ∈ t
k
∧ a / ∈ t
k
¦
¦k [ I ⊆ t
k
∧ b / ∈ t
k
¦
−¦k [ I ⊆ t
k
∧ a / ∈ t
k
¦
(¦k [ I ⊆ t
k
¦ −¦k [ I ∪ ¦b¦ ⊆ t
k
¦)
−(¦k [ I ⊆ t
k
¦ −¦k [ I ∪ ¦a¦ ⊆ t
k
¦)
(K
T
(I) −K
T
(I ∪ ¦b¦)
−(K
T
(I) −K
T
(I ∪ ¦a¦)
D(b [ I) −D(a [ I)
Christian Borgelt Frequent Pattern Mining 115
Summary Eclat
Basic Processing Scheme
• Lcjthﬁrst travcrsa¦ ot thc jrcﬁx trcc
• Lata is rcjrcscntcd as ¦ists ot transaction idcntiﬁcrs (onc jcr itcm)
• Sujjort countin¸ is donc ¦y intcrscctin¸ ¦ists ot transaction idcntiﬁcrs
Advantages
• Lcjthﬁrst scarch rcduccs mcmory rcquircmcnts
• ¹sua¦¦y (considcra¦¦y) tastcr than Ajriori
Disadvantages
• \ith a sjarsc transaction ¦ist rcjrcscntation (row indiccs)
cc¦at is diﬃcu¦t to cxccutc tor modcrn jroccssors (¦ranch jrcdiction)
Software
• http://www.borgelt.net/eclat.html
Christian Borgelt Frequent Pattern Mining 116
The SaM Algorithm
Sj¦it and `cr¸c A¦¸orithm Lor¸c¦t 200S
Christian Borgelt Frequent Pattern Mining 117
SaM: Basic Ideas
• Thc itcm scts arc chcckcd in lexicographic order
(depthﬁrst traversal ot thc jrcﬁx trcc)
• Stcj ¦y stcj c¦imination ot itcms trom thc transaction data¦asc
• lccursivc jroccssin¸ ot thc conditiona¦ transaction data¦ascs
• \hi¦c Lc¦at uscs a jurc¦y vcrtica¦ transaction rcjrcscntation,
Sa` uscs a jurc¦y horizontal transaction representation
This dcmonstratcs that thc travcrsa¦ ordcr tor thc jrcﬁx trcc and
thc rcjrcscntation torm ot thc transaction data¦asc can ¦c com¦incd trcc¦y
• Thc data structurc uscd is a simj¦y array ot transactions
• Thc two conditiona¦ data¦ascs tor thc two su¦jro¦¦cms tormcd in cach stcj
arc crcatcd with a split step and a merge step
Luc to thcsc stcjs thc a¦¸orithm is ca¦¦cd Sj¦it and `cr¸c (Sa`)
Christian Borgelt Frequent Pattern Mining 118
SaM: Preprocessing the Transaction Database
1
a d
a c d e
b d
b c d g
b c f
a b d
b d e
b c d e
b c
a b d f
2
g 1
f 2
e 3
a !
c `
b S
d S
s
min
3
3
a d
e a c d
b d
c b d
c b
a b d
e b d
e c b d
c b
a b d
!
e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`
1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 Ori¸ina¦ transaction data¦asc
2 Ircqucncy ot individua¦ itcms
3 ltcms in transactions sortcd
asccndin¸¦y wrt thcir trcqucncy
! Transactions sortcd ¦cxico¸rajhica¦¦y
in dcsccndin¸ ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin¸ stcj)
` Lata structurc uscd ¦y thc a¦¸orithm
Christian Borgelt Frequent Pattern Mining 119
SaM: Basic Operations
1 e a c d
1 e c b d
1 e b d
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
e
e
e
split
prefix e
2 a b d
1 a d
1 c b d
2 c b
1 b d
1 a c d
1 c b d
1 b d
1 a c d
2 a b d
1 a d
2 c b d
2 c b
2 b d
merge
prefix e
e removed
• Split Step: (on thc ¦ctt. tor ﬁrst su¦jro¦¦cm)
◦ `ovc a¦¦ transactions startin¸ with thc samc itcm to a ncw array
◦ lcmovc thc common ¦cadin¸ itcm (advancc jointcr into transaction)
• Merge Step: (on thc ri¸ht. tor sccond su¦jro¦¦cm)
◦ `cr¸c thc rcst ot thc transaction array and thc cojicd transactions
◦ Thc mcr¸c ojcration is simi¦ar to a mergesort jhasc
Christian Borgelt Frequent Pattern Mining 120
SaM: PseudoCode
function Sa` (a array ot transactions, (∗ conditiona¦ data¦asc to jroccss ∗)
p sct ot itcms, (∗ jrcﬁx ot thc conditiona¦ data¦asc a ∗)
s
min
int) (∗ minimum sujjort ot an itcm sct ∗)
var i itcm. (∗ ¦uﬀcr tor thc sj¦it itcm ∗)
b array ot transactions. (∗ sj¦it rcsu¦t ∗)
begin (∗ — sj¦it and mcr¸c rccursion — ∗)
while a is not cmjty do (∗ whi¦c thc data¦asc is not cmjty ∗)
i a0itcms0. (∗ ¸ct ¦cadin¸ itcm ot ﬁrst transaction ∗)
movc transactions startin¸ with i to b. (∗ sj¦it stcj ﬁrst su¦jro¦¦cm ∗)
mcr¸c b and thc rcst ot a into a. (∗ mcr¸c stcj sccond su¦jro¦¦cm ∗)
if s(i) ≥ s
min
then (∗ it thc sj¦it itcm is trcqucnt ∗)
p p ∪ ¦i¦. (∗ cxtcnd thc jrcﬁx itcm sct and ∗)
rcjort p with sujjort s(i). (∗ rcjort thc tound trcqucnt itcm sct ∗)
Sa`(b, p, s
min
). (∗ jroccss thc sj¦it rcsu¦t rccursivc¦y, ∗)
p p −¦i¦. (∗ thcn rcstorc thc ori¸ina¦ jrcﬁx ∗)
end.
end.
end. (∗ tunction Sa`() ∗)
Christian Borgelt Frequent Pattern Mining 121
SaM: PseudoCode — Split Step
var i itcm. (∗ ¦uﬀcr tor thc sj¦it itcm ∗)
s int. (∗ sujjort ot thc sj¦it itcm ∗)
b array ot transactions. (∗ sj¦it rcsu¦t ∗)
begin (∗ — sj¦it stcj ∗)
b cmjty. s 0. (∗ initia¦izc sj¦it rcsu¦t and itcm sujjort ∗)
i a0itcms0. (∗ ¸ct ¦cadin¸ itcm ot ﬁrst transaction ∗)
while a is not cmjty (∗ whi¦c data¦asc is not cmjty and ∗)
and a0itcms0 i do (∗ ncxt transaction starts with samc itcm ∗)
s s + a0w¸t. (∗ sum occurrcnccs (comjutc sujjort) ∗)
rcmovc i trom a0itcms. (∗ rcmovc sj¦it itcm trom transaction ∗)
if a0itcms is not cmjty (∗ it transaction has not ¦ccomc cmjty ∗)
then rcmovc a0 trom a and ajjcnd it to b.
else rcmovc a0 trom a. end. (∗ movc it to thc conditiona¦ data¦asc, ∗)
end. (∗ othcrwisc simj¦y rcmovc it ∗)
end. (∗ cmjty transactions arc c¦iminatcd ∗)
• `otc that thc sj¦it stcj a¦so dctcrmincs thc sujjort ot thc itcm i
Christian Borgelt Frequent Pattern Mining 122
SaM: PseudoCode — Merge Step
var c array ot transactions. (∗ ¦uﬀcr tor rcst ot sourcc array ∗)
begin (∗ — mcr¸c stcj ∗)
c a. a cmjty. (∗ initia¦izc thc outjut array ∗)
while b and c arc ¦oth not cmjty do (∗ mcr¸c sj¦it and rcst ot data¦asc ∗)
if c0itcms > b0itcms (∗ cojy ¦cx sma¦¦cr transaction trom c ∗)
then rcmovc c0 trom c and ajjcnd it to a.
else if c0itcms < b0itcms (∗ cojy ¦cx sma¦¦cr transaction trom b ∗)
then rcmovc b0 trom b and ajjcnd it to a.
else b0w¸t b0w¸t +c0w¸t. (∗ sum thc occurrcnccs,wci¸hts ∗)
rcmovc b0 trom b and ajjcnd it to a.
rcmovc c0 trom c. (∗ movc com¦incd transaction and ∗)
end. (∗ dc¦ctc thc othcr, cqua¦ transaction ∗)
end. (∗ kccj on¦y onc cojy jcr transaction ∗)
while c is not cmjty do (∗ cojy rcst ot transactions in c ∗)
rcmovc c0 trom c and ajjcnd it to a. end.
while b is not cmjty do (∗ cojy rcst ot transactions in b ∗)
rcmovc b0 trom b and ajjcnd it to a. end.
end. (∗ sccond rccursion cxccutcd ¦y ¦ooj ∗)
Christian Borgelt Frequent Pattern Mining 123
SaM: Optimization
• lt thc transaction data¦asc is sjarsc,
thc two transaction arrays to mcr¸c can su¦stantia¦¦y diﬀcr in sizc
• ln this casc Sa` can ¦ccomc tair¦y s¦ow,
¦ccausc thc mcr¸c stcj jroccsscs many morc transactions than thc sj¦it stcj
• lntuitivc cxj¦anation (cxtrcmc casc)
◦ Sujjosc mergesort a¦ways mcr¸cd a sin¸¦c c¦cmcnt
with thc rccursivc¦y sortcd rcst ot thc array (or ¦ist)
◦ This vcrsion ot mcr¸csort wou¦d ¦c cquiva¦cnt to insertion sort
◦ As a conscqucncc thc timc comj¦cxity worscns trom O(n¦o¸ n) to O(n
2
)
• lossi¦¦c ojtimization
◦ `odity thc mcr¸c stcj it thc arrays to mcr¸c diﬀcr si¸niﬁcant¦y in sizc
◦ ldca usc thc samc ojtimization as in binary search ¦ascd insertion sort
Christian Borgelt Frequent Pattern Mining 124
SaM: PseudoCode — Binary Search Based Merge
function mcr¸c (a, b array ot transactions) array ot transactions
var l, m, r int. (∗ ¦inary scarch varia¦¦cs ∗)
c array ot transactions. (∗ outjut transaction array ∗)
begin (∗ — ¦inary scarch ¦ascd mcr¸c — ∗)
c cmjty. (∗ initia¦izc thc outjut array ∗)
while a and b arc ¦oth not cmjty do (∗ mcr¸c thc two transaction arrays ∗)
l 0. r ¦cn¸th(a). (∗ initia¦izc thc ¦inary scarch ran¸c ∗)
while l < r do (∗ whi¦c thc scarch ran¸c is not cmjty ∗)
m ⌊
l+r
2
⌋. (∗ comjutc thc midd¦c indcx ∗)
if am < b0 (∗ comjarc thc transaction to inscrt ∗)
then l m + 1. else r m. (∗ and adajt thc ¦inary scarch ran¸c ∗)
end. (∗ accordin¸ to thc comjarison rcsu¦t ∗)
while l > 0 do (∗ whi¦c sti¦¦ ¦ctorc inscrtion josition ∗)
rcmovc a0 trom a and ajjcnd it to c.
l l −1. (∗ cojy ¦cx ¦ar¸cr transaction and ∗)
end. (∗ dccrcmcnt thc transaction countcr ∗)
. . .
Christian Borgelt Frequent Pattern Mining 125
SaM: PseudoCode — Binary Search Based Merge
. . .
rcmovc b0 trom b and ajjcnd it to c. (∗ cojy thc transaction to inscrt and ∗)
i ¦cn¸th(c) −1. (∗ ¸ct its indcx in thc outjut array ∗)
if a is not cmjty and a0itcms ciitcms
then ciw¸t ciw¸t +a0w¸t. (∗ it thcrc is a transaction in thc rcst ∗)
rcmovc a0 trom a. (∗ that is cqua¦ to thc onc ,ust cojicd, ∗)
end. (∗ thcn sum thc transaction wci¸hts ∗)
end. (∗ and rcmovc trans trom thc rcst ∗)
while a is not cmjty do (∗ cojy rcst ot transactions in a ∗)
rcmovc a0 trom a and ajjcnd it to c. end.
while b is not cmjty do (∗ cojy rcst ot transactions in b ∗)
rcmovc b0 trom b and ajjcnd it to c. end.
return c. (∗ rcturn thc mcr¸c rcsu¦t ∗)
end. (∗ tunction mcr¸c() ∗)
• Ajj¦yin¸ this mcr¸c jroccdurc it thc ¦cn¸th ratio ot thc transaction arrays
cxcccds 1o1 accc¦cratcs thc cxccution on sjarsc data scts
Christian Borgelt Frequent Pattern Mining 126
SaM: Optimization and External Storage
• Acccjtin¸ a s¦i¸ht¦y morc comj¦icatcd jroccssin¸ schcmc,
onc may work with double source buﬀering
◦ lnitia¦¦y, onc sourcc is thc injut data¦asc and thc othcr sourcc is cmjty
◦ A sj¦it rcsu¦t, which has to ¦c crcatcd ¦y movin¸ and mcr¸in¸ transactions
trom ¦oth sourccs, is a¦ways mcr¸cd to thc sma¦¦cr sourcc
◦ lt ¦oth sourccs havc ¦ccomc ¦ar¸c,
thcy may ¦c mcr¸cd in ordcr to cmjty onc sourcc
• `otc that Sa` can casi¦y ¦c imj¦cmcntcd to work on external storage
◦ ln jrincij¦c, thc transactions nccd not ¦c ¦oadcd into main mcmory
◦ Lvcn thc transaction array can casi¦y ¦c storcd on cxtcrna¦ stora¸c
or as a rc¦ationa¦ data¦asc ta¦¦c
◦ Thc tact that thc transaction array is jroccsscd ¦incar¦y
is advanta¸cous tor cxtcrna¦ stora¸c ojcrations
Christian Borgelt Frequent Pattern Mining 127
Summary SaM
Basic Processing Scheme
• Lcjthﬁrst travcrsa¦ ot thc jrcﬁx trcc
• Lata is rcjrcscntcd as an array ot transactions (jurc¦y horizonta¦ rcjrcscntation)
• Sujjort countin¸ is donc imj¦icit¦y in thc sj¦it stcj
Advantages
• \cry simj¦c data structurc and jroccssin¸ schcmc
• Lasy to imj¦cmcnt tor ojcration on cxtcrna¦ stora¸c , rc¦ationa¦ data¦ascs
Disadvantages
• Can ¦c s¦ow on sjarsc transaction data¦ascs duc to thc mcr¸c stcj
Software
• http://www.borgelt.net/sam.html
Christian Borgelt Frequent Pattern Mining 128
The RElim Algorithm
lccursivc L¦imination A¦¸orithm Lor¸c¦t 200`
Christian Borgelt Frequent Pattern Mining 129
Recursive Elimination: Basic Ideas
• Thc itcm scts arc chcckcd in lexicographic order
(depthﬁrst traversal ot thc jrcﬁx trcc)
• Stcj ¦y stcj c¦imination ot itcms trom thc transaction data¦asc
• lccursivc jroccssin¸ ot thc conditiona¦ transaction data¦ascs
• Avoids thc main jro¦¦cm ot thc Sa` a¦¸orithm
docs not usc a mcr¸c ojcration to ¸rouj transactions with thc samc ¦cadin¸ itcm
• lL¦im rathcr maintains one list of transactions per item,
thus cmj¦oyin¸ thc corc idca ot radix sort
Lowcvcr, on¦y transactions startin¸ with an itcm arc in thc corrcsjondin¸ ¦ist
• Attcr an itcm has ¦ccn jroccsscd, transactions arc rcassi¸ncd to othcr ¦ists
(¦ascd on thc ncxt itcm in thc transaction)
• lL¦im is insjircd ¦y thc IlGrowth a¦¸orithm (discusscd ¦atcr)
and c¦osc¦y rc¦atcd to thc Lminc a¦¸orithm (¦ut simj¦cr data structurc)
Christian Borgelt Frequent Pattern Mining 130
RElim: Preprocessing the Transaction Database
1
3
samc
as tor
Sa`
!
e a c d
e c b d
e b d
a b d
a b d
a d
c b d
c b
c b
b d
`
d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
1 Ori¸ina¦ transaction data¦asc
2 Ircqucncy ot individua¦ itcms
3 ltcms in transactions sortcd
asccndin¸¦y wrt thcir trcqucncy
! Transactions sortcd ¦cxico¸rajhica¦¦y
in dcsccndin¸ ordcr (comjarison ot
itcms invcrtcd wrt jrcccdin¸ stcj)
` Lata structurc uscd ¦y thc a¦¸orithm
(¦cadin¸ itcms imj¦icit in ¦ist)
Christian Borgelt Frequent Pattern Mining 131
RElim: Basic Operations
initial database
d
0
b
1
c
3
a
3
e
3
1 d 1 b d
2 b
2 b d
1 d
1 a c d
1 c b d
1 b d
3
e
a
c
b
prefix e
d
0
b
1
c
1
a
1
1 d 1 b d 1 c d
e eliminated
d
0
b
2
c
4
a
4
1 d
1 d
1 b d
1 b d
2 b
1 c d
2 b d
1 d
Thc ¦asic ojcrations ot thc lL¦im a¦¸orithm
Thc ri¸htmost ¦ist is travcrscd and rcassi¸ncd
oncc to an initia¦¦y cmjty ¦ist array (condi
tiona¦ data¦asc tor thc jrcﬁx e, scc toj ri¸ht)
and oncc to thc ori¸ina¦ ¦ist array (c¦iminatin¸
itcm e, scc ¦ottom ¦ctt) Thcsc two data¦ascs
arc thcn ¦oth jroccsscd rccursivc¦y
• `otc that attcr a simj¦c rcassi¸nmcnt thcrc may ¦c duj¦icatc ¦ist c¦cmcnts
Christian Borgelt Frequent Pattern Mining 132
RElim: PseudoCode
function lL¦im (a array ot transaction ¦ists, (∗ cond data¦asc to jroccss ∗)
p sct ot itcms, (∗ jrcﬁx ot thc conditiona¦ data¦asc a ∗)
s
min
int) int (∗ minimum sujjort ot an itcm sct ∗)
var i, k itcm. (∗ ¦uﬀcr tor thc currcnt itcm ∗)
s int. (∗ sujjort ot thc currcnt itcm ∗)
n int. (∗ num¦cr ot tound trcqucnt itcm scts ∗)
b array ot transaction ¦ists. (∗ conditiona¦ data¦asc tor currcnt itcm ∗)
t, u transaction ¦ist c¦cmcnt. (∗ to travcrsc thc transaction ¦ists ∗)
begin (∗ — rccursivc c¦imination — ∗)
n 0. (∗ initia¦izc thc num¦cr ot tound itcm scts ∗)
while a is not cmjty do (∗ whi¦c conditiona¦ data¦asc is not cmjty ∗)
i ¦ast itcm ot a. s aiw¸t. (∗ ¸ct thc ncxt itcm to jroccss ∗)
if s ≥ s
min
then (∗ it thc currcnt itcm is trcqucnt ∗)
p p ∪ ¦i¦. (∗ cxtcnd thc jrcﬁx itcm sct and ∗)
rcjort p with sujjort s. (∗ rcjort thc tound trcqucnt itcm sct ∗)
. . . (∗ crcatc conditiona¦ data¦asc tor i ∗)
p p −¦i¦. (∗ and jroccss it rccursivc¦y, ∗)
end. (∗ thcn rcstorc thc ori¸ina¦ jrcﬁx ∗)
Christian Borgelt Frequent Pattern Mining 133
RElim: PseudoCode
if s ≥ s
min
then (∗ it thc currcnt itcm is trcqucnt ∗)
. . . (∗ rcjort thc tound trcqucnt itcm sct ∗)
b array ot transaction ¦ists. (∗ crcatc an cmjty ¦ist array ∗)
t aihcad. (∗ ¸ct thc ¦ist associatcd with thc itcm ∗)
while t , ni¦ do (∗ whi¦c not at thc cnd ot thc ¦ist ∗)
u cojy ot t. t tsucc. (∗ cojy thc transaction ¦ist c¦cmcnt, ∗)
k uitcms0. (∗ ¸o to thc ncxt ¦ist c¦cmcnt, and ∗)
rcmovc k trom uitcms. (∗ rcmovc thc ¦cadin¸ itcm trom thc cojy ∗)
if uitcms is not cmjty (∗ add thc cojy to thc conditiona¦ data¦asc ∗)
then usucc bkhcad. bkhcad u. end.
bkw¸t bkw¸t +uw¸t. (∗ sum thc transaction wci¸ht ∗)
end. (∗ in thc ¦ist wci¸ht,transaction countcr ∗)
n n + 1 + lL¦im(b, p, s
min
). (∗ jroccss thc crcatcd data¦asc rccursivc¦y ∗)
. . . (∗ and sum thc tound trcqucnt itcm scts, ∗)
end. (∗ thcn rcstorc thc ori¸ina¦ itcm sct jrcﬁx ∗)
. . . (∗ ¸o on ¦y rcassi¸nin¸ ∗)
(∗ thc jroccsscd transactions ∗)
Christian Borgelt Frequent Pattern Mining 134
RElim: PseudoCode
. . .
t aihcad. (∗ ¸ct thc ¦ist associatcd with thc itcm ∗)
while t , ni¦ do (∗ whi¦c not at thc cnd ot thc ¦ist ∗)
u t. t tsucc. (∗ notc thc currcnt ¦ist c¦cmcnt, ∗)
k uitcms0. (∗ ¸o to thc ncxt ¦ist c¦cmcnt, and ∗)
rcmovc k trom uitcms. (∗ rcmovc thc ¦cadin¸ itcm trom currcnt ∗)
if uitcms is not cmjty (∗ rcassi¸n thc notcd ¦ist c¦cmcnt ∗)
then usucc akhcad. akhcad u. end.
akw¸t akw¸t +uw¸t. (∗ sum thc transaction wci¸ht ∗)
end. (∗ in thc ¦ist wci¸ht,transaction countcr ∗)
rcmovc ai trom a. (∗ rcmovc thc jroccsscd ¦ist ∗)
end.
return n. (∗ rcturn thc num¦cr ot trcqucnt itcm scts ∗)
end. (∗ tunction lL¦im() ∗)
• ln ordcr to rcmovc duj¦icatc c¦cmcnts, it is usua¦¦y advisa¦¦c
to sort and comjrcss thc ncxt transaction ¦ist ¦ctorc it is jroccsscd
Christian Borgelt Frequent Pattern Mining 135
Summary RElim
Basic Processing Scheme
• Lcjthﬁrst travcrsa¦ ot thc jrcﬁx trcc
• Lata is rcjrcscntcd as ¦ists ot transactions (onc jcr itcm)
• Sujjort countin¸ is imj¦icit in thc (rc)assi¸nmcnt stcj
Advantages
• Simj¦c data structurcs and jroccssin¸ schcmc
• Comjctitivc with thc tastcst a¦¸orithms dcsjitc this simj¦icity
Disadvantages
• lL¦im is usua¦¦y outjcrtormcd ¦y Il¸rowth (discusscd ncxt)
Software
• http://www.borgelt.net/relim.html
Christian Borgelt Frequent Pattern Mining 136
The FPGrowth Algorithm
Ircqucnt lattcrn Growth A¦¸orithm Lan, lci, and Yin 2000
Christian Borgelt Frequent Pattern Mining 137
FPGrowth: Basic Ideas
• IlGrowth mcans Frequent Pattern Growth
• Thc itcm scts arc chcckcd in lexicographic order
(depthﬁrst traversal ot thc jrcﬁx trcc)
• Stcj ¦y stcj c¦imination ot itcms trom thc transaction data¦asc
• lccursivc jroccssin¸ ot thc conditiona¦ transaction data¦ascs
• Thc transaction data¦asc is rcjrcscntcd as an FPtree
An Iltrcc is ¦asica¦¦y a preﬁx tree with additiona¦ structurc
nodcs ot this trcc that corrcsjond to thc samc itcm arc ¦inkcd
This combines a horizontal and a vertical database representation
• This data structurc is uscd to comjutc conditiona¦ data¦ascs cﬃcicnt¦y
A¦¦ transactions containin¸ a ¸ivcn itcm can casi¦y ¦c tound
¦y thc ¦inks ¦ctwccn thc nodcs corrcsjondin¸ to this itcm
Christian Borgelt Frequent Pattern Mining 138
FPGrowth: Preprocessing the Transaction Database
1
a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
2
d S
b ¨
c `
a !
e 3
f 2
g 1
s
min
3
3
d a
d c a e
d b
d b c
b c
d b a
d b e
b c e
d c
d b a
!
d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
`
Iltrcc
(scc ncxt s¦idc)
1 Ori¸ina¦ transaction data¦asc
2 Ircqucncy ot individua¦ itcms
3 ltcms in transactions sortcd
dcsccndin¸¦y wrt thcir trcqucncy
and intrcqucnt itcms rcmovcd
! Transactions sortcd ¦cxico¸rajhica¦¦y
in asccndin¸ ordcr (comjarison ot
itcms is thc samc as in jrcccdin¸ stcj)
` Lata structurc uscd ¦y thc a¦¸orithm
(dctai¦s on ncxt s¦idc)
Christian Borgelt Frequent Pattern Mining 139
Transaction Representation: FPTree
• Lui¦d a frequent pattern tree (FPtree) trom thc transactions
(¦asica¦¦y a jrcﬁx trcc with links between the branches that ¦ink nodcs
with thc samc itcm and a header table tor thc rcsu¦tin¸ itcm ¦ists)
• Ircqucnt sin¸¦c itcm scts can ¦c rcad dircct¦y trom thc Iltrcc
Simple Example Database
1
a d f
a c d e
b d
b c d
b c
a b d
b d e
b c e g
c d f
a b d
!
d b
d b c
d b a
d b a
d b e
d c
d c a e
d a
b c
b c e
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
e 3
e 1
e 1
e 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 140
Transaction Representation: FPTree
• An Iltrcc com¦incs a horizonta¦ and a vcrtica¦ transaction rcjrcscntation
• Horizontal Representation: jrcﬁx trcc ot transactions
Vertical Representation: ¦inks ¦ctwccn thc jrcﬁx trcc ¦ranchcs
`otc thc jrcﬁx trcc is invcrtcd,
ic thcrc arc on¦y jarcnt jointcrs
Chi¦d jointcrs arc not nccdcd
duc to thc jroccssin¸ schcmc
(to ¦c discusscd)
ln jrincij¦c, a¦¦ nodcs rctcrrin¸
to thc samc itcm can ¦c storcd
in an array rathcr than a ¦ist
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
e 3
e 1
e 1
e 1
frequent pattern tree
Christian Borgelt Frequent Pattern Mining 141
Recursive Processing
• Thc initia¦ Iltrcc is projected wrt thc itcm corrcsjondin¸ to
thc ri¸htmost ¦cvc¦ in thc trcc (¦ct this itcm ¦c i)
• This yic¦ds an Iltrcc ot thc conditional database
(data¦asc ot transactions containin¸ thc itcm i, ¦ut with this itcm rcmovcd
— it is imj¦icit in thc Iltrcc and rccordcd as a common jrcﬁx)
• Irom thc jro,cctcd Iltrcc thc trcqucnt itcm scts
containin¸ itcm i can ¦c rcad dircct¦y
• Thc rightmost level ot thc ori¸ina¦ (unjro,cctcd) Iltrcc is removed
(thc itcm i is rcmovcd trom thc data¦asc)
• Thc jro,cctcd Iltrcc is jroccsscd rccursivc¦y. thc itcm i is notcd as a jrcﬁx
that is to ¦c addcd in dccjcr ¦cvc¦s ot thc rccursion
• Attcrwards thc rcduccd ori¸ina¦ Iltrcc is turthcr jroccsscd
¦y workin¸ on thc ncxt ¦cvc¦ ¦cttwards
Christian Borgelt Frequent Pattern Mining 142
Projecting an FPTree
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
e 3
e 1
e 1
e 1
d 2
b 1
b 1
c 1
c 1
a 1
↑
dctachcd jro,cction
← Iltrcc with attachcd jro,cction
d 2 b 2 c 2 a 1
d 2
b 1
b 1
c 1
c 1
a 1
• Ly travcrsin¸ thc nodc ¦ist tor thc ri¸htmost itcm,
a¦¦ transactions containin¸ this itcm can ¦c tound
• Thc Iltrcc ot thc conditiona¦ data¦asc tor this itcm is crcatcd
¦y cojyin¸ thc nodcs on thc jaths to thc root
Christian Borgelt Frequent Pattern Mining 143
Projecting an FPTree
• A simj¦cr, ¦ut usua¦¦y cqua¦¦y cﬃcicnt jro,cction schcmc
is to cxtract a jath to thc root as a (rcduccd) transaction
and to inscrt this transaction into a ncw Iltrcc
• Ior thc inscrtion into thc ncw trcc thcrc arc two ajjroachcs
◦ Ajart trom a jarcnt jointcr (which is nccdcd tor thc jath cxtraction),
cach nodc josscsscs a jointcr to its ﬁrst child and its right sibling
Thcsc jointcrs a¦¦ow to inscrt a ncw transaction tojdown
◦ lt thc initia¦ Iltrcc has ¦ccn ¦ui¦t trom a ¦cxico¸rajhica¦¦y sortcd
transaction data¦asc, thc travcrsa¦ ot thc itcm ¦ists yic¦ds thc
(rcduccd) transactions in ¦cxico¸rajhica¦ ordcr
This can ¦c cxj¦oitcd to inscrt a transaction usin¸ on¦y thc header table
• Ly jroccssin¸ an Iltrcc trom left to right (or trom top to bottom
wrt thc jrcﬁx trcc), thc jro,cction may cvcn rcusc thc a¦rcady jrcscnt nodcs
and thc a¦rcady jroccsscd jart ot thc hcadcr ta¦¦c (topdown fpgrowth)
ln this way thc a¦¸orithm can ¦c cxccutcd on a ﬁxcd amount ot mcmory
Christian Borgelt Frequent Pattern Mining 144
Reducing the Original FPTree
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
e 3
e 1
e 1
e 1
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
• Thc ori¸ina¦ Iltrcc is rcduccd ¦y rcmovin¸ thc ri¸htmost ¦cvc¦
• This yic¦ds thc conditiona¦ data¦asc tor itcm scts not containin¸ thc itcm
corrcsjondin¸ to thc ri¸htmost ¦cvc¦
Christian Borgelt Frequent Pattern Mining 145
FPgrowth: DivideandConquer
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
e 3
e 1
e 1
e 1
d S b ¨ c ` a !
d S
b `
b 2
c 1
c 2
c 2
a 2
a 1
a 1
↑
Conditiona¦ data¦asc
with itcm e rcmovcd
(sccond su¦jro¦¦cm)
d 2 b 2 c 2 a 1
d 2
b 1
b 1
c 1
c 1
a 1 ← Conditiona¦ data¦asc tor jrcﬁx e
(ﬁrst su¦jro¦¦cm)
Christian Borgelt Frequent Pattern Mining 146
Pruning a Projected FPTree
• Trivial case: lt thc itcm corrcsjondin¸ to thc ri¸htmost ¦cvc¦ is intrcqucnt,
thc itcm and thc Iltrcc ¦cvc¦ arc rcmovcd without jro,cction
• More interesting case: An itcm corrcsjondin¸ to a midd¦c ¦cvc¦
is intrcqucnt, ¦ut an itcm on a ¦cvc¦ turthcr to thc ri¸ht is trcqucnt
Example FPTree with an intrcqucnt itcm on a midd¦c ¦cvc¦
a o b 1 c ! d 3
a o b 1 c 1
c 3
d 1
d 2
a o b 1 c ! d 3
a o c ! d 3
• Soca¦¦cd αjrunin¸ or Lonsai jrunin¸ ot a (jro,cctcd) Iltrcc
• lmj¦cmcntcd ¦y ¦ctttori¸ht ¦cvc¦wisc mcr¸in¸ ot nodcs with samc jarcnts
Christian Borgelt Frequent Pattern Mining 147
FPgrowth: Implementation Issues
• Chains:
lt an Iltrcc has ¦ccn rcduccd to a chain, no jro,cctions arc comjutcd anymorc
lathcr a¦¦ su¦scts ot thc sct ot itcms in thc chain arc tormcd and rcjortcd
• Rebuilding the FPtree:
An Iltrcc may ¦c jro,cctcd ¦y cxtractin¸ thc (rcduccd) transactions dcscri¦cd
¦y thc jaths to thc root and inscrtin¸ thcm into a ncw Iltrcc (scc a¦ovc)
This makcs it jossi¦¦c to chan¸c thc itcm ordcr, with thc to¦¦owin¸ advantages
◦ `o nccd tor α or Lonsai jrunin¸, sincc thc itcms can ¦c rcordcrcd
so that a¦¦ conditiona¦¦y trcqucnt itcms ajjcar on thc ¦ctt
◦ `o nccd tor jcrtcct cxtcnsion jrunin¸, ¦ccausc thc jcrtcct cxtcnsions can ¦c
movcd to thc ¦ctt and arc jroccsscd at thc cnd with thc chain ojtimization
Lowcvcr, thcrc arc a¦so disadvantages
◦ Lithcr thc Iltrcc has to ¦c travcrscd twicc or jair trcqucncics havc to ¦c
dctcrmincd to rcordcr thc itcms accordin¸ to thcir conditiona¦ trcqucncy
Christian Borgelt Frequent Pattern Mining 148
FPgrowth: Implementation Issues
• Thc initia¦ Iltrcc is ¦ui¦t trom an array¦ascd main mcmory rcjrcscntation
ot thc transaction data¦asc (c¦iminatcs thc nccd tor chi¦d jointcrs)
• This has thc disadvanta¸c that thc mcmory savin¸s ottcn rcsu¦tin¸
trom an Iltrcc rcjrcscntation cannot ¦c tu¦¦y cxj¦oitcd
• Lowcvcr, it has thc advanta¸c that no chi¦d and si¦¦in¸ jointcrs arc nccdcd
and thc transactions can ¦c inscrtcd in ¦cxico¸rajhic ordcr
• Lach Iltrcc nodc has a constant sizc ot 1o ¦ytcs (2 jointcrs, 2 intc¸crs)
A¦¦ocatin¸ thcsc throu¸h thc standard mcmory mana¸cmcnt is wastctu¦
(A¦¦ocatin¸ many sma¦¦ mcmory o¦,ccts is hi¸h¦y incﬃcicnt)
• So¦ution Thc nodcs arc a¦¦ocatcd in onc ¦ar¸c array jcr Iltrcc
• As a conscqucncc, cach Iltrcc rcsidcs in a sin¸¦c mcmory ¦¦ock
Thcrc is no a¦¦ocation and dca¦¦ocation ot individua¦ nodcs
(This may wastc somc mcmory, ¦ut is hi¸h¦y cﬃcicnt)
Christian Borgelt Frequent Pattern Mining 149
FPgrowth: Implementation Issues
• An Iltrcc can ¦c imj¦cmcntcd with on¦y two integer arrays lasz 200!
◦ onc array contains thc transaction countcrs (sujjort va¦ucs) and
◦ onc array contains thc jarcnt jointcrs (as thc indiccs ot array c¦cmcnts)
This rcduccs thc mcmory rcquircmcnts to S ¦ytcs jcr nodc
• Such a mcmory structurc has advantages
duc thc way in which modcrn jroccssors acccss thc main mcmory
Lincar mcmory acccsscs arc tastcr than random acccsscs
◦ `ain mcmory is or¸anizcd as a “ta¦¦c” with rows and co¦umns
◦ Iirst thc row is addrcsscd and thcn, attcr somc dc¦ay, thc co¦umn
◦ Acccsscs to diﬀcrcnt co¦umns in thc samc row can skij thc row addrcssin¸
• Lowcvcr, thcrc arc a¦so disadvantages
◦ lro¸rammin¸ jro,cction and α or Lonsai jrunin¸ ¦ccomcs morc comj¦cx,
¦ccausc ¦css structurc is avai¦a¦¦c
◦ lcordcrin¸ thc itcms is virtua¦¦y ru¦cd out
Christian Borgelt Frequent Pattern Mining 150
Summary FPGrowth
Basic Processing Scheme
• Transaction data¦asc is rcjrcscntcd as a trcqucnt jattcrn trcc
• An Iltrcc is jro,cctcd to o¦tain a conditiona¦ data¦asc
• lccursivc jroccssin¸ ot thc conditiona¦ data¦asc
Advantages
• Ottcn thc tastcst a¦¸orithm or amon¸ thc tastcst a¦¸orithms
Disadvantages
• `orc diﬃcu¦t to imj¦cmcnt than othcr ajjroachcs, comj¦cx data structurc
• An Iltrcc can nccd morc mcmory than a ¦ist or array ot transactions
Software
• http://www.borgelt.net/fpgrowth.html
Christian Borgelt Frequent Pattern Mining 151
Experimental Comparison
Christian Borgelt Frequent Pattern Mining 152
Experiments: Data Sets
• Chess
A data sct ¦istin¸ chcss cnd ¸amc jositions tor kin¸ vs kin¸ and rook
This data sct is jart ot thc ¹Cl machinc ¦carnin¸ rcjository
¨` itcms, 319o transactions
avcra¸c transaction sizc 3¨, dcnsity ≈ 0.`
• Census
A data sct dcrivcd trom an cxtract ot thc ¹S ccnsus ¦urcau data ot 199!,
which was jrcjroccsscd ¦y discrctizin¸ numcric attri¦utcs
This data sct is jart ot thc ¹Cl machinc ¦carnin¸ rcjository
13` itcms, !SS!2 transactions
avcra¸c transaction sizc 1!, dcnsity ≈ 0.1
Thc density ot a transaction data¦asc is thc avcra¸c traction ot a¦¦ itcms occurrin¸
jcr transaction dcnsity avcra¸c transaction sizc , num¦cr ot itcms
Christian Borgelt Frequent Pattern Mining 153
Experiments: Data Sets
• T10I4D100K
An artiﬁcia¦ data sct ¸cncratcd with lL`’s data ¸cncrator
Thc namc is tormcd trom thc jaramctcrs ¸ivcn to thc ¸cncrator
(tor cxamj¦c 100I 100000 transactions)
S¨0 itcms, 100000 transactions
avcra¸c transaction sizc ≈ 10.1, dcnsity ≈ 0.012
• BMSWebview1
A wc¦ c¦ick strcam trom a ¦c¸carc comjany that no ¦on¸cr cxists
lt has ¦ccn uscd in thc ILL cuj 2000 and is a joju¦ar ¦cnchmark
!9¨ itcms, `9o02 transactions
avcra¸c transaction sizc ≈ 2.`, dcnsity ≈ 0.00`
Thc density ot a transaction data¦asc is thc avcra¸c traction ot a¦¦ itcms occurrin¸
jcr transaction dcnsity avcra¸c transaction sizc , num¦cr ot itcms
Christian Borgelt Frequent Pattern Mining 154
Experiments: Programs and Test System
• A¦¦ jro¸rams arc my own imj¦cmcntations
A¦¦ usc thc samc codc tor rcadin¸ thc transaction data¦asc
and tor writin¸ thc tound trcqucnt itcm scts
Thcrctorc diﬀcrcnccs in sjccd can on¦y ¦c thc cﬀcct ot thc jroccssin¸ schcmcs
• Thcsc jro¸rams and thcir sourcc codc can ¦c tound on my wc¦ sitc
http://www.borgelt.net/fpm.html
◦ Ajriori http://www.borgelt.net/apriori.html
◦ Lc¦at http://www.borgelt.net/eclat.html
◦ IlGrowth http://www.borgelt.net/fpgrowth.html
◦ lL¦im http://www.borgelt.net/relim.html
◦ Sa` http://www.borgelt.net/sam.html
• Thc tcst systcm was an lL`,Lcnovo Xo0s ¦ajtoj
(lntc¦ Ccntrino Luo L2!00, 1o¨ GLz, 1 GL main mcmory)
runnin¸ SuSL Linux 103. jro¸rams wcrc comji¦cd with ¸cc !21
Christian Borgelt Frequent Pattern Mining 155
Experiments: Execution Times
1000 1200 1400 1600 1800 2000
1
0
1
2
apriori
eclat
fpgrowth
relim
sam
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
apriori
eclat
fpgrowth
relim
sam
relim h
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1
apriori
eclat
fpgrowth
relim
sam
census
33 34 35 36 37 38 39 40
0
1
apriori
eclat
fpgrowth
relim
sam
webview1
Lccima¦ ¦o¸arithm ot cxccution timc in scconds ovcr a¦so¦utc minimum sujjort
Christian Borgelt Frequent Pattern Mining 156
Reminder: Perfect Extensions
• Thc scarch can ¦c imjrovcd with soca¦¦cd perfect extension pruning
• Givcn an itcm sct I, an itcm a / ∈ I is ca¦¦cd a perfect extension ot I,
iﬀ I and I ∪¦a¦ havc thc samc sujjort (a¦¦ transactions containin¸ I contain a)
• lcrtcct cxtcnsions havc thc to¦¦owin¸ jrojcrtics
◦ lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is a¦so a jcrtcct cxtcnsion ot any itcm sct J ⊇ I (as ¦on¸ as a / ∈ J)
◦ lt I is a trcqucnt itcm sct and X is thc sct ot a¦¦ jcrtcct cxtcnsions ot I,
thcn a¦¦ scts I ∪ J with J ∈ 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc a¦so trcqucnt and havc thc samc sujjort as I
• This can ¦c cxj¦oitcd ¦y co¦¦cctin¸ jcrtcct cxtcnsion itcms in thc rccursion,
in a third c¦cmcnt ot a su¦jro¦¦cm dcscrijtion S (D, P, X)
• Oncc idcntiﬁcd, jcrtcct cxtcnsion itcms arc no ¦on¸cr jroccsscd in thc rccursion,
¦ut arc on¦y uscd to ¸cncratc a¦¦ sujcrscts ot thc jrcﬁx havin¸ thc samc sujjort
Christian Borgelt Frequent Pattern Mining 157
Experiments: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima¦ ¦o¸arithm ot cxccution timc in scconds ovcr a¦so¦utc minimum sujjort
Christian Borgelt Frequent Pattern Mining 158
Reducing the Output:
Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 159
Maximal Item Sets
• Considcr thc sct ot maximal (frequent) item sets
M
T
(s
min
) ¦I ⊆ B [ s
T
(I) ≥ s
min
∧ ∀J ⊃ I s
T
(J) < s
min
¦.
That is An item set is maximal if it is frequent,
but none of its proper supersets is frequent.
• Sincc with this dcﬁnition wc know that
∀s
min
∀I ∈ F
T
(s
min
) I ∈ M
T
(s
min
) ∨ ∃J ⊃ I s
T
(J) ≥ s
min
it to¦¦ows (can casi¦y ¦c jrovcn ¦y succcssivc¦y cxtcndin¸ thc itcm sct I)
∀s
min
∀I ∈ F
T
(s
min
) ∃J ∈ M
T
(s
min
) I ⊆ J.
That is Every frequent item set has a maximal superset.
• Thcrctorc ∀s
min
F
T
(s
min
)
_
I∈M
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 160
Mathematical Excursion: Maximal Elements
• Lct R ¦c a su¦sct ot a jartia¦¦y ordcrcd sct (S, ≤)
An c¦cmcnt x ∈ R is ca¦¦cd maximal or a maximal element ot R it
∀y ∈ R x ≤ y ⇒ x y.
• Thc notions minimal and minimal element arc dcﬁncd ana¦o¸ous¦y
• `axima¦ c¦cmcnts nccd not ¦c uniquc,
¦ccausc thcrc may ¦c c¦cmcnts x, y ∈ R with ncithcr x ≤ y nor y ≤ x
• lnﬁnitc jartia¦¦y ordcrcd scts nccd not josscss a maxima¦,minima¦ c¦cmcnt
• Lcrc wc considcr thc sct F
T
(s
min
) as a su¦sct ot thc jartia¦¦y ordcrcd sct (2
B
, ⊆)
Thc maximal (frequent) item sets arc thc maxima¦ c¦cmcnts ot F
T
(s
min
)
M
T
(s
min
) ¦I ∈ F
T
(s
min
) [ ∀J ∈ F
T
(s
min
) I ⊆ J ⇒ I J¦.
That is, no sujcrsct ot a maxima¦ (trcqucnt) itcm sct is trcqucnt
Christian Borgelt Frequent Pattern Mining 161
Maximal Item Sets: Example
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• Thc maxima¦ itcm scts arc
¦b, c¦, ¦a, c, d¦, ¦a, c, e¦, ¦a, d, e¦.
• Lvcry trcqucnt itcm sct is a su¦sct ot at ¦cast onc ot thcsc scts
Christian Borgelt Frequent Pattern Mining 162
Hasse Diagram and Maximal Item Sets
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
lcd ¦oxcs arc maxima¦
itcm scts, whitc ¦oxcs
intrcqucnt itcm scts
Lassc dia¸ram with maxima¦ itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 163
Limits of Maximal Item Sets
• Thc sct ot maxima¦ itcm scts cajturcs thc sct ot a¦¦ trcqucnt itcm scts,
¦ut thcn wc know at most thc sujjort ot thc maxima¦ itcm scts cxact¦y
• A¦out thc sujjort ot a nonmaxima¦ trcqucnt itcm sct wc on¦y know
∀s
min
∀I ∈ F
T
(s
min
) −M
T
(s
min
) s
T
(I) ≥ max
J∈M
T
(s
min
),J⊃I
s
T
(J).
This rc¦ation to¦¦ows immcdiatc¦y trom ∀I ∀J ⊇ I s
T
(I) ≥ s
T
(J),
that is, an itcm sct cannot havc a ¦owcr sujjort than any ot its sujcrscts
• `otc that wc havc ¸cncra¦¦y
∀s
min
∀I ∈ F
T
(s
min
) s
T
(I) ≥ max
J∈M
T
(s
min
),J⊇I
s
T
(J).
• Question: Can wc ﬁnd a su¦sct ot thc sct ot a¦¦ trcqucnt itcm scts,
which a¦so jrcscrvcs know¦cd¸c ot a¦¦ sujjort va¦ucs´
Christian Borgelt Frequent Pattern Mining 164
Closed Item Sets
• Considcr thc sct ot closed (frequent) item sets
C
T
(s
min
) ¦I ⊆ B [ s
T
(I) ≥ s
min
∧ ∀J ⊃ I s
T
(J) < s
T
(I)¦.
That is An item set is closed if it is frequent,
but none of its proper supersets has the same support.
• Sincc with this dcﬁnition wc know that
∀s
min
∀I ∈ F
T
(s
min
) I ∈ C
T
(s
min
) ∨ ∃J ⊃ I s
T
(J) s
T
(I)
it to¦¦ows (can casi¦y ¦c jrovcn ¦y succcssivc¦y cxtcndin¸ thc itcm sct I)
∀s
min
∀I ∈ F
T
(s
min
) ∃J ∈ C
T
(s
min
) I ⊆ J.
That is Every frequent item set has a closed superset.
• Thcrctorc ∀s
min
F
T
(s
min
)
_
I∈C
T
(s
min
)
2
I
Christian Borgelt Frequent Pattern Mining 165
Closed Item Sets
• Lowcvcr, not on¦y has cvcry trcqucnt itcm sct a c¦oscd sujcrsct,
¦ut it has a closed superset with the same support
∀s
min
∀I ∈ F
T
(s
min
) ∃J ⊇ I J ∈ C
T
(s
min
) ∧ s
T
(J) s
T
(I).
(lroot scc thc considcrations on thc ncxt s¦idc)
• Thc sct ot a¦¦ c¦oscd itcm scts jrcscrvcs know¦cd¸c ot a¦¦ sujjort va¦ucs
∀s
min
∀I ∈ F
T
(s
min
) s
T
(I) max
J∈C
T
(s
min
),J⊇I
s
T
(J).
• `otc that thc wcakcr statcmcnt
∀s
min
∀I ∈ F
T
(s
min
) s
T
(I) ≥ max
J∈C
T
(s
min
),J⊇I
s
T
(J)
to¦¦ows immcdiatc¦y trom ∀I ∀J ⊇ I s
T
(I) ≥ s
T
(J), that is,
an itcm sct cannot havc a ¦owcr sujjort than any ot its sujcrscts
Christian Borgelt Frequent Pattern Mining 166
Closed Item Sets
• Alternative characterization of closed item sets:
I is c¦oscd ⇔ s
T
(I) ≥ s
min
∧ I
k∈K
T
(I)
t
k
.
lcmindcr K
T
(I) ¦k ∈ ¦1, . . . , n¦ [ I ⊆ t
k
¦ is thc cover ot I wrt T
• This is dcrivcd as to¦¦ows sincc ∀k ∈ K
T
(I) I ⊆ t
k
, it is o¦vious that
∀s
min
∀I ∈ F
T
(s
min
) I ⊆
k∈K
T
(I)
t
k
,
lt I ⊂
k∈K
T
(I)
t
k
, it is not c¦oscd, sincc
k∈K
T
(I)
t
k
has thc samc sujjort
On thc othcr hand, no sujcrsct ot
k∈K
T
(I)
t
k
has thc covcr K
T
(I)
• `otc that thc a¦ovc charactcrization a¦¦ows us to construct tor any itcm sct
thc (uniquc¦y dctcrmincd) c¦oscd sujcrsct that has thc samc sujjort
Christian Borgelt Frequent Pattern Mining 167
Mathematical Excursion: Closure Operators
• A closure operator on a sct S is a tunction cl 2
S
→ 2
S
,
which satisﬁcs thc to¦¦owin¸ conditions ∀X, Y ⊆ S
◦ X ⊆ cl (X) (cl is extensive)
◦ X ⊆ Y ⇒ cl (X) ⊆ cl (Y ) (cl is increasing or monotone)
◦ cl (cl (X)) cl (X) (cl is idempotent)
• A sct R ⊆ S is ca¦¦cd closed it it is cqua¦ to its c¦osurc
R is c¦oscd ⇔ R cl (R).
• Thc closed (frequent) item sets arc induccd ¦y thc c¦osurc ojcrator
cl (I)
k∈K
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) ¦I ∈ F
T
(s
min
) [ I cl (I)¦
Christian Borgelt Frequent Pattern Mining 168
Mathematical Excursion: Galois Connections
• Lct (X, _
X
) and (Y, _
Y
) ¦c two jartia¦¦y ordcrcd scts
• A tunction jair (f
1
, f
2
) with f
1
X → Y and f
2
Y → X
is ca¦¦cd a (monotone) Galois connection iﬀ
◦ ∀A
1
, A
2
∈ X A
1
_ A
2
⇒ f
1
(A
1
) _ f
1
(A
2
),
◦ ∀B
1
, B
2
∈ Y B
1
_ B
2
⇒ f
2
(B
1
) _ f
2
(B
2
),
◦ ∀A ∈ X ∀B ∈ Y A _ f
2
(B) ⇔ B _ f
1
(A)
• A tunction jair (f
1
, f
2
) with f
1
X → Y and f
2
Y → X
is ca¦¦cd an antimonotone Galois connection iﬀ
◦ ∀A
1
, A
2
∈ X A
1
_ A
2
⇒ f
1
(A
1
) _ f
1
(A
2
),
◦ ∀B
1
, B
2
∈ Y B
1
_ B
2
⇒ f
2
(B
1
) _ f
2
(B
2
),
◦ ∀A ∈ X ∀B ∈ Y A _ f
2
(B) ⇔ B _ f
1
(A)
• ln a monotonc Ga¦ois conncction, ¦oth f
1
and f
2
arc monotonc,
in an antimonotonc Ga¦ois conncction, ¦oth f
1
and f
2
arc antimonotonc
Christian Borgelt Frequent Pattern Mining 169
Mathematical Excursion: Galois Connections
• Lct thc two scts X and Y ¦c jowcr scts ot somc scts U and V , rcsjcctivc¦y,
and ¦ct thc jartia¦ ordcrs ¦c thc su¦sct rc¦ations on thcsc jowcr scts, that is, ¦ct
(X, _
X
) (2
U
, ⊆) and (Y, _
Y
) (2
V
, ⊆).
• Thcn thc com¦ination f
1
◦ f
2
X → X ot thc tunctions ot a Ga¦ois conncction
is a closure operator (as wc¦¦ as thc com¦ination f
2
◦ f
1
Y → Y )
(i) ∀A ⊆ U A ⊆ f
2
(f
1
(A)) (a c¦osurc ojcrator is extensive)
◦ Sincc (f
1
, f
2
) is a Ga¦ois conncction, wc know
∀A ⊆ U ∀B ⊆ V A ⊆ f
2
(B) ⇔ B ⊆ f
1
(A).
◦ Choosc B f
1
(A)
∀A ⊆ U A ⊆ f
2
(f
1
(A)) ⇔ f
1
(A) ⊆ f
1
(A)
. ¸¸ .
truc
.
◦ Choosc A f
2
(B)
∀B ⊆ V f
2
(B) ⊆ f
2
(B)
. ¸¸ .
truc
⇔ B ⊆ f
1
(f
2
(B)).
Christian Borgelt Frequent Pattern Mining 170
Mathematical Excursion: Galois Connections
(ii) ∀A
1
, A
2
⊆ U A
1
⊆ A
2
⇒ f
2
(f
1
(A
1
)) ⊆ f
2
(f
1
(A
2
))
(a c¦osurc ojcrator is increasing or monotone)
◦ This jrojcrty to¦¦ows immcdiatc¦y trom thc tact that
thc tunctions f
1
and f
2
arc ¦oth (anti)monotonc
◦ lt f
1
and f
2
arc ¦oth monotonc, wc havc
∀A
1
, A
2
⊆ U A
1
⊆ A
2
⇒ ∀A
1
, A
2
⊆ U f
1
(A
1
) ⊆ f
1
(A
2
)
⇒ ∀A
1
, A
2
⊆ U f
2
(f
1
(A
1
)) ⊆ f
2
(f
1
(A
2
)).
◦ lt f
1
and f
2
arc ¦oth antimonotonc, wc havc
∀A
1
, A
2
⊆ U A
1
⊆ A
2
⇒ ∀A
1
, A
2
⊆ U f
1
(A
1
) ⊇ f
1
(A
2
)
⇒ ∀A
1
, A
2
⊆ U f
2
(f
1
(A
1
)) ⊆ f
2
(f
1
(A
2
)).
Christian Borgelt Frequent Pattern Mining 171
Mathematical Excursion: Galois Connections
(ii) ∀A ⊆ U f
2
(f
1
(f
2
(f
1
(A)))) f
2
(f
1
(A)) (a c¦osurc ojcrator is idempotent)
◦ Sincc ¦oth f
1
◦ f
2
and f
2
◦ f
1
arc cxtcnsivc (scc a¦ovc), wc know
∀A ⊆ V A ⊆ f
2
(f
1
(A)) ⊆ f
2
(f
1
(f
2
(f
1
(A))))
∀B ⊆ V B ⊆ f
1
(f
2
(B)) ⊆ f
1
(f
2
(f
1
(f
2
(B))))
◦ Choosin¸ B f
1
(A
′
) with A
′
⊆ U, wc o¦tain
∀A
′
⊆ U f
1
(A
′
) ⊆ f
1
(f
2
(f
1
(f
2
(f
1
(A
′
))))).
◦ Sincc (f
1
, f
2
) is a Ga¦ois conncction, wc know
∀A ⊆ U ∀B ⊆ V A ⊆ f
2
(B) ⇔ B ⊆ f
1
(A).
◦ Choosin¸ A f
2
(f
1
(f
2
(f
1
(A
′
)))) and B f
1
(A
′
), wc o¦tain
∀A
′
⊆ U f
2
(f
1
(f
2
(f
1
(A
′
)))) ⊆ f
2
(f
1
(A
′
))
⇔ f
1
(A
′
) ⊆ f
1
(f
2
(f
1
(f
2
(f
1
(A
′
)))))
. ¸¸ .
truc (scc a¦ovc)
.
Christian Borgelt Frequent Pattern Mining 172
Galois Connections in Frequent Item Set Mining
• Considcr thc jartia¦¦y ordcr scts (2
B
, ⊆) and (2
¦1,...,n¦
, ⊆)
Lct f
1
2
B
→ 2
¦1,...,n¦
, I → K
T
(I) ¦k ∈ ¦1, . . . , n¦ [ I ⊆ t
k
¦
and f
2
2
¦1,...,n¦
→ 2
B
, J →
j∈J
t
j
¦i ∈ B [ ∀j ∈ J i ∈ t
j
¦
• Thc tunction jair (f
1
, f
2
) is an antimonotone Galois connection
◦ ∀I
1
, I
2
∈ 2
B
I
1
⊆ I
2
⇒ f
1
(I
1
) K
T
(I
1
) ⊇ K
T
(I
2
) f
1
(I
2
),
◦ ∀J
1
, J
2
∈ 2
¦1,...,n¦
J
1
⊆ J
2
⇒ f
2
(J
1
)
k∈J
1
t
k
⊇
k∈J
2
t
k
f
2
(J
2
),
◦ ∀I ∈ 2
B
∀J ∈ 2
¦1,...,n¦
I ⊆ f
2
(J)
j∈J
t
j
⇔ J ⊆ f
1
(I) K
T
(I)
• As a conscqucncc f
1
◦ f
2
2
B
→ 2
B
, I →
k∈K
T
(I)
t
k
is a closure operator
Christian Borgelt Frequent Pattern Mining 173
Galois Connections in Frequent Item Set Mining
• Likcwisc f
2
◦ f
1
2
¦1,...,n¦
→ 2
¦1,...,n¦
, J → K
T
(
j∈J
t
j
)
is a¦so a closure operator
• Iurthcrmorc, it wc rcstrict our considcrations to thc rcsjcctivc scts
ot c¦oscd scts in ¦oth domains, that is, to thc scts
(
B
¦I ⊆ B [ I f
2
(f
1
(I))
k∈K
T
(I)
t
k
¦ and
(
T
¦J ⊆ ¦1, . . . , n¦ [ J f
1
(f
2
(J)) K
T
(
j∈J
t
j
)¦,
thcrc cxists a 1to1 relationship ¦ctwccn thcsc two scts,
which is dcscri¦cd ¦y thc Ga¦ois conncction
f
′
1
f
1
[
(
B
is a bijection with f
′−1
1
f
′
2
f
2
[
(
T
(This to¦¦ows immcdiatc¦y trom thc tacts that thc Ga¦ois conncction
dcscri¦cs c¦osurc ojcrators and that a c¦osurc ojcrator is idcmjotcnt)
• Thcrctorc ﬁndin¸ c¦oscd itcm scts with a ¸ivcn minimum support is cquiva¦cnt
to ﬁndin¸ c¦oscd scts ot transaction idcntiﬁcrs ot a ¸ivcn minimum size
Christian Borgelt Frequent Pattern Mining 174
Closed Item Sets: Example
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• A¦¦ trcqucnt itcm scts arc c¦oscd with thc cxccjtion ot ¦b¦ and ¦d, e¦
• ¦b¦ is a su¦sct ot ¦b, c¦, ¦oth havc a sujjort ot 3 ˆ 30/
¦d, e¦ is a su¦sct ot ¦a, d, e¦, ¦oth havc a sujjort ot ! ˆ !0/
Christian Borgelt Frequent Pattern Mining 175
Hasse diagram and Closed Item Sets
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
lcd ¦oxcs arc c¦oscd
itcm scts, whitc ¦oxcs
intrcqucnt itcm scts
Lassc dia¸ram with c¦oscd itcm scts (s
min
3)
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 176
Reminder: Perfect Extensions
• Thc scarch can ¦c imjrovcd with soca¦¦cd perfect extension pruning
• Givcn an itcm sct I, an itcm a / ∈ I is ca¦¦cd a perfect extension ot I,
iﬀ I and I ∪¦a¦ havc thc samc sujjort (a¦¦ transactions containin¸ I contain a)
• lcrtcct cxtcnsions havc thc to¦¦owin¸ jrojcrtics
◦ lt thc itcm a is a jcrtcct cxtcnsion ot an itcm sct I,
thcn a is a¦so a jcrtcct cxtcnsion ot any itcm sct J ⊇ I (as ¦on¸ as a / ∈ J)
◦ lt I is a trcqucnt itcm sct and X is thc sct ot a¦¦ jcrtcct cxtcnsions ot I,
thcn a¦¦ scts I ∪ J with J ∈ 2
X
(whcrc 2
X
dcnotcs thc jowcr sct ot X)
arc a¦so trcqucnt and havc thc samc sujjort as I
• This can ¦c cxj¦oitcd ¦y co¦¦cctin¸ jcrtcct cxtcnsion itcms in thc rccursion,
in a third c¦cmcnt ot a su¦jro¦¦cm dcscrijtion S (D, P, X)
• Oncc idcntiﬁcd, jcrtcct cxtcnsion itcms arc no ¦on¸cr jroccsscd in thc rccursion,
¦ut arc on¦y uscd to ¸cncratc a¦¦ sujcrscts ot thc jrcﬁx havin¸ thc samc sujjort
Christian Borgelt Frequent Pattern Mining 177
Closed Item Sets and Perfect Extensions
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦b, c, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• c is a jcrtcct cxtcnsion ot ¦b¦ as ¦b¦ and ¦b, c¦ ¦oth havc sujjort 3
• a is a jcrtcct cxtcnsion ot ¦d, e¦ as ¦d, e¦ and ¦a, d, e¦ ¦oth havc sujjort !
• `onc¦oscd itcm scts josscss at ¦cast onc jcrtcct cxtcnsion,
c¦oscd itcm scts do not josscss any jcrtcct cxtcnsions
Christian Borgelt Frequent Pattern Mining 178
Relation of Maximal and Closed Item Sets
empty set
item base
maxima¦ (trcqucnt) itcm scts
empty set
item base
c¦oscd (trcqucnt) itcm scts
• Thc sct ot c¦oscd itcm scts is thc union ot thc scts ot maxima¦ itcm scts
tor a¦¦ minimum sujjort va¦ucs at ¦cast as ¦ar¸c as s
min
C
T
(s
min
)
_
s∈¦s
min
,s
min
+1,...,n−1,n¦
M
T
(s)
Christian Borgelt Frequent Pattern Mining 179
Types of Frequent Item Sets: Summary
• Frequent Item Set
Any trcqucnt itcm sct (sujjort is hi¸hcr than thc minima¦ sujjort)
I trcqucnt ⇔ s
T
(I) ≥ s
min
• Closed (Frequent) Item Set
A trcqucnt itcm sct is ca¦¦cd closed it no sujcrsct has thc samc sujjort
I c¦oscd ⇔ s
T
(I) ≥ s
min
∧ ∀J ⊃ I s
T
(J) < s
T
(I)
• Maximal (Frequent) Item Set
A trcqucnt itcm sct is ca¦¦cd maximal it no sujcrsct is trcqucnt
I maxima¦ ⇔ s
T
(I) ≥ s
min
∧ ∀J ⊃ I s
T
(J) < s
min
• O¦vious rc¦ations ¦ctwccn thcsc tyjcs ot itcm scts
◦ A¦¦ maxima¦ itcm scts and a¦¦ c¦oscd itcm scts arc trcqucnt
◦ A¦¦ maxima¦ itcm scts arc c¦oscd
Christian Borgelt Frequent Pattern Mining 180
Types of Frequent Item Sets: Summary
0 itcms 1 itcm 2 itcms 3 itcms
∅
+
10 ¦a¦
+
¨ ¦a, c¦
+
! ¦a, c, d¦
+∗
3
¦b¦ 3 ¦a, d¦
+
` ¦a, c, e¦
+∗
3
¦c¦
+
¨ ¦a, e¦
+
o ¦a, d, e¦
+∗
!
¦d¦
+
o ¦b, c¦
+∗
3
¦e¦
+
¨ ¦c, d¦
+
!
¦c, e¦
+
!
¦d, e¦ !
• Frequent Item Set
Any trcqucnt itcm sct (sujjort is hi¸hcr than thc minima¦ sujjort)
• Closed (Frequent) Item Set (markcd with
+
)
A trcqucnt itcm sct is ca¦¦cd closed it no sujcrsct has thc samc sujjort
• Maximal (Frequent) Item Set (markcd with
∗
)
A trcqucnt itcm sct is ca¦¦cd maximal it no sujcrsct is trcqucnt
Christian Borgelt Frequent Pattern Mining 181
Experiments: Data Sets (Reminder)
• Chess
A data sct ¦istin¸ chcss cnd ¸amc jositions tor kin¸ vs kin¸ and rook
This data sct is jart ot thc ¹Cl machinc ¦carnin¸ rcjository
¨` itcms, 319o transactions
avcra¸c transaction sizc 3¨, dcnsity ≈ 0.`
• Census
A data sct dcrivcd trom an cxtract ot thc ¹S ccnsus ¦urcau data ot 199!,
which was jrcjroccsscd ¦y discrctizin¸ numcric attri¦utcs
This data sct is jart ot thc ¹Cl machinc ¦carnin¸ rcjository
13` itcms, !SS!2 transactions
avcra¸c transaction sizc 1!, dcnsity ≈ 0.1
Thc density ot a transaction data¦asc is thc avcra¸c traction ot a¦¦ itcms occurrin¸
jcr transaction dcnsity avcra¸c transaction sizc , num¦cr ot itcms
Christian Borgelt Frequent Pattern Mining 182
Experiments: Data Sets (Reminder)
• T10I4D100K
An artiﬁcia¦ data sct ¸cncratcd with lL`’s data ¸cncrator
Thc namc is tormcd trom thc jaramctcrs ¸ivcn to thc ¸cncrator
(tor cxamj¦c 100I 100000 transactions)
S¨0 itcms, 100000 transactions
avcra¸c transaction sizc ≈ 10.1, dcnsity ≈ 0.012
• BMSWebview1
A wc¦ c¦ick strcam trom a ¦c¸carc comjany that no ¦on¸cr cxists
lt has ¦ccn uscd in thc ILL cuj 2000 and is a joju¦ar ¦cnchmark
!9¨ itcms, `9o02 transactions
avcra¸c transaction sizc ≈ 2.`, dcnsity ≈ 0.00`
Thc density ot a transaction data¦asc is thc avcra¸c traction ot a¦¦ itcms occurrin¸
jcr transaction dcnsity avcra¸c transaction sizc , num¦cr ot itcms
Christian Borgelt Frequent Pattern Mining 183
Types of Frequent Item Sets: Experiments
1000 1200 1400 1600 1800 2000
4
5
6
7
frequent
closed
maximal
chess
0 5 10 15 20 25 30 35 40 45 50
4
5
6
frequent
closed
maximal
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
5
6
7 frequent
closed
maximal
census
33 34 35 36 37 38 39 40
4
5
6
7
8
frequent
closed
maximal
webview1
Lccima¦ ¦o¸arithm ot thc num¦cr ot itcm scts ovcr a¦so¦utc minimum sujjort
Christian Borgelt Frequent Pattern Mining 184
Reminder: Perfect Extension Pruning
1000 1200 1400 1600 1800 2000
1
0
1
2
w/o pep
apriori
eclat
fpgrowth
chess
0 5 10 15 20 25 30 35 40 45 50
0
1
w/o pep
apriori
eclat
fpgrowth
T10I4D100K
0 10 20 30 40 50 60 70 80 90 100
0
1 w/o pep
apriori
eclat
fpgrowth
census
33 34 35 36 37 38 39 40
0
1
w/o pep
apriori
eclat
fpgrowth
webview1
Lccima¦ ¦o¸arithm ot cxccution timc in scconds ovcr a¦so¦utc minimum sujjort
Christian Borgelt Frequent Pattern Mining 185
Searching for Closed and Maximal Item Sets
Christian Borgelt Frequent Pattern Mining 186
Searching for Closed Frequent Item Sets
• \c know that it suﬃccs to ﬁnd thc c¦oscd itcm scts to¸cthcr with thcir sujjort
• Thc charactcrization ot c¦oscd itcm scts ¦y
I c¦oscd ⇔ s
T
(I) ≥ s
min
∧ I
k∈K
T
(I)
t
k
su¸¸csts to ﬁnd thcm ¦y tormin¸ a¦¦ jossi¦¦c intcrscctions ot thc transactions
(with at ¦cast s
min
transactions) and chcckin¸ thcir sujjort
• Lowcvcr, ajjroachcs usin¸ this idca arc rarc¦y comjctitivc with othcr mcthods
• Sjccia¦ cascs in which thcy arc comjctitivc arc domains with tcw transactions and
vcry many itcms An cxamj¦c ot such a domain is ¸cnc cxjrcssion ana¦ysis
• lmj¦cmcntations ot intcrscction ajjroachcs can ¦c tound hcrc
http://www.borgelt.net/carpenter.html
http://www.borgelt.net/ista.html
Christian Borgelt Frequent Pattern Mining 187
Filtering Frequent Item Sets
• lt on¦y c¦oscd itcm scts or on¦y maxima¦ itcm scts arc to ¦c tound with itcm sct
cnumcration ajjroachcs, thc tound trcqucnt itcm scts havc to ¦c ﬁ¦tcrcd
• Somc usctu¦ notions tor ﬁ¦tcrin¸ and jrunin¸
◦ Thc head H ⊆ B ot a scarch trcc nodc is thc sct ot itcms on thc jath
¦cadin¸ to it lt is thc jrcﬁx ot thc conditiona¦ data¦asc tor this nodc
◦ Thc tail L ⊆ B ot a scarch trcc nodc is thc sct ot itcms that arc trcqucnt
in its conditiona¦ data¦asc Thcy arc thc jossi¦¦c cxtcnsions ot H
◦ `otc that ∀h ∈ H ∀l ∈ L h < l
◦ E ¦i ∈ B−H [ ∃h ∈ H h > i¦ is thc sct ot eliminated items
Thcsc itcms arc not considcrcd anymorc in thc corrcsjondin¸ su¦trcc
• `otc that thc itcms in thc tai¦ and thcir sujjort in thc conditiona¦ data¦asc
arc known, at ¦cast attcr thc scarch rcturns trom thc rccursivc jroccssin¸
Christian Borgelt Frequent Pattern Mining 188
Head, Tail and Eliminated Items
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu¦¦) jrcﬁx trcc tor thc ﬁvc itcms a, b, c, d, e
• Thc ¦¦uc ¦oxcs arc thc trcqucnt itcm scts
• Ior thc cncirc¦cd scarch trcc nodcs wc havc
rcd hcad H ¦b¦, tai¦ L ¦c¦, c¦iminatcd itcms E ¦a¦
¸rccn hcad H ¦a, c¦, tai¦ L ¦d, e¦, c¦iminatcd itcms E ¦b¦
Christian Borgelt Frequent Pattern Mining 189
Closed and Maximal Item Sets
• \hcn ﬁ¦tcrin¸ trcqucnt itcm scts tor c¦oscd and maxima¦ itcm scts
thc to¦¦owin¸ conditions arc casy and cﬃcicnt to chcck
◦ lt thc tai¦ ot a scarch trcc nodc is not cmjty,
its hcad is not a maxima¦ itcm sct
◦ lt an itcm in thc tai¦ ot a scarch trcc nodc has thc samc sujjort
as thc hcad, thc hcad is not a c¦oscd itcm sct
• Lowcvcr, thc invcrsc imj¦ications nccd not ho¦d
◦ lt thc tai¦ ot a scarch trcc nodc is cmjty,
its hcad is not ncccssari¦y a maxima¦ itcm sct
◦ lt no itcm in thc tai¦ ot a scarch trcc nodc has thc samc sujjort
as thc hcad, thc hcad is not ncccssari¦y a c¦oscd itcm sct
• Thc jro¦¦cm arc thc eliminated items,
which can sti¦¦ rcndcr thc hcad nonc¦oscd or nonmaxima¦
Christian Borgelt Frequent Pattern Mining 190
Closed and Maximal Item Sets
Check the Deﬁning Condition Directly:
• Closed Item Sets
Chcck whcthcr ∃a ∈ E K
T
(H) ⊆ K
T
(a)
or chcck whcthcr
k∈K
T
(H)
(t
k
−H) , ∅
lt cithcr is thc casc, H is not c¦oscd, othcrwisc it is
`otc that with thc ¦attcr condition, thc intcrscction can ¦c comjutcd transaction
¦y transaction lt can ¦c conc¦udcd that H is c¦oscd as soon as thc intcrscction
¦ccomcs cmjty
• Maximal Item Sets:
Chcck whcthcr ∃a ∈ E s
T
(H ∪ ¦a¦) ≥ s
min
lt this is thc casc, H is not maxima¦, othcrwisc it is
Christian Borgelt Frequent Pattern Mining 191
Closed and Maximal Item Sets
• Chcckin¸ thc dcﬁnin¸ condition dircct¦y is trivia¦ tor thc tai¦ itcms,
as thcir sujjort va¦ucs arc avai¦a¦¦c trom thc conditiona¦ transaction data¦ascs
• As a conscqucncc, a¦¦ itcm sct cnumcration ajjroachcs tor c¦oscd and
maxima¦ itcm scts chcck thc dcﬁnin¸ condition tor thc tai¦ itcms
• Lowcvcr, chcckin¸ thc dcﬁnin¸ condition can ¦c diﬃcu¦t tor thc c¦iminatcd itcms,
sincc additiona¦ data (¦cyond thc conditiona¦ transaction data¦asc) is nccdcd to
dctcrminc thcir occurrcnccs in thc transactions or thcir sujjort va¦ucs
• lt can dcjcnd on thc data¦asc structurc uscd whcthcr a chcck
ot thc dcﬁnin¸ condition is cﬃcicnt tor thc c¦iminatcd itcms or not
• As a conscqucncc, somc itcm sct cnumcration a¦¸orithms
do not chcck thc dcﬁnin¸ condition tor thc c¦iminatcd itcms,
¦ut rc¦y on a rcjository ot a¦rcady tound c¦oscd or maxima¦ itcm scts
• \ith such a rcjository it can ¦c chcckcd in an indircct way
whcthcr an itcm sct is c¦oscd or maxima¦
Christian Borgelt Frequent Pattern Mining 192
Checking the Eliminated Items: Repository
• Lach tound maxima¦ or c¦oscd itcm sct is storcd in a rcjository
(lrctcrrcd data structurc tor thc rcjository jrcﬁx trcc)
• lt is chcckcd whcthcr a sujcrsct ot thc hcad H with thc samc sujjort
has a¦rcady ¦ccn tound lt ycs, thc hcad H is ncithcr c¦oscd nor maxima¦
• Lvcn morc thc hcad H nccd not ¦c jroccsscd rccursivc¦y,
¦ccausc thc rccursion cannot yic¦d any c¦oscd or maxima¦ itcm scts
Thcrctorc thc currcnt su¦trcc ot thc scarch trcc can ¦c jruncd
• `otc that with a rcjository thc dcjthﬁrst scarch has to jrocccd trom ¦ctt to ri¸ht
◦ \c nccd thc rcjository to chcck tor jossi¦¦y cxistin¸ c¦oscd
or maxima¦ sujcrscts that contain onc or morc c¦iminatcd itcm(s)
◦ ltcm scts containin¸ c¦iminatcd itcms arc considcrcd on¦y
in scarch trcc ¦ranchcs to thc ¦ctt ot thc considcrcd nodc
◦ Thcrctorc thcsc ¦ranchcs must a¦rcady havc ¦ccn jroccsscd
in ordcr to cnsurc that jossi¦¦c sujcrscts havc a¦rcady ¦ccn rccordcd
Christian Borgelt Frequent Pattern Mining 193
Checking the Eliminated Items: Repository
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu¦¦) jrcﬁx trcc tor thc ﬁvc itcms a, b, c, d, e
• Sujjosc thc jrcﬁx trcc wou¦d ¦c travcrscd trom ri¸ht to ¦ctt
• Ior nonc ot thc trcqucnt itcm scts ¦d, e¦, ¦c, d¦ and ¦c, e¦ it cou¦d ¦c dctcrmincd
with thc hc¦j ot a rcjository that thcy arc not maxima¦, ¦ccausc thc maxima¦
itcm scts ¦a, c, d¦, ¦a, c, e¦, ¦a, d, e¦ havc not ¦ccn jroccsscd thcn
Christian Borgelt Frequent Pattern Mining 194
Checking the Eliminated Items: Repository
• lt a sujcrsct ot thc currcnt hcad H with thc samc sujjort
has a¦rcady ¦ccn tound, thc hcad H nccd not ¦c jroccsscd,
¦ccausc it cannot yic¦d any maxima¦ or c¦oscd itcm scts
• Thc rcason is that a tound jrojcr sujcrsct I ⊃ H with s
T
(I) s
T
(H)
contains at ¦cast onc itcm i ∈ I −H that is a jcrtcct cxtcnsion ot H
• Thc itcm i is an c¦iminatcd itcm, that is, i / ∈ L (itcm i is not in thc tai¦)
(lt i wcrc in L, thc sct I wou¦d not ¦c in thc rcjository a¦rcady)
• lt thc itcm i is a jcrtcct cxtcnsion ot thc hcad H,
it is a jcrtcct cxtcnsion ot a¦¦ sujcrscts J ⊇ H with i / ∈ J
• A¦¦ itcm scts cxj¦orcd trom thc scarch trcc nodc with hcad H and tai¦ L
arc su¦scts ot H ∪ L (¦ccausc on¦y thc itcms in L arc conditiona¦¦y trcqucnt)
• Conscqucnt¦y, thc itcm i is a jcrtcct cxtcnsion ot a¦¦ itcm scts cxj¦orcd trom thc
scarch trcc nodc with hcad H and tai¦ L, and thcrctorc nonc ot thcm can ¦c c¦oscd
Christian Borgelt Frequent Pattern Mining 195
Checking the Eliminated Items: Repository
• lt is usua¦¦y advanta¸cous to usc not ,ust a sin¸¦c, ¸¦o¦a¦ rcjository,
¦ut to crcatc conditiona¦ rcjositorics tor cach rccursivc ca¦¦,
which contain on¦y thc tound c¦oscd itcm scts that contain H
• \ith conditiona¦ rcjositorics thc chcck tor a known sujcrsct rcduccs
to thc chcck whcthcr thc conditiona¦ rcjository contains an itcm sct
with thc ncxt sj¦it itcm and thc samc sujjort as thc currcnt hcad
(`otc that thc chcck is cxccutcd ¦ctorc ¸oin¸ into rccursion,
that is, ¦ctorc constructin¸ thc cxtcndcd hcad ot a chi¦d nodc
lt thc chcck ﬁnds a sujcrsct, thc chi¦d nodc is jruncd)
• Thc conditiona¦ rcjositorics arc o¦taincd ¦y ¦asica¦¦y thc samc ojcration as
thc conditiona¦ transaction data¦ascs (jro,cctin¸,conditionin¸ on thc sj¦it itcm)
• A joju¦ar structurc tor thc rcjository is an Iltrcc,
¦ccausc it a¦¦ows tor simj¦c and cﬃcicnt jro,cction,conditionin¸
Lowcvcr, a simj¦c jrcﬁx trcc that is jro,cctcd tojdown may a¦so ¦c uscd
Christian Borgelt Frequent Pattern Mining 196
Closed and Maximal Item Sets: Pruning
• lt on¦y c¦oscd itcm scts or on¦y maxima¦ itcm scts arc to ¦c tound,
additiona¦ jrunin¸ ot thc scarch trcc ¦ccomcs jossi¦¦c
• Perfect Extension Pruning / Parent Equivalence Pruning (PEP)
◦ Givcn an itcm sct I, an itcm a / ∈ I is ca¦¦cd a perfect extension ot I,
iﬀ thc itcm scts I and I ∪ ¦a¦ havc thc samc sujjort s
T
(I) s
T
(I ∪ ¦a¦)
(that is, it a¦¦ transactions containin¸ I a¦so contain thc itcm a)
Thcn wc know ∀J ⊇ I s
T
(J ∪ ¦a¦) s
T
(J)
◦ As a conscqucncc, no sujcrsct J ⊇ I with a / ∈ J can ¦c c¦oscd
Lcncc a can ¦c addcd dircct¦y to thc jrcﬁx ot thc conditiona¦ data¦asc
• Lct X
T
(I) ¦a [ a / ∈ I ∧s
T
(I ∪¦a¦) s
T
(I)¦ ¦c thc sct ot a¦¦ jcrtcct cxtcnsion
itcms Thcn thc who¦c sct X
T
(I) can ¦c addcd to thc jrcﬁx
• lcrtcct cxtcnsion , jarcnt cquiva¦cncc jrunin¸ can ¦c ajj¦icd tor ¦oth c¦oscd and
maxima¦ itcm scts, sincc a¦¦ maxima¦ itcm scts arc c¦oscd
Christian Borgelt Frequent Pattern Mining 197
Head Union Tail Pruning
• lt on¦y maxima¦ itcm scts arc to ¦c tound, cvcn morc
additiona¦ jrunin¸ ot thc scarch trcc ¦ccomcs jossi¦¦c
• General Idea: A¦¦ trcqucnt itcm scts in thc su¦trcc rootcd at a nodc
with hcad H and tai¦ L arc su¦scts ot H ∪ L
• Maximal Item Set Contains Head ∪ Tail Pruning (MFIHUT)
◦ lt wc ﬁnd out that H ∪ L is a su¦sct ot an a¦rcady tound
maxima¦ itcm sct, thc who¦c su¦trcc can ¦c jruncd
◦ This jrunin¸ mcthod rcquircs a ¦ctt to ri¸ht travcrsa¦ ot thc jrcﬁx trcc
• Frequent Head ∪ Tail Pruning (FHUT)
◦ lt H ∪ L is not a su¦sct ot an a¦rcady tound maxima¦ itcm sct
and ¦y somc c¦cvcr mcans wc discovcr that H ∪ L is trcqucnt,
H ∪ L can immcdiatc¦y ¦c rccordcd as a maxima¦ itcm sct
Christian Borgelt Frequent Pattern Mining 198
Alternative Description of Closed Item Set Mining
• ln ordcr to avoid rcdundant scarch in thc jartia¦¦y ordcrcd sct (2
B
, ⊆),
wc assi¸ncd a uniquc jarcnt itcm sct to cach itcm sct (cxccjt thc cmjty sct)
• Ana¦o¸ous¦y, wc may structurc thc sct ot c¦oscd itcm scts
¦y assi¸nin¸ unique closed parent item sets ¹no et al. 2003
• Lct ≤ ¦c an itcm ordcr and ¦ct I ¦c a c¦oscd itcm sct with I ,
1≤k≤n
t
k
Lct i
∗
∈ I ¦c thc (uniquc¦y dctcrmincd) itcm satistyin¸
s
T
(¦i ∈ I [ i < i
∗
¦) > s
T
(I) and s
T
(¦i ∈ I [ i ≤ i
∗
¦) s
T
(I).
lntuitivc¦y, thc itcm i
∗
is thc ¸rcatcst itcm in I that is not a jcrtcct cxtcnsion
(A¦¦ itcms ¸rcatcr than i
∗
can ¦c rcmovcd without aﬀcctin¸ thc sujjort)
Lct I
∗
¦i ∈ I [ i < i
∗
¦ and X
T
(I) ¦i ∈ B −I [ s
T
(I ∪ ¦i¦) s
T
(I)¦
Thcn thc canonica¦ jarcnt p
C
(I) ot I is thc itcm sct
p
C
(I) I
∗
∪ ¦i ∈ X
T
(I
∗
) [ i > i
∗
¦.
lntuitivc¦y, to ﬁnd thc canonica¦ jarcnt ot thc itcm sct I, thc rcduccd itcm sct I
∗
is cnhanccd ¦y a¦¦ jcrtcct cxtcnsion itcms to¦¦owin¸ thc itcm i
∗
Christian Borgelt Frequent Pattern Mining 199
Alternative Description of Closed Item Set Mining
• `otc that
1≤k≤n
t
k
is thc sma¦¦cst c¦oscd itcm sct tor a ¸ivcn data¦asc T
• `otc a¦so that thc sct ¦i ∈ X
T
(I
∗
) [ i > i
∗
¦ nccd not contain a¦¦ itcms i > i
∗
,
¦ccausc a jcrtcct cxtcnsion ot I
∗
∪ ¦i
∗
¦ nccd not ¦c a jcrtcct cxtcnsion ot I
∗
,
sincc K
T
(I
∗
) ⊃ K
T
(I
∗
∪ ¦i
∗
¦)
• Ior thc rccursivc scarch, thc to¦¦owin¸ tormu¦ation is usctu¦
Lct I ⊆ B ¦c a c¦oscd itcm sct Thc canonical children ot I (that is,
thc c¦oscd itcm scts that havc I as thcir canonica¦ jarcnt) arc thc itcm scts
J I ∪ ¦i¦ ∪ ¦j ∈ X
T
(I ∪ ¦i¦) [ j > i¦
with ∀j ∈ I i > j and ¦j ∈ X
T
(I ∪ ¦i¦) [ j < i¦ X
T
(J) ∅
• Thc union with ¦j ∈ X
T
(I ∪ ¦i¦) [ j > i¦
rcjrcscnts jcrtcct cxtcnsion or jarcnt cquiva¦cncc jrunin¸
a¦¦ jcrtcct cxtcnsions in thc tai¦ ot I ∪ ¦i¦ arc immcdiatc¦y addcd
• Thc condition ¦j ∈ X
T
(I ∪ ¦i¦) [ j < i¦ ∅ cxjrcsscs
that thcrc must not ¦c any jcrtcct cxtcnsions amon¸ thc c¦iminatcd itcms
Christian Borgelt Frequent Pattern Mining 200
Additional Frequent Item Set Filtering
Christian Borgelt Frequent Pattern Mining 201
Additional Frequent Item Set Filtering
• General problem of frequent item set mining:
Thc num¦cr ot trcqucnt itcm scts, cvcn thc num¦cr ot c¦oscd or maxima¦ itcm
scts, can cxcccd thc num¦cr ot transactions in thc data¦asc ¦y tar
• Thcrctorc Additiona¦ ﬁ¦tcrin¸ is ncccssary to ﬁnd
thc ’‘rc¦cvant” or “intcrcstin¸” trcqucnt itcm scts
• Gcncra¦ idca Compare support to expectation.
◦ ltcm scts consistin¸ ot itcms that ajjcar trcqucnt¦y
arc ¦ikc¦y to havc a hi¸h sujjort
◦ Lowcvcr, this is not surjrisin¸
wc cxjcct this cvcn it thc occurrcncc ot thc itcms is indcjcndcnt
◦ Additiona¦ ﬁ¦tcrin¸ shou¦d rcmovc itcm scts with a sujjort
c¦osc to thc sujjort cxjcctcd trom an indcjcndcnt occurrcncc
Christian Borgelt Frequent Pattern Mining 202
Additional Frequent Item Set Filtering
Full Independence
• Lva¦uatc itcm scts with
̺
ﬁ
(I)
s
T
(I) n
[I[−1
a∈I
s
T
(¦a¦)
ˆ p
T
(I)
a∈I
ˆ p
T
(¦a¦)
.
an rcquirc a minimum va¦uc tor this mcasurc
(ˆ p
T
is thc jro¦a¦i¦ity cstimatc ¦ascd on T)
• Assumcs tu¦¦ indcjcndcncc ot thc itcms in ordcr
to torm an cxjcctation a¦out thc sujjort ot an itcm sct
• Advanta¸c Can ¦c comjutcd trom on¦y thc sujjort ot thc itcm sct
and thc sujjort va¦ucs ot thc individua¦ itcms
• Lisadvanta¸c lt somc itcm sct I scorcs hi¸h on this mcasurc,
thcn a¦¦ J ⊃ I arc a¦so ¦ikc¦y to scorc hi¸h,
cvcn it thc itcms in J −I arc indcjcndcnt ot I
Christian Borgelt Frequent Pattern Mining 203
Additional Frequent Item Set Filtering
Incremental Independence
• Lva¦uatc itcm scts with
̺
ii
(I) min
a∈I
n s
T
(I)
s
T
(I −¦a¦) s
T
(¦a¦)
min
a∈I
ˆ p
T
(I)
ˆ p
T
(I −¦a¦) ˆ p
T
(¦a¦)
.
an rcquirc a minimum va¦uc tor this mcasurc
(ˆ p
T
is thc jro¦a¦i¦ity cstimatc ¦ascd on T)
• Advanta¸c lt I contains indcjcndcnt itcms,
thc minimum cnsurcs a ¦ow va¦uc
• Lisadvanta¸cs \c nccd to know thc sujjort va¦ucs ot a¦¦ su¦scts I −¦a¦
lt thcrc cxist hi¸h scorin¸ indcjcndcnt su¦scts I
1
and I
2
with [I
1
[ > 1, [I
2
[ > 1, I
1
∩ I
2
∅ and I
1
∪ I
2
I,
thc itcm sct I sti¦¦ rcccivcs a hi¸h cva¦uation
Christian Borgelt Frequent Pattern Mining 204
Additional Frequent Item Set Filtering
Subset Independence
• Lva¦uatc itcm scts with
̺
si
(I) min
J⊂I,J,∅
n s
T
(I)
s
T
(I −J) s
T
(J)
min
J⊂I,J,∅
ˆ p
T
(I)
ˆ p
T
(I −J) ˆ p
T
(J)
.
an rcquirc a minimum va¦uc tor this mcasurc
(ˆ p
T
is thc jro¦a¦i¦ity cstimatc ¦ascd on T)
• Advanta¸c Lctccts a¦¦ cascs whcrc a dccomjosition is jossi¦¦c
and cva¦uatcs thcm with a ¦ow va¦uc
• Lisadvanta¸cs \c nccd to know thc sujjort va¦ucs ot a¦¦ jrojcr su¦scts J
• lmjrovcmcnt ¹sc incrcmcnta¦ indcjcndcncc and in thc minimum considcr
on¦y itcms ¦a¦ tor which I −¦a¦ has ¦ccn cva¦uatcd hi¸h
This cajturcs su¦sct indcjcndcncc “incrcmcnta¦¦y”
Christian Borgelt Frequent Pattern Mining 205
Summary Frequent Item Set Mining
• \ith a canonical form ot an itcm sct thc Lassc dia¸ram
can ¦c turncd into a much simj¦cr preﬁx tree
(→ dividcandconqucr schcmc usin¸ conditiona¦ data¦ascs)
• Item set enumeration a¦¸orithms diﬀcr in
◦ thc traversal order ot thc jrcﬁx trcc
(¦rcadthﬁrst,¦cvc¦wisc vcrsus dcjthﬁrst travcrsa¦)
◦ thc transaction representation
horizontal (itcm arrays) vcrsus vertical (transaction ¦ists)
vcrsus specialized data structures ¦ikc Iltrccs
◦ thc types of frequent item sets tound
frequent vcrsus closed vcrsus maximal item sets
(additiona¦ jrunin¸ mcthods tor c¦oscd and maxima¦ itcm scts)
• An a¦tcrnativc arc transaction set enumeration or intersection a¦¸orithms
• Additional ﬁltering is ncccssary to rcducc thc sizc ot thc outjut
Christian Borgelt Frequent Pattern Mining 206
Example Application:
Finding Neuron Assemblies in Neural Spike Data
Christian Borgelt Frequent Pattern Mining 207
Biological Background
Structure of a prototypical neuron
cc¦¦ corc
axon
myc¦in shcath
cc¦¦ ¦ody
(soma)
tcrmina¦ ¦outon
synajsis
dcndritcs
Christian Borgelt Frequent Pattern Mining 208
Biological Background
c _ A¦vin ` Lurt
c _ 1aco¦ \i¦son
Christian Borgelt Frequent Pattern Mining 209
Biological Background
(Very) simpliﬁed description of neural information processing
• Axon tcrmina¦ rc¦cascs chcmica¦s, ca¦¦cd neurotransmitters
• Thcsc act on thc mcm¦ranc ot thc rcccjtor dcndritc to chan¸c its jo¦arization
(Thc insidc is usua¦¦y ¨0m\ morc nc¸ativc than thc outsidc)
• Lccrcasc in jotcntia¦ diﬀcrcncc excitatory synajsc
lncrcasc in jotcntia¦ diﬀcrcncc inhibitory synajsc
• lt thcrc is cnou¸h nct cxcitatory injut, thc axon is dcjo¦arizcd
• Thc rcsu¦tin¸ action potential travc¦s a¦on¸ thc axon
(Sjccd dcjcnds on thc dc¸rcc to which thc axon is covcrcd with myc¦in)
• \hcn thc action jotcntia¦ rcachcs thc tcrmina¦ ¦outons,
it tri¸¸crs thc rc¦casc ot ncurotransmittcrs
Christian Borgelt Frequent Pattern Mining 210
Neuronal Action Potential
A schcmatic vicw ot an idca¦izcd action
jotcntia¦ i¦¦ustratcs its various jhascs as
thc action jotcntia¦ jasscs a joint on a
cc¦¦ mcm¦ranc
Actua¦ rccordin¸s ot action jotcntia¦s arc
ottcn distortcd comjarcd to thc schcmatic
vicw ¦ccausc ot variations in c¦cctrojhys
io¦o¸ica¦ tcchniqucs uscd to makc thc
rccordin¸
c _ cnwikijcdiaor¸
Christian Borgelt Frequent Pattern Mining 211
Higher Level Neural Processing
• Thc ¦ow¦cvc¦ mcchanisms ot ncura¦ intormation jroccssin¸ arc tair¦y wc¦¦
undcrstood (ncurotransmittcrs, cxcitation and inhi¦ition, action jotcntia¦)
• Thc hi¸h¦cvc¦ mcchanisms, howcvcr, arc a tojic ot currcnt rcscarch
Thcrc arc scvcra¦ comjctin¸ thcorics (scc thc to¦¦owin¸ s¦idcs)
how ncurons codc and transmit thc intormation thcy jroccss
• ¹j to tair¦y rcccnt¦y it was not jossi¦¦c to rccord thc sjikcs
ot cnou¸h ncurons in jara¦¦c¦ to dccidc ¦ctwccn thc diﬀcrcnt modc¦s
Lowcvcr, ncw mcasurcmcnt tcchniqucs ojcn uj thc jossi¦i¦ity
to rccord dozcns or cvcn uj to a hundrcd ncurons in jara¦¦c¦
• Currcnt¦y mcthods arc invcsti¸atcd ¦y which it wou¦d ¦c jossi¦¦c
to chcck thc va¦idity ot thc diﬀcrcnt codin¸ modc¦s
• Ircqucnt itcm sct minin¸, jrojcr¦y adajtcd, cou¦d jrovidc a mcthod
to tcst thc temporal coincidence hypothesis (scc ¦c¦ow)
Christian Borgelt Frequent Pattern Mining 212
Models of Neuronal Coding
c _ Zo¦t an `adasdy
Frequency Code Hypothesis
Shcrrin¸ton 190o, Lcc¦cs 19`¨, Lar¦ow 19¨2
`curons ¸cncratc diﬀcrcnt trcqucncy ot sjikc trains
as a rcsjonsc to diﬀcrcnt stimu¦us intcnsitics
Christian Borgelt Frequent Pattern Mining 213
Models of Neuronal Coding
c _ Zo¦t an `adasdy
Temporal Coincidence Hypothesis
Gray ct a¦ 1992, Sin¸cr 1993, 199!
Sjikc occurrcnccs arc modu¦atcd ¦y ¦oca¦ ﬁc¦d osci¦¦ation (¸amma)
Ti¸htcr coincidcncc ot sjikcs rccordcd trom diﬀcrcnt ncurons
rcjrcscnt hi¸hcr stimu¦us intcnsity
Christian Borgelt Frequent Pattern Mining 214
Models of Neuronal Coding
c _ Zo¦t an `adasdy
Delay Coding Hypothesis
Lojﬁc¦d 199`, Luzsaki and Chro¦ak 199`
Thc injut currcnt is convcrtcd to thc sjikc dc¦ay
`curon 1 which was stimu¦atcd stron¸cr rcachcd thc thrcsho¦d car¦icr
and initiatcd a sjikc sooncr than ncurons stimu¦atcd ¦css
Liﬀcrcnt dc¦ays ot thc sjikcs (d2d!) rcjrcscnt
rc¦ativc intcnsitics ot thc diﬀcrcnt stimu¦us
Christian Borgelt Frequent Pattern Mining 215
Models of Neuronal Coding
c _ Zo¦t an `adasdy
SpatioTemporal Code Hypothesis
`curons disj¦ay a causa¦ scqucncc ot sjikcs in rc¦ationshij to a stimu¦us conﬁ¸uration
Thc stron¸cr stimu¦us induccs sjikcs car¦icr and wi¦¦ initiatc sjikcs in thc othcr, con
ncctcd cc¦¦s in thc ordcr ot rc¦ativc thrcsho¦d and actua¦ dcjo¦arization Thc scqucncc
ot sjikc jroja¸ation is dctcrmincd ¦y thc sjatiotcmjora¦ conﬁ¸uration ot thc stimu¦us
as wc¦¦ as thc intrinsic conncctivity ot thc nctwork Sjikc scqucnccs coincidc with thc
¦oca¦ ﬁc¦d activity `otc that this modc¦ intc¸ratcs ¦oth thc tcmjora¦ coincidcncc and
thc dc¦ay codin¸ jrincij¦cs
Christian Borgelt Frequent Pattern Mining 216
Models of Neuronal Coding
c _ Zo¦t an `adasdy
Markovian Process of Frequency Modulation
Scidcrmann ct a¦ 199o
Stimu¦us intcnsitics arc convcrtcd to a scqucncc ot trcqucncy cnhanccmcnts and dccrc
mcnts in thc diﬀcrcnt ncurons Liﬀcrcnt stimu¦us conﬁ¸urations arc rcjrcscntcd ¦y
diﬀcrcnt `arkovian scqucnccs across scvcra¦ scconds
Christian Borgelt Frequent Pattern Mining 217
Finding Neuron Assemblies in Neuronal Spike Data
data c _ Son,a Gr un, llIL` Lrain Scicncc lnstitutc, Tokyo
• Lot disj¦ays ot (simu¦atcd) jara¦¦c¦ sjikc trains
vcrtica¦ ncurons (100)
horizonta¦ timc (10 scconds)
• ln onc ot thcsc dot disj¦ays, 20 ncurons arc ﬁrin¸ synchronous¦y
• \ithout jrojcr intc¦¦i¸cnt data ana¦ysis mcthods,
it is virtua¦¦y imjossi¦¦c to dctcct such synchronous ﬁrin¸
Christian Borgelt Frequent Pattern Mining 218
Finding Neuron Assemblies in Neural Spike Data
data c _ Son,a Gr un, llIL` Lrain Scicncc lnstitutc, Tokyo
• lt thc ncurons that ﬁrc to¸cthcr arc ¸roujcd to¸cthcr,
thc synchronous ﬁrin¸ ¦ccomcs casi¦y visi¦¦c
¦ctt cojy ot thc ri¸ht dia¸ram ot thc jrcvious s¦idc
ri¸ht samc data, ¦ut with rc¦cvant ncurons co¦¦cctcd at thc ¦ottom
• A synchronous¦y ﬁrin¸ sct ot ncurons is ca¦¦cd a neuron assembly
• Qucstion Low can wc ﬁnd out which ncurons to ¸rouj to¸cthcr´
Christian Borgelt Frequent Pattern Mining 219
Finding Neuron Assemblies in Neural Spike Data
A Frequent Item Set Mining Approach
• Thc ncurona¦ sjikc trains arc usua¦¦y codcd as jairs ot a ncuron id
and a sjikc timc, sortcd ¦y thc sjikc timc
• ln ordcr to makc trcqucnt itcm sct minin¸ ajj¦ica¦¦c, timc ¦ins arc tormcd
• Lach time bin ¸ivcs risc to onc transaction
lt contains thc sct ot neurons that ﬁrc in this timc ¦in (items)
• Ircqucnt itcm sct minin¸, jossi¦¦y rcstrictcd to maxima¦ itcm scts,
is thcn ajj¦icd with additiona¦ ﬁ¦tcrin¸ ot thc trcqucnt itcm scts
• Ior thc (simu¦atcd) cxamj¦c data sct such an ajjroach
dctccts thc ncuron asscm¦¦y jcrtcct¦y
80 54 88 28 93 83 39 29 50 24 40 30 32 11 82 69 22 60 5 4
(0.5400%/54, 105.1679)
Christian Borgelt Frequent Pattern Mining 220
Finding Neuron Assemblies in Neural Spike Data
Translation of Basic Notions
mathcmatica¦ jro¦¦cm markct ¦askct ana¦ysis sjikc train ana¦ysis
itcm jroduct ncuron
itcm ¦asc sct ot a¦¦ jroducts sct ot a¦¦ ncurons
— (transaction id) customcr timc ¦in
transaction sct ot jroducts sct ot ncurons
¦ou¸ht ¦y a customcr ﬁrin¸ in a timc ¦in
trcqucnt itcm sct sct ot jroducts sct ot ncurons
trcqucnt¦y ¦ou¸ht to¸cthcr trcqucnt¦y ﬁrin¸ to¸cthcr
• ln ¦oth cascs thc injut can ¦c rcjrcscntcd as a ¦inary matrix
(thc soca¦¦cd dot display in sjikc train ana¦ysis)
• `otc, howcvcr, that a dot disj¦ay is usua¦¦y rotatcd ¦y 90
o
usua¦¦y customcrs rctcr to rows, jroducts to co¦umns,
¦ut in a dot disj¦ay, rows arc ncurons, co¦umns arc timc ¦ins
Christian Borgelt Frequent Pattern Mining 221
Finding Neuron Assemblies in Neural Spike Data
Open Problems and Ongoing Work
• Thc trcqucnt itcm sct minin¸ ajjroach with additiona¦ ﬁ¦tcrin¸ works jcrtcct¦y
it thc ncurons ﬁrc in jcrtcct synchrony
Lowcvcr, it is not to ¦c cxjcctcd that rca¦ wor¦d data wi¦¦ ¦c so c¦can
lathcr thcrc wi¦¦ ¦c a considcra¦¦c amount ot tcmjora¦ jitter
Such ,ittcr, to¸cthcr with ¦innin¸ thc data, can makc it diﬃcu¦t to o¦scrvc
thc synchrony, sincc thc sjikcs may cnd uj in diﬀcrcnt ¦ins
• ln addition, it is to ¦c cxjcctcd that cach timc a ncuron asscm¦¦y is activatcd,
on¦y a su¦sct ot thc ncurons (¦ctwccn `0 and S0/) jarticijatc (simu¦ations show
that this cnou¸h to jroja¸atc such synchronous activity)
• ln thcsc cascs jostjroccssin¸ ot thc tound itcm scts is ncccssary
in ordcr to co¦¦cct a¦¦ ncurons ot an asscm¦¦y
• ln addition, jrojcr statistica¦ tcsts havc to ¦c dcvc¦ojcd
Christian Borgelt Frequent Pattern Mining 222
Finding Neuron Assemblies in Neural Spike Data
data c _ Son,a Gr un, llIL` Lrain Scicncc lnstitutc, Tokyo
• Loth dia¸rams show thc samc (simu¦atcd) data, ¦ut on thc ri¸ht
thc ncurons ot thc asscm¦¦y arc co¦¦cctcd at thc ¦ottom
• On¦y a¦out S0/ ot thc ncurons (random¦y choscn) jarticijatc in cach
synchronous ﬁrin¸ Lcncc thcrc is no trcqucnt itcm sct comjrisin¸ a¦¦ ot thcm
• lathcr a trcqucnt itcm sct minin¸ ajjroach ﬁnds a ¦ar¸c num¦cr
ot trcqucnt itcm scts with 12 to 1o ncurons
Christian Borgelt Frequent Pattern Mining 223
Association Rules
Christian Borgelt Frequent Pattern Mining 224
Association Rules: Basic Notions
• Ottcn tound jattcrns arc cxjrcsscd as association rules, tor cxamj¦c
If a customcr ¦uys bread and wine,
then shc,hc wi¦¦ jro¦a¦¦y a¦so ¦uy cheese
• Iorma¦¦y, wc considcr ru¦cs ot thc torm X → Y ,
with X, Y ⊆ A and X ∩ Y ∅
• Support of a Rule X → Y
Lithcr ς
T
(X → Y ) σ
T
(X ∪ Y ) (morc common ru¦c is corrcct)
Or ς
T
(X → Y ) σ
T
(X) (morc j¦ausi¦¦c ru¦c is ajj¦ica¦¦c)
• Conﬁdence of a Rule X → Y
c
T
(X → Y )
σ
T
(X ∪ Y )
σ
T
(X)
s
T
(X ∪ Y )
s
T
(X)
s
T
(I)
s
T
(X)
Thc conﬁdcncc can ¦c sccn as an cstimatc ot P(Y [ X)
Christian Borgelt Frequent Pattern Mining 225
Association Rules: Formal Deﬁnition
Given:
• a sct A ¦a
1
, . . . , a
m
¦ ot itcms,
• a vcctor T (t
1
, . . . , t
n
) ot transactions ovcr A,
• a rca¦ num¦cr ς
min
, 0 < ς
min
≤ 1, thc minimum support,
• a rca¦ num¦cr c
min
, 0 < c
min
≤ 1, thc minimum conﬁdence
Desired:
• thc sct ot a¦¦ association rules, that is, thc sct
! ¦R X → Y [ ς
T
(R) ≥ ς
min
∧ c
T
(R) ≥ c
min
¦.
General Procedure:
• Iind thc trcqucnt itcm scts
• Construct ru¦cs and ﬁ¦tcr thcm wrt ς
min
and c
min
Christian Borgelt Frequent Pattern Mining 226
Generating Association Rules
• \hich minimum sujjort has to ¦c uscd tor ﬁndin¸ thc trcqucnt itcm scts
dcjcnds on thc dcﬁnition ot thc sujjort ot a ru¦c
◦ lt ς
T
(X → Y ) σ
T
(X ∪ Y ),
thcn σ
min
ς
min
or cquiva¦cnt¦y s
min
⌈nς
min
⌉
◦ lt ς
T
(X → Y ) σ
T
(X),
thcn σ
min
ς
min
c
min
or cquiva¦cnt¦y s
min
⌈nς
min
c
min
⌉
• Attcr thc trcqucnt itcm scts havc ¦ccn tound,
thc ru¦c construction thcn travcrscs a¦¦ trcqucnt itcm scts I and
sj¦its thcm into dis,oint su¦scts X and Y (X ∩ Y ∅ and X ∪ Y I),
thus tormin¸ ru¦cs X → Y
◦ Ii¦tcrin¸ ru¦cs wrt conﬁdcncc is a¦ways ncccssary
◦ Ii¦tcrin¸ ru¦cs wrt sujjort is on¦y ncccssary it ς
T
(X → Y ) σ
T
(X)
Christian Borgelt Frequent Pattern Mining 227
Properties of the Conﬁdence
• Irom ∀I ∀J ⊆ I s
T
(I) ≤ s
T
(J) it o¦vious¦y to¦¦ows
∀X, Y ∀a ∈ X
s
T
(X ∪ Y )
s
T
(X)
≥
s
T
(X ∪ Y )
s
T
(X −¦a¦)
and thcrctorc
∀X, Y ∀a ∈ X c
T
(X → Y ) ≥ c
T
(X −¦a¦ → Y ∪ ¦a¦).
That is Moving an item from the antecedent to the consequent
cannot increase the conﬁdence of a rule.
• As an immcdiatc conscqucncc wc havc
∀X, Y ∀a ∈ X c
T
(X → Y ) < c
min
→ c
T
(X −¦a¦ → Y ∪ ¦a¦) < c
min
.
That is If a rule fails to meet the minimum conﬁdence,
no rules over the same item set and with
a larger consequent need to be considered.
Christian Borgelt Frequent Pattern Mining 228
Generating Association Rules
function ru¦cs (I). (∗ — ¸cncratc association ru¦cs ∗)
R ∅. (∗ initia¦izc thc sct ot ru¦cs ∗)
forall f ∈ F do begin (∗ travcrsc thc trcqucnt itcm scts ∗)
m 1. (∗ start with ru¦c hcads (conscqucnts) ∗)
H
m
i∈f
¦¦i¦¦. (∗ that contain on¦y onc itcm ∗)
repeat (∗ travcrsc ru¦c hcads ot incrcasin¸ sizc ∗)
forall h ∈ H
m
do (∗ travcrsc thc jossi¦¦c ru¦c hcads ∗)
if
s
T
(f)
s
T
(f−h)
≥ c
min
(∗ it thc conﬁdcncc is hi¸h cnou¸h, ∗)
then R R ∪ ¦(f −h) → h¦. (∗ add ru¦c to thc rcsu¦t ∗)
else H
m
H
m
−¦h¦. (∗ othcrwisc discard thc hcad ∗)
H
m+1
candidatcs(H
m
). (∗ crcatc hcads with onc itcm morc ∗)
m m + 1. (∗ incrcmcnt thc hcad itcm countcr ∗)
until H
m
∅ or m ≥ [f[. (∗ unti¦ thcrc arc no morc ru¦c hcads ∗)
end. (∗ or antcccdcnt wou¦d ¦ccomc cmjty ∗)
return R. (∗ rcturn thc ru¦cs tound ∗)
end. (∗ ru¦cs ∗)
Christian Borgelt Frequent Pattern Mining 229
Generating Association Rules
function candidatcs (F
k
) (∗ ¸cncratc candidatcs with k + 1 itcms ∗)
begin
E ∅. (∗ initia¦izc thc sct ot candidatcs ∗)
forall f
1
, f
2
∈ F
k
(∗ travcrsc a¦¦ jairs ot trcqucnt itcm scts ∗)
with f
1
¦a
1
, . . . , a
k−1
, a
k
¦ (∗ that diﬀcr on¦y in onc itcm and ∗)
and f
2
¦a
1
, . . . , a
k−1
, a
′
k
¦ (∗ arc in a ¦cxico¸rajhic ordcr ∗)
and a
k
< a
′
k
do begin (∗ (thc ordcr is ar¦itrary, ¦ut ﬁxcd) ∗)
f f
1
∪ f
2
¦a
1
, . . . , a
k−1
, a
k
, a
′
k
¦. (∗ union has k + 1 itcms ∗)
if ∀a ∈ f f −¦a¦ ∈ F
k
(∗ on¦y it a¦¦ su¦scts arc trcqucnt, ∗)
then E E ∪ ¦f¦. (∗ add thc ncw itcm sct to thc candidatcs ∗)
end. (∗ (othcrwisc it cannot ¦c trcqucnt) ∗)
return E. (∗ rcturn thc ¸cncratcd candidatcs ∗)
end (∗ candidatcs ∗)
Christian Borgelt Frequent Pattern Mining 230
Frequent Item Sets: Example
transaction data¦asc
1 ¦a, d, e¦
2 ¦b, c, d¦
3 ¦a, c, e¦
! ¦a, c, d, e¦
` ¦a, e¦
o ¦a, c, d¦
¨ ¦b, c¦
S ¦a, c, d, e¦
9 ¦c, b, e¦
10 ¦a, d, e¦
trcqucnt itcm scts
0 itcms 1 itcm 2 itcms 3 itcms
∅ 10 ¦a¦ ¨ ¦a, c¦ ! ¦a, c, d¦ 3
¦b¦ 3 ¦a, d¦ ` ¦a, c, e¦ 3
¦c¦ ¨ ¦a, e¦ o ¦a, d, e¦ !
¦d¦ o ¦b, c¦ 3
¦e¦ ¨ ¦c, d¦ !
¦c, e¦ !
¦d, e¦ !
• Thc minimum sujjort is s
min
3 or σ
min
0.3 30/ in this cxamj¦c
• Thcrc arc 2
`
32 jossi¦¦c itcm scts ovcr A ¦a, b, c, d, e¦
• Thcrc arc 1o trcqucnt itcm scts (¦ut on¦y 10 transactions)
Christian Borgelt Frequent Pattern Mining 231
Generating Association Rules
Example: I ¦a, c, e¦, X ¦c, e¦, Y ¦a¦
c
T
(c, e → a)
s
T
(¦a, c, e¦)
s
T
(¦c, e¦)
3
!
¨`/
Minimum conﬁdence: 80%
association sujjort ot sujjort ot conﬁdcncc
ru¦c a¦¦ itcms antcccdcnt
b → c 3 (30/) 3 (30/) 100/
d → a ` (`0/) o (o0/) S33/
e → a o (o0/) ¨ (¨0/) S`¨/
a → e o (o0/) ¨ (¨0/) S`¨/
d, e → a ! (!0/) ! (!0/) 100/
a, d → e ! (!0/) ` (`0/) S0/
Christian Borgelt Frequent Pattern Mining 232
Support of an Association Rule
The two rule support deﬁnitions are not equivalent:
transaction data¦asc
1 ¦a, c, e¦
2 ¦b, d¦
3 ¦b, c, d¦
! ¦a, e¦
` ¦a, b, c, d¦
o ¦c, e¦
¨ ¦a, b, d¦
S ¦a, c, d¦
two association ru¦cs
association sujjort ot sujjort ot conﬁdcncc
ru¦c a¦¦ itcms antcccdcnt
a → c 3 (3¨`/) ` (o2`/) o00/
b → d ! (`00/) ! (`00/) 1000/
Lct thc minimum conﬁdcncc ¦c c
min
o0/
• Ior ς
T
(R) σ(X ∪ Y ) and 3 < ς
min
≤ ! on¦y thc ru¦c b → d is ¸cncratcd,
¦ut not thc ru¦c a → c
• Ior ς
T
(R) σ(X) thcrc is no va¦uc ς
min
that ¸cncratcs on¦y thc ru¦c b → d,
¦ut not at thc samc timc a¦so thc ru¦c a → c
Christian Borgelt Frequent Pattern Mining 233
Rule Extraction from Preﬁx Tree
• lcstriction to ru¦cs with onc itcm in thc hcad,conscqucnt
• Lxj¦oit thc jrcﬁx trcc to ﬁnd thc sujjort ot thc ¦ody,antcccdcnt
• Travcrsc thc itcm sct trcc ¦rcadthﬁrst or dcjthﬁrst
• Ior cach nodc travcrsc thc jath to thc root and
¸cncratc and tcst onc ru¦c jcr nodc
root
hdnode
i j head
_
prev
j
body
samc
jath
_
¹
isnode
p
p
p
p
p
J
J
p
p
p
p
p
J
J
p
p
p
p
p
• Iirst ru¦c Gct thc sujjort ot thc ¦ody,
antcccdcnt trom thc jarcnt nodc
• `cxt ru¦cs Liscard thc hcad,consc
qucnt itcm trom thc downward jath
and to¦¦ow thc rcmainin¸ jath trom thc
currcnt nodc
Christian Borgelt Frequent Pattern Mining 234
Reminder: Preﬁx Tree
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a
b c
d
b
c d c d d
c d d d
d
A (tu¦¦) jrcﬁx trcc tor thc ﬁvc itcms a, b, c, d, e
• Lascd on a ¸¦o¦a¦ ordcr ot thc itcms (which can ¦c ar¦itrary)
• Thc itcm scts countcd in a nodc consist ot
◦ a¦¦ itcms ¦a¦c¦in¸ thc cd¸cs to thc nodc (common jrcﬁx) and
◦ onc itcm to¦¦owin¸ thc ¦ast cd¸c ¦a¦c¦ in thc itcm ordcr
Christian Borgelt Frequent Pattern Mining 235
Additional Rule Filtering: Simple Measures
• Gcncra¦ idca Comjarc
ˆ
P
T
(Y [ X) c
T
(X → Y )
and
ˆ
P
T
(Y ) c
T
( ∅ → Y ) σ
T
(Y )
• (A¦so¦utc) conﬁdcncc diﬀcrcncc to jrior
d
T
(R) [c
T
(X → Y ) −σ
T
(Y )[
• Litt va¦uc
l
T
(R)
c
T
(X → Y )
σ
T
(Y )
• (A¦so¦utc) diﬀcrcncc ot ¦itt va¦uc to 1
q
T
(R)
¸
¸
¸
¸
¸
c
T
(X → Y )
σ
T
(Y )
−1
¸
¸
¸
¸
¸
• (A¦so¦utc) diﬀcrcncc ot ¦itt quoticnt to 1
r
T
(R)
¸
¸
¸
¸
¸
1 −min
_
c
T
(X → Y )
σ
T
(Y )
,
σ
T
(Y )
c
T
(X → Y )
_¸
¸
¸
¸
¸
Christian Borgelt Frequent Pattern Mining 236
Additional Rule Filtering: More Sophisticated Measures
• Considcr thc 2 2 contin¸cncy ta¦¦c or thc cstimatcd jro¦a¦i¦ity ta¦¦c
X ,⊆ t X ⊆ t
Y ,⊆ t n
00
n
01
n
0.
Y ⊆ t n
10
n
11
n
1.
n
.0
n
.1
n
..
X ,⊆ t X ⊆ t
Y ,⊆ t p
00
p
01
p
0.
Y ⊆ t p
10
p
11
p
1.
p
.0
p
.1
1
• n
..
is thc tota¦ num¦cr ot transactions
n
.1
is thc num¦cr ot transactions to which thc ru¦c is ajj¦ica¦¦c
n
11
is thc num¦cr ot transactions tor which thc ru¦c is corrcct
lt is p
ij
n
ij
n
..
, p
i.
n
i.
n
..
, p
.j
n
.j
n
..
tor i, j 1, 2
• Gcncra¦ idca ¹sc mcasurcs tor thc strcn¸th ot dcjcndcncc ot X and Y
• Thcrc is a ¦ar¸c num¦cr ot such mcasurcs ot dcjcndcncc
ori¸inatin¸ trom statistics, dccision trcc induction ctc
Christian Borgelt Frequent Pattern Mining 237
An Informationtheoretic Evaluation Measure
Information Gain (Iu¦¦¦ack and Lci¦¦cr 19`1, Quin¦an 19So)
Lascd on Shannon Lntrojy H −
n
i1
p
i
¦o¸
2
p
i
(Shannon 19!S)
I
¸ain
(X, Y ) H(Y ) − H(Y [X)
¸ .. ¸
−
k
Y
i1
p
i.
¦o¸
2
p
i.
−
¸ .. ¸
k
X
j1
p
.j
_
_
−
k
Y
i1
p
i[j
¦o¸
2
p
i[j
_
_
H(Y ) Lntrojy ot thc distri¦ution ot Y
H(Y [X) Expected entropy ot thc distri¦ution ot Y
it thc va¦uc ot thc X ¦ccomcs known
H(Y ) −H(Y [X) Lxjcctcd cntrojy rcduction or information gain
Christian Borgelt Frequent Pattern Mining 238
Interpretation of Shannon Entropy
• Lct S ¦s
1
, . . . , s
n
¦ ¦c a ﬁnitc sct ot a¦tcrnativcs
havin¸ jositivc jro¦a¦i¦itics P(s
i
), i 1, . . . , n, satistyin¸
n
i1
P(s
i
) 1
• Shannon Entropy:
H(S) −
n
i1
P(s
i
) ¦o¸
2
P(s
i
)
• lntuitivc¦y Expected number of yes/no questions that have
to be asked in order to determine the obtaining alternative.
◦ Sujjosc thcrc is an orac¦c, which knows thc o¦tainin¸ a¦tcrnativc,
¦ut rcsjonds on¦y it thc qucstion can ¦c answcrcd with “ycs” or “no”
◦ A ¦cttcr qucstion schcmc than askin¸ tor onc a¦tcrnativc attcr thc othcr
can casi¦y ¦c tound Lividc thc sct into two su¦scts ot a¦out cqua¦ sizc
◦ Ask tor containmcnt in an ar¦itrari¦y choscn su¦sct
◦ Ajj¦y this schcmc rccursivc¦y → num¦cr ot qucstions ¦oundcd ¦y ⌈¦o¸
2
n⌉
Christian Borgelt Frequent Pattern Mining 239
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy −
i
P(s
i
) ¦o¸
2
P(s
i
) 2.1` ¦it,sym¦o¦
Linear Traversal
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc ¦cn¸th 32! ¦it,sym¦o¦
Codc cﬃcicncy 0oo!
Equal Size Subsets
s
1
, s
2
, s
3
, s
!
, s
`
0.25 0.75
s
1
, s
2
s
3
, s
!
, s
`
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
2 2 2 3 3
Codc ¦cn¸th 2`9 ¦it,sym¦o¦
Codc cﬃcicncy 0S30
Christian Borgelt Frequent Pattern Mining 240
Question/Coding Schemes
• Sj¦ittin¸ into su¦scts ot a¦out cqua¦ sizc can ¦cad to a ¦ad arran¸cmcnt
ot thc a¦tcrnativcs into su¦scts → hi¸h cxjcctcd num¦cr ot qucstions
• Good qucstion schcmcs takc thc jro¦a¦i¦ity ot thc a¦tcrnativcs into account
• ShannonFano Coding (19!S)
◦ Lui¦d thc qucstion,codin¸ schcmc tojdown
◦ Sort thc a¦tcrnativcs wrt thcir jro¦a¦i¦itics
◦ Sj¦it thc sct so that thc su¦scts havc a¦out cqua¦ probability
(sj¦its must rcsjcct thc jro¦a¦i¦ity ordcr ot thc a¦tcrnativcs)
• Huﬀman Coding (19`2)
◦ Lui¦d thc qucstion,codin¸ schcmc ¦ottomuj
◦ Start with onc c¦cmcnt scts
◦ A¦ways com¦inc thosc two scts that havc thc sma¦¦cst jro¦a¦i¦itics
Christian Borgelt Frequent Pattern Mining 241
Question/Coding Schemes
P(s
1
) 0.10, P(s
2
) 0.1`, P(s
3
) 0.1o, P(s
!
) 0.19, P(s
`
) 0.!0
Shannon cntrojy −
i
P(s
i
) ¦o¸
2
P(s
i
) 2.1` ¦it,sym¦o¦
Shannon–Fano Coding (19!S)
s
1
, s
2
, s
3
, s
!
, s
`
0.25
0.41
s
1
, s
2
s
1
, s
2
, s
3
0.59
s
!
, s
`
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 2 2 2
Codc ¦cn¸th 22` ¦it,sym¦o¦
Codc cﬃcicncy 09``
Huﬀman Coding (19`2)
s
1
, s
2
, s
3
, s
!
, s
`
0.60
s
1
, s
2
, s
3
, s
!
0.25 0.35
s
1
, s
2
s
3
, s
!
0.10 0.15 0.16 0.19 0.40
s
1
s
2
s
3
s
!
s
`
3 3 3 3 1
Codc ¦cn¸th 220 ¦it,sym¦o¦
Codc cﬃcicncy 09¨¨
Christian Borgelt Frequent Pattern Mining 242
Question/Coding Schemes
• lt can ¦c shown that Luﬀman codin¸ is ojtima¦
it wc havc to dctcrminc thc o¦tainin¸ a¦tcrnativc in a sin¸¦c instancc
(`o qucstion,codin¸ schcmc has a sma¦¦cr cxjcctcd num¦cr ot qucstions)
• On¦y it thc o¦tainin¸ a¦tcrnativc has to ¦c dctcrmincd in a scqucncc
ot (indcjcndcnt) situations, this schcmc can ¦c imjrovcd ujon
• ldca lroccss thc scqucncc not instancc ¦y instancc,
¦ut com¦inc two, thrcc or morc consccutivc instanccs and
ask dircct¦y tor thc o¦tainin¸ com¦ination ot a¦tcrnativcs
• A¦thou¸h this cn¦ar¸cs thc qucstion,codin¸ schcmc, thc cxjcctcd num¦cr
ot qucstions jcr idcntiﬁcation is rcduccd (¦ccausc cach intcrro¸ation
idcntiﬁcs thc o¦tainin¸ a¦tcrnativc tor scvcra¦ situations)
• Lowcvcr, thc cxjcctcd num¦cr ot qucstions jcr idcntiﬁcation
ot an o¦tainin¸ a¦tcrnativc cannot ¦c madc ar¦itrari¦y sma¦¦
Shannon showcd that thcrc is a ¦owcr ¦ound, namc¦y thc Shannon cntrojy
Christian Borgelt Frequent Pattern Mining 243
Interpretation of Shannon Entropy
P(s
1
)
1
2
, P(s
2
)
1
!
, P(s
3
)
1
S
, P(s
!
)
1
1o
, P(s
`
)
1
1o
Shannon cntrojy −
i
P(s
i
) ¦o¸
2
P(s
i
) 1.S¨` ¦it,sym¦o¦
lt thc jro¦a¦i¦ity distri¦ution a¦¦ows tor a
jcrtcct Luﬀman codc (codc cﬃcicncy 1),
thc Shannon cntrojy can casi¦y ¦c intcr
jrctcd as to¦¦ows
−
i
P(s
i
) ¦o¸
2
P(s
i
)
i
P(s
i
)
. ¸¸ .
occurrcncc
jro¦a¦i¦ity
¦o¸
2
1
P(s
i
)
. ¸¸ .
jath ¦cn¸th
in trcc
.
ln othcr words, it is thc cxjcctcd num¦cr
ot nccdcd ycs,no qucstions
Perfect Question Scheme
s
!
, s
`
s
3
, s
!
, s
`
s
2
, s
3
, s
!
, s
`
s
1
, s
2
, s
3
, s
!
, s
`
1
2
1
4
1
8
1
16
1
16
s
1
s
2
s
3
s
!
s
`
1 2 3 ! !
Codc ¦cn¸th 1S¨` ¦it,sym¦o¦
Codc cﬃcicncy 1
Christian Borgelt Frequent Pattern Mining 244
A Statistical Evaluation Measure
χ
2
Measure
• Comjarcs thc actua¦ ,oint distri¦ution
with a hypothetical independent distribution
• ¹scs a¦so¦utc comjarison
• Can ¦c intcrjrctcd as a diﬀcrcncc mcasurc
χ
2
(X, Y )
k
X
i1
k
Y
j1
n
..
(p
i.
p
.j
−p
ij
)
2
p
i.
p
.j
• Sidc rcmark lntormation ¸ain can a¦so ¦c intcrjrctcd as a diﬀcrcncc mcasurc
I
¸ain
(X, Y )
k
X
j1
k
Y
i1
p
ij
¦o¸
2
p
ij
p
i.
p
.j
Christian Borgelt Frequent Pattern Mining 245
A Statistical Evaluation Measure
χ
2
Measure
• Comjarcs thc actua¦ ,oint distri¦ution
with a hypothetical independent distribution
• ¹scs a¦so¦utc comjarison
• Can ¦c intcrjrctcd as a diﬀcrcncc mcasurc
χ
2
(X, Y )
k
X
i1
k
Y
j1
n
..
(p
i.
p
.j
−p
ij
)
2
p
i.
p
.j
• Ior k
X
k
Y
2 (as tor ru¦c cva¦uation) thc χ
2
mcasurc simj¦iﬁcs to
χ
2
(X, Y ) n
..
(p
1.
p
.1
−p
11
)
2
p
1.
(1 −p
1.
)p
.1
(1 −p
.1
)
n
..
(n
1.
n
.1
−n
..
n
11
)
2
n
1.
(n
..
−n
1.
)n
.1
(n
..
−n
.1
)
.
Christian Borgelt Frequent Pattern Mining 246
Examples from the Census Data
A¦¦ ru¦cs arc statcd as
consequent < antecedent (support%, confidence%, lift)
whcrc thc sujjort ot a ru¦c is thc sujjort ot thc antcccdcnt
Trivial/Obvious Rules
edu_num=13 < education=Bachelors (16.4, 100.0, 6.09)
sex=Male < relationship=Husband (40.4, 99.99, 1.50)
sex=Female < relationship=Wife (4.8, 99.9, 3.01)
Interesting Comparisons
marital=Nevermarried < age=young sex=Female (12.3, 80.8, 2.45)
marital=Nevermarried < age=young sex=Male (17.4, 69.9, 2.12)
salary>50K < occupation=Execmanagerial sex=Male (8.9, 57.3, 2.40)
salary>50K < occupation=Execmanagerial (12.5, 47.8, 2.00)
salary>50K < education=Masters (5.4, 54.9, 2.29)
hours=overtime < education=Masters (5.4, 41.0, 1.58)
Christian Borgelt Frequent Pattern Mining 247
Examples from the Census Data
salary>50K < education=Masters (5.4, 54.9, 2.29)
salary>50K < occupation=Execmanagerial (12.5, 47.8, 2.00)
salary>50K < relationship=Wife (4.8, 46.9, 1.96)
salary>50K < occupation=Profspecialty (12.6, 45.1, 1.89)
salary>50K < relationship=Husband (40.4, 44.9, 1.88)
salary>50K < marital=Marriedcivspouse (45.8, 44.6, 1.86)
salary>50K < education=Bachelors (16.4, 41.3, 1.73)
salary>50K < hours=overtime (26.0, 40.6, 1.70)
salary>50K < occupation=Execmanagerial hours=overtime
(5.5, 60.1, 2.51)
salary>50K < occupation=Profspecialty hours=overtime
(4.4, 57.3, 2.39)
salary>50K < education=Bachelors hours=overtime
(6.0, 54.8, 2.29)
Christian Borgelt Frequent Pattern Mining 248
Examples from the Census Data
salary>50K < occupation=Profspecialty marital=Marriedcivspouse
(6.5, 70.8, 2.96)
salary>50K < occupation=Execmanagerial marital=Marriedcivspouse
(7.4, 68.1, 2.85)
salary>50K < education=Bachelors marital=Marriedcivspouse
(8.5, 67.2, 2.81)
salary>50K < hours=overtime marital=Marriedcivspouse
(15.6, 56.4, 2.36)
marital=Marriedcivspouse < salary>50K (23.9, 85.4, 1.86)
Christian Borgelt Frequent Pattern Mining 249
Examples from the Census Data
hours=halftime < occupation=Otherservice age=young
(4.4, 37.2, 3.08)
hours=overtime < salary>50K (23.9, 44.0, 1.70)
hours=overtime < occupation=Execmanagerial (12.5, 43.8, 1.69)
hours=overtime < occupation=Execmanagerial salary>50K
(6.0, 55.1, 2.12)
hours=overtime < education=Masters (5.4, 40.9, 1.58)
education=Bachelors < occupation=Profspecialty (12.6, 36.2, 2.20)
education=Bachelors < occupation=Execmanagerial (12.5, 33.3, 2.03)
education=HSgrad < occupation=Transportmoving (4.8, 51.9, 1.61)
education=HSgrad < occupation=Machineopinspct (6.2, 50.7, 1.6)
Christian Borgelt Frequent Pattern Mining 250
Examples from the Census Data
occupation=Profspecialty < education=Masters (5.4, 49.0, 3.88)
occupation=Profspecialty < education=Bachelors sex=Female
(5.1, 34.7, 2.74)
occupation=Admclerical < education=Somecollege sex=Female
(8.6, 31.1, 2.71)
sex=Female < occupation=Admclerical (11.5, 67.2, 2.03)
sex=Female < occupation=Otherservice (10.1, 54.8, 1.65)
sex=Female < hours=halftime (12.1, 53.7, 1.62)
age=young < hours=halftime (12.1, 53.3, 1.79)
age=young < occupation=Handlerscleaners (4.2, 50.6, 1.70)
age=senior < workclass=Selfempnotinc (7.9, 31.1, 1.57)
Christian Borgelt Frequent Pattern Mining 251
Summary Association Rules
• Association Rule Induction is a Two Step Process
◦ Iind thc trcqucnt itcm scts (minimum sujjort)
◦ Iorm thc rc¦cvant association ru¦cs (minimum conﬁdcncc)
• Generating the Association Rules
◦ Iorm a¦¦ jossi¦¦c association ru¦cs trom thc trcqucnt itcm scts
◦ Ii¦tcr “intcrcstin¸” association ru¦cs
¦ascd on minimum sujjort and minimum conﬁdcncc
• Filtering the Association Rules
◦ Comjarc ru¦c conﬁdcncc and conscqucnt sujjort
◦ lntormation ¸ain
◦ χ
2
mcasurc
Christian Borgelt Frequent Pattern Mining 252
Mining More Complex Patterns
Christian Borgelt Frequent Pattern Mining 253
Mining More Complex Patterns
• Thc scarch schcmc in Ircqucnt Grajh,Trcc,Scqucncc minin¸ is thc samc,
namc¦y thc ¸cncra¦ schcmc ot scarchin¸ with a canonica¦ torm
• Frequent (Sub)Graph Mining comjriscs thc othcr arcas
◦ Trccs arc sjccia¦ ¸rajhs, namc¦y ¸rajhs that arc sin¸¦y conncctcd
◦ Scqucnccs can ¦c sccn as sjccia¦ trccs, namc¦y chains
(on¦y onc or two ¦ranchcs — dcjcndin¸ on thc choicc ot thc root)
• Frequent Sequence Mining and Frequent Tree Mining can cxj¦oit
◦ Sjccia¦izcd canonica¦ torms that a¦¦ow tor morc cﬃcicnt chccks
◦ Sjccia¦ data structurcs to rcjrcscnt thc data¦asc to minc,
so that sujjort countin¸ ¦ccomcs morc cﬃcicnt
• \c wi¦¦ trcat Frequent Graph Mining ﬁrst and
wi¦¦ discuss ojtimizations tor thc othcr arcas ¦atcr
Christian Borgelt Frequent Pattern Mining 254
Motivation:
Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 255
Molecular Fragment Mining
• Motivation: Accelerating Drug Development
◦ lhascs ot dru¸ dcvc¦ojmcnt jrcc¦inica¦ and c¦inica¦
◦ Lata ¸athcrin¸ ¦y hi¸hthrou¸hjut scrccnin¸
¦ui¦din¸ mo¦ccu¦ar data¦ascs with activity intormation
◦ Accc¦cration jotcntia¦ ¦y intc¦¦i¸cnt data ana¦ysis
(quantitativc) structurcactivity rc¦ationshij discovcry
• Mining Molecular Databases
◦ Lxamj¦c data `Cl LTl Ll\ Antivira¦ Scrccn data sct
◦ Lcscrijtion ¦an¸ua¸cs tor mo¦ccu¦cs
S`lLLS, SL`, SLﬁ¦c,Cta¦ ctc
◦ Iindin¸ common mo¦ccu¦ar su¦structurcs
◦ Iindin¸ discriminativc mo¦ccu¦ar su¦structurcs
Christian Borgelt Frequent Pattern Mining 256
Accelerating Drug Development
• Lcvc¦ojin¸ a ncw dru¸ can takc 10 to 12 years
(trom thc choicc ot thc tar¸ct to thc introduction into thc markct)
• ln rcccnt ycars thc duration ot thc dru¸ dcvc¦ojmcnt jroccsscs increased
continuous¦y. at thc samc thc number ot su¦stanccs undcr dcvc¦ojmcnt
has gone down drastica¦¦y
• Luc to hi¸h invcstmcnts jharmaccutica¦ comjanics must sccurc thcir markct
josition and comjctitivcncss ¦y on¦y a few, highly successful drugs
• As a conscqucncc thc chanccs tor thc dcvc¦ojmcnt
ot dru¸s tor tar¸ct ¸roujs
◦ with rare diseases or
◦ with special diseases in developing countries
arc considcra¦¦y rcduccd
• A si¸niﬁcant reduction of the development time cou¦d miti¸atc this trcnd
or cvcn rcvcrsc it
(Source: Bundesministerium f¨ ur Bildung und Forschung, Germany)
Christian Borgelt Frequent Pattern Mining 257
Phases of Drug Development
• Discovery and Optimization of Candidate Substances
◦ Li¸hThrou¸hjut Scrccnin¸
◦ Lcad Liscovcry and Lcad Ojtimization
• Preclinical Test Series (tcsts with anima¦s, ca 3 ycars)
◦ Iundamcnta¦ tcst wrt cﬀcctivcncss and sidc cﬀccts
• Clinical Test Series (tcsts with humans, ca !–o ycars)
◦ lhasc 1 ca 30–S0 hca¦thy humans
Chcck tor sidc cﬀccts
◦ lhasc 2 ca 100–300 humans cxhi¦itin¸ thc symjtoms ot thc tar¸ct discasc
Chcck tor cﬀcctivcncss
◦ lhasc 3 uj to 3000 hca¦thy and i¦¦ humans at ¦cast 3 ycars
Lctai¦cd chcck ot cﬀcctivcncss and sidc cﬀccts
• Oﬃcial Acceptance as a Drug
Christian Borgelt Frequent Pattern Mining 258
Drug Development: Acceleration Potential
• Thc ¦cn¸th ot thc jrcc¦inica¦ and c¦inica¦ tcsts scrics can hard¦y ¦c rcduccd,
sincc thcy scrvc thc jurjosc to cnsurc thc satcty ot thc jaticnts
• Thcrctorc ajjroachcs to sjccd uj thc dcvc¦ojmcnt jroccss
usua¦¦y tar¸ct thc preclinical phase ¦ctorc thc anima¦ tcsts
• ln jarticu¦ar, it is tricd to imjrovc thc scarch tor ncw dru¸ candidatcs
(lead discovery) and thcir ojtimization (lead optimization)
Here Intelligent Data Analysis and Frequent Pattern Mining can help.
One possible approach:
• \ith hi¸hthrou¸hjut scrccnin¸ a vcry ¦ar¸c num¦cr ot su¦stanccs
is tcstcd automatica¦¦y and thcir activity is dctcrmincd
• Thc rcsu¦tin¸ mo¦ccu¦ar data¦ascs arc ana¦yzcd ¦y tryin¸
to ﬁnd common substructures ot activc su¦stanccs
Christian Borgelt Frequent Pattern Mining 259
HighThroughput Screening
On soca¦¦cd microplates jrotcins,cc¦¦s arc automatica¦¦y com¦incd with a ¦ar¸c
varicty ot chcmica¦ comjounds
c
_
w
w
w
.
m
a
t
r
i
x
t
e
c
h
c
o
r
p
.
c
o
m
w
w
w
.
e
l
i
s
a

t
e
k
.
c
o
m
w
w
w
.
t
h
e
r
m
o
.
c
o
m
w
w
w
.
a
r
r
a
y
i
t
.
c
o
m
Christian Borgelt Frequent Pattern Mining 260
HighThroughput Screening
Thc ﬁ¦¦cd microj¦atcs arc thcn cva¦uatcd in spectrometers
(wrt a¦sorjtion, ﬂuorcsccncc, ¦umincsccncc, jo¦arization ctc)
c _ www.moleculardevices.com www.biotek.com
Christian Borgelt Frequent Pattern Mining 261
HighThroughput Screening
Attcr thc mcasurcmcnt thc su¦stanccs arc c¦assiﬁcd as active or inactive
Figure c _ Christof Fattinger, HoﬀmannLaRoche, Basel
Ly ana¦yzin¸ thc rcsu¦ts onc trics
to undcrstand thc dcjcndcncc
¦ctwccn mo¦ccu¦ar structurc and
activity
QSAR —
Quantitativc StructurcActivity
lc¦ationshij `odc¦in¸
ln this arca a ¦ar¸c
num¦cr ot data minin¸
a¦¸orithms arc uscd
• tcaturc sc¦cction mcthods
• dccision trccs
• ncura¦ nctworks ctc
Christian Borgelt Frequent Pattern Mining 262
Example: NCI DTP HIV Antiviral Screen
• Amon¸ othcr data scts, thc `ationa¦ Canccr lnstitutc (`Cl) has madc
thc DTP HIV Antiviral Screen Data Set ju¦¦ic¦y avai¦a¦¦c
• A ¦ar¸c num¦cr ot chcmica¦ comjounds whcrc tcstcd
whcthcr thcy jrotcct human CL` cc¦¦s a¸ainst an Ll\1 intcction
• Su¦stanccs that jrovidcd `0/ jrotcction wcrc rctcstcd
• Su¦stanccs that rcjroduci¦¦y jrovidcd 100/ jrotcction
arc ¦istcd as “conﬁrmed active” (CA)
• Su¦stanccs that rcjroduci¦¦y jrovidcd at ¦cast `0/ jrotcction
arc ¦istcd as “moderately active” (CM)
• A¦¦ othcr su¦stanccs
arc ¦istcd as “conﬁrmed inactive” (CI)
• 32` CA, S¨¨ CM, 3` 9o9 CI (tota¦ 3¨ 1¨1 su¦stanccs)
Christian Borgelt Frequent Pattern Mining 263
Form of the Input Data
Lxccrjt trom thc `Cl LTl Ll\ Antivira¦ Scrccn data sct (S`lLLS tormat)
737, 0,CN(C)C1=[S+][Zn]2(S1)SC(=[S+]2)N(C)C
2018, 0,N#CC(=CC1=CC=CC=C1)C2=CC=CC=C2
19110,0,OC1=C2N=C(NC3=CC=CC=C3)SC2=NC=N1
20625,2,NC(=N)NC1=C(SSC2=C(NC(N)=N)C=CC=C2)C=CC=C1.OS(O)(=O)=O
22318,0,CCCCN(CCCC)C1=[S+][Cu]2(S1)SC(=[S+]2)N(CCCC)CCCC
24479,0,C[N+](C)(C)C1=CC2=C(NC3=CC=CC=C3S2)N=N1
50848,2,CC1=C2C=CC=CC2=N[C](CSC3=CC=CC=C3)[N+]1=O
51342,0,OC1=C2C=NC(=NC2=C(O)N=N1)NC3=CC=C(Cl)C=C3
55721,0,NC1=NC(=C(N=O)C(=N1)O)NC2=CC(=C(Cl)C=C2)Cl
55917,0,O=C(N1CCCC[CH]1C2=CC=CN=C2)C3=CC=CC=C3
64054,2,CC1=C(SC[C]2N=C3C=CC=CC3=C(C)[N+]2=O)C=CC=C1
64055,1,CC1=CC=CC(=C1)SC[C]2N=C3C=CC=CC3=C(C)[N+]2=O
64057,2,CC1=C2C=CC=CC2=N[C](CSC3=NC4=CC=CC=C4S3)[N+]1=O
66151,0,[O][N+](=O)C1=CC2=C(C=NN=C2C=C1)N3CC3
...
identiﬁcation number, activity (2: CA, 1: CM, 0: CI), molecule description in SMILES notation
Christian Borgelt Frequent Pattern Mining 264
Input Format: SMILES Notation and SLN
SMILES Notation: (zL Lay¦i¸ht, lnc)
c1:c:c(F):c:c2:c:1C1C(CC2)C2C(C)(CC1)C(O)CC2
SLN (SYBYL Line Notation): (Trijos, lnc)
C[1]H:CH:C(F):CH:C[8]:C:@1C[10]HCH(CH2CH2@8)C[20]HC(CH3)
(CH2CH2@10)CH(CH2CH2@20)OH
Represented Molecule:
Iu¦¦ lcjrcscntation
F O
C
C
C C
C
C
C
C
C
C
C
C C
C
C C
C
C
C C
C H
H
H
H
H H
HH
H
H
H H H H
H
H
H
H H
H
H
H H
Simj¦iﬁcd lcjrcscntation
O F
Christian Borgelt Frequent Pattern Mining 265
Input Format: Grammar for SMILES and SLN
Gcncra¦ ¸rammar tor (¦incar) mo¦ccu¦c dcscrijtions (S`lLLS and SL`)
`o¦ccu¦c Atom Lranch
Lranch ε
[ Lond Atom Lranch
[ Lond La¦c¦ Lranch
[ ( Lranch ) Lranch
Atom L¦cmcnt La¦c¦Lct
La¦c¦Lct ε
[ La¦c¦ La¦c¦Lct
¦¦ack nontcrmina¦ sym¦o¦s
¦¦uc tcrmina¦ sym¦o¦s
Thc dcﬁnitions ot thc nontcrmina¦s ”L¦cmcnt”, ”Lond”, and ”La¦c¦”
dcjcnd on thc choscn dcscrijtion ¦an¸ua¸c Ior S`lLLS it is
L¦cmcnt B [ C [ N [ O [ F [ [H] [ [He] [ [Li] [ [Be] [
Lond ε [  [ = [ # [ : [ .
La¦c¦ Li¸it [ % Li¸it Li¸it
Li¸it 0 [ 1 [ [ 9
Christian Borgelt Frequent Pattern Mining 266
Input Format: SDﬁle/Ctab
LAlanine (13C)
user initials, program, date/time etc.
comment
6 5 0 0 1 0 3 V2000
0.6622 0.5342 0.0000 C 0 0 2 0 0 0
0.6622 0.3000 0.0000 C 0 0 0 0 0 0
0.7207 2.0817 0.0000 C 1 0 0 0 0 0
1.8622 0.3695 0.0000 N 0 3 0 0 0 0
0.6220 1.8037 0.0000 O 0 0 0 0 0 0
1.9464 0.4244 0.0000 O 0 5 0 0 0 0
1 2 1 0 0 0
1 3 1 1 0 0
1 4 1 0 0 0
2 5 2 0 0 0
2 6 1 0 0 0
M END
> <value>
0.2
$$$$
O 5
C
2 O6
C
1
C 3 N4
SLﬁ¦c Structurcdata ﬁ¦c
Cta¦ Conncction ta¦¦c (¦incs !–1o)
c _ L¦scvicr Scicncc
Christian Borgelt Frequent Pattern Mining 267
Finding Common Molecular Substructures
N N N O
O
N
N
O
O
O
O
N
N
N
N N N O
O
N
N
O
O
O
N N N O
O
N
N
O
O
O
P
O
O
O
O
O
N N N O
O
N
N
O
O
O
O
O
O
O
N N N O
O
N
N
O
O
Some Molecules from the NCI HIV Database
Common Fragment
Christian Borgelt Frequent Pattern Mining 268
Finding Molecular Substructures
• Common Molecular Substructures
◦ Ana¦yzc on¦y thc activc mo¦ccu¦cs
◦ Iind mo¦ccu¦ar tra¸mcnts that ajjcar trcqucnt¦y in thc mo¦ccu¦cs
• Discriminative Molecular Substructures
◦ Ana¦yzc thc activc and thc inactivc mo¦ccu¦cs
◦ Iind mo¦ccu¦ar tra¸mcnts that ajjcar trcqucnt¦y in thc activc mo¦ccu¦cs
and on¦y rarc¦y in thc inactivc mo¦ccu¦cs
• Rationale in both cases
◦ Thc tound tra¸mcnts can ¸ivc hints which structura¦ jrojcrtics
arc rcsjonsi¦¦c tor thc activity ot a mo¦ccu¦c
◦ This can hc¦j to idcntity dru¸ candidatcs (soca¦¦cd pharmacophores)
and to ¸uidc tuturc scrccnin¸ cﬀorts
Christian Borgelt Frequent Pattern Mining 269
Frequent (Sub)Graph Mining
Christian Borgelt Frequent Pattern Mining 270
Frequent (Sub)Graph Mining: General Approach
• Iindin¸ trcqucnt itcm scts mcans to ﬁnd
sets of items that are contained in many transactions
• Iindin¸ trcqucnt su¦structurcs mcans to ﬁnd
graph fragments that are contained in many graphs
in a ¸ivcn data¦asc ot attri¦utcd ¸rajhs (uscr sjcciﬁcs minimum sujjort)
• Grajh structurc ot vcrticcs and cd¸cs has to ¦c takcn into account
⇒ Scarch jartia¦¦y ordcrcd sct ot ¸rajh structurcs instcad ot su¦scts
`ain jro¦¦cm How can we avoid redundant search?
• ¹sua¦¦y thc scarch is rcstrictcd to connected substructures
◦ Conncctcd su¦structurcs suﬃcc tor most ajj¦ications
◦ This rcstriction considcra¦¦y narrows thc scarch sjacc
Christian Borgelt Frequent Pattern Mining 271
Frequent (Sub)Graph Mining: Basic Notions
• Lct A ¦a
1
, . . . , a
m
¦ ¦c a sct ot attributes or labels
• A labeled or attributed graph is a trij¦c G (V, E, ℓ), whcrc
◦ V is thc sct ot vcrticcs,
◦ E ⊆ V V −¦(v, v) [ v ∈ V ¦ is thc sct ot cd¸cs, and
◦ ℓ V ∪ E → A assi¸ns ¦a¦c¦s trom thc sct A to vcrticcs and cd¸cs
`otc that G is undirected and simple and contains no loops
Lowcvcr, ¸rajhs without thcsc rcstrictions cou¦d ¦c hand¦cd as wc¦¦
`otc a¦so that scvcra¦ vcrticcs and cd¸cs may havc thc samc attri¦utc,¦a¦c¦
Lxamj¦c molecule representation
• Atom attri¦utcs atom tyjc (chcmica¦ c¦cmcnt), char¸c, aromatic rin¸ ﬂa¸
• Lond attri¦utcs ¦ond tyjc (sin¸¦c, dou¦¦c, trij¦c, aromatic)
Christian Borgelt Frequent Pattern Mining 272
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor ¦a¦c¦cd ¸rajhs thc samc notions can ¦c uscd as tor norma¦ ¸rajhs
\ithout torma¦ dcﬁnition, wc wi¦¦ usc, tor cxamj¦c
• A vcrtcx v is incident to an cd¸c e, and thc cd¸c is incident to thc vcrtcx v,
iﬀ e (v, v
′
) or e (v
′
, v)
• Two diﬀcrcnt vcrticcs arc adjacent or connected
it thcy arc incidcnt to thc samc cd¸c
• A path is a scqucncc ot cd¸cs conncctin¸ two vcrticcs
lt is undcrstood that no cd¸c (and no vcrtcx) occurs twicc
• A ¸rajh is ca¦¦cd connected it thcrc cxists a jath ¦ctwccn any two vcrticcs
• A subgraph consists ot a su¦sct ot thc vcrticcs and a su¦sct ot thc cd¸cs
lt S is a (jrojcr) su¦¸rajh ot G wc writc S ⊆ G or S ⊂ G, rcsjcctivc¦y
• A connected component ot a ¸rajh is a su¦¸rajh that is conncctcd and
maxima¦ in thc scnsc that any ¦ar¸cr su¦¸rajh containin¸ it is not conncctcd
Christian Borgelt Frequent Pattern Mining 273
Frequent (Sub)Graph Mining: Basic Notions
`otc that tor ¦a¦c¦cd ¸rajhs thc samc notions can ¦c uscd as tor norma¦ ¸rajhs
\ithout torma¦ dcﬁnition, wc wi¦¦ usc, tor cxamj¦c
• A vcrtcx ot a ¸rajh is ca¦¦cd isolated it it is not incidcnt to any cd¸c
• A vcrtcx ot a ¸rajh is ca¦¦cd a leaf it it is incidcnt to cxact¦y onc cd¸c
• An cd¸c ot a ¸rajh is ca¦¦cd a bridge it rcmovin¸ it
incrcascs thc num¦cr ot conncctcd comjoncnts ot thc ¸rajh
`orc intuitivc¦y a ¦rid¸c is thc on¦y conncction ¦ctwccn two vcrticcs,
that is, thcrc is no othcr jath on which onc can rcach thc onc trom thc othcr
• An cd¸c ot a ¸rajh is ca¦¦cd a proper bridge
it it is a ¦rid¸c and not incidcnt to a ¦cat
ln othcr words an cd¸c is a jrojcr ¦rid¸c it rcmovin¸ it crcatcs an iso¦atcd vcrtcx
• A¦¦ othcr ¦rid¸cs arc ca¦¦cd leaf bridges
(¦ccausc thcy arc incidcnt to at ¦cast onc ¦cat)
Christian Borgelt Frequent Pattern Mining 274
Frequent (Sub)Graph Mining: Basic Notions
• Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs
A subgraph isomorphism ot S to G or an occurrence ot S in G
is an in,cctivc tunction f V
S
→ V
G
with
◦ ∀v ∈ V
S
ℓ
S
(v) ℓ
G
(f(v)) and
◦ ∀(u, v) ∈ E
S
(f(u), f(v)) ∈ E
G
∧ ℓ
S
((u, v)) ℓ
G
((f(u), f(v)))
That is, thc majjin¸ f jrcscrvcs thc conncction structurc and thc ¦a¦c¦s
lt such a majjin¸ f cxists, wc writc S ⊑ G
• `otc that thcrc may ¦c scvcra¦ ways to maj a ¦a¦c¦cd ¸rajh S to a ¦a¦c¦cd ¸rajh G
so that thc conncction structurc and thc vcrtcx and cd¸c ¦a¦c¦s arc jrcscrvcd
Ior cxamj¦c, G may josscss scvcra¦ su¦¸rajhs that arc isomorjhic to S
lt may cvcn ¦c that thc ¸rajh S can ¦c majjcd in scvcra¦ diﬀcrcnt ways to thc
samc su¦¸rajh ot G This is thc casc it thcrc cxists a su¦¸rajh isomorjhism ot S
to itsc¦t (a soca¦¦cd graph automorphism) that is not thc idcntity
Christian Borgelt Frequent Pattern Mining 275
Frequent (Sub)Graph Mining: Basic Notions
Lct S and G ¦c two ¦a¦c¦cd ¸rajhs
• S and G arc ca¦¦cd isomorphic, writtcn S ≡ G, iﬀ S ⊑ G and G ⊑ S
ln this casc a tunction f majjin¸ S to G is ca¦¦cd a graph isomorphism
A tunction f majjin¸ S to itsc¦t is ca¦¦cd a graph automorphism
• S is properly contained in G, writtcn S < G, iﬀ S ⊑ G and S ,≡ G
• lt S ⊑ G or S < G, thcn thcrc cxists a (jrojcr) su¦¸rajh G
′
ot G,
such that S and G
′
arc isomorjhic
This cxj¦ains thc tcrm “su¦¸rajh isomorjhism”
• Thc set of all connected subgraphs ot G is dcnotcd ¦y ((G)
lt is o¦vious that tor a¦¦ S ∈ ((G) S ⊑ G
Lowcvcr, thcrc arc (unconncctcd) ¸rajhs S with S ⊑ G that arc not in ((G)
Thc sct ot a¦¦ (conncctcd) su¦¸rajhs is ana¦o¸ous to thc jowcr sct ot a sct
Christian Borgelt Frequent Pattern Mining 276
Subgraph Isomorphism: Examples
G
S
1
S
2
N
N
O
O O
O
O
N
N
O
• A mo¦ccu¦c G that rcjrcscnts a ¸rajh in a data¦asc
and two ¸rajhs S
1
and S
2
that arc containcd in G
• Thc su¦¸rajh rc¦ationshij is torma¦¦y dcscri¦cd ¦y a majjin¸ f
ot thc vcrticcs ot onc ¸rajh to thc vcrticcs ot anothcr
G (V
G
, E
G
), S (V
S
, E
S
), f V
S
→ V
G
.
• This majjin¸ must jrcscrvc thc conncction structurc and thc ¦a¦c¦s
Christian Borgelt Frequent Pattern Mining 277
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
→ V
G
S
2
f
2
V
S
2
→ V
G
N
N
O
O O
O
O
N
N
O
• Thc majjin¸ must jrcscrvc thc conncction structurc
∀(u, v) ∈ E
S
(f(u), f(v)) ∈ E
G
.
• Thc majjin¸ must jrcscrvc vcrtcx and cd¸c ¦a¦c¦s
∀v ∈ V
S
ℓ
S
(v) ℓ
G
(f(v)), ∀(u, v) ∈ E
S
ℓ
S
((u, v)) ℓ
G
((f(u), f(v))).
Lcrc oxy¸cn must ¦c majjcd to oxy¸cn, sin¸¦c ¦onds to sin¸¦c ¦onds ctc
Christian Borgelt Frequent Pattern Mining 278
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
→ V
G
S
2
f
2
V
S
2
→ V
G
g
2
V
S
2
→ V
G
N
N
O
O O
O
O
N
N
O
• Thcrc may ¦c morc than onc jossi¦¦c majjin¸ , occurrcncc
(Thcrc arc cvcn thrcc morc occurrcnccs ot S
2
)
• Lowcvcr, wc arc currcnt¦y on¦y intcrcstcd in whcthcr thcrc cxists a majjin¸
(Thc num¦cr ot occurrcnccs wi¦¦ ¦ccomc imjortant
whcn wc considcr minin¸ trcqucnt (su¦)¸rajhs in a sin¸¦c ¸rajh)
• Tcstin¸ whcthcr a su¦¸rajh isomorjhism cxists ¦ctwccn ¸ivcn ¸rajhs S and G
is NPcomplete (that is, rcquircs cxjoncntia¦ timc un¦css l `l)
Christian Borgelt Frequent Pattern Mining 279
Subgraph Isomorphism: Examples
G
S
1
f
1
V
S
1
→ V
G
S
3
f
3
V
S
3
→ V
G
g
3
V
S
3
→ V
G
N
N
O
O O
O
O
N
N O
O
• A ¸rajh may ¦c majjcd to itsc¦t (automorphism)
• Trivia¦¦y, cvcry ¸rajh josscsscs thc idcntity as an automorjhism
(Lvcry ¸rajh can ¦c majjcd to itsc¦t ¦y majjin¸ cach nodc to itsc¦t)
• lt a ¸rajh (tra¸mcnt) josscsscs an automorjhism that is not thc idcntity
thcrc is morc than onc occurrcncc at the same location in anothcr ¸rajh
• Thc num¦cr ot occurrcnccs ot a ¸rajh (tra¸mcnt) in a ¸rajh can ¦c hu¸c
Christian Borgelt Frequent Pattern Mining 280
Frequent (Sub)Graph Mining: Basic Notions
Lct S ¦c a ¦a¦c¦cd ¸rajh and ( a vcctor ot ¦a¦c¦cd ¸rajhs
• A ¦a¦c¦cd ¸rajh G ∈ ( covers thc ¦a¦c¦cd ¸rajh S or
thc ¦a¦c¦cd ¸rajh S is contained in a ¦a¦c¦cd ¸rajh G ∈ ( iﬀ S ⊑ G
• Thc sct K
(
(S) ¦k∈¦1, . . . , n¦ [ S⊑G
k
¦ is ca¦¦cd thc cover ot S wrt (
Thc covcr ot a ¸rajh is thc indcx sct ot thc data¦asc ¸rajhs that covcr it
lt may a¦so ¦c dcﬁncd as a vcctor ot a¦¦ ¦a¦c¦cd ¸rajhs that covcr it
(which, howcvcr, is comj¦icatcd to writc in torma¦¦y corrcct way)
• Thc va¦uc s
(
(S) [K
(
(S)[ is ca¦¦cd thc (absolute) support ot S wrt (
Thc va¦uc σ
(
(S)
1
n
[K
(
(S)[ is ca¦¦cd thc relative support ot S wrt (
Thc sujjort ot S is thc num¦cr or traction ot ¦a¦c¦cd ¸rajhs that contain it
Somctimcs σ
(
(S) is a¦so ca¦¦cd thc (relative) frequency ot S wrt (
Christian Borgelt Frequent Pattern Mining 281
Frequent (Sub)Graph Mining: Formal Deﬁnition
Given:
• a sct A ¦a
1
, . . . , a
m
¦ ot attri¦utcs or ¦a¦c¦s,
• a vcctor ( (G
1
, . . . , G
n
) ot ¸rajhs with ¦a¦c¦s in A,
• a num¦cr s
min
∈ l`, 0 < s
min
≤ n, or (cquiva¦cnt¦y)
a num¦cr σ
min
∈ ll, 0 < σ
min
≤ 1, thc minimum support
Desired:
• thc sct ot frequent (sub)graphs or frequent fragments, that is,
thc sct F
(
(s
min
) ¦S [ s
(
(S) ≥ s
min
¦ or (cquiva¦cnt¦y)
thc sct Φ
(
(σ
min
) ¦S [ σ
(
(S) ≥ σ
min
¦
`otc that with thc rc¦ations s
min
⌈nσ
min
⌉ and σ
min
1
n
s
min
thc two vcrsions can casi¦y ¦c transtormcd into cach othcr
Christian Borgelt Frequent Pattern Mining 282
Frequent (Sub)Graphs: Example
cxamj¦c mo¦ccu¦cs
(¸rajh data¦asc)
S C N C
O
O S C N
F
O S C N
O
Thc num¦crs
¦c¦ow thc su¦¸rajhs
statc thcir sujjort
trcqucnt mo¦ccu¦ar tra¸mcnts (s
min
2)
∗ (cmjty ¸rajh)
3
S O C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C S C N S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 283
Properties of the Support of (Sub)Graphs
• A brute force approach that cnumcratcs a¦¦ jossi¦¦c (su¦)¸rajhs, dctcrmincs
thcir sujjort, and discards intrcqucnt (su¦)¸rajhs is usua¦¦y infeasible
Thc num¦cr ot jossi¦¦c (conncctcd) (su¦)¸rajhs,
¸rows vcry quick¦y with thc num¦cr ot vcrticcs and cd¸cs
• Idea: Considcr thc jrojcrtics ot thc sujjort, in jarticu¦ar
∀S ∀R ⊇ S K
(
(R) ⊆ K
(
(S).
This jrojcrty ho¦ds, ¦ccausc ∀G ∀S ∀R ⊇ S R ⊑ G → S ⊑ G
Lach additiona¦ cd¸c is anothcr condition a data¦asc ¸rajh has to satisty
Grajhs that do not satisty this condition arc rcmovcd trom thc covcr
• lt to¦¦ows ∀S ∀R ⊇ S s
(
(R) ≤ s
(
(S).
That is If a (sub)graph is extended, its support cannot increase.
Onc a¦so says that sujjort is antimonotone or downward closed
Christian Borgelt Frequent Pattern Mining 284
Properties of the Support of (Sub)Graphs
• Irom ∀S ∀R ⊇ S s
(
(R) ≤ s
(
(S) it to¦¦ows
∀s
min
∀S ∀R ⊇ S s
(
(S) < s
min
→ s
(
(R) < s
min
.
That is No supergraph of an infrequent (sub)graph can be frequent.
• This jrojcrty is ottcn rctcrrcd to as thc Apriori Property
lationa¦c Somctimcs wc can know a priori, that is, ¦ctorc chcckin¸ its sujjort
¦y acccssin¸ thc ¸ivcn ¸rajh data¦asc, that a (su¦)¸rajh cannot ¦c trcqucnt
• Ot coursc, thc contrajosition ot this imj¦ication a¦so ho¦ds
∀s
min
∀S ∀R ⊆ S s
(
(S) ≥ s
min
→ s
(
(R) ≥ s
min
.
That is All subgraphs of a frequent (sub)graph are frequent.
• This su¸¸csts a comjrcsscd rcjrcscntation ot thc sct ot trcqucnt (su¦)¸rajhs
Christian Borgelt Frequent Pattern Mining 285
Reminder: Partially Ordered Sets
• A partial order is a ¦inary rc¦ation ≤ ovcr a sct S which satisﬁcs ∀a, b, c ∈ S
◦ a ≤ a (rcﬂcxivity)
◦ a ≤ b ∧ b ≤ a ⇒ a b (antisymmctry)
◦ a ≤ b ∧ b ≤ c ⇒ a ≤ c (transitivity)
• A sct with a jartia¦ ordcr is ca¦¦cd a partially ordered set (or poset tor short)
• Lct a and b ¦c two distinct c¦cmcnts ot a jartia¦¦y ordcrcd sct (S, ≤)
◦ it a ≤ b or b ≤ a, thcn a and b arc ca¦¦cd comparable
◦ it ncithcr a ≤ b nor b ≤ a, thcn a and b arc ca¦¦cd incomparable
• lt a¦¦ jairs ot c¦cmcnts ot thc undcr¦yin¸ sct S arc comjara¦¦c,
thc ordcr ≤ is ca¦¦cd a total order or a linear order
• ln a tota¦ ordcr thc rcﬂcxivity axiom is rcj¦accd ¦y thc stron¸cr axiom
◦ a ≤ b ∨ b ≤ a (tota¦ity)
Christian Borgelt Frequent Pattern Mining 286
Properties of the Support of (Sub)Graphs
Monotonicity in Calculus and Analysis
• A tunction f ll → ll is ca¦¦cd monotonically nondecreasing
it ∀x, y x ≤ y ⇒ f(x) ≤ f(y)
• A tunction f ll → ll is ca¦¦cd monotonically nonincreasing
it ∀x, y x ≤ y ⇒ f(x) ≥ f(y)
Monotonicity in Order Theory
• Ordcr thcory is conccrncd with ar¦itrary jartia¦¦y ordcrcd scts
Thc tcrms increasing and decreasing arc avoidcd, ¦ccausc thcy ¦osc thcir jictoria¦
motivation as soon as scts arc considcrcd that arc not tota¦¦y ordcrcd
• A tunction f S
1
→ S
2
, whcrc S
1
and S
2
arc two jartia¦¦y ordcrcd scts, is ca¦¦cd
monotone or orderpreserving it ∀x, y ∈ S
1
x ≤ y ⇒ f(x) ≤ f(y)
• A tunction f S
1
→ S
2
, is ca¦¦cd
antimonotone or orderreversing it ∀x, y ∈ S
1
x ≤ y ⇒ f(x) ≥ f(y)
• ln this scnsc thc sujjort ot a (su¦)¸rajh is antimonotonc
Christian Borgelt Frequent Pattern Mining 287
Properties of Frequent (Sub)Graphs
• A su¦sct R ot a jartia¦¦y ordcrcd sct (S, ≤) is ca¦¦cd downward closed
it tor any c¦cmcnt ot thc sct a¦¦ sma¦¦cr c¦cmcnts arc a¦so in it
∀x ∈ R ∀y ∈ S y ≤ x ⇒ y ∈ R
ln this casc thc su¦sct R is a¦so ca¦¦cd a lower set
• Thc notions ot upward closed and upper set arc dcﬁncd ana¦o¸ous¦y
• Ior cvcry s
min
thc sct ot trcqucnt (su¦)¸rajhs F
(
(s
min
)
is downward c¦oscd wrt thc jartia¦ ordcr ⊑
∀S ∈ F
(
(s
min
) S ⊑ R ⇒ R ∈ F
(
(s
min
)
• Sincc thc sct ot trcqucnt (su¦)¸rajhs is induccd ¦y thc sujjort tunction,
thc notions ot up or downward closed arc transtcrrcd to thc sujjort tunction
Any sct ot (su¦)¸rajhs induccd ¦y a sujjort thrcsho¦d θ is uj or downward c¦oscd
F
(
(θ) ¦S [ s
(
(S) ≥ θ¦ is downward c¦oscd,
I
(
(θ) ¦S [ s
(
(S) < θ¦ is ujward c¦oscd
Christian Borgelt Frequent Pattern Mining 288
Types of Frequent (Sub)Graphs
Christian Borgelt Frequent Pattern Mining 289
Maximal (Sub)Graphs
• Considcr thc sct ot maximal (frequent) (sub)graphs / fragments
M
(
(s
min
) ¦S [ s
(
(S) ≥ s
min
∧ ∀R ⊃ S s
(
(R) < s
min
¦.
That is A (su¦)¸rajh is maxima¦ it it is trcqucnt,
¦ut nonc ot its jrojcr sujcr¸rajhs is trcqucnt
• Sincc with this dcﬁnition wc know that
∀s
min
∀S ∈ F
(
(s
min
) S ∈ M
(
(s
min
) ∨ ∃R ⊃ S s
(
(R) ≥ s
min
it to¦¦ows (can casi¦y ¦c jrovcn ¦y succcssivc¦y cxtcndin¸ thc ¸rajh S)
∀s
min
∀S ∈ F
(
(s
min
) ∃R ∈ M
(
(s
min
) S ⊆ R.
That is Every frequent (sub)graph has a maximal supergraph.
• Thcrctorc ∀s
min
F
(
(s
min
)
_
S∈M
(
(s
min
)
((S).
Christian Borgelt Frequent Pattern Mining 290
Reminder: Maximal Elements
• Lct R ¦c a su¦sct ot a jartia¦¦y ordcrcd sct (S, ≤)
An c¦cmcnt x ∈ R is ca¦¦cd maximal or a maximal element ot R it
∀y ∈ R x ≤ y ⇒ x y.
• Thc notions minimal and minimal element arc dcﬁncd ana¦o¸ous¦y
• `axima¦ c¦cmcnts nccd not ¦c uniquc,
¦ccausc thcrc may ¦c c¦cmcnts y ∈ R with ncithcr x ≤ y nor y ≤ x
• lnﬁnitc jartia¦¦y ordcrcd scts nccd not josscss a maxima¦ c¦cmcnt
• Lcrc wc considcr thc sct F
(
(s
min
) to¸cthcr with thc jartia¦ ordcr ⊑
Thc maximal (frequent) (sub)graphs arc thc maxima¦ c¦cmcnts ot F
(
(s
min
)
M
(
(s
min
) ¦S ∈ F
(
(s
min
) [ ∀R ∈ F
(
(s
min
) S ⊑ R ⇒ S ≡ R¦.
That is, no sujcr¸rajh ot a maxima¦ (trcqucnt) (su¦)¸rajh is trcqucnt
Christian Borgelt Frequent Pattern Mining 291
Maximal (Sub)Graphs: Example
cxamj¦c mo¦ccu¦cs
(¸rajh data¦asc)
S C N C
O
O S C N
F
O S C N
O
Thc num¦crs
¦c¦ow thc su¦¸rajhs
statc thcir sujjort
trcqucnt mo¦ccu¦ar tra¸mcnts (s
min
2)
∗ (cmjty ¸rajh)
3
S O C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C S C N S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 292
Limits of Maximal (Sub)Graphs
• Thc sct ot maxima¦ (su¦)¸rajhs cajturcs thc sct ot a¦¦ trcqucnt (su¦)¸rajhs,
¦ut thcn wc know on¦y thc sujjort ot thc maxima¦ (su¦)¸rajhs
• A¦out thc sujjort ot a nonmaxima¦ trcqucnt (su¦)¸rajhs wc on¦y know
∀s
min
∀S ∈ F
(
(s
min
) −M
(
(s
min
) s
(
(S) ≥ max
R∈M
(
(s
min
),R⊃S
s
(
(R).
This rc¦ation to¦¦ows immcdiatc¦y trom ∀S ∀R ⊇ S s
(
(S) ≥ s
(
(R),
that is, a (su¦)¸rajh cannot havc a ¦owcr sujjort than any ot its sujcr¸rajhs
• `otc that wc havc ¸cncra¦¦y
∀s
min
∀S ∈ F
(
(s
min
) s
(
(S) ≥ max
R∈M
(
(s
min
),R⊇S
s
(
(R).
• Question: Can wc ﬁnd a su¦sct ot thc sct ot a¦¦ trcqucnt (su¦)¸rajhs,
which a¦so jrcscrvcs know¦cd¸c ot a¦¦ sujjort va¦ucs´
Christian Borgelt Frequent Pattern Mining 293
Closed (Sub)Graphs
• Considcr thc sct ot closed (frequent) (sub)graphs / fragments
C
(
(s
min
) ¦S [ s
(
(S) ≥ s
min
∧ ∀R ⊃ S s
(
(R) < s
(
(S)¦.
That is A (su¦)¸rajh is c¦oscd it it is trcqucnt,
¦ut nonc ot its jrojcr sujcr¸rajhs has thc samc sujjort
• Sincc with this dcﬁnition wc know that
∀s
min
∀S ∈ F
(
(s
min
) S ∈ C
(
(s
min
) ∨ ∃R ⊃ S s
(
(R) s
(
(S)
it to¦¦ows (can casi¦y ¦c jrovcn ¦y succcssivc¦y cxtcndin¸ thc ¸rajh S)
∀s
min
∀S ∈ F
(
(s
min
) ∃R ∈ C
(
(s
min
) S ⊆ R.
That is Every frequent (sub)graph has a closed supergraph.
• Thcrctorc ∀s
min
F
(
(s
min
)
_
S∈C
(
(s
min
)
((S).
Christian Borgelt Frequent Pattern Mining 294
Closed (Sub)Graphs
• Lowcvcr, not on¦y has cvcry trcqucnt (su¦)¸rajh a c¦oscd sujcr¸rajh,
¦ut it has a closed supergraph with the same support
∀s
min
∀S ∈ F
(
(s
min
) ∃R ⊇ S R ∈ C
(
(s
min
) ∧ s
(
(R) s
(
(S).
(lroot considcr thc c¦osurc ojcrator that is dcﬁncd on thc to¦¦owin¸ s¦idcs)
`otc, howcvcr, that thc sujcr¸rajh nccd not ¦c uniquc — scc ¦c¦ow
• Thc sct ot a¦¦ c¦oscd (su¦)¸rajhs jrcscrvcs know¦cd¸c ot a¦¦ sujjort va¦ucs
∀s
min
∀S ∈ F
(
(s
min
) s
(
(S) max
R∈C
(
(s
min
),R⊇S
s
(
(R).
• `otc that thc wcakcr statcmcnt
∀s
min
∀S ∈ F
(
(s
min
) s
(
(S) ≥ max
R∈C
(
(s
min
),R⊇S
s
(
(R)
to¦¦ows immcdiatc¦y trom ∀S ∀R ⊇ S s
(
(S) ≥ s
(
(R), that is,
a (su¦)¸rajh cannot havc a ¦owcr sujjort than any ot its sujcr¸rajhs
Christian Borgelt Frequent Pattern Mining 295
Reminder: Closure Operators
• A closure operator on a sct S is a tunction cl 2
S
→ 2
S
,
which satisﬁcs thc to¦¦owin¸ conditions ∀X, Y ⊆ S
◦ X ⊆ cl (X) (cl is cxtcnsivc)
◦ X ⊆ Y ⇒ cl (X) ⊆ cl (Y ) (cl is incrcasin¸ or monotonc)
◦ cl (cl (X)) cl (X) (cl is idcmjotcnt)
• A sct R ⊆ S is ca¦¦cd closed it it is cqua¦ to its c¦osurc
R is c¦oscd ⇔ R cl (R)
• Thc closed (frequent) item sets arc induccd ¦y thc c¦osurc ojcrator
cl (I)
k∈K
T
(I)
t
k
.
rcstrictcd to thc sct ot trcqucnt itcm scts
C
T
(s
min
) ¦I ∈ F
T
(s
min
) [ I cl (I)¦
Christian Borgelt Frequent Pattern Mining 296
Closed (Sub)Graphs
• Question: ls thcrc a c¦osurc ojcrator that induccs thc c¦oscd (su¦)¸rajhs´
• At ﬁrst ¸¦ancc, it ajjcars natura¦ to transtcr thc ojcration
cl (I)
k∈K
T
(I)
t
k
¦y rcj¦acin¸ thc intcrscction with thc greatest common subgraph
• ¹ntortunatc¦y, this is not jossi¦¦c, ¦ccausc thc ¸rcatcst common su¦¸rajh
ot two (or morc) ¸rajhs nccd not ¦c uniquc¦y dcﬁncd
◦ Considcr thc two ¸rajhs (which arc actua¦¦y chains)
A−B −C and A −B −B −C.
◦ Thcrc arc two ¸rcatcst common su¦¸rajhs
A −B and B −C.
• As a conscqucncc, thc intcrscction ot a sct ot data¦asc ¸rajhs
can yic¦d a set of graphs instcad ot a sin¸¦c common ¸rajh
Christian Borgelt Frequent Pattern Mining 297
Reminder: Galois Connections
• Lct (X, _
X
) and (Y, _
Y
) ¦c two jartia¦¦y ordcrcd scts
• A tunction jair (f
1
, f
2
) with f
1
X → Y and f
2
Y → X
is ca¦¦cd a (monotone) Galois connection iﬀ
◦ ∀A
1
, A
2
∈ X A
1
_ A
2
⇒ f
1
(A
1
) _ f
1
(A
2
),
◦ ∀B
1
, B
2
∈ Y B
1
_ B
2
⇒ f
2
(B
1
) _ f
2
(B
2
),
◦ ∀A ∈ X ∀B ∈ Y A _ f
2
(B) ⇔ B _ f
1
(A)
• A tunction jair (f
1
, f
2
) with f
1
X → Y and f
2
Y → X
is ca¦¦cd an antimonotone Galois connection iﬀ
◦ ∀A
1
, A
2
∈ X A
1
_ A
2
⇒ f
1
(A
1
) _ f
1
(A
2
),
◦ ∀B
1
, B
2
∈ Y B
1
_ B
2
⇒ f
2
(B
1
) _ f
2
(B
2
),
◦ ∀A ∈ X ∀B ∈ Y A _ f
2
(B) ⇔ B _ f
1
(A)
• ln a monotonc Ga¦ois conncction, ¦oth f
1
and f
2
arc monotonc,
in an antimonotonc Ga¦ois conncction, ¦oth f
1
and f
2
arc antimonotonc
Christian Borgelt Frequent Pattern Mining 298
Reminder: Galois Connections
Galois Connections and Closure Operators
• Lct thc two scts X and Y ¦c jowcr scts ot somc scts U and V , rcsjcctivc¦y,
and ¦ct thc jartia¦ ordcrs ¦c thc su¦sct rc¦ations on thcsc jowcr scts, that is, ¦ct
(X, _
X
) (2
U
, ⊆) and (Y, _
Y
) (2
V
, ⊆).
• Thcn thc com¦ination f
1
◦ f
2
X → X ot thc tunctions ot a Ga¦ois conncction
is a closure operator (as wc¦¦ as thc com¦ination f
2
◦ f
1
Y → Y )
Galois Connections in Frequent Item Set Mining
• Considcr thc jartia¦¦y ordcr scts (2
B
, ⊆) and (2
¦1,...,n¦
, ⊆)
Lct f
1
2
B
→ 2
¦1,...,n¦
, I → K
T
(I) ¦k ∈ ¦1, . . . , n¦ [ I ⊆ t
k
¦
and f
2
2
¦1,...,n¦
→ 2
B
, J →
j∈J
t
j
¦i ∈ B [ ∀j ∈ J i ∈ t
j
¦
• Thc tunction jair (f
1
, f
2
) is an antimonotone Galois connection
Thcrctorc thc com¦ination f
1
◦ f
2
2
B
→ 2
B
is a closure operator
Christian Borgelt Frequent Pattern Mining 299
Galois Connections in Frequent (Sub)Graph Mining
• Lct ( (G
1
, . . . , G
n
) ¦c a vcctor ot data¦asc ¸rajhs
• Lct U ¦c thc sct ot a¦¦ su¦¸rajhs ot thc data¦asc ¸rajhs in (, that is,
U ¦S [ ∃i ∈ ¦1, . . . , n¦ S ⊑ G
i
¦
• Lct V ¦c thc indcx sct ot thc data¦asc ¸rajhs in (, that is
V ¦1, . . . , n¦ (sct ot ¸rajh idcntiﬁcrs)
• (2
U
, ⊆) and (2
V
, ⊑) arc jartia¦¦y ordcrcd scts Considcr thc tunction jair
f
1
2
U
→ 2
V
, I → ¦k ∈ U [ ∀S ∈ I S ⊑ G
k
¦. and
f
2
2
V
→ 2
U
J → ¦S ∈ V [ ∀k ∈ J S ⊑ G
k
¦,
• Thc jair (f
1
, f
2
) is a Ga¦ois conncction ot X (2
U
, ⊆) and Y (2
V
, ⊑)
◦ ∀A
1
, A
2
∈ 2
U
A
1
⊆ A
2
⇒ f
1
(A
1
) ⊇ f
1
(A
2
),
◦ ∀B
1
, B
2
∈ 2
V
B
1
⊆ B
2
⇒ f
2
(B
1
) ⊇ f
2
(B
2
),
◦ ∀A ∈ 2
U
∀B ∈ 2
V
A ⊆ f
2
(B) ⇔ B ⊆ f
1
(A)
Christian Borgelt Frequent Pattern Mining 300
Galois Connections in Frequent (Sub)Graph Mining
• Sincc thc tunction jair (f
1
, f
2
) is an (antimonotonc) Ga¦ois conncction,
f
2
◦ f
1
2
U
→ 2
U
is a closure operator
• This c¦osurc ojcrator can ¦c uscd to dcﬁnc thc c¦oscd (su¦)¸rajhs
A su¦¸rajh S is closed wrt a ¸rajh data¦asc ( iﬀ
S ∈ (f
2
◦ f
1
)(¦S¦) ∧ ,∃ G ∈ (f
2
◦ f
1
)(¦S¦) S < G.
• Thc ¸cncra¦ization to a Ga¦ois conncction takcs torma¦¦y carc ot thc jro¦¦cm
that thc ¸rcatcst common su¦¸rajh may not ¦c uniquc¦y dctcrmincd
• lntuitivc¦y, thc a¦ovc dcﬁnition simj¦y says that a su¦¸rajh S is c¦oscd iﬀ
◦ it is a common su¦¸rajh ot a¦¦ data¦asc ¸rajhs containin¸ it and
◦ no sujcr¸rajh ot it is a¦so a common su¦¸rajh ot thcsc ¸rajhs
That is, a su¦¸rajh S is c¦oscd it it is one ot thc ¸rcatcst common su¦¸rajhs
ot a¦¦ data¦asc ¸rajhs containin¸ it
• Thc Ga¦ois conncction is on¦y nccdcd to jrovc thc c¦osurc ojcrator jrojcrty
Christian Borgelt Frequent Pattern Mining 301
Closed (Sub)Graphs: Example
cxamj¦c mo¦ccu¦cs
(¸rajh data¦asc)
S C N C
O
O S C N
F
O S C N
O
Thc num¦crs
¦c¦ow thc su¦¸rajhs
statc thcir sujjort
trcqucnt mo¦ccu¦ar tra¸mcnts (s
min
2)
∗ (cmjty ¸rajh)
3
S
O
C N
3 3 3 3
O S S C C O C N
2 3 2 3
O S C
S C N
S C O N C O
2 3 2 2
O S C
N
S C N
O
2 2
Christian Borgelt Frequent Pattern Mining 302
Types of Frequent (Sub)Graphs
• Frequent (Sub)Graph
Any trcqucnt (su¦)¸rajh (sujjort is hi¸hcr than thc minima¦ sujjort)
I trcqucnt ⇔ s
(
(S) ≥ s
min
• Closed (Sub)Graph
A trcqucnt (su¦)¸rajh is ca¦¦cd closed it no sujcr¸rajh has thc samc sujjort
I c¦oscd ⇔ s
(
(S) ≥ s
min
∧ ∀R ⊃ S s
(
(R) < s
(
(S)
• Maximal (Sub)Graph
A trcqucnt (su¦)¸rajh is ca¦¦cd maximal it no sujcr¸rajh is trcqucnt
I maxima¦ ⇔ s
(
(S) ≥ s
min
∧ ∀R ⊃ S s
(
(R) < s
min
• O¦vious rc¦ations ¦ctwccn thcsc tyjcs ot (su¦)¸rajhs
◦ A¦¦ maxima¦ and a¦¦ c¦oscd (su¦)¸rajhs arc trcqucnt
◦ A¦¦ maxima¦ (su¦)¸rajhs arc c¦oscd
Christian Borgelt Frequent Pattern Mining 303
Searching for Frequent (Sub)Graphs
Christian Borgelt Frequent Pattern Mining 304
Partially Ordered Set of Subgraphs
Hasse diagram ranging from the empty graph to the database graphs.
• Thc su¦¸rajh (isomorjhism) rc¦ationshij dcﬁncs a jartia¦ ordcr on su¦¸rajhs
• Thc cmjty ¸rajh is (torma¦¦y) containcd in a¦¦ su¦¸rajhs
• Thcrc is usua¦¦y no (natura¦) uniquc ¦ar¸cst ¸rajh
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
Christian Borgelt Frequent Pattern Mining 305
Frequent (Sub)Graphs
The frequent (sub)graphs form a partially ordered subset at the top.
• Thcrctorc thc jartia¦¦y ordcrcd sct shou¦d ¦c scarchcd tojdown
• Standard scarch stratc¸ics ¦rcadthﬁrst and dcjthﬁrst
• Lcjthﬁrst scarch is usua¦¦y jrctcra¦¦c, sincc thc scarch trcc can ¦c vcry widc
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
s
min
2
F
F S
O S F F S C C N C
O S C
F F
S C N O S C
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
*
S O C N
O S S C C O C N
O S C S C N S C O O C N
O S C
N
S C N
O
1
1
1 1 1
1 1 1 1 1
1 1 1
3
3
3
3
3
2 3 2 3
2 3 2 2
2 2
Christian Borgelt Frequent Pattern Mining 306
Closed and Maximal Frequent (Sub)Graphs
Partially ordered subset of frequent (sub)graphs.
• C¦oscd trcqucnt (su¦)¸rajhs arc cncirc¦cd
• Thcrc arc 1! trcqucnt (su¦)¸rajhs, ¦ut on¦y ! c¦oscd (su¦)¸rajhs
• Thc two c¦oscd (su¦)¸rajhs at thc ¦ottom arc a¦so maxima¦
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
*
S O C N
O S S C C O C N
O S C S C N S C O O C N
O S C N S C N
O
3
3 3 3
3
2 3 2 3
2
3
2 2
2 2
Christian Borgelt Frequent Pattern Mining 307
Basic Search Principle
• Grow (sub)graphs into the graphs of the given database.
◦ Start with a sin¸¦c vcrtcx (sccd vcrtcx)
◦ Add an cd¸c (and may¦c a vcrtcx) in cach stcj
◦ Lctcrminc thc sujjort and jrunc intrcqucnt (su¦)¸rajhs
• `ain jro¦¦cm A (sub)graph can be grown in several diﬀerent ways
S S C S C O S C N
O
O C O N C O S C N
O
C C N S C N S C N
O
C C N N C O S C N
O
ctc (S morc jossi¦i¦itics)
*
S O C N
S C C O C N
S C N S C O O C N
S C N
O
Christian Borgelt Frequent Pattern Mining 308
Reminder: Searching for Frequent Item Sets
• \c havc to scarch thc jartia¦¦y ordcrcd sct (2
B
, ⊆) , its Lassc dia¸ram
• Assi¸nin¸ uniquc jarcnts turns thc Lassc dia¸ram into a trcc
• Travcrsin¸ thc rcsu¦tin¸ trcc cxj¦orcs cach itcm sct cxact¦y oncc
Lassc dia¸ram and a jossi¦¦c trcc tor ﬁvc itcms
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
a b c d e
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
Christian Borgelt Frequent Pattern Mining 309
Searching for Frequent (Sub)Graphs
• \c havc to scarch thc jartia¦¦y ordcrcd sct ot (conncctcd) (su¦)¸rajhs
ran¸in¸ trom thc cmjty ¸rajh to thc data¦asc ¸rajhs
• Assi¸nin¸ uniquc jarcnts turns corrcsjondin¸ Lassc dia¸ram into a trcc
• Travcrsin¸ thc rcsu¦tin¸ trcc cxj¦orcs cach (su¦)¸rajh cxact¦y oncc
Su¦¸rajh Lassc dia¸ram and a jossi¦¦c trcc
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
*
F S O C N
F S O S S C C O C N
O S F F S C O S C S C N S C O O C N C N C
O S C
F F
S C N O S C
N
O S C
O
S C N
O
S C N
C O
C N C
O S C N
F
O S C N
O
S C N C
O
Christian Borgelt Frequent Pattern Mining 310
Searching with Unique Parents
Principle of a Search Algorithm based on Unique Parents:
• Base Loop:
◦ Travcrsc a¦¦ jossi¦¦c vcrtcx attri¦utcs (thcir uniquc jarcnt is thc cmjty ¸rajh)
◦ lccursivc¦y jroccss a¦¦ vcrtcx attri¦utcs that arc trcqucnt
• Recursive Processing:
Ior a ¸ivcn trcqucnt (su¦)¸rajh S
◦ Gcncratc a¦¦ cxtcnsions R ot S ¦y an cd¸c or ¦y an cd¸c and a vcrtcx
(it thc vcrtcx is not yct in S) tor which S is thc choscn uniquc jarcnt
◦ Ior a¦¦ R it R is trcqucnt, jroccss R rccursivc¦y, othcrwisc discard R
• Questions:
◦ Low can wc torma¦¦y assi¸n uniquc jarcnts´
◦ (Low) Can wc makc surc that wc ¸cncratc on¦y thosc cxtcnsions
tor which thc (su¦)¸rajh that is cxtcndcd is thc choscn uniquc jarcnt´
Christian Borgelt Frequent Pattern Mining 311
Assigning Unique Parents
• Iorma¦¦y, thc sct ot a¦¦ possible parents ot a (conncctcd) (su¦)¸rajh S is
P(S) ¦R ∈ ((S) [ ,∃ U ∈ ((S) R ⊂ U ⊂ S¦.
ln othcr words, thc jossi¦¦c jarcnts ot S arc its maximal proper subgraphs
• Lach jossi¦¦c jarcnt contains cxact¦y one edge less than thc (su¦)¸rajh S
• lt wc can dcﬁnc an ordcr on thc cd¸cs ot thc (su¦)¸rajh S,
wc can casi¦y sin¸¦c out a uniquc jarcnt, thc canonical parent p
c
(S)
◦ Lct e
∗
¦c thc last edge in thc ordcr that is not a proper bridge
(ic cithcr a ¦cat ¦rid¸c or no ¦rid¸c)
◦ Thc canonical parent p
c
(S) is thc ¸rajh S without the edge e
∗
◦ lt e
∗
is a ¦cat ¦rid¸c, wc a¦so havc to rcmovc thc crcatcd iso¦atcd nodc
◦ lt e
∗
is thc on¦y cd¸c ot S, wc a¦so nccd an ordcr ot thc nodcs,
so that wc can dccidc which iso¦atcd nodc to rcmovc
◦ `otc it S is conncctcd, thcn p
c
(S) is conncctcd, as e
∗
is not a jrojcr ¦rid¸c
Christian Borgelt Frequent Pattern Mining 312
Assigning Unique Parents
• ln ordcr to dcﬁnc an ordcr ot thc cd¸cs ot a ¸ivcn (su¦)¸rajh,
wc wi¦¦ rc¦y on a canonical form ot (su¦)¸rajhs
• Canonica¦ torms tor ¸rajhs arc morc comj¦cx than canonica¦ torms tor itcm scts
(rcmindcr on ncxt s¦idc), ¦ccausc wc havc to codc thc conncction structurc
• A canonica¦ torm ot a (su¦)¸rajh is a sjccia¦ rcjrcscntation ot this (su¦)¸rajh
◦ Lach (su¦)¸rajh is dcscri¦cd ¦y a code word
◦ lt dcscri¦cs thc ¸rajh structurc and thc vcrtcx and cd¸c ¦a¦c¦s
(and thus imj¦icit¦y ordcrs thc cd¸cs and vcrticcs)
◦ Thc (su¦)¸rajh can ¦c rcconstructcd trom thc codc word
◦ Thcrc may ¦c mu¦tij¦c codc words that dcscri¦c thc samc (su¦)¸rajh
◦ Onc ot thc codc words is sin¸¦cd out as thc canonical code word
• Thcrc arc two main jrincij¦cs tor canonica¦ torms ot ¸rajhs
◦ spanning trees and ◦ adjacency matrices
Christian Borgelt Frequent Pattern Mining 313
Support Counting
Subgraph Isomorphism Tests
• Gcncratc cxtcnsions ¦ascd on ¸¦o¦a¦ intormation a¦out cd¸cs
◦ Co¦¦cct trij¦cs ot sourcc nodc ¦a¦c¦, cd¸c ¦a¦c¦, and dcstination nodc ¦a¦c¦
◦ Travcrsc thc (cxtcnda¦¦c) nodcs ot a ¸ivcn tra¸mcnt
and attach cd¸cs ¦ascd on thc co¦¦cctcd trij¦cs
• Travcrsc data¦asc ¸rajhs and tcst whcthcr ¸cncratcd cxtcnsion occurs
(Thc data¦asc ¸rajhs may ¦c rcstrictcd to thosc containin¸ thc jarcnt)
Maintain List of Occurrences
• Iind and rccord a¦¦ occurrcnccs ot sin¸¦c nodc ¸rajhs
• Chcck data¦asc ¸rajhs tor cxtcnsions ot known occurrcnccs
This immcdiatc¦y yic¦ds thc occurrcnccs ot thc cxtcndcd tra¸mcnts
• Lisadvanta¸c considcra¦¦c mcmory is nccdcd tor storin¸ thc occurrcnccs
• Advanta¸c tcwcr cxtcndcd tra¸mcnts and tastcr sujjort countin¸
Christian Borgelt Frequent Pattern Mining 314
Canonical Forms of Graphs
Christian Borgelt Frequent Pattern Mining 315
Reminder: Canonical Form for Item Sets
• An itcm sct is rcjrcscntcd ¦y a code word. cach ¦cttcr rcjrcscnts an itcm
Thc codc word is a word ovcr thc a¦jha¦ct A, thc sct ot a¦¦ itcms
• Thcrc arc k' jossi¦¦c codc words tor an itcm sct ot sizc k,
¦ccausc thc itcms may ¦c ¦istcd in any ordcr
• Ly introducin¸ an (ar¦itrary, ¦ut ﬁxcd) order of the items,
and ¦y comjarin¸ codc words ¦cxico¸rajhica¦¦y,
wc can dcﬁnc an ordcr on thcsc codc words
Lxamj¦c abc < bac < bca < cab tor thc itcm sct ¦a, b, c¦ and a < b < c
• Thc ¦cxico¸rajhica¦¦y sma¦¦cst codc word tor an itcm sct
is thc canonical code word
O¦vious¦y thc canonica¦ codc word ¦ists thc itcms in thc choscn, ﬁxcd ordcr
ln jrincij¦c, thc samc ¸cncra¦ idca can ¦c uscd tor ¸rajhs
Lowcvcr, a ¸¦o¦a¦ ordcr on thc vcrtcx and cd¸c attri¦utcs is not cnou¸h
Christian Borgelt Frequent Pattern Mining 316
Canonical Forms of Graphs: General Idea
• Construct a code word that uniquc¦y idcntiﬁcs an (attri¦utcd or ¦a¦c¦cd) ¸rajh
uj to automorjhisms (that is, symmctrics)
• Basic idea: Thc charactcrs ot thc codc word dcscri¦c thc cd¸cs ot thc ¸rajh
• Core problem: \crtcx and cd¸c attri¦utcs can casi¦y ¦c incorjoratcd into
a codc word, ¦ut how to dcscri¦c thc conncction structurc is not so o¦vious
• Thc vcrticcs ot thc ¸rajh must ¦c num¦crcd (cndowcd with uniquc ¦a¦c¦s),
¦ccausc wc nccd to sjccity thc vcrticcs that arc incidcnt to an cd¸c
(`otc vcrtcx ¦a¦c¦s nccd not ¦c uniquc — scvcra¦ nodcs may havc thc samc ¦a¦c¦)
• Lach jossi¦¦c num¦crin¸ ot thc vcrticcs ot thc ¸rajh yic¦ds a codc word,
which is thc concatcnation ot thc (sortcd) cd¸c dcscrijtions (“charactcrs”)
(`otc that thc ¸rajh can ¦c rcconstructcd trom such a codc word)
• Thc rcsu¦tin¸ ¦ist ot codc words is sortcd ¦cxico¸rajhica¦¦y
• Thc ¦cxico¸rajhica¦¦y sma¦¦cst codc word is thc canonical code word
(A¦tcrnativc¦y, onc may choosc thc ¦cxico¸rajhica¦¦y ¸rcatcst codc word)
Christian Borgelt Frequent Pattern Mining 317
Searching with Canonical Forms
• Lct S ¦c a (su¦)¸rajh and w
c
(S) its canonica¦ codc word
Lct e
∗
(S) ¦c thc ¦ast cd¸c in thc cd¸c ordcr induccd ¦y w
c
(S)
(ic thc ordcr in which thc cd¸cs arc dcscri¦cd) that is not a jrojcr ¦rid¸c
• General Recursive Processing with Canonical Forms:
Ior a ¸ivcn trcqucnt (su¦)¸rajh S
◦ Gcncratc a¦¦ cxtcnsions R ot S ¦y a sin¸¦c cd¸c or an cd¸c and a vcrtcx
(it onc vcrtcx incidcnt to thc cd¸c is not yct jart ot S)
◦ Iorm thc canonica¦ codc word w
c
(R) ot cach cxtcndcd (su¦)¸rajh R
◦ lt thc cd¸c e
∗
(R) as induccd ¦y w
c
(R) is thc cd¸c addcd to S to torm R
and R is trcqucnt, jroccss R rccursivc¦y, othcrwisc discard R
• Questions:
◦ Low can wc torma¦¦y dcﬁnc canonica¦ codc words´
◦ Lo wc havc to ¸cncratc a¦¦ jossi¦¦c cxtcnsions ot a trcqucnt (su¦)¸rajh´
Christian Borgelt Frequent Pattern Mining 318
Canonical Forms: Preﬁx Property
• Sujjosc thc canonica¦ torm josscsscs thc preﬁx property
Every preﬁx of a canonical code word is a canonical code word itself
⇒ Thc cd¸c e
∗
is a¦ways thc ¦ast dcscri¦cd cd¸c
⇒ Thc ¦on¸cst jrojcr jrcﬁx ot thc canonica¦ codc word ot a (su¦)¸rajh S
not on¦y dcscri¦cs thc canonica¦ jarcnt ot S, ¦ut is its canonica¦ codc word
• Thc ¸cncra¦ rccursivc jroccssin¸ schcmc with canonica¦ torms rcquircs
to construct thc canonical code word ot cach crcatcd (su¦)¸rajh
in ordcr to dccidc whcthcr it has to ¦c jroccsscd rccursivc¦y or not
⇒ \c know thc canonica¦ codc word ot any (su¦)¸rajh that is jroccsscd
• \ith this codc word wc know, duc to thc preﬁx property, thc canonica¦
codc words ot a¦¦ chi¦d (su¦)¸rajhs that havc ¦c cxj¦orcd in thc rccursion
with the exception of the last letter (that is, thc dcscrijtion ot thc addcd cd¸c)
⇒ \c on¦y havc to chcck whcthcr thc codc word that rcsu¦ts trom ajjcndin¸
thc dcscrijtion ot thc addcd cd¸c to thc ¸ivcn canonica¦ codc word is canonica¦
Christian Borgelt Frequent Pattern Mining 319
Searching with the Preﬁx Property
Principle of a Search Algorithm based on the Preﬁx Property:
• Base Loop:
◦ Travcrsc a¦¦ jossi¦¦c vcrtcx attri¦utcs, that is,
thc canonica¦ codc words ot sin¸¦c vcrtcx (su¦)¸rajhs
◦ lccursivc¦y jroccss cach codc word that dcscri¦cs a trcqucnt (su¦)¸rajh
• Recursive Processing:
Ior a ¸ivcn (canonica¦) codc word ot a trcqucnt (su¦)¸rajh
◦ Gcncratc a¦¦ jossi¦¦c cxtcnsions ¦y an cd¸c (and a may¦c a vcrtcx)
This is donc ¦y ajjcndin¸ thc cd¸c dcscrijtion to thc codc word
◦ Chcck whcthcr thc cxtcndcd codc word is thc canonical code word
ot thc (su¦)¸rajh dcscri¦cd ¦y thc cxtcndcd codc word
(and, ot coursc, whcthcr thc dcscri¦cd (su¦)¸rajh is trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivc¦y, othcrwisc discard it
Christian Borgelt Frequent Pattern Mining 320
The Preﬁx Property
• Advantages of the Preﬁx Property:
◦ Tcstin¸ whcthcr a ¸ivcn codc word is canonica¦ can ¦c simj¦cr,tastcr
than constructin¸ a canonica¦ codc word trom scratch
◦ Thc jrcﬁx jrojcrty usua¦¦y a¦¦ows us to casi¦y ﬁnd simj¦c ru¦cs
to restrict the extensions that nccd to ¦c ¸cncratcd
• Disadvantages of the Preﬁx Property:
◦ Onc has rcduccd trccdom in thc dcﬁnition ot a canonica¦ torm
This can makc it imjossi¦¦c to cxj¦oit ccrtain jrojcrtics ot a ¸rajh
that can hc¦j to construct a canonica¦ torm quick¦y
• ln thc to¦¦owin¸ wc considcr main¦y canonica¦ torms havin¸ thc jrcﬁx jrojcrty
• Lowcvcr, it wi¦¦ ¦c discusscd ¦atcr how additiona¦ ¸rajh jrojcrtics
can ¦c cxj¦oitcd to imjrovc thc construction ot a canonica¦ torm
it thc jrcﬁx jrojcrty is not madc a rcquircmcnt
Christian Borgelt Frequent Pattern Mining 321
Canonical Forms based on Spanning Trees
Christian Borgelt Frequent Pattern Mining 322
Spanning Trees
• A (¦a¦c¦cd) ¸rajh G is ca¦¦cd a tree iﬀ tor any jair ot vcrticcs in G
thcrc cxists exactly one path conncctin¸ thcm in G
• A spanning tree ot a (¦a¦c¦cd) conncctcd ¸rajh G is a su¦¸rajh S ot G that
◦ is a trcc and
◦ comjriscs a¦¦ vcrticcs ot G (that is, V
S
V
G
)
Lxamj¦cs ot sjannin¸ trccs
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
• Thcrc arc 1 9 + ` ! o ` −1 29 jossi¦¦c sjannin¸ trccs tor this cxamj¦c,
¦ccausc ¦oth rin¸s havc to ¦c cut ojcn
Christian Borgelt Frequent Pattern Mining 323
Canonical Forms based on Spanning Trees
• A code word dcscri¦in¸ a ¸rajh can ¦c constructcd ¦y
◦ systcmatica¦¦y constructin¸ a spanning tree ot thc ¸rajh,
◦ numbering the vertices in thc ordcr in which thcy arc visitcd,
◦ dcscri¦in¸ cach cd¸c ¦y thc num¦crs ot thc vcrticcs it connccts,
thc cd¸c ¦a¦c¦, and thc ¦a¦c¦s ot thc incidcnt vcrticcs, and
◦ ¦istin¸ thc cd¸c dcscrijtions in thc ordcr in which thc cd¸cs arc visitcd
(Ld¸cs c¦osin¸ cyc¦cs may nccd sjccia¦ trcatmcnt)
• Thc most common ways ot constructin¸ a sjannin¸ trcc arc
◦ depthﬁrst search ⇒ ¸Sjan Yan and Lan 2002
◦ breadthﬁrst search ⇒ `oSS,`oIa Lor¸c¦t and Lcrtho¦d 2002
An a¦tcrnativc way is to visit a¦¦ chi¦drcn ot a vcrtcx ¦ctorc jrocccdin¸
in a dcjthﬁrst manncr (can ¦c sccn as a variant ot dcjthﬁrst scarch)
Othcr systcmatic scarch schcmcs arc, in jrincij¦c, a¦so ajj¦ica¦¦c
Christian Borgelt Frequent Pattern Mining 324
Canonical Forms based on Spanning Trees
• Lach startin¸ joint (choicc ot a root) and cach way to ¦ui¦d a sjannin¸ trcc
systcmatica¦¦y trom a ¸ivcn startin¸ joint yic¦ds a diﬀcrcnt codc word
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
O
F N
N
O
O
Thcrc arc 12 jossi¦¦c startin¸ joints and scvcra¦ ¦ranchin¸ joints
As a conscqucncc, thcrc arc scvcra¦ hundrcd jossi¦¦c codc words
• Thc ¦cxico¸rajhica¦¦y sma¦¦cst codc word is thc canonical code word
• Sincc thc cd¸cs arc ¦istcd in thc ordcr in which thcy arc visitcd durin¸ thc
sjannin¸ trcc construction, this canonica¦ torm has thc preﬁx property
lt a jrcﬁx ot a canonica¦ codc word wcrc not canonica¦, thcrc wou¦d ¦c
a startin¸ joint and a sjannin¸ trcc that yic¦d a sma¦¦cr codc word
(¹sc thc canonica¦ codc word ot thc jrcﬁx ¸rajh and ajjcnd thc missin¸ cd¸c)
Christian Borgelt Frequent Pattern Mining 325
Canonical Forms based on Spanning Trees
• An edge description consists ot
◦ thc indiccs ot thc sourcc and thc dcstination vcrtcx
(dcﬁnition thc sourcc ot an cd¸c is thc vcrtcx with thc sma¦¦cr indcx),
◦ thc attri¦utcs ot thc sourcc and thc dcstination vcrtcx,
◦ thc cd¸c attri¦utc
• Listin¸ thc cd¸cs in thc ordcr in which thcy arc visitcd can ottcn ¦c charactcrizcd
¦y a precedence order on thc dcscri¦in¸ c¦cmcnts ot an cd¸c
• Ordcr ot individua¦ c¦cmcnts (con,ccturcs, ¦ut sujjortcd ¦y cxjcrimcnts)
◦ \crtcx and cd¸c attri¦utcs shou¦d ¦c sortcd accordin¸ to thcir trcqucncy
◦ Asccndin¸ ordcr sccms to ¦c rccommcnda¦¦c tor thc vcrtcx attri¦utcs
• Simpliﬁcation: Thc sourcc attri¦utc is nccdcd on¦y tor thc ﬁrst cd¸c
and thus can ¦c sj¦it oﬀ trom thc ¦ist ot cd¸c dcscrijtions
Christian Borgelt Frequent Pattern Mining 326
Canonical Forms: Edge Sorting Criteria
• Precedence Order for Depthﬁrst Search:
◦ dcstination vcrtcx indcx (asccndin¸)
◦ sourcc vcrtcx indcx (dcsccndin¸) ←
◦ cd¸c attri¦utc (asccndin¸)
◦ dcstination vcrtcx attri¦utc (asccndin¸)
• Precedence Order for Breadthﬁrst Search:
◦ sourcc vcrtcx indcx (asccndin¸)
◦ cd¸c attri¦utc (asccndin¸)
◦ dcstination vcrtcx attri¦utc (asccndin¸)
◦ dcstination vcrtcx indcx (asccndin¸)
• Edges Closing Cycles:
Ld¸cs c¦osin¸ cyc¦cs may ¦c distin¸uishcd trom sjannin¸ trcc cd¸cs,
¸ivin¸ sjannin¸ trcc cd¸cs a¦so¦utc jrcccdcncc ovcr cd¸cs c¦osin¸ cyc¦cs
A¦tcrnativc Sort ¦ctwccn thc othcr cd¸cs ¦ascd on thc jrcccdcncc ru¦cs
Christian Borgelt Frequent Pattern Mining 327
Canonical Forms: Code Words
Irom thc dcscri¦cd jroccdurc thc to¦¦owin¸ codc words rcsu¦t
(rc¸u¦ar cxjrcssions with nontcrmina¦ sym¦o¦s)
• DepthFirst Search: a (i
d
i
s
b a)
m
• BreadthFirst Search: a (i
s
b a i
d
)
m
(or a (i
s
i
d
b a)
m
)
whcrc n thc num¦cr ot vcrticcs ot thc ¸rajh,
m thc num¦cr ot cd¸cs ot thc ¸rajh,
i
s
indcx ot thc sourcc vcrtcx ot an cd¸c, i
s
∈ ¦0, . . . , n −1¦,
i
d
indcx ot thc dcstination vcrtcx ot an cd¸c, i
d
∈ ¦0, . . . , n −1¦,
a thc attri¦utc ot a vcrtcx,
b thc attri¦utc ot an cd¸c
Thc ordcr ot thc c¦cmcnts dcscri¦in¸ an cd¸c rcﬂccts thc jrcccdcncc ordcr
That i
s
in thc dcjthﬁrst scarch cxjrcssion is undcr¦incd is mcant as a rcmindcr
that thc cd¸c dcscrijtions havc to ¦c sortcd dcsccndin¸¦y wrt this va¦uc
Christian Borgelt Frequent Pattern Mining 328
Canonical Forms: A Simple Example
O
N
S
O
O
cxamj¦c
mo¦ccu¦c
dcjthﬁrst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
¦rcadthﬁrst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Order of Elements: S ≺ N ≺ O ≺ C Order of Bonds: ≺
Code Words:
A S 10N 21O 31C 43C 54O 64=O 73C 87C 80C
L S 0N1 0C2 1O3 1C4 2C5 4C5 4C6 6O7 6=O8
(lcmindcr in A thc cd¸cs arc sortcd descendingly wrt thc sccond cntry)
Christian Borgelt Frequent Pattern Mining 329
Checking for Canonical Form: Compare Preﬁxes
• Base Loop:
◦ Travcrsc a¦¦ vcrticcs with a ¦a¦c¦ no ¦css than thc currcnt root vcrtcx
(ﬁrst charactcr ot thc codc word. jossi¦¦c roots ot sjannin¸ trccs)
• Recursive Processing:
◦ Thc rccursivc jroccssin¸ constructs a¦tcrnativc sjannin¸ trccs and
comjarc thc codc words rcsu¦tin¸ trom it with thc codc word to chcck
◦ ln cach rccursion stcj onc cd¸c is addcd to thc sjannin¸ trcc and its dcscrijtion
is comjarcd to thc corrcsjondin¸ onc in thc codc word to chcck
◦ lt thc ncw cd¸c dcscrijtion is larger, thc cd¸c can ¦c skijjcd
(ncw codc word is ¦cxico¸rajhica¦¦y ¦ar¸cr)
◦ lt thc ncw cd¸c dcscrijtion is smaller, thc codc word is not canonica¦
(ncw codc word is ¦cxico¸rajhica¦¦y sma¦¦cr)
◦ lt thc ncw cd¸c dcscrijtion is equal, thc rcst ot thc codc word
is jroccsscd rccursivc¦y (codc word jrcﬁxcs arc cqua¦)
Christian Borgelt Frequent Pattern Mining 330
Checking for Canonical Form
function isCanonica¦ (w array ot int, G ¸rajh) ¦oo¦can.
var v vcrtcx. (∗ to travcrsc thc vcrticcs ot thc ¸rajh ∗)
e cd¸c. (∗ to travcrsc thc cd¸cs ot thc ¸rajh ∗)
x array ot vcrtcx. (∗ to co¦¦cct thc num¦crcd vcrticcs ∗)
begin
forall v ∈ G.V do v.i −1. (∗ c¦car thc vcrtcx indiccs ∗)
forall e ∈ G.E do e.i −1. (∗ c¦car thc cd¸c markcrs ∗)
forall v ∈ G.V do begin (∗ travcrsc thc jotcntia¦ root vcrticcs ∗)
if v.a < w0 then return ta¦sc. (∗ it v has a sma¦¦cr ¦a¦c¦, a¦ort ∗)
if v.a w0 then begin (∗ it v has thc samc ¦a¦c¦, chcck rcst ∗)
v.i 0. x0 v. (∗ num¦cr and rccord thc root vcrtcx ∗)
if not rcc(w, 1, x, 1, 0) (∗ chcck thc codc word rccursivc¦y and ∗)
then return ta¦sc. (∗ a¦ort it a sma¦¦cr codc word is tound ∗)
v.i −1. (∗ c¦car thc vcrtcx indcx a¸ain ∗)
end.
end.
return truc. (∗ thc codc word is canonica¦ ∗)
end (∗ isCanonica¦ ∗) (∗ tor a ¦rcadthﬁrst scarch sjannin¸ trcc ∗)
Christian Borgelt Frequent Pattern Mining 331
Checking for Canonical Form
function rcc (w array ot int, k int, x array ot vcrtcx, n int, i int) ¦oo¦can.
(∗ w codc word to ¦c tcstcd ∗)
(∗ k currcnt josition in codc word ∗)
(∗ x array ot a¦rcady ¦a¦c¦cd,num¦crcd vcrticcs ∗)
(∗ n num¦cr ot ¦a¦c¦cd,num¦crcd vcrticcs ∗)
(∗ i indcx ot ncxt cxtcnda¦¦c vcrtcx to chcck. i < n ∗)
var d vcrtcx. (∗ vcrtcx at thc othcr cnd ot an cd¸c ∗)
j int. (∗ indcx ot dcstination vcrtcx ∗)
u ¦oo¦can. (∗ ﬂa¸ tor unnum¦crcd dcstination vcrtcx ∗)
r ¦oo¦can. (∗ ¦uﬀcr tor a rccursion rcsu¦t ∗)
begin
if k ≥ ¦cn¸th(w) return truc. (∗ tu¦¦ codc word has ¦ccn ¸cncratcd ∗)
while i < wk do begin (∗ chcck whcthcr thcrc is an cd¸c with ∗)
forall e incidcnt to xi do (∗ a sourcc vcrtcx havin¸ a sma¦¦cr indcx ∗)
if e.i < 0 then return ta¦sc.
i i + 1. (∗ it thcrc is an unmarkcd cd¸c, a¦ort, ∗)
end. (∗ othcrwisc ¸o to thc ncxt vcrtcx ∗)
Christian Borgelt Frequent Pattern Mining 332
Checking for Canonical Form
forall e incidcnt to xi (in sortcd ordcr) do begin
if e.i < 0 then begin (∗ travcrsc thc unvisitcd incidcnt cd¸cs ∗)
if e.a < wk + 1 then return ta¦sc. (∗ chcck thc ∗)
if e.a > wk + 1 then return truc. (∗ cd¸c attri¦utc ∗)
d vcrtcx incidcnt to e othcr than xi.
if d.a < wk + 2 then return ta¦sc. (∗ chcck dcstination ∗)
if d.a > wk + 2 then return truc. (∗ vcrtcx attri¦utc ∗)
if d.i < 0 then j n else j d.i.
if j < wk + 3 then return ta¦sc. (∗ chcck dcstination vcrtcx indcx ∗)
 (∗ chcck rcst ot codc word rccursivc¦y, ∗)
(∗ ¦ccausc jrcﬁxcs arc cqua¦ ∗)
end.
end.
return truc. (∗ rcturn that no sma¦¦cr codc word ∗)
end (∗ rcc ∗) (∗ than w cou¦d ¦c tound ∗)
Christian Borgelt Frequent Pattern Mining 333
Checking for Canonical Form
forall e incidcnt to xi (in sortcd ordcr) do begin
if e.i < 0 then begin (∗ travcrsc thc unvisitcd incidcnt cd¸cs ∗)
 (∗ chcck thc currcnt cd¸c ∗)
if j wk + 3 then begin (∗ it cd¸c dcscrijtions arc cqua¦ ∗)
e.i 1. u d.i < 0. (∗ mark cd¸c and num¦cr vcrtcx ∗)
if u then begin d.i j. xn d. n n + 1. end
r rcc(w, k + !, x, n, i). (∗ chcck rccursivc¦y ∗)
if u then begin d.i −1. n n −1. end
e.i −1. (∗ unmark cd¸c (and vcrtcx) a¸ain ∗)
if not r then return ta¦sc.
end. (∗ cva¦uatc thc rccursion rcsu¦t ∗)
end. (∗ a¦ort it a sma¦¦cr codc word was tound ∗)
end.
return truc. (∗ rcturn that no sma¦¦cr codc word ∗)
end (∗ rcc ∗) (∗ than w cou¦d ¦c tound ∗)
Christian Borgelt Frequent Pattern Mining 334
Restricted Extensions
Christian Borgelt Frequent Pattern Mining 335
Canonical Forms: Restricted Extensions
Principle of the Search Algorithm up to now:
• Gcncratc a¦¦ jossi¦¦c cxtcnsions ot a ¸ivcn canonica¦ codc word
¦y thc dcscrijtion ot an cd¸c that cxtcnds thc dcscri¦cd (su¦)¸rajh
• Chcck whcthcr thc cxtcndcd codc word is canonica¦ (and thc (su¦)¸rajh trcqucnt)
lt it is, jroccss thc cxtcndcd codc word rccursivc¦y, othcrwisc discard it
Straightforward Improvement:
• Ior somc cxtcnsions ot a ¸ivcn canonica¦ codc word it is casy to scc
that thcy wi¦¦ not ¦c canonica¦ thcmsc¦vcs
• Thc trick is to chcck whcthcr a sjannin¸ trcc rooted at the same vertex
yic¦ds a codc word that is sma¦¦cr than thc crcatcd cxtcndcd codc word
• This immcdiatc¦y ru¦cs out cd¸cs attachcd to ccrtain vcrticcs in thc (su¦)¸rajh
(on¦y ccrtain vcrticcs arc extendable, that is, can ¦c incidcnt to a ncw cd¸c)
as wc¦¦ as ccrtain cd¸cs c¦osin¸ cyc¦cs
Christian Borgelt Frequent Pattern Mining 336
Canonical Forms: Restricted Extensions
DepthFirst Search: Rightmost Path Extension
• Extendable Vertices:
◦ On¦y vcrticcs on thc rightmost path ot thc sjannin¸ trcc may ¦c cxtcndcd
◦ lt thc sourcc vcrtcx ot thc ncw cd¸c is not a ¦cat, thc cd¸c dcscrijtion
must not jrcccdc thc dcscrijtion ot thc downward cd¸c on thc jath
(That is, thc cd¸c attri¦utc must ¦c no ¦css than thc cd¸c attri¦utc ot thc
downward cd¸c, and it it is cqua¦, thc attri¦utc ot its dcstination vcrtcx must
¦c no ¦css than thc attri¦utc ot thc downward cd¸c’s dcstination vcrtcx)
• Edges Closing Cycles:
◦ Ld¸cs c¦osin¸ cyc¦cs must start at an cxtcnda¦¦c vcrtcx
◦ Thcy must ¦cad to thc ri¸htmost ¦cat (vcrtcx at cnd ot ri¸htmost jath)
◦ Thc indcx ot thc sourcc vcrtcx must jrcccdc thc indcx ot thc sourcc vcrtcx
ot any cd¸c a¦rcady incidcnt to thc ri¸htmost ¦cat
Christian Borgelt Frequent Pattern Mining 337
Canonical Forms: Restricted Extensions
BreadthFirst Search: Maximum Source Extension
• Extendable Vertices:
◦ On¦y vcrticcs havin¸ an indcx no ¦css than thc maximum source index
ot cd¸cs that arc a¦rcady in thc (su¦)¸rajh may ¦c cxtcndcd
◦ lt thc sourcc ot thc ncw cd¸c is thc onc havin¸ thc maximum sourcc indcx,
it may ¦c cxtcndcd on¦y ¦y cd¸cs whosc dcscrijtions do not jrcccdc
thc dcscrijtion ot any downward cd¸c a¦rcady incidcnt to this vcrtcx
(That is, thc cd¸c attri¦utc must ¦c no ¦css, and it it is cqua¦,
thc attri¦utc ot thc dcstination vcrtcx must ¦c no ¦css)
• Edges Closing Cycles:
◦ Ld¸cs c¦osin¸ cyc¦cs must start at an cxtcnda¦¦c vcrtcx
◦ Thcy must ¦cad “torward”,
that is, to a vcrtcx havin¸ a ¦ar¸cr indcx than thc cxtcndcd vcrtcx
Christian Borgelt Frequent Pattern Mining 338
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamj¦c
mo¦ccu¦c
dcjthﬁrst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
¦rcadthﬁrst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc ri¸htmost jath, that is, 0, 1, 3, ¨, S
L vcrticcs with an indcx no sma¦¦cr than thc maximum sourcc, that is, o, ¨, S
Edges Closing Cycles:
A nonc, ¦ccausc thc cxistin¸ cyc¦c cd¸c has thc sma¦¦cst jossi¦¦c sourcc
L thc cd¸c ¦ctwccn thc vcrticcs ¨ and S
Christian Borgelt Frequent Pattern Mining 339
Restricted Extensions: A Simple Example
O
N
S
O
O
cxamj¦c
mo¦ccu¦c
dcjthﬁrst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
¦rcadthﬁrst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
lt othcr vcrticcs arc cxtcndcd, a trcc with the same root yic¦ds a sma¦¦cr codc word
Example: attach a sin¸¦c ¦ond to a car¦on atom at thc ¦cttmost oxy¸cn atom
A S 10N 21O 31C 43C 54O 64=O 73C 87C 80C 92C
S 10N 21O 32C
L S 0N1 0C2 1O3 1C4 2C5 4C5 4C6 6O7 6=O8 3C9
S 0N1 0C2 1O3 1C4 2C5 3C6
Christian Borgelt Frequent Pattern Mining 340
Canonical Forms: Restricted Extensions
• Thc ru¦cs undcr¦yin¸ rcstrictcd cxtcnsions jrovidc a oncsidcd answcr
to thc qucstions whcthcr an cxtcnsion yic¦ds a canonica¦ codc word
• Depthﬁrst search canonical form
◦ lt thc cxtcnsion cd¸c is not a ri¸htmost jath cxtcnsion,
thcn thc rcsu¦tin¸ codc word is certainly not canonica¦
◦ lt thc cxtcnsion cd¸c is a ri¸htmost jath cxtcnsion,
thcn thc rcsu¦tin¸ codc word may or may not be canonica¦
• Breadthﬁrst search canonical form
◦ lt thc cxtcnsion cd¸c is not a maximum sourcc cxtcnsion,
thcn thc rcsu¦tin¸ codc word is certainly not canonica¦
◦ lt thc cxtcnsion cd¸c is a maximum sourcc cxtcnsion,
thcn thc rcsu¦tin¸ codc word may or may not be canonica¦
• As a conscqucncc, a canonical form test is sti¦¦ ncccssary
Christian Borgelt Frequent Pattern Mining 341
Example Search Tree
• Start with a sin¸¦c vcrtcx (sccd vcrtcx)
• Add an cd¸c (and may¦c a vcrtcx) in cach stcj (restricted extensions)
• Lctcrminc thc sujjort and jrunc intrcqucnt (su¦)¸rajhs
• Chcck tor canonica¦ torm and jrunc (su¦)¸rajhs with noncanonica¦ codc words
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C S O
O S C S C N S C O
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S ≺ F ≺ N ≺ C ≺ O  ≺ =
¦rcadthﬁrst scarch canonica¦ torm
Christian Borgelt Frequent Pattern Mining 342
Searching without a Seed Atom
*
S N O C
S C N C O C O C C C
S C C N C C O C C O C O O C C O C O C C C
S C C C S C C N
S C C C
N
S C C C O S C C C O
S C C C O
O
12 7 5
3
c¦ycin
N C
C
C
O
O
O
cystcin
N C
C
C
O
O
S
scrin
N C
C
C
O
O
O
¦rcadthﬁrst scarch canonica¦ torm S ≺ N ≺ O ≺ C  ≺ =
• Chcmica¦ c¦cmcnts jroccsscd on thc ¦ctt arc cxc¦udcd on thc ri¸ht
Christian Borgelt Frequent Pattern Mining 343
Comparison of Canonical Forms
(dcjthﬁrst vcrsus ¦rcadthﬁrst sjannin¸ trcc construction)
Christian Borgelt Frequent Pattern Mining 344
Canonical Forms: Comparison
DepthFirst vs. BreadthFirst Search Canonical Form
• \ith ¦rcadthﬁrst scarch canonica¦ torm thc cxtcnda¦¦c vcrticcs
arc much casicr to travcrsc, as thcy a¦ways havc consccutivc indiccs
Onc on¦y has to storc and ujdatc onc num¦cr, namc¦y thc indcx
ot thc maximum cd¸c sourcc, to dcscri¦c thc vcrtcx ran¸c
• A¦so thc chcck tor canonica¦ torm is s¦i¸ht¦y morc comj¦cx (to jro¸ram)
tor dcjthﬁrst scarch canonica¦ torm
• Thc two canonica¦ torms o¦vious¦y ¦cad to diﬀcrcnt ¦ranchin¸ tactors,
widths and dcjths ot thc scarch trcc
Lowcvcr, it is not immcdiatc¦y c¦car, which torm ¦cads to thc “¦cttcr”
(morc cﬃcicnt) structurc ot thc scarch trcc
• Thc cxjcrimcnta¦ rcsu¦ts rcjortcd in thc to¦¦owin¸ indicatc that it may dcjcnd
on thc data sct which canonica¦ torm jcrtorms ¦cttcr
Christian Borgelt Frequent Pattern Mining 345
Advantage for Maximum Source Extensions
Gcncratc a¦¦ su¦structurcs
(that contain nitro¸cn)
ot thc cxamj¦c mo¦ccu¦c
O
C
N
C
C
C O
lro¦¦cm Thc two ¦ranchcs cmanatin¸
trom thc nitro¸cn atom start idcntica¦¦y
Thus ri¸htmost jath cxtcnsions try
thc ri¸ht ¦ranch ovcr and ovcr a¸ain
Search Trees with N ≺ O ≺ C
`aximum Sourcc Lxtcnsion
li¸htmost lath Lxtcnsion
C
N
C
O
C
N
C
C
C
C
N
C
O
N
C
N
C
N
C
O
C
N
C
C
N
O
C
N
C
C
C
N
C
O
C
C
N
C
C
C
N
O
C
N
C
C
O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O
C
N
C
O
C
N
C
C
C
C
N
C
O
O
C
C
N
C
O
C
C
C
N
C
O
O
C
C
N
C
O
C
N
C
N
O
C
N
C
C
N
C
N
C
O
C
N
C
O
C
C
N
C
C
C
N
C
C
N
C
O
C
N
C
C
O
C
C
N
C O
C
C
N
C
C
C
C
N
C
O
C
N
C
C
O
O
C
N
C
C
C O
C
C
N
C
C
O
C
N
C
C
C O noncanonica¦ 3 noncanonica¦ o
Christian Borgelt Frequent Pattern Mining 346
Advantage for Rightmost Path Extensions
Gcncratc a¦¦ su¦structurcs
(that contain nitro¸cn)
ot thc cxamj¦c mo¦ccu¦c
(N ≺ C)
N
C
C
C
C
lro¦¦cm Thc rin¸ ot car¦on atoms
can ¦c c¦oscd ¦ctwccn any two ¦ranchcs
(thrcc ways ot ¦ui¦din¸ thc tra¸mcnt,
on¦y onc ot which is canonica¦)
Search Trees with N ≺ C
`aximum Sourcc Lxtcnsion
li¸htmost lath Lxtcnsion
N
C
C C
C
N
C
C
C
C
3
5
4
N
C
C
C
C
5
4
3
N
N
C
N
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
3
4
5
N
C
C C
C
N
N
C
N
C
C
N
C
C
C
N
C
C C
N
C
C
C
N
C
C
C
C
N
C
C
C
C
N
C
C
C
C
5 noncanonica¦ 3 noncanonica¦ 1
Christian Borgelt Frequent Pattern Mining 347
Experiments: Data Sets
• Index Chemicus — Subset of 1993
◦ 1293 mo¦ccu¦cs , 3!!31 atoms , 3o`9! ¦onds
◦ Ircqucnt tra¸mcnts down to tair¦y ¦ow sujjort va¦ucs arc trccs (no rin¸s)
◦ `cdium num¦cr ot tra¸mcnts and c¦oscd tra¸mcnts
• Steroids
◦ 1¨ mo¦ccu¦cs , !01 atoms , !`o ¦onds
◦ A ¦ar¸c jart ot thc trcqucnt tra¸mcnts contain onc or morc rin¸s
◦ Lu¸c num¦cr ot tra¸mcnts, sti¦¦ ¦ar¸c num¦cr ot c¦oscd tra¸mcnts
Christian Borgelt Frequent Pattern Mining 348
Steroids Data Set
O
O
O Br
O F
O
O
O
O O
O
O O
O
O O
O
O
O O
O
O O
O
O
O
O
O O
O O
O
O
O
O O
O
O
O
O O
O
O
O O
O
O
O
N
O
O
N
Christian Borgelt Frequent Pattern Mining 349
Experiments: IC93 Data Set
3 3.5 4 4.5 5 5.5 6
5
10
15
20
time/seconds
breadthﬁrst
depthﬁrst
3 3.5 4 4.5 5 5.5 6
0
5
10
15
fragments/10
4
breadthﬁrst
depthﬁrst
processed
3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurences/10
6
breadthﬁrst
depthﬁrst
Lxjcrimcnta¦ rcsu¦ts on thc lC93 data
Thc horizonta¦ axis shows thc minima¦
sujjort in jcrccnt Thc curvcs show thc
num¦cr ot ¸cncratcd and jroccsscd tra¸
mcnts (toj ¦ctt), num¦cr ot jroccsscd oc
currcnccs (toj ri¸ht), and thc cxccution
timc in scconds (¦ottom ¦ctt) tor thc two
canonica¦ torms,cxtcnsion stratc¸ics
Christian Borgelt Frequent Pattern Mining 350
Experiments: Steroids Data Set
2 3 4 5 6 7 8
10
15
20
25
30
35
time/seconds
breadthﬁrst
depthﬁrst
2 3 4 5 6 7 8
5
10
15
fragments/10
5
breadthﬁrst
depthﬁrst
processed
2 3 4 5 6 7 8
6
8
10
12
occurrences/10
6
breadthﬁrst
depthﬁrst
Lxjcrimcnta¦ rcsu¦ts on thc stcroids data
Thc horizonta¦ axis shows thc a¦so¦utc
minima¦ sujjort Thc curvcs show thc
num¦cr ot ¸cncratcd and jroccsscd tra¸
mcnts (toj ¦ctt), num¦cr ot jroccsscd oc
currcnccs (toj ri¸ht), and thc cxccution
timc in scconds (¦ottom ¦ctt) tor thc two
canonica¦ torms,cxtcnsion stratc¸ics
Christian Borgelt Frequent Pattern Mining 351
Equivalent Sibling Pruning
Christian Borgelt Frequent Pattern Mining 352
Alternative Test: Equivalent Siblings
• Basic Idea:
◦ lt thc (su¦)¸rajh to cxtcnd cxhi¦its a ccrtain symmctry, scvcra¦ cxtcnsions
may ¦c cquiva¦cnt (in thc scnsc that thcy dcscri¦c thc samc (su¦)¸rajh)
◦ At most onc ot thcsc si¦¦in¸ cxtcnsions can ¦c in canonica¦ torm, namc¦y
thc onc least restricting future extensions (¦cx sma¦¦cst codc word)
◦ ldcntity cquiva¦cnt si¦¦in¸s and kccj on¦y thc maxima¦¦y cxtcnda¦¦c onc
• Test Procedure for Equivalence:
◦ Gct any ¸rajh in which two si¦¦in¸ (su¦)¸rajhs to comjarc occur
(lt thcrc is no such ¸rajh, thc si¦¦in¸s arc not cquiva¦cnt)
◦ `ark any occurrcncc ot thc ﬁrst (su¦)¸rajh in thc ¸rajh
◦ Travcrsc a¦¦ occurrcnccs ot thc sccond (su¦)¸rajh in thc ¸rajh
and chcck whcthcr a¦¦ cd¸cs ot an occurrcncc arc markcd
lt thcrc is such an occurrcncc, thc two (su¦)¸rajhs arc cquiva¦cnt
Christian Borgelt Frequent Pattern Mining 353
Alternative Test: Equivalent Siblings
If siblings in the search tree are equivalent,
only the one with the least restrictions needs to be processed.
Example: `inin¸ jhcno¦, jcrcso¦, and catccho¦
O C
C C
C
C C
C O C
C C
C
C C
O
O C
C C
C
C C
Considcr cxtcnsions ot a o¦ond car¦on rin¸ (twc¦vc jossi¦¦c occurrcnccs)
O C
C C
C
C C
0
1 2
3
4 5
O C
C C
C
C C
1
2 3
4
5 0
O C
C C
C
C C
2
3 4
5
0 1
O C
C C
C
C C
1
0 5
4
3 2
On¦y thc (su¦)¸rajh that least restricts future extensions
(ic, that has thc ¦cxico¸rajhica¦¦y sma¦¦cst codc word) can ¦c in canonica¦ torm
¹sc dcjthﬁrst canonica¦ torm (ri¸htmost jath cxtcnsions) and C ≺ O
Christian Borgelt Frequent Pattern Mining 354
Alternative Test: Equivalent Siblings
• Test for Equivalent Siblings before Test for Canonical Form
◦ Travcrsc thc si¦¦in¸ cxtcnsions and comjarc cach jair
◦ Ot two cquiva¦cnt si¦¦in¸s rcmovc thc onc
that rcstricts tuturc cxtcnsions morc
• Advantages:
◦ ldcntiﬁcs somc codc words that arc noncanonica¦ in a simj¦c way
◦ Tcst ot two si¦¦in¸s is at most ¦incar in thc num¦cr ot cd¸cs
and at most ¦incar in thc num¦cr ot occurrcnccs
• Disadvantages:
◦ Locs not idcntity a¦¦ noncanonica¦ codc words,
thcrctorc a su¦scqucnt canonica¦ torm tcst is sti¦¦ nccdcd
◦ Comjarcs two si¦¦in¸ (su¦)¸rajhs,
thcrctorc it is quadratic in thc num¦cr ot si¦¦in¸s
Christian Borgelt Frequent Pattern Mining 355
Alternative Test: Equivalent Siblings
Thc cﬀcctivcncss ot cquiva¦cnt si¦¦in¸ jrunin¸ dcjcnds on thc canonica¦ torm
`inin¸ thc IC93 data with !/ minima¦ sujjort
dcjthﬁrst ¦rcadthﬁrst
cquiva¦cnt si¦¦in¸ jrunin¸ 1`o ( 19/) !19` (S3¨/)
canonica¦ torm jrunin¸ ¨9SS (9S1/) S1` (1o3/)
tota¦ jrunin¸ S1!! `010
(c¦oscd) (su¦)¸rajhs tound 2002 2002
`inin¸ thc steroids data with minima¦ sujjort o
dcjthﬁrst ¦rcadthﬁrst
cquiva¦cnt si¦¦in¸ jrunin¸ 1`32¨ ( ¨2/) 1`2`o2 (`!o/)
canonica¦ torm jrunin¸ 19¨!!9 (92S/) 12¨02o (!`!/)
tota¦ jrunin¸ 212¨¨o 2¨9`SS
(c¦oscd) (su¦)¸rajhs tound 1!20 1!20
Christian Borgelt Frequent Pattern Mining 356
Alternative Test: Equivalent Siblings
Observations:
• Lcjthﬁrst torm ¸cncratcs morc duj¦icatc (su¦)¸rajhs on thc lC93 data
and tcwcr duj¦icatc (su¦)¸rajhs on thc stcroids data (as sccn ¦ctorc)
• Thcrc arc on¦y vcry tcw cquiva¦cnt si¦¦in¸s with dcjthﬁrst torm
on ¦oth thc lC93 data and thc stcroids data
(Con,ccturc cquiva¦cnt si¦¦in¸s rcsu¦t trom “rotatcd” trcc ¦ranchcs,
which arc ¦css ¦ikc¦y to ¦c si¦¦in¸s with dcjthﬁrst torm)
• \ith ¦rcadthﬁrst scarch canonica¦ torm a ¦ar¸c jart ot thc (su¦)¸rajhs
that arc not ¸cncratcd in canonica¦ torm (with a canonica¦ codc word)
can ¦c ﬁ¦tcrcd out with cquiva¦cnt si¦¦in¸ jrunin¸
• On thc tcst lC93 data no diﬀcrcncc in sjccd cou¦d ¦c o¦scrvcd,
jrcsuma¦¦y ¦ccausc jrunin¸ takcs on¦y a sma¦¦ jart ot thc tota¦ timc
• On thc stcroids data, howcvcr, cquiva¦cnt si¦¦in¸ jrunin¸
yic¦ds a s¦i¸ht sjccduj tor ¦rcadthﬁrst torm (∼ `/)
Christian Borgelt Frequent Pattern Mining 357
Canonical Forms based on Adjacency Matrices
Christian Borgelt Frequent Pattern Mining 358
Adjacency Matrices
• A (norma¦, that is, un¦a¦c¦cd) ¸rajh can ¦c dcscri¦cd ¦y an adjacency matrix
◦ A ¸rajh G with n vcrticcs is dcscri¦cd ¦y an n n matrix A (a
ij
)
◦ Givcn a num¦crin¸ ot thc vcrticcs (trom 1 to n), cach vcrtcx is associatcd
with thc row and co¦umn corrcsjondin¸ to its num¦cr
◦ A matrix c¦cmcnt a
ij
is 1 it thcrc cxists an cd¸c ¦ctwccn thc vcrticcs
with num¦crs i and j and 0 othcrwisc
• Ad,accncy matriccs arc not uniquc
Liﬀcrcnt num¦crin¸s ot thc vcrticcs ¦cad to diﬀcrcnt ad,accncy matriccs
1 2
3
4
5
1
1
2
2
3
3
4
4
5
5
0 1 0 1 0
1 0 1 1 0
0 1 0 1 1
1 1 1 0 0
0 0 1 0 0
5 4
2
3
1
1
1
2
2
3
3
4
4
5
5
0 1 0 0 0
1 0 1 1 0
0 1 0 1 1
0 1 1 0 1
0 0 1 1 0
Christian Borgelt Frequent Pattern Mining 359
Extended Adjacency Matrices
• A ¦a¦c¦cd ¸rajh can ¦c dcscri¦cd ¦y an extended adjacency matrix
◦ lt thcrc is an cd¸c ¦ctwccn thc vcrticcs with num¦crs i and j
thc matrix c¦cmcnt a
ij
contains thc ¦a¦c¦ ot this cd¸c
and thc sjccia¦ ¦a¦c¦ (thc cmjty ¦a¦c¦) othcrwisc
◦ Thcrc is an additiona¦ co¦umn containin¸ thc vcrtcx ¦a¦c¦s
• Ot coursc, cxtcndcd ad,accncy matriccs arc a¦so not uniquc
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
C
N
C
C
S
C
O
O
O
Christian Borgelt Frequent Pattern Mining 360
From Adjacency Matrices to Code Words
• An (cxtcndcd) ad,accncy matrix can ¦c turncd into a code word
¦y simj¦y ¦istin¸ its c¦cmcnts row ¦y row
• Sincc tor undircctcd ¸rajhs thc ad,accncy matrix is ncccssari¦y symmctric,
it suﬃccs to ¦ist thc c¦cmcnts ot thc ujjcr (or ¦owcr) trian¸¦c
• Ior sjarsc ¸rajhs (tcw cd¸cs) ¦istin¸ on¦y co¦umn,¦a¦c¦ jairs can advanta¸cous,
¦ccausc this rcduccs thc codc word ¦cn¸th
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
lc¸u¦ar cxjrcssion
(nontcrmina¦s)
(a ( i
c
b )
∗
)
n
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
codc word
S 2  3 
N 4  5 
C 6 
O
C 6  7 
C
C 8  9 =
O
O
Christian Borgelt Frequent Pattern Mining 361
From Adjacency Matrices to Code Words
• \ith an (ar¦itrary, ¦ut ﬁxcd) ordcr on thc ¦a¦c¦ sct A (and dcﬁnin¸ that
intc¸cr num¦crs, which arc ordcrcd in thc usua¦ way, jrcccdc a¦¦ ¦a¦c¦s),
codc words can ¦c comjarcd ¦cxico¸rajhica¦¦y (S ≺ N ≺ O ≺ C .  ≺ =)
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
S 2  3  N 4  5  C 6  O C 6  7  C C 8  9 = O O
<
O
N
S
O
O
C
C
C
C
7
2
5
6
4
1
3
8
9
C 2  3  4  N 5  7  C 8  9 = C 6  S 6  C O O O
• As tor canonica¦ torms ¦ascd on sjannin¸ trccs, wc thcn dcﬁnc thc ¦cxico¸rajhica¦¦y
sma¦¦cst (or ¦ar¸cst) codc word as thc canonical code word
• `otc that ad,accncy matriccs a¦¦ow tor a much larger number of code words,
¦ccausc any num¦crin¸ ot thc vcrticcs is acccjta¦¦c
Ior canonica¦ torms ¦ascd on sjannin¸ trccs, thc vcrtcx num¦crin¸
must ¦c comjati¦¦c with a (sjcciﬁc) construction ot a sjannin¸ trcc
Christian Borgelt Frequent Pattern Mining 362
From Adjacency Matrices to Code Words
• Thcrc is a varicty ot othcr ways in which an ad,accncy matrix
may ¦c turncd into a codc word
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
S
N
C
O
C
C
C
O
O
¦owcr trian¸¦c
S
N 1 
C 1 
O 2 
C 2 
C 3  5 
C 5 
O 6  7 
O 7 =
co¦umnwisc
S N C O C C C O O
 1 
 1 
 2 
 2 
 3  5 
 5 
 7 
 7 =
(`otc that thc co¦umnwisc ¦istin¸ nccds a scjarator charactcr “”)
• Lowcvcr, thc rowwisc ¦istin¸ rcstrictcd to thc ujjcr trian¸¦c (as uscd ¦ctorc)
has thc advanta¸c that it has a jrojcrty ana¦o¸ous to thc preﬁx property
ln contrast to this, thc two torms shown a¦ovc do not havc this jrojcrty
Christian Borgelt Frequent Pattern Mining 363
Exploiting Vertex Signatures
Christian Borgelt Frequent Pattern Mining 364
Canonical Form and Vertex and Edge Labels
• \crtcx and cd¸c ¦a¦c¦s hc¦j considcra¦¦y to construct a canonica¦ codc word
or to chcck whcthcr a ¸ivcn codc word is canonica¦
Canonica¦ torm chcck or construction arc usua¦¦y (much) s¦owcr,morc diﬃcu¦t
tor un¦a¦c¦cd ¸rajhs or ¸rajhs with tcw diﬀcrcnt vcrtcx and cd¸c ¦a¦c¦s
• Thc rcason is that with vcrtcx and cd¸c ¦a¦c¦s constructcd codc word jrcﬁxcs
may a¦rcady a¦¦ow us to makc a dccision ¦ctwccn (scts ot) codc words
• lntuitivc cxj¦anation with an cxtrcmc cxamj¦c
Sujjosc that a¦¦ vcrticcs ot a ¸ivcn (su¦)¸rajh havc diﬀcrcnt ¦a¦c¦s Thcn
◦ Thc root,ﬁrst row vcrtcx is uniquc¦y dctcrmincd
it is thc vcrtcx with thc sma¦¦cst ¦a¦c¦ (wrt thc choscn ordcr)
◦ Thc ordcr ot cach vcrtcx’s nci¸h¦ors in thc canonica¦ torm is dctcrmincd
at ¦cast ¦y thc vcrtcx ¦a¦c¦s (¦ut may¦c a¦so ¦y thc cd¸c ¦a¦c¦s)
◦ As a conscqucncc, constructin¸ thc canonica¦ codc word is strai¸httorward
Christian Borgelt Frequent Pattern Mining 365
Canonical Form and Vertex and Edge Labels
• Thc comj¦cxity ot constructin¸ a canonica¦ codc word is causcd ¦y cqua¦ cd¸c and
vcrtcx ¦a¦c¦s, which makc it ncccssary to ajj¦y a backtracking a¦¸orithm
• Question: Can wc cxj¦oit ¸rajh jrojcrtics (that is, thc conncction structurc)
to distin¸uish vcrticcs,cd¸cs with thc samc ¦a¦c¦´
• Idea: Lcscri¦c how thc (su¦)¸rajh undcr considcration “¦ooks trom a vcrtcx”
This can ¦c achicvcd ¦y constructin¸ a “¦oca¦ codc word” (vertex signature)
◦ Start with thc ¦a¦c¦ ot thc vcrtcx
◦ lt thcrc is morc than onc vcrtcx with a ccrtain ¦a¦c¦,
add a (sortcd) ¦ist ot thc ¦a¦c¦s ot thc incidcnt cd¸cs
◦ lt thcrc is morc than onc vcrtcx with thc samc ¦ist,
add a (sortcd) ¦ist ot thc ¦ists ot thc ad,accnt vcrticcs
◦ Continuc with thc vcrticcs that arc two cd¸cs away and so on
Christian Borgelt Frequent Pattern Mining 366
Constructing Vertex Signatures
Thc jroccss ot constructin¸ vcrtcx si¸naturcs is ¦cst dcscri¦cd
as an iterative subdivision of equivalence classes
• Thc initia¦ si¸naturc ot cach vcrtcx is simj¦y its ¦a¦c¦
• Thc vcrtcx sct is sj¦it into cquiva¦cncc c¦asscs
¦ascd on thc initia¦ vcrtcx si¸naturc (that is, thc vcrtcx ¦a¦c¦s)
• Lquiva¦cncc c¦asscs with morc than onc vcrtcx arc thcn jroccsscd
¦y ajjcndin¸ thc (sortcd) ¦a¦c¦s ot thc incidcnt cd¸cs to thc vcrtcx si¸naturc
Thc vcrtcx sct is thcn rcjartitioncd ¦ascd on thc cxtcndcd vcrtcx si¸naturc
• ln a sccond stcj thc (sortcd) si¸naturcs ot thc ad,accnt vcrticcs arc ajjcndcd
• ln su¦scqucnt stcjs thcsc si¸naturcs ot ad,accnt vcrticcs arc rcj¦accd
¦y thc ujdatcd vcrtcx si¸naturcs
• Thc jroccss stojs whcn no rcj¦accmcnt sj¦its an cquiva¦cncc c¦ass
Christian Borgelt Frequent Pattern Mining 367
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx si¸naturc
1 S
2 N
! O
S O
9 O
3 C
o C
` C
¨ C
Vertex Signatures, Step 1
• Thc initia¦ vcrtcx si¸naturcs
arc simj¦y thc vcrtcx ¦a¦c¦s
• Thcrc arc tour cquiva¦cncc c¦asscs
S, N, O, and C
• Thc cquiva¦cncc c¦asscs S and N
nccd not turthcr jroccssin¸,
¦ccausc thcy a¦rcady contain
on¦y a sin¸¦c vcrtcx
• Lowcvcr, thc vcrtcx si¸naturcs O and C
nccd to ¦c cxtcndcd in ordcr to sj¦it
thc corrcsjondin¸ cquiva¦cncc c¦asscs
Christian Borgelt Frequent Pattern Mining 368
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx si¸naturc
1 S
2 N
! O 
S O 
9 O =
3 C 
o C 
` C 
¨ C =
Vertex Signatures, Step 2
• Thc vcrtcx si¸naturcs ot thc c¦asscs
that contain morc than onc vcrtcx arc
cxtcndcd ¦y thc sortcd ¦ist ot ¦a¦c¦s
ot thc incidcnt cd¸cs
• This distin¸uishcs thc thrcc oxy¸cn atoms,
¦ccausc two is incidcnt to a sin¸¦c ¦ond,
thc othcr to a dou¦¦c ¦ond
• lt a¦so distin¸uishcs most car¦on atoms,
¦ccausc thcy havc diﬀcrcnt scts
ot incidcnt cd¸cs
• On¦y thc si¸naturcs ot car¦ons 3 and o
and thc si¸naturcs ot oxy¸cns ! and 9
nccd to ¦c cxtcndcd turthcr
Christian Borgelt Frequent Pattern Mining 369
Constructing Vertex Signatures
O
N
S
O
O
C
C
C
C
4
2
1
3
6
5
7
8
9
vcrtcx si¸naturc
1 S
2 N
! O  N
S O  C =
9 O =
3 C  S C 
o C  C  C 
` C 
¨ C =
Vertex Signatures, Step 3
• Thc vcrtcx si¸naturcs ot car¦ons 3 and o
and ot oxy¸cns ! and 9 arc cxtcndcd
¦y thc sortcd ¦ist ot vcrtcx si¸naturcs
ot thc ad,accnt vcrticcs
• This distin¸uishcs thc two jairs
(car¦on 3 is ad,accnt to a su¦tur atom,
oxy¸cn ! is incidcnt to a nitro¸cn atom)
• As a rcsu¦t, a¦¦ cquiva¦cncc c¦asscs
contain on¦y a sin¸¦c vcrtcx and thus
wc o¦taincd a uniquc vcrtcx ¦a¦c¦in¸
• \ith this uniquc vcrtcx ¦a¦c¦in¸,
constructin¸ a canonica¦ codc word
¦ccomcs vcry simj¦c and cﬃcicnt
Christian Borgelt Frequent Pattern Mining 370
Elements of Vertex Signatures
• ¹sin¸ on¦y (sortcd) ¦ists ot ¦a¦c¦s ot incidcnt cd¸cs and ad,accnt vcrticcs
cannot a¦ways distin¸uish a¦¦ vcrticcs
Lxamj¦c Ior thc to¦¦owin¸ two (un¦a¦c¦cd) ¸rajhs such vcrtcx si¸naturcs
cannot sj¦it thc so¦c cquiva¦cncc c¦ass
• Thc cquiva¦cncc c¦ass can ¦c sj¦it tor thc ri¸ht ¸rajh, thou¸h, it thc num¦cr
ot ad,accnt vcrticcs that arc ad,accnt is incorjoratcd into thc vcrtcx si¸naturc
Thcrc is a¦so a ¦ar¸c varicty ot othcr ¸rajh jrojcrtics that may ¦c uscd
• Lowcvcr, tor ncithcr ¸rajh thc cquiva¦cncc c¦asscs can ¦c rcduccd to sin¸¦c vcrticcs
Ior thc ¦ctt ¸rajh it is not cvcn jossi¦¦c at a¦¦ to sj¦it thc cquiva¦cncc c¦ass
• Thc rcason is that ¦oth ¸rajhs josscss automorphisms othcr thcn thc idcntity
Christian Borgelt Frequent Pattern Mining 371
Automorphism Groups
• Lct F
auto
(G) ¦c thc sct ot a¦¦ automorphisms ot a (¦a¦c¦cd) ¸rajh G
Thc orbit ot a vcrtcx v ∈ V
G
wrt F
auto
(G) is thc sct
o(v) ¦u ∈ V
G
[ ∃f ∈ F
auto
(G) u f(v)¦.
`otc that wc havc a¦ways v ∈ o(v), ¦ccausc thc idcntity is a¦ways in F
auto
(G)
• Thc vcrticcs in an or¦it cannot jossi¦¦y ¦c distin¸uishcd ¦y vcrtcx si¸naturcs,
¦ccausc thc ¸rajh “¦ooks thc samc” trom a¦¦ ot thcm
• ln ordcr to dca¦ with or¦its, onc can cxj¦oit that thc automorjhisms F
auto
(G)
ot a ¸rajh G torm a group (thc automorphism group ot G)
◦ Lurin¸ thc construction ot a canonica¦ codc word,
dctcct automorjhisms (vcrtcx num¦crin¸s ¦cadin¸ to thc samc codc word)
◦ Irom tound automorjhisms, generators ot thc ¸rouj ot automorjhisms
can ¦c dcrivcd Thcsc ¸cncrators can thcn ¦c uscd to avoid cxj¦orin¸
imj¦icd automorjhisms, thus sjccdin¸ uj thc scarch `cIay 19S1
Christian Borgelt Frequent Pattern Mining 372
Canonical Form and Vertex Signatures
• Advantages of Vertex Signatures:
◦ \crticcs with thc samc ¦a¦c¦ can ¦c distin¸uishcd in a jrcjroccssin¸ stcj
◦ Constructin¸ canonica¦ codc words can thus ¦ccomc much casicr,tastcr,
¦ccausc thc ncccssary ¦acktrackin¸ can ottcn ¦c rcduccd considcra¦¦y
(Thc ¸ains arc usua¦¦y jarticu¦ar¦y ¦ar¸c tor ¸rajhs with tcw,no ¦a¦c¦s)
• Disadvantages of Vertex Signatures:
◦ \crtcx si¸naturcs can rctcr to thc ¸rajh as a who¦c
and thus may ¦c diﬀcrcnt tor su¦¸rajhs
(\crticcs with diﬀcrcnt si¸naturcs in a su¦¸rajh
may havc thc samc si¸naturc in a sujcr¸rajh and vicc vcrsa)
◦ As a conscqucncc it can ¦c diﬃcu¦t to cnsurc
that thc rcsu¦tin¸ canonica¦ torm has thc preﬁx property
ln such a casc onc may not ¦c a¦¦c to rcstrict (su¦)¸rajh cxtcnsions
or to usc thc simj¦iﬁcd scarch schcmc (on¦y codc word chccks)
Christian Borgelt Frequent Pattern Mining 373
Repository of Processed Fragments
Christian Borgelt Frequent Pattern Mining 374
Repository of Processed Fragments
• Canonical form pruning is thc jrcdominant mcthod
to avoid rcdundant scarch in trcqucnt (su¦)¸rajh minin¸
• Thc o¦vious a¦tcrnativc, a repository of processed (sub)graphs,
has rcccivcd tair¦y ¦itt¦c attcntion Lor¸c¦t and Iicd¦cr 200¨
◦ \hcncvcr a ncw (su¦)¸rajh is crcatcd, thc rcjository is acccsscd
◦ lt it contains thc (su¦)¸rajh, wc know that it has a¦rcady ¦ccn jroccsscd
and thcrctorc it can ¦c discardcd
◦ On¦y (su¦)¸rajhs that arc not containcd in thc rcjository arc cxtcndcd
and, ot coursc, inscrtcd into thc rcjository
• lt thc rcjository is ¦aid out as a hash ta¦¦c with a carctu¦¦y dcsi¸ncd
hash tunction, it is comjctitivc with canonica¦ torm jrunin¸
(ln somc cxjcrimcnts, thc rcjository¦ascd ajjroach
cou¦d outjcrtorm canonica¦ torm jrunin¸ ¦y 1`/)
Christian Borgelt Frequent Pattern Mining 375
Repository of Processed Fragments
• Lach (su¦)¸rajh shou¦d ¦c storcd usin¸ a minima¦ amount ot mcmory
(sincc thc num¦cr ot jroccsscd (su¦)¸rajhs is usua¦¦y hu¸c)
◦ Storc a (su¦)¸rajh ¦y ¦istin¸ thc cd¸cs ot onc occurrcncc
(`otc that tor conncctcd (su¦)¸rajhs thc cd¸cs a¦so idcntity a¦¦ vcrticcs)
• Thc containmcnt tcst has to ¦c madc as tast as jossi¦¦c
(sincc it wi¦¦ ¦c carricd out trcqucnt¦y)
◦ Try to avoid a tu¦¦ isomorjhism tcst with a hash ta¦¦c
Lmj¦oy a hash tunction that is comjutcd trom ¦oca¦ ¸rajh jrojcrtics
(Lasic idca com¦inc thc vcrtcx and cd¸c attri¦utcs and thc vcrtcx dc¸rccs)
◦ lt an isomorjhism tcst is ncccssary, do quick chccks ﬁrst
num¦cr ot vcrticcs, num¦cr ot cd¸cs, ﬁrst containin¸ data¦asc ¸rajh ctc
◦ Actua¦ isomorjhism tcst
mark storcd occurrcncc and chcck tor tu¦¦y markcd ncw occurrcncc
(ct thc jroccdurc ot cquiva¦cnt si¦¦in¸ jrunin¸)
Christian Borgelt Frequent Pattern Mining 376
Canonical Form Pruning versus Repository
• Advantage of Canonical Form Pruning
On¦y onc tcst (tor canonica¦ torm) is nccdcd in ordcr to dctcrminc
whcthcr a (su¦)¸rajh nccds to ¦c jroccsscd or not
• Disadvantage of Canonical Form Pruning
lt is most cost¦y tor thc (su¦)¸rajhs that arc crcatcd in canonica¦ torm
(→ s¦owcst tor tra¸mcnts that havc to ¦c jroccsscd)
• Advantage of Repositorybased Pruning
Ottcn a¦¦ows to dccidc vcry quick¦y that a (su¦)¸rajh has not ¦ccn jroccsscd
(→ tastcst tor tra¸mcnts that havc to ¦c jroccsscd)
• Disadvantages of Repositorybased Pruning
`u¦tij¦c isomorjhism tcsts may ¦c ncccssary tor a jroccsscd tra¸mcnt
`ccds tar morc mcmory than canonica¦ torm jrunin¸
A rcjository vcry diﬃcu¦t to usc in a jara¦¦c¦ a¦¸orithm
Christian Borgelt Frequent Pattern Mining 377
Canonical Form vs. Repository: Execution Times
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
2 2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
80 time/seconds
canon. form
repository
• Lxjcrimcnta¦ rcsu¦ts on thc lC93 data sct,
scarch timc in scconds (vcrtica¦ axis) vcrsus
minimum sujjort in jcrccnt (horizonta¦ axis)
• Lctt maximum sourcc cxtcnsions
• li¸ht ri¸htmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 378
Canonical Form vs. Repository: Numbers of (Sub)Graphs
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
dupl. tests
processed
duplicates
• Lxjcrimcnta¦ rcsu¦ts on thc lC93 data sct,
num¦crs ot su¦¸rajhs uscd in thc scarch
• Lctt maximum sourcc cxtcnsions
• li¸ht ri¸htmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 379
Repository Performance
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
2 2.5 3 3.5 4 4.5 5 5.5 6
0
20
40
60
80 subgraphs/10 000
generated
accesses
isom. tests
duplicates
• Lxjcrimcnta¦ rcsu¦ts on thc lC93 data sct,
jcrtormancc ot rcjository¦ascd jrunin¸
• Lctt maximum sourcc cxtcnsions
• li¸ht ri¸htmost jath cxtcnsions
Christian Borgelt Frequent Pattern Mining 380
Perfect Extension Pruning
Christian Borgelt Frequent Pattern Mining 381
Reminder: Perfect Extension Pruning for Item Sets
• lt on¦y c¦oscd itcm scts or on¦y maxima¦ itcm scts arc to ¦c tound,
additiona¦ jrunin¸ ot thc scarch trcc ¦ccomcs jossi¦¦c
• Sujjosc that durin¸ thc scarch wc discovcr that
s
T
(I ∪ ¦a¦) s
T
(I)
tor somc itcm sct I and somc itcm a / ∈ I (That is, I is not c¦oscd)
\c ca¦¦ thc itcm a a perfect extension ot I Thcn wc know
∀J ⊇ I s
T
(J ∪ ¦a¦) s
T
(J).
This can most casi¦y ¦c sccn ¦y considcrin¸ that K
T
(I) ⊆ K
T
(¦a¦)
and hcncc K
T
(J) ⊆ K
T
(¦a¦), sincc K
T
(J) ⊆ K
T
(I)
• As a conscqucncc, no sujcrsct J ⊇ I with a / ∈ J can ¦c c¦oscd
Lcncc a can ¦c addcd dircct¦y to thc jrcﬁx ot thc conditiona¦ data¦asc
Thc samc ¦asic idca can a¦so ¦c uscd tor ¸rajhs, ¦ut nccds modiﬁcations
Christian Borgelt Frequent Pattern Mining 382
Perfect Extensions
• An cxtcnsion ot a ¸rajh (tra¸mcnt) is ca¦¦cd perfect,
it it can ¦c ajj¦icd to a¦¦ ot its occurrcnccs in cxact¦y thc samc way
• Attention: lt may not ¦c cnou¸h to comjarc thc sujjort
and thc num¦cr ot occurrcnccs ot thc ¸rajh tra¸mcnt
(Lvcn thou¸h jcrtcct cxtcnsions must havc thc samc sujjort and
an intc¸cr mu¦tij¦c ot thc num¦cr ot occurrcnccs ot thc ¦asc tra¸mcnt)
O C S C
N
O C S C N
O
O
C S C
C S C N O C S C
2+2 embs.
1+1 embs. 1+3 embs.
`cithcr is a sin¸¦c ¦ond to nitro¸cn a jcrtcct cxtcnsion ot OCSC
nor is a sin¸¦c ¦ond to oxy¸cn a jcrtcct cxtcnsion ot NCSC
Lowcvcr, wc nccd that a jcrtcct cxtcnsion ot a ¸rajh tra¸mcnt
is a¦so a jcrtcct cxtcnsion ot any sujcr¸rajh ot this tra¸mcnt
• Consequence: lt may ¦c ncccssary to chcck whcthcr a¦¦ occurrcnccs
ot thc ¦asc tra¸mcnt ¦cad to thc samc num¦cr ot cxtcndcd occurrcnccs
Christian Borgelt Frequent Pattern Mining 383
Partial Perfect Extension Pruning
• Basic idea of perfect extension pruning:
Iirst ¸row a tra¸mcnt to thc ¦i¸¸cst common su¦structurc
• Partial perfect extension pruning: lt thc chi¦drcn ot a scarch trcc vcrtcx
arc ordcrcd ¦cxico¸rajhica¦¦y (wrt thcir codc word), no tra¸mcnt in a su¦trcc
to thc ri¸ht ot a jcrtcct cxtcnsion ¦ranch can ¦c c¦oscd Yan and Lan 2003
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
S C O
O S C
O
S C N C
O S C N
O
S C N C
O
S
S C
O S C S C N
O S C N S C N
O
3
1
3
2
2 3
2
2 1
2
1
1 1
S ≺ F ≺ N ≺ C ≺ O  ≺ =
¦rcadthﬁrst scarch canonica¦ torm
Christian Borgelt Frequent Pattern Mining 384
Full Perfect Extension Pruning
• Full perfect extension pruning: Lor¸c¦t and `cin¦ 200o
A¦so jrunc thc ¦ranchcs to thc ¦ctt ot thc jcrtcct cxtcnsion ¦ranch
• Problem: This jrunin¸ mcthod intcrtcrcs with canonica¦ torm jrunin¸,
¦ccausc thc cxtcnsions in thc ¦ctt si¦¦in¸s cannot ¦c rcjcatcd in thc jcrtcct
cxtcnsion ¦ranch (rcstrictcd cxtcnsions, “simj¦c ru¦cs” tor canonica¦ torm)
cxamj¦c mo¦ccu¦cs
S C N C
O
O S C N
F
O S C N
O
scarch trcc tor sccd S
S F S O
O S C S C O
S C N C
O S C N
O
S C N C
O
S
S C
S C N
O S C N S C N
O
3
1
3
2
2 3
2
2
2
1
1 1
S ≺ F ≺ N ≺ C ≺ O  ≺ =
¦rcadthﬁrst scarch canonica¦ torm
Christian Borgelt Frequent Pattern Mining 385
Code Word Reorganization
• Restricted extensions:
`ot a¦¦ cxtcnsions ot a tra¸mcnt arc a¦¦owcd ¦y thc canonica¦ torm
Somc can ¦c chcckcd ¦y simj¦c ru¦cs (ri¸htmost jath,max sourcc cxtcnsion)
• Consequence: ln ordcr to makc canonica¦ torm jrunin¸ and tu¦¦ jcrtcct
cxtcnsion jrunin¸ comjati¦¦c, thc rcstrictions on cxtcnsions must ¦c miti¸atcd
• Example:
Thc corc jro¦¦cm ot o¦tainin¸ thc scarch trcc on thc jrcvious s¦idc is
how wc can avoid that thc tra¸mcnt OSCN is jruncd as noncanonica¦
◦ Thc ¦rcadthﬁrst scarch canonica¦ codc word tor this tra¸mcnt is
S 0C1 0O2 1N3
◦ Lowcvcr, with thc scarch trcc on thc jrcvious s¦idc it is assi¸ncd
S 0C1 1N2 0O3
• Solution: Lcviatc trom ajjcndin¸ thc dcscrijtion ot a ncw cd¸c
A¦¦ow tor a (strict¦y ¦imitcd) codc word rcor¸anization
Christian Borgelt Frequent Pattern Mining 386
Code Word Reorganization
• ln ordcr to o¦tain a jrojcr codc, it must ¦c jossi¦¦c to shitt dcscrijtions
ot ncw cd¸cs jast dcscrijtions ot jcrtcct cxtcnsion cd¸cs in thc codc word
• Thc codc word ot a tra¸mcnt consists ot two jarts
◦ a preﬁx cndin¸ with thc ¦ast nonjcrtcct cxtcnsion cd¸c and
◦ a (jossi¦¦y cmjty) suﬃx ot jcrtcct cxtcnsion cd¸cs
• A ncw cd¸c dcscrijtion is usua¦¦y ajjcndcd at thc cnd ot thc codc word
This is sti¦¦ thc standard jroccdurc is thc suﬃx is cmjty
Lowcvcr, it thc suﬃx is not cmjty, thc dcscrijtion ot thc ncw cd¸c
may ¦c inscrtcd into thc suﬃx or cvcn movcd dircct¦y ¦ctorc thc suﬃx
(\hichcvcr jossi¦i¦ity yic¦ds thc ¦cxico¸rajhica¦¦y sma¦¦cst codc word)
• lathcr than to actua¦¦y shitt and modity cd¸c dcscrijtion,
it is tcchnica¦¦y casicr to rc¦ui¦d thc codc word trom thc tront
(ln jarticu¦ar, rcnum¦crin¸ thc vcrticcs is casicr)
Christian Borgelt Frequent Pattern Mining 387
Code Word Reorganization: Example
• Shift an cxtcnsion to thc jrojcr j¦acc and rcnum¦cr thc vcrticcs
1 Lasc tra¸mcnt SCN canonica¦ codc S 0C1 1N2
2 Lxtcnsion to OSCN (noncanonica¦') codc S 0C1 1N2 0O3
3 Shitt cxtcnsion (inva¦id) codc S 0C1 0O3 1N2
! lcnum¦cr vcrticcs canonica¦ codc S 0C1 0O2 1N3
• Rebuild thc codc word trom thc tront
◦ Thc root vcrtcx (hcrc thc su¦tur atom) is a¦ways in thc ﬁxcd jart
lt rcccivcs thc initia¦ vcrtcx indcx, that is, 0 (zcro)
◦ Comjarc two jossi¦¦c codc word jrcﬁxcs S 0O1 and S 0C1
Iix thc ¦attcr, sincc it is ¦cxico¸rajhica¦¦y sma¦¦cr
◦ Comjarc thc codc word jrcﬁxcs S 0C1 0O2 and S 0C1 1N2
Iix thc tormcr, sincc it is ¦cxico¸rajhica¦¦y sma¦¦cr
◦ Ajjcnd thc rcmainin¸ jcrtcct cxtcnsion cd¸c S 0C1 0O2 1N3
¦rcadthﬁrst scarch canonica¦ torm. S ≺ N ≺ C ≺ O.  ≺ =
Christian Borgelt Frequent Pattern Mining 388
Perfect Extensions: Problems with Cycles/Rings
cxamj¦c
mo¦ccu¦cs
scarch trcc tor sccd N
N O
C
C C
C
N O
C
C
C C
N
N O N C
C N O N O C N C C
C
N O C
C C
N O N O C
C
N C C
C
C C
N O C
C
N O C
C
O N
C C C
N O C
C C
C C
N O C
C C C C
C O N
• Problem: lcrtcct cxtcnsions in cyc¦cs may not a¦¦ow tor jrunin¸
• Consequence: Additional constraint Lor¸c¦t and `cin¦ 200o
lcrtcct cxtcnsions must ¦c ¦rid¸cs or cd¸cs c¦osin¸ a cyc¦c,rin¸
Christian Borgelt Frequent Pattern Mining 389
Experiments: IC93 without Ring Mining
2.5 3 3.5 4 4.5 5 5.5 6
4
6
8
10
12
14
occurrences/10
6
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
5
10
15
20 fragments/10
4
full
partial
none
2.5 3 3.5 4 4.5 5 5.5 6
20
40
60
nodes/10
3
full
partial
none
Lxjcrimcnta¦ rcsu¦ts on thc lC93 data,
o¦taincd without rin¸ minin¸ (sin¸¦c
¦ond cxtcnsions) Thc horizonta¦ axis
shows thc minima¦ sujjort in jcrccnt
Thc curvcs show thc num¦cr ot ¸cncratcd
tra¸mcnts (toj ¦ctt), thc num¦cr ot jro
ccsscd occurrcnccs (¦ottom ¦ctt), and thc
num¦cr ot scarch trcc nodcs (toj ri¸ht)
tor thc thrcc diﬀcrcnt mcthods
Christian Borgelt Frequent Pattern Mining 390
Experiments: IC93 with Ring Mining
2 2.5 3 3.5 4
10
20
30
occurrences/10
5
full
partial
none
2 2.5 3 3.5 4
20
40
60
fragments/10
3
full
partial
none
2 2.5 3 3.5 4
0
5
10
15
20
nodes/10
3
full
partial
none
Lxjcrimcnta¦ rcsu¦ts on thc lC93 data,
o¦taincd with rin¸ minin¸ Thc hori
zonta¦ axis shows thc minima¦ sujjort
in jcrccnt Thc curvcs show thc num
¦cr ot ¸cncratcd tra¸mcnts (toj ¦ctt), thc
num¦cr ot jroccsscd occurrcnccs (¦ottom
¦ctt), and thc num¦cr ot scarch trcc nodcs
(toj ri¸ht) tor thc thrcc diﬀcrcnt mcth
ods
Christian Borgelt Frequent Pattern Mining 391
Extensions for Molecular Fragment Mining
Christian Borgelt Frequent Pattern Mining 392
Extensions of the Search Algorithm
• Rings Lotcr, Lor¸c¦t, and Lcrtho¦d 200!. Lor¸c¦t 200o
◦ lrcjroccssin¸ Iind rin¸s in thc mo¦ccu¦cs and mark thcm
◦ ln thc scarch jroccss Add a¦¦ atoms and ¦onds ot a rin¸ in onc stcj
◦ Considcra¦¦y imjrovcs cﬃcicncy and intcrjrcta¦i¦ity
• Carbon Chains `cin¦, Lor¸c¦t, and Lcrtho¦d 200!
◦ Add a car¦on chain in onc stcj, i¸norin¸ its ¦cn¸th
◦ Lxtcnsions ¦y a car¦on chain match rc¸ard¦css ot thc chain ¦cn¸th
• Wildcard Atoms Lotcr, Lor¸c¦t, and Lcrtho¦d 200!
◦ Lcﬁnc c¦asscs ot atoms that can ¦c sccn as cquiva¦cnt
◦ Com¦inc tra¸mcnt cxtcnsions with cquiva¦cnt atoms
◦ lntrcqucnt tra¸mcnts that diﬀcr on¦y in a tcw atoms
trom trcqucnt tra¸mcnts can ¦c tound
Christian Borgelt Frequent Pattern Mining 393
Ring Mining: Treat Rings as Units
• General Idea of Ring Mining
A rin¸ (cyc¦c) is cithcr containcd in a tra¸mcnt as a who¦c or not at a¦¦
• Filter Approaches
◦ (Su¦)¸rajhs,tra¸mcnts arc ¸rown cd¸c ¦y cd¸c (as ¦ctorc)
◦ Iound trcqucnt ¸rajh tra¸mcnts arc ﬁ¦tcrcd
Grajh tra¸mcnts with incomj¦ctc rin¸s arc discardcd
◦ Additiona¦ scarch trcc jrunin¸
lrunc su¦trccs that yic¦d on¦y tra¸mcnts with incomj¦ctc rin¸s
• Reordering Approach
◦ lt an cd¸c is addcd that is jart ot onc or morc rin¸s,
(onc ot) thc containin¸ rin¸(s) is addcd as a who¦c (a¦¦ ot its cd¸cs arc addcd)
◦ lncomjati¦i¦itics with canonica¦ torm jrunin¸ arc hand¦cd
¦y rcordcrin¸ codc words (simi¦ar to tu¦¦ jcrtcct cxtcnsion jrunin¸)
Christian Borgelt Frequent Pattern Mining 394
Ring Mining: Preprocessing
lin¸ minin¸ is simj¦cr attcr jrcjroccssin¸ thc rin¸s in thc ¸rajhs to ana¦yzc
Basic Preprocessing: (tor ﬁ¦tcr ajjroachcs)
• `ark a¦¦ cd¸cs ot rin¸s in a uscrsjcciﬁcd sizc ran¸c
(mo¦ccu¦ar tra¸mcnt minin¸ usua¦¦y rin¸s with ` – o vcrticcs,atoms)
• Tcchnica¦¦y, thcrc arc two rin¸ idcntiﬁcation jarts jcr cd¸c
◦ A markcr in thc cd¸c attri¦utc,
which tundamcnta¦¦y distin¸uishcs rin¸ cd¸cs trom nonrin¸ cd¸cs
◦ A sct ot ﬂa¸s idcntityin¸ thc diﬀcrcnt rin¸s an cd¸c is containcd in
(`otc that an cd¸c can ¦c jart ot scvcra¦ rin¸s)
Extended Preprocessing: (tor rcordcrin¸ ajjroach)
N
0
1
5
8 6
2
4
3
7 9
• `ark pseudorings, that is, rin¸s ot sma¦¦cr sizc than thc uscr sjcciﬁcd, ¦ut which
consist on¦y ot cd¸cs that arc jart ot rin¸s within thc uscrsjcciﬁcd sizc ran¸c
Christian Borgelt Frequent Pattern Mining 395
Filter Approaches: Open Rings
Idea of Open Ring Filtering:
lt wc rcquirc thc outjut to havc on¦y comj¦ctc rin¸s, wc havc to idcntity and
rcmovc tra¸mcnts with rin¸ cd¸cs that do not ¦c¦on¸ to any comj¦ctc rin¸
• lin¸ cd¸cs havc ¦ccn markcd in thc jrcjroccssin¸
⇒ lt is known which cd¸cs ot a ¸rown (su¦)¸rajh arc rin¸ cd¸cs
(in thc undcr¦yin¸ ¸rajhs ot thc data¦asc)
• Ajj¦y thc jrcjroccssin¸ jroccdurc to a ¸rown (su¦)¸rajh, ¦ut
◦ kccj thc markcr in thc cd¸c attri¦utc.
◦ on¦y sct thc ﬂa¸s that idcntity thc rin¸s an cd¸c is containcd in
• Chcck tor cd¸cs that havc a rin¸ markcr in thc cd¸c attri¦utc,
¦ut did not rcccivc any rin¸ ﬂa¸ whcn thc (su¦)¸rajh was rcjroccsscd
• lt such cd¸cs cxist, thc (su¦)¸rajh contains unc¦oscd,ojcn rin¸s,
so thc (su¦)¸rajh must not ¦c rcjortcd
Christian Borgelt Frequent Pattern Mining 396
Filter Approaches: Unclosable Rings
Idea of Unclosable Ring Filtering:
Grown (su¦)¸rajhs with ojcn rin¸s that cannot ¦c c¦oscd ¦y tuturc cxtcnsions
can ¦c jruncd trom thc scarch
• Canonica¦ torm jrunin¸ a¦¦ows to rcstrict thc jossi¦¦c cxtcnsions ot a tra¸mcnt
⇒ Luc to jrcvious cxtcnsions ccrtain vcrticcs ¦ccomc uncxtcnda¦¦c
⇒ Somc rin¸s cannot ¦c c¦oscd ¦y cxtcndin¸ a (su¦)¸rajh
• O¦vious¦y, a ncccssary (thou¸h not suﬃcicnt) condition tor a¦¦ rin¸s ¦cin¸ c¦oscd
is that cvcry vcrtcx has cithcr zcro or at ¦cast two incidcnt rin¸ cd¸cs
lt thcrc is a vcrtcx with on¦y onc incidcnt rin¸ cd¸c,
this cd¸c must ¦c jart ot an incomj¦ctc rin¸
• lt an uncxtcnda¦¦c vcrtcx ot a ¸rown (su¦)¸rajh has on¦y onc incidcnt rin¸ cd¸c,
this (su¦)¸rajh can ¦c jruncd trom thc scarch
(¦ccausc thcrc is an ojcn rin¸ that can ncvcr ¦c c¦oscd)
Christian Borgelt Frequent Pattern Mining 397
Reminder: Restricted Extensions
O
N
S
O
O
cxamj¦c
mo¦ccu¦c
dcjthﬁrst
A
S
0
N
1
C
3
C
7
C
8
O
2
C
4
O
5
O
6
¦rcadthﬁrst
L
C
6
O
7
O
8
S
0
N
1
C
2
O
3
C
4
C
5
Extendable Vertices:
A vcrticcs on thc ri¸htmost jath, that is, 0, 1, 3, ¨, S
L vcrticcs with an indcx no sma¦¦cr than thc maximum sourcc, that is, o, ¨, S
Edges Closing Cycles:
A nonc, ¦ccausc thc cxistin¸ cyc¦c cd¸c has thc sma¦¦cst jossi¦¦c sourcc
L thc cd¸c ¦ctwccn thc vcrticcs ¨ and S
Christian Borgelt Frequent Pattern Mining 398
Filter Approaches: Merging Ring Extensions
Idea of Merging Ring Extensions:
Thc jrcvious mcthods work on individua¦ cd¸cs and hcncc cannot a¦ways dctcct
it an cxtcnsion on¦y ¦cads to tra¸mcnts with comj¦ctc rin¸s that arc intrcqucnt
• Add a¦¦ cd¸cs ot a rin¸, thus distin¸uishin¸ cxtcnsions that
◦ start with thc samc individua¦ cd¸c, ¦ut
N O
C
C
C C
N O
C
C C
C
◦ ¦cad into rin¸s ot diﬀcrcnt sizc or diﬀcrcnt comjosition
• Lctcrminc thc sujjort ot thc ¸rown (su¦)¸rajhs and jrunc intrcqucnt oncs
• Trim and mcr¸c rin¸ cxtcnsions that sharc thc samc initia¦ cd¸c
Advantage of Merging Ring Extensions:
• A¦¦ cxtcnsions arc rcmovcd that ¦ccomc intrcqucnt whcn comj¦ctcd into rin¸s
• A¦¦ occurrcnccs arc rcmovcd that ¦cad to intrcqucnt (su¦)¸rajhs
oncc rin¸s arc comj¦ctcd
Christian Borgelt Frequent Pattern Mining 399
A Reordering Approach
• Drawback of Filtering:
(Su¦)¸rajhs arc sti¦¦ cxtcndcd cd¸c ¦y cd¸c ⇒ Ira¸mcnts ¸row tair¦y s¦ow¦y
• Better Approach:
◦ Add a¦¦ cd¸cs ot a rin¸ in onc stcj (\hcn a rin¸ cd¸c is addcd,
crcatc onc cxtcndcd (su¦)¸rajh tor cach rin¸ it is containcd in)
◦ lcordcr ccrtain cd¸cs in ordcr to comj¦y with canonica¦ torm jrunin¸
• Problems of a Reordering Approach:
◦ Onc must a¦¦ow tor inscrtions ¦ctwccn a¦rcady addcd rin¸ cd¸cs
(¦ccausc ¦ranchcs may jrcccdc rin¸ cd¸cs in thc canonica¦ torm)
◦ Onc must not commit too car¦y to an ordcr ot thc cd¸cs
(¦ccausc ¦ranchcs may inﬂucncc thc ordcr ot thc rin¸ cd¸cs)
◦ A¦¦ jossi¦¦c ordcrs ot (¦oca¦¦y) cquiva¦cnt cd¸cs must ¦c tricd,
¦ccausc any ot thcm may jroducc va¦id outjut
Christian Borgelt Frequent Pattern Mining 400
Problems of Reordering Approaches
One must not commit too early to an order of the edges.
l¦¦ustration cﬀccts ot attachin¸ a ¦ranch to an asymmctric rin¸ N ≺ O ≺ C,  ≺ =
N O O
0
2
4
5
3
1
N 0C1 0C2 1C3 2C4 3C5 4=C5
N O O
0
1
3
5
4
2
N 0C1 0C2 1C3 2C4 3=C5 4C5
N O O
0
2
5
6
3
1
4
N 0C1 0C2 1C3 2O4 2C5 3=C6 5C6
N O O
0
1
4
6
5
2
3
N 0C1 0C2 1O3 1C4 2C5 3C6 5=C6
• \rt a ¦rcadthﬁrst scarch canonica¦ torm, thc cd¸cs ot thc rin¸
can ¦c ordcrcd in two diﬀcrcnt ways (ujjcr two rows)
Thc ujjcr,¦ctt is thc canonica¦ torm ot thc jurc rin¸
• \ith an attachcd ¦ranch (c¦osc to thc root vcrtcx),
thc othcr ordcrin¸ ot thc rin¸ cd¸cs (¦owcr,ri¸ht) is thc canonica¦ torm
Christian Borgelt Frequent Pattern Mining 401
Keeping NonCanonical Fragments
Solution of the early commitment problem:
`aintain (and cxtcnd) ¦oth ordcrin¸s ot thc rin¸ cd¸cs and
a¦¦ow tor dcviations trom thc canonica¦ torm ¦cyond “ﬁxcd” cd¸cs
• Principle: kccj (and, conscqucnt¦y, a¦so cxtcnd) tra¸mcnts that arc not in
canonica¦ torm, ¦ut that cou¦d ¦ccomc canonica¦ oncc ¦ranchcs arc addcd
• `ccdcd a ru¦c which noncanonica¦ tra¸mcnts to kccj and which to discard
• ldca addin¸ a rin¸ can ¦c sccn as addin¸ its initia¦ cd¸c as in an cd¸c¦ycd¸c
jroccdurc, and somc additiona¦ cd¸cs, thc jositions ot which arc not yct ﬁxcd
• As a conscqucncc wc can sj¦it thc codc word into two jarts
◦ a ﬁxed preﬁx, which is a¦so ¦ui¦t ¦y an cd¸c¦ycd¸c jroccdurc, and
◦ a volatile suﬃx, which consists ot thc additiona¦ (rin¸) cd¸cs
Christian Borgelt Frequent Pattern Mining 402
Keeping NonCanonical Fragments
• Fixed preﬁx of a code word:
Thc jrcﬁx ot thc codc word uj to (and inc¦udin¸)
thc ¦ast cd¸c addcd in an cd¸c¦ycd¸c manncr
• Volatile suﬃx of a code word:
Thc suﬃx ot thc codc word attcr thc ¦ast cd¸c
addcd in an cd¸c¦ycd¸c manncr (and cxc¦udin¸ it)
• Rule for keeping noncanonical fragments:
If the current code word deviates from the canonical code word
in the ﬁxed part, the fragment is pruned, otherwise it is kept.
• Justiﬁcation of this rule:
◦ lt thc dcviation is in thc ﬁxcd jart, no ¦atcr addition ot cd¸cs
can havc any cﬀcct on it, sincc thc ﬁxcd jart wi¦¦ ncvcr ¦c chan¸cd
◦ lt, howcvcr, thc dcviation is in thc vo¦ati¦c jart, a ¦atcr cxtcnsion cd¸c
may ¦c inscrtcd in such a way that thc codc word ¦ccomcs canonica¦
Christian Borgelt Frequent Pattern Mining 403
Search Tree for an Asymmetric Ring with Branches
`aintain (and cxtcnd) ¦oth ordcrin¸s ot thc rin¸ cd¸cs and
a¦¦ow tor dcviations trom thc canonica¦ torm ¦cyond ﬁxed cd¸cs
N
N O O
0
2
4
5
3
1
N O O
0
1
3
5
4
2
N O O
0
2
5
6
4
1
3
N O O
0
2
5
6
3
1
4
N O O
0
1
3
6
5
2
4
N O O
0
1
4
6
5
2
3
N O O
0
2
6
7
4
1
3 5
N O O
0
1 2
3
4 6
5
7
Thc cd¸cs ot a ¸rown su¦¸rajh arc sj¦it into
• ﬁxed edges (cd¸cs that cou¦d havc ¦ccn addcd in an cd¸c¦ycd¸c manncr),
• volatile edges (cd¸cs that havc ¦ccn addcd with rin¸ cxtcnsions
and ¦ctorc,¦ctwccn which cd¸cs may ¦c inscrtcd)
Christian Borgelt Frequent Pattern Mining 404
Search Tree for an Asymmetric Ring with Branches
• Thc scarch constructs thc rin¸ with ¦oth jossi¦¦c num¦crin¸s ot thc vcrticcs
◦ Thc torm on thc ¦ctt is canonic, so it is kcjt
◦ ln thc tra¸mcnt on thc ri¸ht on¦y thc ﬁrst rin¸ ¦ond is ﬁxcd,
a¦¦ othcr ¦onds arc vo¦ati¦c
Sincc thc codc word tor this tra¸mcnt dcviatcs trom thc canonica¦ onc
on¦y at thc `th ¦ond, wc may not discard it
• On thc ncxt ¦cvc¦, thcrc arc two canonica¦ and two noncanonica¦ tra¸mcnts
Thc noncanonica¦ tra¸mcnts ¦oth diﬀcr in thc ﬁxcd jart,
which now consists ot thc ﬁrst thrcc ¦onds, and thus arc jruncd
• On thc third ¦cvc¦, thcrc is onc canonica¦ and onc noncanonica¦ tra¸mcnt
Thc noncanonica¦ tra¸mcnt diﬀcrs in thc vo¦ati¦c jart (thc ﬁrst tour ¦onds
arc ﬁxcd, ¦ut it dcviatcs trom thc canonica¦ codc word on¦y in thc ¨th ¦ond)
and thus may not ¦c jruncd trom thc scarch
Christian Borgelt Frequent Pattern Mining 405
Connected and Nested Rings
Connected and nested rings can josc jro¦¦cms, ¦ccausc in thc jrcscncc ot
equivalent edges thc ordcr ot thcsc cd¸cs cannot ¦c dctcrmincd ¦oca¦¦y
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
• Ld¸cs arc (¦oca¦¦y) equivalent it thcy start trom thc samc vcrtcx, havc thc samc
cd¸c attri¦utc, and ¦cad to vcrticcs with thc samc vcrtcx attri¦utc
• Lquiva¦cnt cd¸cs must ¦c spliced in a¦¦ ways, in which thc ordcr ot thc cd¸cs
a¦rcady in thc (su¦)¸rajh and thc ordcr ot thc ncw¦y addcd cd¸cs is jrcscrvcd
• lt is ncccssary to considcr pseudorings tor cxtcnsions,
¦ccausc othcrwisc not a¦¦ ordcrs ot cquiva¦cnt cd¸cs arc ¸cncratcd
Christian Borgelt Frequent Pattern Mining 406
Splicing Equivalent Edges
• ln jrincij¦c, all possible orders of equivalent edges havc to ¦c considcrcd,
¦ccausc any ot thcm may in thc cnd yic¦d thc canonica¦ torm
\c cannot (a¦ways) dccidc ¦oca¦¦y which is thc ri¸ht ordcr,
¦ccausc this may dcjcnd on cd¸cs addcd ¦atcr
• `cvcrthc¦css, wc may not rcordcr cquiva¦cnt cd¸cs trcc¦y,
as this wou¦d intcrtcrc with kccjin¸ ccrtain noncanonica¦ tra¸mcnts
Ly kccjin¸ somc noncanonica¦ tra¸mcnts wc a¦rcady considcr somc variants
ot ordcrs ot cquiva¦cnt cd¸cs Thcsc must not ¦c ¸cncratcd a¸ain
• Splicing rule for equivalent edges: (¦rcadthﬁrst scarch canonica¦ torm)
Thc ordcr ot thc cquiva¦cnt cd¸cs a¦rcady in thc tra¸mcnt must ¦c maintaincd,
and thc ordcr ot thc cquiva¦cnt ncw cd¸cs must ¦c maintaincd
Thc two scqucnccs ot cquiva¦cnt cd¸cs may ¦c mcr¸cd in a “zijjcr¦ikc” manncr,
sc¦cctin¸ thc ncxt cd¸c trom cithcr ¦ist, ¦ut jrcscrvin¸ thc ordcr in cach ¦ist
Christian Borgelt Frequent Pattern Mining 407
The Necessity of PseudoRings
Thc splicing rule cxj¦ains thc ncccssity ot pseudorings
\ithout jscudorin¸s it is imjossi¦¦c to achicvc canonica¦ torm in somc cascs
N
0
1
5
8 6
2
4
3
7 9
5
8 6
2
4
7
N
N
N
0
1
3 5 4
2
N
0
1
3 7 6
2
5
4
N
0
1
5 7 6
2
4
3
N
0
1
3
6 5
2
4
4
8 7
N
0
1
3
6 5
2
4
8
9 7
• lt wc cou¦d on¦y add thc `rin¸ and thc orin¸, ¦ut not thc 3rin¸,
thc ujward ¦ond trom thc atom num¦crcd 1 wou¦d a¦ways jrcccdc
at ¦cast onc ot thc othcr two ¦onds that arc cquiva¦cnt to it
(sincc thc ordcr ot cxistin¸ ¦onds must ¦c jrcscrvcd)
• Lowcvcr, in thc canonica¦ torm thc ujward ¦ond succccds ¦oth othcr ¦onds,
and this wc can achicvc on¦y ¦y addin¸ thc 3¦ond rin¸ ﬁrst
Christian Borgelt Frequent Pattern Mining 408
Splicing Equivalent Edges
• Thc considcrcd splicing rule is tor a ¦rcadthﬁrst scarch canonica¦ torm
ln this torm cquiva¦cnt cd¸cs arc ad,accnt in thc canonica¦ codc word
• ln a dcjthﬁrst scarch canonica¦ torm cquiva¦cnt cd¸cs
can ¦c tar ajart trom cach othcr in thc codc word
`cvcrthc¦css somc “sj¦icin¸” is ncccssary to jrojcr¦y trcat cquiva¦cnt cd¸cs
in this canonica¦ torm, cvcn thou¸h thc ru¦c is s¦i¸ht¦y simj¦cr
• Splicing rule for equivalent edges: (dcjthﬁrst scarch canonica¦ torm)
Thc ﬁrst ncw rin¸ cd¸c has to ¦c tricd in a¦¦ ¦ocations in thc vo¦ati¦c jart
ot thc codc word, whcrc cquiva¦cnt cd¸cs can ¦c tound
• Sincc wc cannot dccidc ¦oca¦¦y which ot thcsc cd¸cs shou¦d ¦c to¦¦owcd ﬁrst
whcn ¦ui¦din¸ thc sjannin¸ trcc, wc havc to try a¦¦ ot thcsc jossi¦i¦itics
in ordcr not to miss thc canonica¦ onc
Christian Borgelt Frequent Pattern Mining 409
Avoiding Duplicate Fragments
• Thc sj¦icin¸ ru¦cs sti¦¦ a¦¦ow that thc samc tra¸mcnt can ¦c rcachcd in thc samc
torm in diﬀcrcnt ways, namc¦y ¦y addin¸ (ncstcd) rin¸s in diﬀcrcnt ordcrs
lcason wc cannot a¦ways distin¸uish ¦ctwccn two diﬀcrcnt ordcrs
in which two rin¸s sharin¸ a vcrtcx arc addcd
• `ccdcd an augmented canonical form test
• Ideas undcr¦yin¸ such an au¸mcntcd tcst
◦ Thc rcquircmcnt ot comj¦ctc rin¸s introduccs dcjcndcnccs ¦ctwccn cd¸cs
Thc jrcscncc ot ccrtain cd¸cs enforces thc jrcscncc ot ccrtain othcr cd¸cs
◦ Thc samc codc word ot a tra¸mcnt is crcatcd scvcra¦ timcs,
¦ut cach timc with a diﬀerent ﬁxed part
Thc josition ot thc ﬁrst cd¸c ot a rin¸ cxtcnsion (attcr rcordcrin¸)
is thc cnd ot thc ﬁxcd jart ot thc (cxtcndcd) codc word
Christian Borgelt Frequent Pattern Mining 410
Ring Key Pruning
Dependences between Edges
• Thc rcquircmcnt ot comj¦ctc rin¸s introduccs dcjcndcnccs ¦ctwccn cd¸cs
(ldca considcr tormin¸ su¦tra¸mcnts with on¦y comj¦ctc rin¸s)
• A rin¸ cd¸c e
1
ot a tra¸mcnt enforces the presence ot anothcr rin¸ cd¸c e
2
iﬀ thc sct ot rin¸s containin¸ e
1
is a su¦sct ot thc sct ot rin¸s containin¸ e
2
◦ ln ordcr tor a rin¸ cd¸c to ¦c jrcscnt in a su¦tra¸mcnt,
at ¦cast onc ot thc rin¸s containin¸ it must ¦c jrcscnt
◦ lt a rin¸ cd¸c e
1
cntorccs a rin¸ cd¸c e
2
, it is not jossi¦¦c to torm
a su¦tra¸mcnt with on¦y comj¦ctc rin¸s that contains e
1
, ¦ut not e
2
◦ O¦vious¦y, cvcry rin¸ cd¸c cntorccs at ¦cast its own jrcscncc
◦ ln ordcr to cajturc a¦so nonrin¸ cd¸cs ¦y such a dcﬁnition,
wc dcﬁnc that a nonrin¸ cd¸c cntorccs on¦y its own jrcscncc
Christian Borgelt Frequent Pattern Mining 411
Ring Key Pruning
Example of Dependences between Edges
N 0 3
5
4 1
2
N 0 3
5
4 2
1
N 0 2
4
3 1
2
N 0 2
4
5 3
1
(A¦¦ cd¸c dcscrijtions rctcr to thc vcrtcx num¦crin¸ in thc tra¸mcnt on thc ¦ctt)
• ln thc tra¸mcnt on thc ¦ctt, any cd¸c in thc sct ¦(0, 3), (1, !), (3, `), (!, `)¦
cntorccs thc jrcscncc ot any othcr cd¸c in this sct, ¦ccausc a¦¦
ot thcsc cd¸cs arc containcd cxact¦y in thc `rin¸ and thc orin¸
• ln thc samc way, thc cd¸cs (0, 2) and (1, 2) cntorcc cach othcr,
¦ccausc ¦oth arc containcd cxact¦y in thc 3rin¸ and thc orin¸
• Thc cd¸c (0, 1), howcvcr, on¦y cntorccs itsc¦t and is cntorccd on¦y ¦y itsc¦t
• Thcrc arc no othcr cntorccmcnt rc¦ations ¦ctwccn cd¸cs
Christian Borgelt Frequent Pattern Mining 412
Ring Key Pruning
(Shortest) Ring Keys
• \c considcr jrcﬁxcs ot codc words that contain !k + 1 charactcrs,
k ∈ ¦0, 1, . . . , m¦, whcrc m is thc num¦cr ot cd¸cs ot thc tra¸mcnt
• A jrcﬁx v ot a codc word vw (whcthcr canonica¦ or not) is ca¦¦cd a ring key
iﬀ cach cd¸c dcscri¦cd in w is cntorccd ¦y at ¦cast onc cd¸c dcscri¦cd in v
• Thc jrcﬁx v is ca¦¦cd a shortest ring key ot vw iﬀ it is a rin¸ kcy
and thcrc is no shortcr jrcﬁx that is a rin¸ kcy tor vw
`otc Thc shortcst rin¸ kcy ot a codc word is uniquc¦y dcﬁncd,
¦ut dcjcnds, ot coursc, on thc considcrcd codc word
• ldca ot (Shortcst) lin¸ Icy lrunin¸
Liscard tra¸mcnts that arc tormcd with a codc word,
thc ﬁxcd jart ot which is not a shortcst rin¸ kcy
Christian Borgelt Frequent Pattern Mining 413
Ring Key Pruning
• Example ot (shortcst) rin¸ kcy(s)
N 0 3
5
4 1
2
Lrcadthﬁrst scarch (canonica¦) codc word
N 0C1 0C2 0C3 1C2 1C4 3C5 4C5
Ld¸cs e
1
e
2
e
3
e
!
e
`
e
o
e
¨
• N is o¦vious¦y not a rin¸ kcy, ¦ccausc it cntorccs no cd¸cs
• N 0C1 is not a rin¸ kcy, ¦ccausc it docs not cntorcc, tor cxamj¦c, e
2
or e
3
• N 0C1 0C2 is not a rin¸ kcy, ¦ccausc it docs not cntorcc, tor cxamj¦c, e
3
• N 0C1 0C2 0C3 is thc shortcst rin¸ kcy, ¦ccausc
e
!
(1, 2) is cntorccd ¦y e
2
(0, 2) and
e
`
(1, !), e
o
(3, `) and e
¨
(!, `) arc cntorccd ¦y e
3
(0, 3)
• Any ¦on¸cr jrcﬁx is a rin¸ kcy, ¦ut not a shortcst rin¸ kcy
Christian Borgelt Frequent Pattern Mining 414
Ring Key Pruning
• lt on¦y codc words with ﬁxcd jarts that arc shortcst rin¸ kcys arc cxtcndcd,
it suﬃccs to chcck whcthcr thc ﬁxcd jart is a rin¸ kcy
• Anchor lt a tra¸mcnt contains on¦y onc rin¸, thc ﬁrst rin¸ cd¸c cntorccs
thc othcr rin¸ cd¸cs and thus thc ﬁxcd jart is a shortcst rin¸ kcy
• lnduction stcj
◦ Lct vw ¦c a codc word with ﬁxcd jart v and vo¦ati¦c jart w,
tor which thc jrcﬁx v is a shortcst rin¸ kcy
◦ Lxtcndin¸ this codc word ¸cncra¦¦y transtorms it into a codc word vuxw
′
u dcscri¦cs cd¸cs ori¸ina¦¦y dcscri¦cd ¦y jarts ot w (u may ¦c cmjty),
x is thc dcscrijtion ot thc ﬁrst ncw cd¸c and
w
′
dcscri¦cs thc rcmainin¸ o¦d and ncw cd¸cs
◦ Thc codc word vuxw
′
cannot havc a shortcr rin¸ kcy than vux,
¦ccausc thc cd¸cs dcscri¦cd in vu do not cntorcc thc cd¸c dcscri¦cd ¦y x
Christian Borgelt Frequent Pattern Mining 415
Ring Key Pruning
Test Procedure of Ring Key Pruning
• Chcck tor cach vo¦ati¦c cd¸c whcthcr it is cntorccd ¦y at ¦cast onc ﬁxcd cd¸c
◦ `ark a¦¦ rin¸s in thc considcrcd tra¸mcnt (sct rin¸ ﬂa¸s)
◦ lcmovc a¦¦ rin¸s containin¸ a ¸ivcn vo¦ati¦c cd¸c e (c¦car rin¸ ﬂa¸s)
◦ lt ¦y this jroccdurc a ﬁxcd rin¸ cd¸c ¦ccomcs ﬂa¸¦css,
thc cd¸c e is cntorccd ¦y it, othcrwisc thc cd¸c e is not cntorccd
• Example:
N 0 2
4
3 1
2
N 0 3
5
4 1
2
N 0 3
5
4 1
2
◦ Lxtcndin¸ thc `rin¸ yic¦ds thc tra¸mcnt on thc ri¸ht in canonica¦ torm
with thc ﬁrst two cd¸cs (that is, e
1
(0, 1) and e
2
(0, 2)) ﬁxcd
◦ Thc jrcﬁx N 0C1 0C2 is not a rin¸ kcy (thc ¸rcy cd¸cs arc not cntorccd)
and hcncc thc tra¸mcnt is discardcd, cvcn thou¸h it is in canonica¦ torm
Christian Borgelt Frequent Pattern Mining 416
Search Tree for Nested Rings
N
N 0 3
5
4 2
1
N 0 2
4
3 1
2
N 0 2
4
5 3
1
N 0 3
5
4 1
2
N 0 2
5
4 1
3
N 0 3
5
4 1
2
N 0 3
5
4 1
2
N 0 2
4
5 3
1
N 0 3
5
4 2
1
N 0 3
5
4 1
2
N 0 3
5
4 1
2
N 0 2
5
4 1
3
N 0 3
5
4 1
2
N 0 1
4
5 3
2
N 0 1
4
5 2
3
N 0 3
5
4 1
2
N 0 3
5
4 2
1
N 0 2
4
5 3
1
N 0 3
5
4 1
2
N 0 1
4
5 2
3
N 0 1
4
5 3
2
also in
canonical
form
(so¦id tramc cxtcndcd and rcjortcd. dashcd tramc cxtcndcd, ¦ut not rcjortcd. no tramc jruncd)
• Thc tu¦¦ tra¸mcnt is ¸cncratcd twicc in cach torm (cvcn thc canonica¦)
• Augmented Canonical Form Test:
◦ Thc crcatcd codc words havc diﬀcrcnt ﬁxcd jarts
◦ Chcck whcthcr thc ﬁxcd jart is a shortcst rin¸ kcy
Christian Borgelt Frequent Pattern Mining 417
Search Tree for Nested Rings
• ln a¦¦ tra¸mcnts in thc ¦ottom row ot thc scarch trcc (tra¸mcnts with tramcs)
thc ﬁrst thrcc cd¸cs arc ﬁxcd, thc rcst is vo¦ati¦c
Thc jrcﬁx N 0C1 0C2 0C3 dcscri¦in¸ thcsc cd¸cs is a shortcst rin¸ kcy
Lcncc thcsc tra¸mcnts arc kcjt and jroccsscd
• ln thc row a¦ovc it (tra¸mcnts without tramcs),
on¦y thc ﬁrst two cd¸cs arc ﬁxcd, thc rcst is vo¦ati¦c
Thc jrcﬁx N 0C1 0C2 dcscri¦in¸ thcsc cd¸cs is not a rin¸ kcy
(Thc ¸rcy cd¸cs arc not cntorccd) Lcncc thcsc tra¸mcnts arc discardcd
• `otc that tor a¦¦ sin¸¦c rin¸ tra¸mcnts two ot thcir tour chi¦drcn arc kcjt,
cvcn thou¸h on¦y thc onc at thc ¦ctt ¦ottom is in canonica¦ torm
Thc rcason is that thc dcviation trom thc canonica¦ torm rcsidcs
in thc vo¦ati¦c jart ot thc tra¸mcnt
Ly attachin¸ additiona¦ rin¸s any ot thcsc tra¸mcnts may ¦ccomc canonica¦
Christian Borgelt Frequent Pattern Mining 418
Experiments: IC93
2 2.5 3 3.5 4 4.5 5
0
5
10
15
20
25
time/seconds
reorder
merge rings
close rings
2 2.5 3 3.5 4 4.5 5
0
2
4
6
8 fragments/10
4
reorder
merge rings
close rings
2 2.5 3 3.5 4 4.5 5
0
1
2
3
4
5 occurrences/10
6
reorder
merge rings
close rings
Lxjcrimcnta¦ rcsu¦ts on thc lC93
data Thc horizonta¦ axis shows thc
minima¦ sujjort in jcrccnt Thc
curvcs show thc num¦cr ot ¸cncratcd
tra¸mcnts (toj ¦ctt), thc num¦cr ot
jroccsscd occurrcnccs (toj ri¸ht), and
thc cxccution timc in scconds (¦ottom
¦ctt) tor thc thrcc diﬀcrcnt stratc¸ics
Christian Borgelt Frequent Pattern Mining 419
Experiments: NCI HIV Screening Database
0.5 1 1.5 2 2.5 3 3.5 4
0
50
100
150
time/seconds
reorder
merge rings
close rings
0.5 1 1.5 2 2.5 3 3.5 4
0
1
2
3
fragments/10
4
reorder
merge rings
close rings
0.5 1 1.5 2 2.5 3 3.5 4
0
2
4
6
8
occurrences/10
7
reorder
merge rings
close rings
Lxjcrimcnta¦ rcsu¦ts on thc Ll\ data
Thc horizonta¦ axis shows thc minima¦
sujjort in jcrccnt Thc curvcs show
thc num¦cr ot ¸cncratcd tra¸mcnts
(toj ¦ctt), thc num¦cr ot jroccsscd oc
currcnccs (toj ri¸ht), and thc cxccu
tion timc in scconds (¦ottom ¦ctt) tor
thc thrcc diﬀcrcnt stratc¸ics
Christian Borgelt Frequent Pattern Mining 420
Found Molecular Fragments
Christian Borgelt Frequent Pattern Mining 421
NCI DTP HIV Antiviral Screen: AZT
N N N O
O
N
N
O
O
O
O
N
N
N
N N N O
O
N
N
O
O
O
N N N O
O
N
N
O
O
O
P
O
O
O
O
O
N N N O
O
N
N
O
O
O
O
O
O
O
N N N O
O
N
N
O
O
Some Molecules from the NCI HIV Database
Common Fragment
Christian Borgelt Frequent Pattern Mining 422
NCI DTP HIV Antiviral Screen: Other Fragments
N
N
S
O
O
O
Ira¸mcnt 1
CA `23/
Cl,C` 00`/
N
N
S
O
O
O
Ira¸mcnt 2
CA !92/
Cl,C` 00¨/
N
N
O
Ira¸mcnt 3
CA `23/
Cl,C` 00S/
O N
O
P
O
O
Ira¸mcnt !
CA 9S`/
Cl,C` 00¨/
N
N
O
O O
O
O
Ira¸mcnt `
CA 101`/
Cl,C` 00!/
S
N Cl
Ira¸mcnt o
CA 9S`/
Cl,C` 000/
Christian Borgelt Frequent Pattern Mining 423
Experiments: Ring Extensions
Improved Interpretability
N
Ira¸mcnt 1
¦asic a¦¸orithm
trcq in CA 22¨¨/
N
Ira¸mcnt 2
with rin¸ cxtcnsions
trcq in CA 2000/
O
O
S
N
O
O
`SC ]oo¨9!S
N
S
N
O
`SC ]o9So01
Comjounds trom thc `Cl canccr data sct that contain Ira¸mcnt 1 ¦ut not 2
Christian Borgelt Frequent Pattern Mining 424
Experiments: Carbon Chains
• Tcchnica¦¦y Add a car¦on chain in onc stcj, i¸norin¸ its ¦cn¸th
• Lxtcnsion ¦y a car¦on chain match rc¸ard¦css ot thc chain ¦cn¸th
• Advanta¸c Ira¸mcnts can rcjrcscnt car¦on chains ot varyin¸ ¦cn¸th
Example from the NCI Cancer Dataset:
Ira¸mcnt with Chain
N
N
C*
trcq CA 1!S/
trcq Cl 013/
Actua¦ Structurcs
N
N
N
N
Christian Borgelt Frequent Pattern Mining 425
Experiments: Wildcard Atoms
• Lcﬁnc c¦asscs ot atoms that can ¦c considcrcd as cquiva¦cnt
• Com¦inc tra¸mcnt cxtcnsions with cquiva¦cnt atoms
• Advanta¸c lntrcqucnt tra¸mcnts that diﬀcr on¦y in a tcw atoms
trom trcqucnt tra¸mcnts can ¦c tound
Examples from the NCI HIV Dataset:
A
N
S
Cl
AO A`
CA ``/ 3¨/
Cl,C` 00/ 00/
N
Cl
S
B
LO LS
CA ``/ 001/
Cl,C` 00/ 00/
Christian Borgelt Frequent Pattern Mining 426
Summary Frequent (Sub)Graph Mining
• Ircqucnt (su¦)¸rajh minin¸ is c¦osc¦y rc¦atcd to trcqucnt itcm sct minin¸
Find frequent (sub)graphs instcad ot trcqucnt su¦scts
• A corc jro¦¦cm ot trcqucnt (su¦)¸rajh minin¸ is how to avoid rcdundant scarch
This jro¦¦cm is so¦vcd with thc hc¦j ot canonical forms of graphs
Liﬀcrcnt canonica¦ torms ¦cad to diﬀcrcnt ¦chavior ot thc scarch a¦¸orithm
• Thc rcstriction to closed fragments is a ¦oss¦css rcduction ot thc outjut
A¦¦ trcqucnt tra¸mcnts can ¦c rcconstructcd trom thc c¦oscd oncs
• A rcstriction to c¦oscd tra¸mcnts a¦¦ows tor additiona¦ jrunin¸ stratc¸ics
jartia¦ and tu¦¦ perfect extension pruning
• Lxtcnsions ot thc ¦asic a¦¸orithm (jarticu¦ar¦y usctu¦ tor mo¦ccu¦cs) inc¦udc
Ring Mining, (Carbon) Chain Mining, and Wildcard Vertices
• A Java implementation tor mo¦ccu¦ar tra¸mcnt minin¸ is avai¦a¦¦c at
http://www.borgelt.net/moss.html
Christian Borgelt Frequent Pattern Mining 427
Mining a Single Graph
Christian Borgelt Frequent Pattern Mining 428
Reminder: Basic Notions
• A labeled or attributed graph is a trij¦c G (V, E, ℓ), whcrc
◦ V is thc sct ot vcrticcs,
◦ E ⊆ V V −¦(v, v) [ v ∈ V ¦ is thc sct ot cd¸cs, and
◦ ℓ V ∪ E → A assi¸ns ¦a¦c¦s trom thc sct A to vcrticcs and cd¸cs
• Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs
A subgraph isomorphism ot S to G or an occurrence ot S in G
is an in,cctivc tunction f V
S
→ V
G
with
◦ ∀v ∈ V
S
ℓ
S
(v) ℓ
G
(f(v)) and
◦ ∀(u, v) ∈ E
S
(f(u), f(v)) ∈ E
G
∧ ℓ
S
((u, v)) ℓ
G
((f(u), f(v)))
That is, thc majjin¸ f jrcscrvcs thc conncction structurc and thc ¦a¦c¦s
Christian Borgelt Frequent Pattern Mining 429
AntiMonotonicity of Subgraph Support
`ost natura¦ dcﬁnition ot su¦¸rajh sujjort in a sin¸¦c ¸rajh scttin¸
number of occurrences (su¦¸rajh isomorjhisms)
Problem: Thc num¦cr ot occurrcnccs ot a su¦¸rajh is not antimonotone
Lxamj¦c
injut ¸rajh
s
G
(A) 1
A B B
su¦¸rajhs
A
s
G
(B−A−B) 2
A B B
2 1 3
occurrcnccs
B B A
A B B
1
A B B
1 2 3
3 2
But: Antimonotonicity is vita¦ tor thc cﬃcicncy ot trcqucnt su¦¸rajh minin¸
Question: Low shou¦d wc dcﬁnc su¦¸rajh sujjort in a sin¸¦c ¸rajh´
Christian Borgelt Frequent Pattern Mining 430
AntiMonotonicity of Subgraph Support
`ost natura¦ dcﬁnition ot su¦¸rajh sujjort in a sin¸¦c ¸rajh scttin¸
number of occurrences (su¦¸rajh isomorjhisms)
Problem: Thc num¦cr ot occurrcnccs ot a su¦¸rajh is not antimonotone
Lxamj¦c
injut ¸rajh
s
G
(A) 1
A B B
su¦¸rajhs
A
s
G
(A−B) 2
A B
s
G
(B−A−B) 2
A B B
2 1 3
occurrcnccs
B B A
B A B
B A B
A B B
1
A B B
1 2 3
3 2
But: Antimonotonicity is vita¦ tor thc cﬃcicncy ot trcqucnt su¦¸rajh minin¸
Question: Low shou¦d wc dcﬁnc su¦¸rajh sujjort in a sin¸¦c ¸rajh´
Christian Borgelt Frequent Pattern Mining 431
Relations between Occurrences
• Lct f
1
and f
2
two su¦¸rajh isomorjhisms ot S to G and
V
1
¦v ∈ V
G
[ ∃u ∈ V
S
v f
1
(u)¦ and
V
2
¦v ∈ V
G
[ ∃u ∈ V
S
v f
2
(u)¦
Thc two su¦¸rajh isomorjhisms f
1
and f
2
arc ca¦¦cd
◦ overlapping, writtcn f
1
◦◦f
2
, iﬀ V
1
∩ V
2
, ∅,
◦ equivalent, writtcn f
1
◦f
2
, iﬀ V
1
V
2
,
◦ identical, writtcn f
1
≡ f
2
, iﬀ ∀v ∈ V
S
f
1
(v) f
2
(v)
• `otc that idcntica¦ su¦¸rajh isomorjhisms arc cquiva¦cnt
and that cquiva¦cnt su¦¸rajh isomorjhisms arc ovcr¦ajjin¸
• Thcrc can ¦c nonidcntica¦, ¦ut cquiva¦cnt su¦¸rajh isomorjhisms,
namc¦y it S josscsscs an automorjhism that is not thc idcntity
Christian Borgelt Frequent Pattern Mining 432
Overlap Graphs of Occurrences
Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs and
¦ct V
O
¦c thc sct ot a¦¦ occurrcnccs (su¦¸rajh isomorjhisms) ot S in G
Thc overlap graph ot S wrt G is thc ¸rajh O (V
O
, E
O
),
which has thc sct V
O
ot occurrcnccs ot S in G as its vcrtcx sct
and thc cd¸c sct E
O
¦(f
1
, f
2
) [ f
1
, f
2
∈ V
O
∧ f
1
,≡ f
2
∧ f
1
◦◦f
2
¦
Example:
injut ¸rajh
B A B A B
su¦¸rajh
B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 433
Maximum Independent Set Support
Lct G (V, E) ¦c an (undircctcd) ¸rajh with vcrtcx sct V
and cd¸c sct E ⊆ V V −¦(v, v) [ v ∈ V ¦
An independent vertex set ot G is a sct I ⊆ V with ∀u, v ∈ I (u, v) / ∈ E
I is a maximum independent vertex set iﬀ
• it is an indcjcndcnt vcrtcx sct and
• tor a¦¦ indcjcndcnt vcrtcx scts J ot G it is [I[ ≥ [J[
`otcs Iindin¸ a maximum indcjcndcnt vcrtcx sct is an `lcomj¦ctc jro¦¦cm
Lowcvcr, a ¸rccdy a¦¸orithm usua¦¦y ¸ivcs vcry ¸ood ajjroximations
Lct O (V
O
, E
O
) ¦c thc ovcr¦aj ¸rajh ot thc occurrcnccs
ot a ¦a¦c¦cd ¸rajh S (V
S
, E
S
, ℓ
S
) in a ¦a¦c¦cd ¸rajh G (V
G
, E
G
, ℓ
G
)
Thc maximum independent set support (or MISsupport tor short)
ot S wrt G is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot O
Christian Borgelt Frequent Pattern Mining 434
Finding a Maximum Independent Set
• ¹nmark a¦¦ vcrticcs ot thc ovcr¦aj ¸rajh
• Exact Backtracking Algorithm
◦ Iind an unmarkcd vcrtcx with maximum dc¸rcc and try two jossi¦i¦itics
◦ Sc¦cct it tor thc `lS, that is, mark it as sc¦cctcd and
mark a¦¦ ot its nci¸h¦ors as cxc¦udcd
◦ Lxc¦udc it trom thc `lS, that is, mark it as cxc¦udcd
◦ lroccss thc rcst rccursivc¦y and rccord ¦cst so¦ution tound
• Heuristic Greedy Algorithm
◦ Sc¦cct a vcrtcx with thc minimum num¦cr ot unmarkcd nci¸h¦ors and
mark a¦¦ ot its nci¸h¦ors as cxc¦udcd
◦ lroccss thc rcst ot thc ¸rajh rccursivc¦y
• ln ¦oth a¦¸orithms vcrticcs with ¦css than two unmarkcd nci¸h¦ors
can ¦c sc¦cctcd and a¦¦ ot thcir nci¸h¦ors markcd as cxc¦udcd
Christian Borgelt Frequent Pattern Mining 435
AntiMonotonicity of MISSupport: Preliminaries
Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs
Lct T (V
T
, E
T
, ℓ
T
) a (noncmjty) jrojcr su¦¸rajh ot S
(that is, V
T
⊂ V
S
, E
T
(V
T
V
T
) ∩ E
S
, and ℓ
T
≡ ℓ
S
[
V
T
∪E
T
)
Lct f ¦c an occurrcncc ot S in G
An occurrcncc f
′
ot thc su¦¸rajh T is ca¦¦cd a Tancestor ot thc occurrcncc f
iﬀ f
′
≡ f[
V
T
, that is, it f
′
coincidcs with f on thc vcrtcx sct V
T
ot T
Observations:
Ior ¸ivcn G, S, T and f thc Tanccstor f
′
ot thc occurrcncc f is uniquc¦y dcﬁncd
Lct f
1
and f
2
¦c two (nonidcntica¦, ¦ut may¦c cquiva¦cnt) occurrcncc ot S in G
f
1
and f
2
ovcr¦aj it thcrc cxist ovcr¦ajjin¸ Tanccstors f
′
1
and f
′
2
ot thc occurrcnccs f
1
and f
2
, rcsjcctivc¦y
(`otc Thc invcrsc imj¦ication docs not ho¦d ¸cncra¦¦y)
Christian Borgelt Frequent Pattern Mining 436
AntiMonotonicity of MISSupport: Proof
Theorem: `lSsujjort is antimonotonc
Proof: \c havc to show that thc `lSsujjort ot a su¦¸rajh S wrt a ¸rajh G
cannot cxcccd thc `lSsujjort ot any (noncmjty) jrojcr su¦¸rajh T ot S
• Lct I ¦c an ar¦itrary indcjcndcnt vcrtcx sct ot thc ovcr¦aj ¸rajh O ot S wrt G
• Thc sct I induccs a su¦sct I
′
ot thc vcrticcs ot thc ovcr¦aj ¸rajh O
′
ot an (ar¦itrary, ¦ut ﬁxcd) su¦¸rajh T ot thc considcrcd su¦¸rajh S,
which consists ot thc (uniquc¦y dcﬁncd) Tanccstors ot thc vcrticcs in I
• lt is [I[ [I
′
[, ¦ccausc no two c¦cmcnts ot I can havc thc samc Tanccstor
• \ith simi¦ar ar¸umcnt I
′
is an indcjcndcnt vcrtcx sct ot thc ovcr¦aj ¸rajh O
′
• As a conscqucncc, sincc I is ar¦itrary, cvcry indcjcndcnt vcrtcx sct ot O
induccs an indcjcndcnt vcrtcx sct ot O
′
ot thc samc sizc
• Lcncc thc maximum indcjcndcnt vcrtcx sct ot O
′
must ¦c at ¦cast as ¦ar¸c as thc maximum indcjcndcnt vcrtcx sct ot O
Christian Borgelt Frequent Pattern Mining 437
Harmful and Harmless Overlaps of Occurrences
`ot a¦¦ ovcr¦ajs ot occurrcnccs arc harmtu¦
injut ¸rajh A B C A B C A
su¦¸rajh A B C A
occurrcnccs B C A A B C A
A B C A B C A
Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs and
¦ct f
1
and f
2
¦c two occurrcnccs (su¦¸rajh isomorjhisms) ot S to G
f
1
and f
2
arc ca¦¦cd harmfully overlapping, writtcn f
1
••f
2
, iﬀ
• thcy arc cquiva¦cnt or Iicd¦cr and Lor¸c¦t 200¨
• thcrc cxists a (noncmjty) jrojcr su¦¸rajh T ot S,
so that thc Tanccstors f
′
1
and f
′
2
ot f
1
and f
2
, rcsjcctivc¦y, arc cquiva¦cnt
Christian Borgelt Frequent Pattern Mining 438
Harmful Overlap Graphs and Subgraph Support
Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs and
¦ct V
H
¦c thc sct ot a¦¦ occurrcnccs (su¦¸rajh isomorjhisms) ot S in G
Thc harmful overlap graph ot S wrt G is thc ¸rajh H (V
H
, E
H
),
which has thc sct V
H
ot occurrcnccs ot S in G as its vcrtcx sct
and thc cd¸c sct E
H
¦(f
1
, f
2
) [ f
1
, f
2
∈ V
H
∧ f
1
,≡ f
2
∧ f
1
••f
2
¦
Lct H (V
H
, E
H
) ¦c thc harmtu¦ ovcr¦aj ¸rajh ot thc occurrcnccs
ot a ¦a¦c¦cd ¸rajh S (V
S
, E
S
, ℓ
S
) in a ¦a¦c¦cd ¸rajh G (V
G
, E
G
, ℓ
G
)
Thc harmful overlap support (or HOsupport tor short) ot thc ¸rajh S wrt G
is thc sizc ot a maximum indcjcndcnt vcrtcx sct ot H
Theorem: LOsujjort is antimonotonc
Proof: ldcntica¦ to jroot tor `lSsujjort
(Thc samc two o¦scrvations ho¦d, which wcrc a¦¦ that was nccdcd)
Christian Borgelt Frequent Pattern Mining 439
Harmful Overlap Graphs and Ancestor Relations
injut ¸rajh B A B A B
B B B A A A A B B B
B A B A B
B A B B A
B A B B A
B A B A B
A B B A B
A B B A B
B A B A B
B A B A B
3 1 2
2 1 3
2 1 3
3 1 2
Christian Borgelt Frequent Pattern Mining 440
Subgraph Support Computation
Chcckin¸ whcthcr two occurrcnccs ovcr¦aj is casy, ¦ut
How do we check whether two occurrences overlap harmfully?
Core ideas of the harmful overlap test:
• Try to construct a su¦¸rajh S
E
(V
E
, E
E
, ℓ
E
) that yic¦ds cquiva¦cnt anccstors
ot two ¸ivcn occurrcnccs f
1
and f
2
ot a ¸rajh S (V
S
, E
S
, ℓ
S
)
• Ior such a su¦¸rajh S
E
thc majjin¸ g V
E
→ V
E
with v → f
−1
2
(f
1
(v)),
whcrc f
−1
2
is thc invcrsc ot f
2
, must ¦c a ¦i,cctivc majjin¸
• `orc ¸cncra¦¦y, g must ¦c an automorphism ot S
E
,
that is, a su¦¸rajh isomorjhism ot S
E
to itsc¦t
• Lxj¦oit thc jrojcrtics ot automorjhism
to cxc¦udc vcrticcs trom thc ¸rajh S that cannot ¦c in V
E
Christian Borgelt Frequent Pattern Mining 441
Subgraph Support Computation
Input: Two (diﬀcrcnt) occurrcnccs f
1
and f
2
ot a ¦a¦c¦cd ¸rajh S (V
S
, E
S
, ℓ
S
)
in a ¦a¦c¦cd ¸rajh G (V
G
, E
G
, ℓ
G
)
Output: \hcthcr f
1
and f
2
ovcr¦aj harmtu¦¦y
1) Iorm thc scts V
1
¦v ∈ V
G
[ ∃u ∈ V
S
v f
1
(u)¦
and V
2
¦v ∈ V
G
[ ∃u ∈ V
S
v f
2
(u)¦
2) Iorm thc scts W
1
¦v ∈ V
S
[ f
1
(v) ∈ V
1
∩ V
2
¦
and W
2
¦v ∈ V
S
[ f
2
(v) ∈ V
1
∩ V
2
¦
3) lt V
E
W
1
∩ W
2
∅, rcturn false, othcrwisc rcturn true
• V
E
is thc vcrtcx sct ot a su¦¸rajh S
E
that induccs cquiva¦cnt anccstors
• Any vcrtcx v ∈ V
S
−V
E
cannot contri¦utc to such cquiva¦cnt anccstors
• Lcncc V
E
is a maxima¦ sct ot vcrticcs tor which g is a ¦i,cction
Christian Borgelt Frequent Pattern Mining 442
Restriction to Connected Subgraphs
Thc scarch tor trcqucnt su¦¸rajhs is usua¦¦y rcstrictcd to connected graphs
\c cannot conc¦udc that no cd¸c is nccdcd it thc su¦¸rajh S
E
is not conncctcd
thcrc may ¦c a conncctcd su¦¸rajh ot S
E
that induccs cquiva¦cnt anccstors
ot thc occurrcnccs f
1
and f
2
Lcncc wc havc to considcr su¦¸rajhs ot S
E
in this casc
Lowcvcr, chcckin¸ a¦¦ jossi¦¦c su¦¸rajhs is jrohi¦itivc¦y cost¦y
Comjutin¸ thc cd¸c sct E
E
ot thc su¦¸rajh S
E
1) Lct E
1
¦(v
1
, v
2
) ∈ E
G
[ ∃(u
1
, u
2
) ∈ E
S
(v
1
, v
2
) (f
1
(u
1
), f
1
(u
2
))¦
and E
2
¦(v
1
, v
2
) ∈ E
G
[ ∃(u
1
, u
2
) ∈ E
S
(v
1
, v
2
) (f
2
(u
1
), f
2
(u
2
))¦
2) Lct F
1
¦(v
1
, v
2
) ∈ E
S
[ (f
1
(v
1
), f
1
(v
2
)) ∈ E
1
∩ E
2
¦
and F
2
¦(v
1
, v
2
) ∈ E
S
[ (f
2
(v
1
), f
2
(v
2
)) ∈ E
1
∩ E
2
¦
3) Lct E
E
F
1
∩ F
2
Christian Borgelt Frequent Pattern Mining 443
Restriction to Connected Subgraphs
Lemma: Lct S
C
(V
C
, E
C
, ℓ
C
) ¦c an (ar¦itrary, ¦ut ﬁxcd) conncctcd comjoncnt
ot thc su¦¸rajh S
E
and ¦ct W ¦v ∈ V
C
[ g(v) ∈ V
C
¦
(rcmindcr ∀v ∈ V
E
g(v) f
−1
2
(f
1
(v)), g is an automorjhism ot S
E
)
Thcn it is cithcr W ∅ or W V
C
Proof: (¦y contradiction)
• Sujjosc that thcrc is a conncctcd comjoncnt S
C
with W , ∅ and W , V
C
• Choosc two vcrticcs v
1
∈ W and v
2
∈ V
C
−W
• v
1
and v
2
arc conncctcd ¦y a jath in S
C
, sincc S
C
is a conncctcd comjoncnt
On this jath thcrc must ¦c an cd¸c (v
a
, v
b
) with v
a
∈ W and v
b
∈ V
C
−W
• lt is (v
a
, v
b
) ∈ E
E
and thcrctorc (g(v
a
), g(v
b
)) ∈ E
E
(g is an automorjhism)
• Sincc g(v
a
) ∈ V
C
, it to¦¦ows g(v
b
) ∈ V
C
• Lowcvcr, this imj¦ics v
b
∈ W, contradictin¸ v
b
∈ V
C
−W
Christian Borgelt Frequent Pattern Mining 444
Further Optimization
Thc tcst can ¦c turthcr ojtimizcd ¦y thc to¦¦owin¸ simj¦c insi¸ht
• Two occurrcnccs f
1
and f
2
ovcr¦aj harmtu¦¦y it ∃v ∈ V
S
f
1
(v) f
2
(v),
¦ccausc thcn such a vcrtcx v a¦onc ¸ivcs risc to cquiva¦cnt anccstors
• This tcst can ¦c jcrtormcd vcry quick¦y, so it shou¦d ¦c thc ﬁrst stcj
• Additiona¦ advanta¸c
conncctcd comjoncnts consistin¸ ot iso¦atcd vcrticcs can ¦c nc¸¦cctcd attcrwards
A simj¦c cxamj¦c ot harmtu¦ ovcr¦aj without identical images
injut ¸rajh
B A A B
su¦¸rajh
A A B
occurrcnccs
B B A A B A A B
`otc that thc su¦¸rajh inducin¸ cquiva¦cnt anccstors can ¦c ar¦itrari¦y comj¦cx
cvcn it ∀v ∈ V
S
f
1
(v) , f
2
(v)
Christian Borgelt Frequent Pattern Mining 445
Final Procedure for Harmful Overlap Test
Input: Two (diﬀcrcnt) occurrcnccs f
1
and f
2
ot a ¦a¦c¦cd ¸rajh S (V
S
, E
S
, ℓ
S
)
in a ¦a¦c¦cd ¸rajh G (V
G
, E
G
, ℓ
G
)
Output: \hcthcr f
1
and f
2
ovcr¦aj harmtu¦¦y
1) lt ∃v ∈ S f
1
(v) f
2
(v), rcturn true
2) Iorm thc cd¸c sct E
E
ot thc su¦¸rajh S
E
(as dcscri¦cd a¦ovc) and
torm thc (rcduccd) vcrtcx sct V
′
E
¦v ∈ V
S
[ ∃u ∈ V
S
(v, u) ∈ E
E
¦
(`otc that V
′
E
docs not contain iso¦atcd vcrticcs)
3) Lct S
i
C
(V
i
C
, E
i
C
), 1 ≤ i ≤ n,
¦c thc conncctcd comjoncnts ot S
′
E
(V
′
E
, E
E
)
lt ∃i. 1 ≤ i ≤ n ∃v ∈ V
i
C
g(v) f
−1
2
(f
1
(v)) ∈ V
i
C
,
rcturn true, othcrwisc rcturn false
Christian Borgelt Frequent Pattern Mining 446
Alternative: Minimum Number of Vertex Images
Lct G (V
G
, E
G
, ℓ
G
) and S (V
S
, E
S
, ℓ
S
) ¦c two ¦a¦c¦cd ¸rajhs
and ¦ct F ¦c thc sct ot a¦¦ su¦¸rajh isomorjhisms ot S to G
Thcn thc minimum number of vertex images support
(or MNIsupport tor short) ot S wrt G is dcﬁncd as
min
v∈V
S
[¦u ∈ V
G
[ ∃f ∈ F f(v) u¦[.
Lrin¸mann and `i,sscn 200¨
Advantage:
• Can ¦c comjutcd much morc cﬃcicnt¦y than `lS or LOsujjort
(`o nccd to dctcrminc a maximum indcjcndcnt vcrtcx sct)
Disadvantage:
• Ottcn counts ¦oth ot two cquiva¦cnt occurrcnccs
(Iair¦y unintuitivc ¦chavior)
Lxamj¦c B A A B
Christian Borgelt Frequent Pattern Mining 447
Experimental Results
lndcx
Chcmicus
1993
200 250 300 350 400 450 500
0
100
200
300
400
500
600
number of subgraphs
MNIsupport
HOsupport
MISsupport
# graphs
Tic
Tac
Toc
win
120 140 160 180 200 220 240 260 280 300
0
50
100
150
200
250
300
number of subgraphs
MNIsupport
HOsupport
MISsupport
Christian Borgelt Frequent Pattern Mining 448
Summary
• Lcﬁnin¸ su¦¸rajh sujjort in thc sin¸¦c ¸rajh scttin¸
maximum independent vertex set ot an ovcr¦aj ¸rajh ot thc occurrcnccs
• MISsupport is antimonotone
lroot ¦ook at induccd indcjcndcnt vcrtcx scts tor su¦structurcs
• Lcﬁnition ot harmful overlap support ot a su¦¸rajh
cxistcncc ot cquiva¦cnt anccstor occurrcnccs
• Simj¦c jroccdurc tor tcstin¸ whcthcr two occurrcnccs ovcr¦aj harmtu¦¦y
• Harmful overlap support is antimonotone
• lcstriction to conncctcd su¦structurcs and ojtimizations
• A¦tcrnativc minimum number of vertex images
• Software: http://www.borgelt.net/moss.html
Christian Borgelt Frequent Pattern Mining 449
Frequent Sequence Mining
Christian Borgelt Frequent Pattern Mining 450
Frequent Sequence Mining
• Directed versus undirected sequences
◦ Tcmjora¦ scqucnccs, tor cxamj¦c, arc a¦ways dircctcd
◦ L`A scqucnccs can ¦c undircctcd (¦oth dircctions can ¦c rc¦cvant)
• Multiple sequences versus a single sequence
◦ `u¦tij¦c scqucnccs jurchascs with rc¦atc cards, wc¦ scrvcr acccss jrotoco¦s
◦ Sin¸¦c scqucncc a¦arms in tc¦ccommunication nctworks
• (Time) points versus time intervals
◦ loints L`A scqucnccs, a¦arms in tc¦ccommunication nctworks
◦ lntcrva¦s wcathcr data, movcmcnt ana¦ysis (sjorts mcdicinc)
◦ Iurthcr distinction onc o¦,cct jcr (timc) joint vcrsus mu¦tij¦c o¦,ccts
Christian Borgelt Frequent Pattern Mining 451
Frequent Sequence Mining
• Consecutive subsequences versus subsequences with gaps
◦ a c b a b c b a a¦ways counts as a su¦scqucncc abc
◦ a c b a b c b c may not a¦ways count as a su¦scqucncc abc
• Existence of an occurrence versus counting occurrences
◦ Com¦inatoria¦ countin¸ (a¦¦ occurrcnccs)
◦ `axima¦ num¦cr ot dis,oint occurrcnccs
◦ Tcmjora¦ sujjort (num¦cr ot timc window jositions)
◦ `inimum occurrcncc (sma¦¦cst intcrva¦)
• Relation between the objects in a sequence
◦ itcms on¦y jrcccdc and succccd
◦ ¦a¦c¦cd timc joints t
1
< t
2
, t
1
t
2
, and t
1
> t
2
◦ ¦a¦c¦cd timc intcrva¦s rc¦ations ¦ikc before, starts, overlaps, contains ctc
Christian Borgelt Frequent Pattern Mining 452
Frequent Sequence Mining
• Directed sequences arc casicr to hand¦c
◦ Thc (su¦)scqucncc itsc¦t can ¦c uscd as a codc word
◦ As thcrc is on¦y onc jossi¦¦c codc word jcr scqucncc (on¦y onc dircction),
this codc word is ncccssari¦y canonica¦
• Consecutive subsequences arc casicr to hand¦c
◦ Thcrc arc tcwcr occurrcnccs ot a ¸ivcn su¦scqucncc
◦ Ior cach occurrcncc thcrc is cxact¦y onc jossi¦¦c cxtcnsions
◦ This a¦¦ows tor sjccia¦izcd data structurcs (simi¦ar to an Iltrcc)
• Item sequences arc casicst to hand¦c
◦ Thcrc arc on¦y two jossi¦¦c rc¦ations and thus jattcrns arc simj¦c
◦ Othcr scqucnccs arc hand¦cd with statc machincs tor containmcnt tcsts
Christian Borgelt Frequent Pattern Mining 453
A Canonical Form for Undirected Sequences
• lt thc scqucnccs to minc arc not dircctcd, a su¦scqucncc can not ¦c uscd
as its own codc word, ¦ccausc it docs not havc thc preﬁx property
• Thc rcason is that an undircctcd scqucncc can ¦c rcad torward or ¦ackward,
which ¸ivcs risc to two jossi¦¦c codc words, thc sma¦¦cr (or thc ¦ar¸cr) ot which
may thcn ¦c dcﬁncd as thc canonical code word
• Lxamj¦cs (that thc jrcﬁx jrojcrty is vio¦atcd)
◦ Assumc that thc itcm ordcr is a < b < c . . . and
that thc ¦cxico¸rajhica¦¦y sma¦¦cr codc word is thc canonica¦ onc
◦ Thc scqucncc bab, which is canonica¦, has thc jrcﬁx ba,
¦ut thc canonica¦ torm ot thc scqucncc ba is rathcr ab
◦ Thc scqucncc cabd, which is canonica¦, has thc jrcﬁx cab,
¦ut thc canonica¦ torm ot thc scqucncc cab is rathcr bac
• As a conscqucncc, wc havc to ¦ook tor a diﬀcrcnt way ot tormin¸ codc words
(at ¦cast it wc want thc codc to havc thc jrcﬁx jrojcrty)
Christian Borgelt Frequent Pattern Mining 454
A Canonical Form for Undirected Sequences
• A (simj¦c) jossi¦i¦ity to torm canonica¦ codc words havin¸ thc jrcﬁx jrojcrty
is to hand¦c (su¦)scqucnccs ot even and odd length separately
ln addition, tormin¸ thc codc word is startcd in the middle
• Even length: Thc scqucncc a
m
a
m−1
. . . a
2
a
1
b
1
b
2
. . . b
m−1
b
m
is dcscri¦cd ¦y thc codc word a
1
b
1
a
2
b
2
. . . a
m−1
b
m−1
a
m
b
m
or ¦y thc codc word b
1
a
1
b
2
a
2
. . . b
m−1
a
m−1
b
m
a
m
• Odd length: Thc scqucncc a
m
a
m−1
. . . a
2
a
1
a
0
b
1
b
2
. . . b
m−1
b
m
is dcscri¦cd ¦y thc codc word a
0
a
1
b
1
a
2
b
2
. . . a
m−1
b
m−1
a
m
b
m
or ¦y thc codc word a
0
b
1
a
1
b
2
a
2
. . . b
m−1
a
m−1
b
m
a
m
• Thc ¦cxico¸rajhica¦¦y sma¦¦cr ot thc two codc words is thc canonical code word
• Such scqucnccs arc extended ¦y addin¸ a jair a
m+1
b
m+1
or b
m+1
a
m+1
,
that is, ¦y addin¸ onc itcm at thc tront and onc itcm at thc cnd
Christian Borgelt Frequent Pattern Mining 455
A Canonical Form for Undirected Sequences
Thc codc words dcﬁncd in this way c¦car¦y havc thc preﬁx property
• Sujjosc thc jrcﬁx jrojcrty wou¦d not ho¦d
Thcn thcrc cxists, without ¦oss ot ¸cncra¦ity, a canonica¦ codc word
w
m
a
1
b
1
a
2
b
2
. . . a
m−1
b
m−1
a
m
b
m
,
thc jrcﬁx w
m−1
ot which is not canonica¦, whcrc
w
m−1
a
1
b
1
a
2
b
2
. . . a
m−1
b
m−1
,
• As a conscqucncc, wc havc w
m
< v
m
, whcrc
v
m
b
1
a
1
b
2
a
2
. . . b
m−1
a
m−1
b
m
a
m
,
and v
m−1
< w
m−1
, whcrc
v
m−1
b
1
a
1
b
2
a
2
. . . b
m−1
a
m−1
.
• Lowcvcr, v
m−1
< w
m−1
imj¦ics v
m
< w
m
,
¦ccausc v
m−1
is a jrcﬁx ot v
m
and w
m−1
is a jrcﬁx ot w
m
,
¦ut v
m
< w
m
contradicts w
m
< v
m
Christian Borgelt Frequent Pattern Mining 456
A Canonical Form for Undirected Sequences
• Gcncratin¸ and comjarin¸ thc two jossi¦¦c codc words takcs linear time
Lowcvcr, this can ¦c imjrovcd ¦y maintainin¸ an additiona¦ jiccc ot intormation
• Ior cach scqucncc a symmetry ﬂag is comjutcd
s
m
m
i1
(a
i
b
i
)
• Thc symmctry ﬂa¸ can ¦c maintaincd in constant timc with
s
m+1
s
m
∧ (a
m+1
b
m+1
).
• Thc permissible extensions dcjcnd on thc symmctry ﬂa¸
◦ it s
m
truc, it must ¦c a
m+1
≤ b
m+1
◦ it s
m
ta¦sc, any rc¦ation ¦ctwccn a
m+1
and b
m+1
is acccjta¦¦c
• This ru¦c ¸uarantccs that cxact¦y thc canonica¦ cxtcnsions arc crcatcd
Ajj¦yin¸ this ru¦c to chcck a candidatc cxtcnsion takcs constant time
Christian Borgelt Frequent Pattern Mining 457
Sequences of Time Intervals
• A (¦a¦c¦cd or attri¦utcd) time interval is a trij¦c I (s, e, l),
whcrc s is thc start timc, e is thc cnd timc and l is thc associatcd ¦a¦c¦
• A time interval sequence is a sct ot (¦a¦c¦cd) timc intcrva¦s,
ot which wc assumc that thcy arc maxima¦ in thc scnsc that tor two intcrva¦s
I
1
(s
1
, e
1
, l
1
) and I
2
(s
2
, e
2
, l
2
) with l
1
l
2
wc havc cithcr e
1
< s
2
or e
2
< s
1
Othcrwisc thcy arc mcr¸cd into onc intcrva¦ I (min¦s
1
, s
2
¦, max¦e
1
, e
2
¦, l
1
)
• A time interval sequence database is a vcctor ot timc intcrva¦ scqucnccs
• Timc intcrva¦s can casi¦y ¦c ordcrcd as to¦¦ows
Lct I
1
(s
1
, e
1
, l
1
) and I
2
(s
2
, e
2
, l
2
) ¦c two timc intcrva¦s lt is I
1
≺ I
2
iﬀ
◦ s
1
< s
2
or
◦ s
1
s
2
and e
1
< e
2
or
◦ s
1
s
2
and e
1
e
2
and l
1
< l
2
Luc to thc assumjtion madc a¦ovc, at ¦cast thc third ojtion must ho¦d
Christian Borgelt Frequent Pattern Mining 458
Allen’s Interval Relations
Luc to thcir tcmjora¦ cxtcnsion, timc intcrva¦s a¦¦ow tor diﬀcrcnt rc¦ations
A common¦y uscd sct ot rc¦ations ¦ctwccn timc intcrva¦s arc
Allen’s interval relations A¦¦cn 19S3
A ¦ctorc B
A mccts B
A ovcr¦ajs B
A is ﬁnishcd ¦y B
A contains B
A is startcd ¦y B
A cqua¦s B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
B attcr A
B is mct ¦y A
B is ovcr¦ajjcd ¦y A
B ﬁnishcs A
B durin¸ A
B starts A
B cqua¦s A
Christian Borgelt Frequent Pattern Mining 459
Temporal Interval Patterns
• A tcmjora¦ jattcrn must sjccity thc rc¦ations ¦ctwccn a¦¦ rctcrcnccd intcrva¦s
This can convcnicnt¦y ¦c donc with a matrix
A
B
C
A B C
A c o ¦
B io c m
C a im c
• Such a tcmjora¦ jattcrn matrix can a¦so ¦c intcrjrctcd as an ad,accncy matrix
ot a ¸rajh, which has thc intcrva¦ rc¦ationshijs as cd¸c ¦a¦c¦s
• Gcncra¦¦y, thc injut intcrva¦ scqucnccs may ¦c rcjrcscntcd as such ¸rajhs,
thus majjin¸ thc jro¦¦cm to trcqucnt (su¦)¸rajh minin¸
• Lowcvcr, thc rc¦ationshijs ¦ctwccn timc intcrva¦s arc constraincd
(tor cxamj¦c, “B attcr A” and “C attcr B” imj¦y “C attcr A”)
Thcsc constraints can ¦c cxj¦oitcd to o¦tain a simj¦cr canonica¦ torm
• ln thc canonical form, thc intcrva¦s arc assi¸ncd in incrcasin¸ timc ordcr
to thc rows and co¦umns ot thc tcmjora¦ jattcrn matrix Icmjc 200S
Christian Borgelt Frequent Pattern Mining 460
Support of Temporal Patterns
• Thc sujjort ot a tcmjora¦ jattcrn wrt a sin¸¦c scqucncc can ¦c dcﬁncd ¦y
◦ Com¦inatoria¦ countin¸ (a¦¦ occurrcnccs)
◦ `axima¦ num¦cr ot dis,oint occurrcnccs
◦ Tcmjora¦ sujjort (num¦cr ot timc window jositions)
◦ `inimum occurrcncc (sma¦¦cst intcrva¦)
• Lowcvcr, a¦¦ ot thcsc dcﬁnitions suﬀcr trom thc tact that such sujjort
is not antimonotone or downward closed
A
B B
Thc sujjort ot “A contains B” is 2,
¦ut thc sujjort ot “A” is on¦y 1
• `cvcrthc¦css an cxhaustivc jattcrn scarch can cnsurcd,
without havin¸ to a¦andon jrunin¸ with thc Apriori property
Thc rcasons is that with minimum occurrcncc countin¸ thc rc¦ationshij “contains”
is thc on¦y onc that can ¦cad to sujjort anoma¦ics ¦ikc thc onc shown a¦ovc
Christian Borgelt Frequent Pattern Mining 461
Weakly AntiMonotone / Downward Closed
• Lct T a jattcrn sjacc with a su¦jattcrn rc¦ationshij < and
¦ct s ¦c a tunction trom T to thc rca¦ num¦crs, s T → ll
Ior a jattcrn S ∈ T ¦ct P(S) ¦R [ R < S ∧ ,∃ Q R < Q < S¦
¦c thc sct ot a¦¦ parent patterns ot S
Thc tunction s on thc jattcrn sjacc T is ca¦¦cd
◦ strongly antimonotone or strongly downward closed iﬀ
∀S ∈ T ∀R ∈ P(S) s(R) ≥ s(S),
◦ weakly antimonotone or weakly downward closed iﬀ
∀S ∈ T ∃R ∈ P(S) s(R) ≥ s(S).
• Thc sujjort ot tcmjora¦ intcrva¦ jattcrns is wcak¦y antimonotonc
(at ¦cast) it it is comjutcd trom minima¦ occurrcnccs
• lt tcmjora¦ intcrva¦ jattcrns arc cxtcndcd backwards in time,
thc Apriori property can satc¦y ¦c uscd tor jrunin¸ Icmjc 200S
Christian Borgelt Frequent Pattern Mining 462
Summary Frequent Sequence Mining
• Scvcra¦ diﬀcrcnt types of frequent sequence mining can ¦c distin¸uishcd
◦ sin¸¦c and mu¦tij¦c scqucnccs, dircctcd and undircctcd scqucnccs
◦ itcms vcrsus (¦a¦c¦cd) intcrva¦s, sin¸¦c and mu¦tij¦c o¦,ccts jcr josition
◦ rc¦ations ¦ctwccn thc o¦,ccts, dcﬁnition ot jattcrn sujjort
• A¦¦ common tyjcs ot trcqucnt scqucncc minin¸ josscss canonica¦ torms
tor which canonical extension rules can ¦c tound
\ith thcsc ru¦cs it is jossi¦¦c to chcck in constant timc
whcthcr a jossi¦¦c cxtcnsion ¦cads to a rcsu¦t in canonica¦ torm
• A weakly antimonotone sujjort tunction can ¦c cnou¸h
to a¦¦ow jrunin¸ with thc Apriori property
Lowcvcr, in this casc it must ¦c madc surc that thc canonica¦ torm
assi¸ns an ajjrojriatc jarcnt jattcrn in ordcr to cnsurc an cxhaustivc scarch
Christian Borgelt Frequent Pattern Mining 463
Frequent Tree Mining
Christian Borgelt Frequent Pattern Mining 464
Frequent Tree Mining: Basic Notions
• lcmindcr A path is a scqucncc ot cd¸cs conncctin¸ two vcrticcs in a ¸rajh
• lcmindcr A (¦a¦c¦cd) ¸rajh G is ca¦¦cd a tree iﬀ tor any jair ot vcrticcs in G
thcrc cxists exactly one path conncctin¸ thcm in G
• A trcc is ca¦¦cd rooted it it has a distin¸uishcd vcrtcx, ca¦¦cd thc root
lootcd trccs arc ottcn sccn as dircctcd a¦¦ cd¸cs arc dircctcd away trom thc root
• lt a trcc is not rootcd (that is, it thcrc is no distin¸uishcd vcrtcx), it is ca¦¦cd free
• A trcc is ca¦¦cd ordered it tor cach vcrtcx
thcrc cxists an ordcr on its incidcnt cd¸cs
lt thc trcc is rooted, thc ordcr may ¦c dcﬁncd on thc out¸oin¸ cd¸cs on¦y
• Trccs ot whichcvcr tyjc arc much casicr to hand¦c trcqucnt (su¦)¸rajhs,
¦ccausc it is main¦y thc cyc¦cs (which may ¦c jrcscnt in a ¸cncra¦ ¸rajh)
that makc it diﬃcu¦t to construct thc canonica¦ codc word
Christian Borgelt Frequent Pattern Mining 465
Frequent Tree Mining: Basic Notions
• lcmindcr A path is a scqucncc ot cd¸cs conncctin¸ two vcrticcs in a ¸rajh
• Thc length of a path is thc num¦cr ot its cd¸cs
• Thc distance ¦ctwccn two vcrticcs ot a ¸rajh G
is thc ¦cn¸th ot a shortcst jath conncctin¸ thcm
`otc that in a trcc thcrc is cxact¦y onc jath conncctin¸ two vcrticcs,
which is thcn ncccssari¦y a¦so thc shortcst jath
• ln a rootcd trcc thc depth ot a vcrtcx is its distancc trom thc root vcrtcx
Thc root vcrtcx itsc¦t has dcjth 0
Thc depth ot a trcc is thc dcjth ot its dccjcst vcrtcx
• Thc diameter ot a ¸rajh is thc ¦ar¸cst distancc ¦ctwccn any two vcrticcs
• A diameter path ot a ¸rajh is a jath havin¸ a ¦cn¸th
that is thc diamctcr ot thc ¸rajh
Christian Borgelt Frequent Pattern Mining 466
Rooted Ordered Trees
• Ior rooted ordered trees codc words dcrivcd trom sjannin¸ trccs
can dircct¦y ¦c uscd thc sjannin¸ trcc is simj¦y thc trcc itsc¦t
• Lowcvcr, thc root ot thc sjannin¸ trcc is ﬁxed
it is simj¦y thc root ot thc rootcd ordcrcd trcc
• ln addition, thc order of the children ot cach vcrtcx is ﬁxed
it is simj¦y thc ¸ivcn ordcr ot thc out¸oin¸ cd¸cs
• As a conscqucncc, oncc a travcrsa¦ ordcr tor thc sjannin¸ trcc is ﬁxcd
(tor cxamj¦c, dcjthﬁrst or a ¦rcadthﬁrst travcrsa¦), thcrc is on¦y
one possible code word, which is ncccssari¦y thc canonica¦ codc word
• Thcrctorc rightmost path extension (tor a dcjthﬁrst travcrsa¦)
and maximum source extension (tor a ¦rcadthﬁrst travcrsa¦)
o¦vious¦y jrovidc a canonica¦ cxtcnsion ru¦c tor rootcd ordcrcd trccs
Thcrc is no nccd tor an cxj¦icit tcst tor canonica¦ torm
Christian Borgelt Frequent Pattern Mining 467
Rooted Unordered Trees
• Rooted unordered trees can most convcnicnt¦y ¦c dcscri¦cd ¦y
soca¦¦cd preorder code words
• lrcordcr codc words arc c¦osc¦y rc¦atcd to sjannin¸ trccs that arc constructcd
with a dcjthﬁrst scarch, ¦ccausc a jrcordcr travcrsa¦ is a dcjthﬁrst travcrsa¦
Lowcvcr, thcir sjccia¦ torm makcs it casicr to comjarc codc words tor su¦trccs
• Thc jrcordcr codc words wc considcr hcrc havc thc ¸cncra¦ torm
a ( d b a )
m
,
whcrc m is thc num¦cr ot cd¸cs ot thc trcc, m n −1,
n is thc num¦cr ot vcrticcs ot thc trcc,
a is a vcrtcx attri¦utc , ¦a¦c¦,
b is an cd¸c attri¦utc , ¦a¦c¦, and
d is thc dcjth ot thc sourcc vcrtcx ot an cd¸c
Thc sourcc vcrtcx ot an cd¸c is thc vcrtcx that is c¦oscr to thc root (sma¦¦cr dcjth)
Thc cd¸cs arc ¦istcd in thc ordcr in which thcy arc visitcd in a jrcordcr travcrsa¦
Christian Borgelt Frequent Pattern Mining 468
Rooted Unordered Trees
a
b b
a
b d b
a
b
c
Ior simj¦icity wc omit cd¸c ¦a¦c¦s
ln rootcd trccs cd¸c ¦a¦c¦s can a¦
ways ¦c com¦incd with thc dcs
tination vcrtcx ¦a¦c¦ (that is, thc
¦a¦c¦ ot thc vcrtcx that is tarthcr
away trom thc root)
• Thc a¦ovc rootcd unordcrcd trcc can ¦c dcscri¦cd ¦y thc codc word
a 0b 1d 1b 2b 2c 1a 0b 1a 1b
• `otc that thc codc word consists ot su¦strin¸s that dcscri¦c thc su¦trccs
¸ .. ¸
a 0
¸ .. ¸
b 1
¸..¸
d 1
¸ .. ¸
b 2
¸..¸
b 2
¸..¸
c 1
¸..¸
a 0
¸ .. ¸
b 1
¸..¸
a 1
¸..¸
b
Thc su¦trcc strin¸s arc scjaratcd ¦y a num¦cr statin¸ thc dcjth ot thc jarcnt
Christian Borgelt Frequent Pattern Mining 469
Rooted Unordered Trees
Lxchan¸in¸ codc words on thc samc ¦cvc¦ cxchan¸cs ¦ranchcs,su¦trccs
¸ .. ¸
a 0
¸ .. ¸
b 1
¸..¸
d 1
¸ .. ¸
b 2
¸..¸
b 2
¸..¸
c 1
¸..¸
a 0
¸ .. ¸
b 1
¸..¸
a 1
¸..¸
b
Ior cxamj¦c, in this codc word thc chi¦drcn ot thc root arc cxchan¸cd
¸ .. ¸
a 0
¸ .. ¸
b 1
¸..¸
a 1
¸..¸
b 0
¸ .. ¸
b 1
¸..¸
d 1
¸ .. ¸
b 2
¸..¸
b 2
¸..¸
c 1
¸..¸
a
a
b b
a
b d b
a
b
c
a
b b
d b
a a
b
b
c
Christian Borgelt Frequent Pattern Mining 470
Rooted Unordered Trees
• A¦¦ jossi¦¦c jrcordcr codc words can ¦c o¦taincd trom onc jrcordcr codc word
¦y cxchan¸in¸ su¦strin¸s ot thc codc word that dcscri¦c si¦¦in¸ su¦trccs
(This shows thc advanta¸c ot usin¸ thc vcrtcx dcjth rathcr than thc vcrtcx indcx
no rcnum¦crin¸ ot thc vcrticcs is ncccssary in such a cxchan¸c)
• Ly dcﬁnin¸ an (ar¦itrary, ¦ut ﬁxcd) ordcr on thc vcrtcx ¦a¦c¦s
and usin¸ thc standard ordcr ot thc intc¸cr num¦crs,
thc codc words can ¦c comjarcd ¦cxico¸rajhica¦¦y
(`otc that vcrtcx ¦a¦c¦s arc a¦ways comjarcd to vcrtcx ¦a¦c¦s
and intc¸crs to intc¸crs, ¦ccausc thcsc two c¦cmcnts a¦tcrnatc)
• Contrary to thc common dcﬁnition uscd in a¦¦ car¦icr cascs, wc dcﬁnc
thc ¦cxico¸rajhica¦¦y greatest codc word as thc canonical code word
• Thc canonica¦ codc word tor thc trcc on thc jrcvious s¦idcs is
a 0b 1d 1b 2c 2b 1a 0b 1b 1a
Christian Borgelt Frequent Pattern Mining 471
Rooted Unordered Trees
• ln ordcr to undcrstand thc corc jro¦¦cm ot o¦tainin¸ an cxtcnsion ru¦c
tor rootcd unordcrcd trccs, considcr thc to¦¦owin¸ trcc
a
b b
c c c c
d
c
d b d
c
d
• Thc canonica¦ codc word tor this trcc rcsu¦ts trom thc shown ordcr ot thc su¦trccs
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d
Any cxchan¸c ot su¦trccs ¦cads to a ¦cxico¸rajhica¦¦y smaller codc word
• Low can this trcc ¦c cxtcndcd ¦y addin¸ a chi¦d to thc ¸rcy vcrtcx´
That is, what ¦a¦c¦ may thc chi¦d vcrtcx havc it thc rcsu¦t is to ¦c canonica¦´
Christian Borgelt Frequent Pattern Mining 472
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
• ln thc ﬁrst j¦acc, wc o¦scrvc that thc chi¦d must not havc a ¦a¦c¦ succccdin¸ “d”,
¦ccausc othcrwisc cxchan¸in¸ thc ncw vcrtcx with thc othcr chi¦d
ot thc ¸rcy vcrtcx wou¦d yic¦d a ¦cxico¸rajhica¦¦y larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2e
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2e 2d
• Gcncra¦¦y, thc chi¦drcn ot a vcrtcx must ¦c sortcd dcsccndin¸¦y wrt thcir ¦a¦c¦s
Christian Borgelt Frequent Pattern Mining 473
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
• Sccond¦y, wc o¦scrvc that thc chi¦d must not havc a ¦a¦c¦ succccdin¸ “c”,
¦ccausc othcrwisc cxchan¸in¸ thc su¦trccs ot thc jarcnt ot thc ¸rcy vcrtcx
wou¦d yic¦d a ¦cxico¸rajhica¦¦y larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2d
<
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2d 1c 2d 2c
• Thc su¦trccs ot any vcrtcx must ¦c sortcd dcsccndin¸¦y wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 474
Rooted Unordered Trees
a
b b
c c c c
d
c
d b d
c
d
• Third¦y, wc o¦scrvc that thc chi¦d must not havc a ¦a¦c¦ succccdin¸ “¦”,
¦ccausc othcrwisc cxchan¸in¸ thc su¦trccs ot thc root vcrtcx ot thc trcc
wou¦d yic¦d a ¦cxico¸rajhica¦¦y larger codc word
a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2c
<
a 0b 1c 2d 2c 1c 2d 2c 0b 1c 2d 2c 1c 2d 2b
• Thc su¦trccs ot any vcrtcx must ¦c sortcd dcsccndin¸¦y wrt thcir codc words
Christian Borgelt Frequent Pattern Mining 475
Rooted Unordered Trees
• That a jossi¦¦c cxchan¸c ot su¦trccs at vcrticcs c¦oscr to thc root
ncvcr yic¦d ¦ooscr rcstrictions is no accidcnt
• Sujjosc a rootcd trcc is dcscri¦cd ¦y a canonica¦ codc word
a 0 b 1 w
1
1 w
2
0 b 1 w
3
1 w
!
.
Thcn wc know thc to¦¦owin¸ rc¦ationshijs ¦ctwccn su¦trcc codc words
◦ w
1
≥ w
2
and w
3
≥ w
!
, ¦ccausc othcrwisc an cxchan¸c ot su¦trccs at thc
nodcs ¦a¦c¦cd with “¦” wou¦d ¦cad to a ¦cxico¸rajhica¦¦y ¦ar¸cr codc word
◦ w
1
≥ w
3
, ¦ccausc othcrwisc an cxchan¸c ot su¦trccs at thc nodc ¦a¦c¦cd “a”
wou¦d ¦cad to a ¦cxico¸rajhica¦¦y ¦ar¸cr codc word
• On¦y it w
1
w
3
, thc codc words w
1
and w
3
do not a¦rcady dctcrminc thc ordcr
ot thc su¦trccs ot thc vcrtcx ¦a¦c¦cd with “a” ln this casc wc havc w
2
≥ w
!
Lowcvcr, thcn wc a¦so havc w
3
w
1
≥ w
2
,
showin¸ that w
2
jrovidcs no ¦ooscr rcstriction ot w
!
than w
3
Christian Borgelt Frequent Pattern Mining 476
Rooted Unordered Trees
As a conscqucncc, wc o¦tain thc to¦¦owin¸ simj¦c extension rule
• Lct w ¦c thc canonica¦ codc word ot thc rootcd trcc to cxtcnd and
¦ct d ¦c thc dcjth ot thc rootcd trcc (that is, thc dcjth ot thc dccjcst vcrtcx)
ln addition, ¦ct thc considcrcd cxtcnsion ¦c xa with x ∈ l`
0
and a a vcrtcx ¦a¦c¦
• Lct y ¦c thc sma¦¦cst intc¸cr tor which w has a suﬃx ot thc torm y w
1
w
2
y w
1
with y ∈ l`
0
and w
1
and w
2
strin¸s not containin¸ any y
′
≤ y (w
2
may ¦c cmjty)
lt w docs not josscss such a suﬃx, ¦ct y d (dcjth ot thc trcc)
• lt x > y, thc cxtcnsion is canonica¦ it and on¦y it xa ≤ w
2
• lt x ≤ y, chcck whcthcr w has a suﬃx xw
3
,
whcrc w
3
is a strin¸ not containin¸ any intc¸cr x
′
≤ x
lt w has such a suﬃx, thc cxtcndcd codc word is canonica¦ it and on¦y it a ≤ w
3
lt w docs not havc such a suﬃx, thc cxtcndcd codc word is a¦ways canonica¦
• \ith this cxtcnsion ru¦c no su¦scqucnt canonica¦ torm tcst is nccdcd
Christian Borgelt Frequent Pattern Mining 477
Rooted Unordered Trees
Thc discusscd cxtcnsion ru¦c is vcry cﬃcicnt
• Comjarin¸ thc c¦cmcnts ot thc cxtcnsion takcs constant time
(at most onc intc¸cr and onc ¦a¦c¦ nccd to ¦c comjarcd)
• Inow¦cd¸c ot thc strin¸s w
3
tor a¦¦ jossi¦¦c va¦ucs ot x (0 ≤ x < d)
can maintaincd in constant time
lt suﬃccs to rccord thc startin¸ joints ot thc su¦strin¸s
that dcscri¦c thc ri¸htmost su¦trcc on cach trcc ¦cvc¦
At most onc ot thcsc startin¸ joints can chan¸c with an cxtcnsion
• Inow¦cd¸c ot thc va¦uc ot y and thc two startin¸ joints ot thc strin¸ w
1
in w
can ¦c maintaincd in constant time
As ¦on¸ as no two si¦¦in¸ vcrticcs carry thc samc ¦a¦c¦, it is y d
lt a si¦¦in¸ with thc samc ¦a¦c¦ is addcd, y is sct to thc dcjth ot thc jarcnt
w
1
a occurs at thc josition ot thc w
3
tor y and at thc cxtcnsion vcrtcx ¦a¦c¦
lt a tuturc cxtcnsion diﬀcrs trom w
2
, it is y d, othcrwisc w
1
is cxtcndcd
Christian Borgelt Frequent Pattern Mining 478
Free Trees
• Free trees can ¦c hand¦cd ¦y com¦inin¸ thc idcas ot
how to hand¦c sequences and rooted unordered trees
• Simi¦ar to scqucnccs, trcc trccs ot cvcn and odd diamctcr arc trcatcd scjaratc¦y
• Gcncra¦ idcas tor a canonica¦ torm tor trcc trccs
◦ Even Diameter:
Thc vcrtcx in thc midd¦c ot a diamctcr jath is uniquc¦y dctcrmincd
This vcrtcx can ¦c uscd as thc root ot a rootcd trcc
◦ Odd Diameter:
Thc cd¸c in thc midd¦c ot a diamctcr jath is uniquc¦y dctcrmincd
lcmovin¸ this cd¸c sj¦its thc trcc trcc into two rootcd trccs
• lroccdurc tor ¸rowin¸ trcc trccs
◦ Iirst ¸row a diamctcr jath usin¸ thc canonica¦ torm tor scqucnccs
◦ Lxtcnd thc diamctcr jath into a trcc ¦y addin¸ ¦ranchcs
Christian Borgelt Frequent Pattern Mining 479
Free Trees
• `ain jro¦¦cm ot thc jroccdurc tor ¸rowin¸ trcc trccs
The initially grown diameter path must remain identiﬁable.
(Othcrwisc thc preﬁx property cannot ¦c ¸uarantccd)
• ln ordcr to so¦vc this jro¦¦cm it is cxj¦oitcd that in thc canonica¦ codc word tor a
rootcd unordcrcd trcc codc words dcscri¦in¸ jaths trom thc root to a ¦cat vcrtcx
arc ¦cxico¸rajhica¦¦y incrcasin¸ it thc jaths arc ¦istcd trom ¦ctt to ri¸ht
• Even Diameter:
Thc ori¸ina¦ diamctcr jath rcjrcscnts two jaths trom thc root to two ¦cavcs
To kccj thcm idcntiﬁa¦¦c, thcsc jaths must ¦c thc ¦cxico¸rajhica¦¦y sma¦¦cst
and thc ¦cxico¸rajhica¦¦y ¦ar¸cst jath ¦cadin¸ to this dcjth
• Odd Diameter:
Thc ori¸ina¦ diamctcr jath rcjrcscnts onc jath trom thc root to a ¦cat
in cach ot thc two rootcd trccs thc trcc trcc is sj¦it into
Thcsc jaths must ¦c thc ¦cxico¸rajhica¦¦y sma¦¦cst jaths ¦cadin¸ to this dcjth
Christian Borgelt Frequent Pattern Mining 480
Summary Frequent Tree Mining
• Rooted ordered trees
◦ Thc root is ﬁxcd and thc ordcr ot thc chi¦drcn ot cach vcrtcx is ﬁxcd
◦ Loth rightmost path extension and maximum source extension
o¦vious¦y jrovidc a canonica¦ cxtcnsion ru¦c tor rootcd ordcrcd trccs
• Rooted unordered trees
◦ Thc root is ﬁxcd, ¦ut thcrc is no ordcr ot thc chi¦drcn
◦ Thcrc cxists a canonica¦ cxtcnsion ru¦c ¦ascd on sortcd jrcordcr strin¸s
(constant timc tor ﬁndin¸ a¦¦owcd cxtcnsions) Luccio ct a¦ 2001, 200!
• Free trees
◦ `o nodc is ﬁxcd as thc root, thcrc is no ordcr on ad,accnt vcrticcs
◦ Thcrc cxists a canonica¦ cxtcnsion ru¦c ¦ascd on dcjth scqucnccs
(constant timc tor ﬁndin¸ a¦¦owcd cxtcnsions) `i,sscn and Iok 200!
Christian Borgelt Frequent Pattern Mining 481
Summary Frequent Pattern Mining
Christian Borgelt Frequent Pattern Mining 482
Summary Frequent Pattern Mining
• lossi¦¦c tyjcs ot jattcrns item sets, sequences, trees, and graphs
• A corc in¸rcdicnt ot thc scarch is a canonical form ot thc tyjc ot jattcrn
◦ lurjosc cnsurc that cach jossi¦¦c jattcrn is jroccsscd at most oncc
(Liscard noncanonica¦ codc words, jroccss on¦y canonica¦ oncs)
◦ lt is dcsira¦¦c that thc canonica¦ torm josscsscs thc preﬁx property
◦ Lxccjt tor ¸cncra¦ ¸rajhs thcrc cxist perfect extension rules
◦ Ior ¸cncra¦ ¸rajhs, restricted extensions a¦¦ow to rcducc
thc num¦cr ot actua¦ canonica¦ torm tcsts considcra¦¦y
• Ircqucnt jattcrn minin¸ a¦¸orithms jrunc with thc Apriori property
∀P ∀S ⊃ P s
T
(P) < s
min
→ s
T
(S) < s
min
.
That is No superpattern of an infrequent pattern is frequent.
• Additional ﬁltering is imjortant to sin¸¦c out thc rc¦cvant jattcrns
Christian Borgelt Frequent Pattern Mining 483
Software
Sottwarc tor trcqucnt jattcrn minin¸ can ¦c tound at
• my wc¦ sitc http://www.borgelt.net/fpm.html
◦ Ajriori http://www.borgelt.net/apriori.html
◦ Lc¦at http://www.borgelt.net/eclat.html
◦ IlGrowth http://www.borgelt.net/fpgrowth.html
◦ lL¦im http://www.borgelt.net/relim.html
◦ Sa` http://www.borgelt.net/sam.html
◦ `oSS http://www.borgelt.net/moss.html
• thc Ircqucnt ltcm Sct `inin¸ lmj¦cmcntations (Il`l) lcjository
http://fimi.cs.helsinki.fi/
This rcjository was sct uj with thc contri¦utions to thc Il`l workshojs in 2003
and 200!, whcrc cach su¦mission had to ¦c accomjanicd ¦y thc sourcc codc ot
an imj¦cmcntation Thc wc¦ sitc oﬀcrs a¦¦ sourcc codc, scvcra¦ data scts, and thc
rcsu¦ts ot thc comjctition
Christian Borgelt Frequent Pattern Mining 484
Frequent Item Set Mining: Basic Notions
• Let B = {i1, . . . , im} be a set of items. This set is called the item base. Items may be products, special equipment items, service options etc. • Any subset I ⊆ B is called an item set. An item set may be any set of products that can be bought (together). • Let T = (t1, . . . , tn) with ∀k, 1 ≤ k ≤ n : tk ⊆ B be a vector of transactions over B. This vector is called the transaction database. A transaction database can list, for example, the sets of products bought by the customers of a supermarket in a given period of time. Every transaction is an item set, but some item sets may not appear in T . Transactions need not be pairwise diﬀerent: it may be tj = tk for j = k. T may also be deﬁned as a bag or multiset of transactions. The set B may not be explicitely given, but only implicitly as B = n tk . k=1
Frequent Item Set Mining: Basic Notions
Let I ⊆ B be an item set and T a transaction database over B. • A transaction t ∈ T covers the item set I or the item set I is contained in a transaction t ∈ T iﬀ I ⊆ t.
• The set KT (I) = {k ∈ {1, . . . , n}  I ⊆ tk } is called the cover of I w.r.t. T . The cover of an item set is the index set of the transactions that cover it. It may also be deﬁned as a vector of all transactions that cover it (which, however, is complicated to write in a formally correct way). • The value sT (I) = KT (I) is called the (absolute) support of I w.r.t. T . 1 The value σT (I) = n KT (I) is called the relative support of I w.r.t. T . The support of I is the number or fraction of transactions that contain it. Sometimes σT (I) is also called the (relative) frequency of I w.r.t. T .
Christian Borgelt
Frequent Pattern Mining
5
Christian Borgelt
Frequent Pattern Mining
6
Frequent Item Set Mining: Basic Notions
Alternative Deﬁnition of Transactions • A transaction over an item base B is a tuple t = (tid, J), where ◦ tid is a unique transaction identiﬁer and ◦ J ⊆ B is an item set. • A transaction database T = {t1, . . . , tn} is a set of transactions. A simple set can be used, since transactions diﬀer at least in their identiﬁer. • A transaction t = (tid, J) covers an item set I iﬀ I ⊆ J. Given:
Frequent Item Set Mining: Formal Deﬁnition
• a set B = {i1, . . . , im} of items, the item base, • a vector T = (t1, . . . , tn) of transactions over B, the transaction database, • a number smin ∈ IN, 0 < smin ≤ n, a number σmin ∈ IR, 0 < σmin ≤ 1, Desired: • the set of frequent item sets, that is, the set FT (smin) = {I ⊆ B  sT (I) ≥ smin} or (equivalently) the set ΦT (σmin) = {I ⊆ B  σT (I) ≥ σmin}.
1 Note that with the relations smin = ⌈nσmin⌉ and σmin = n smin the two versions can easily be transformed into each other.
or (equivalently) the minimum support.
• The set KT (I) = {tid  ∃J ⊆ B : ∃t ∈ T : t = (tid, J) ∧ I ⊆ J} is the cover of I w.r.t. T . Remark: If the transaction database is deﬁned as a vector, there is an implicit transaction identiﬁer, namely the position of the transaction in the vector.
Christian Borgelt
Frequent Pattern Mining
7
Christian Borgelt
Frequent Pattern Mining
8
Frequent Item Sets: Example
transaction database 1: {a, d, e} 2: {b, c, d} 3: {a, c, e} 4: {a, c, d, e} 5: {a, e} 6: {a, c, d} 7: {b, c} 8: {a, c, d, e} 9: {b, c, e} 10: {a, d, e} frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a, c}: {a, d}: {a, e}: {b, c}: {c, d}: {c, e}: {d, e}: 3 items 4 {a, c, d}: 3 5 {a, c, e}: 3 6 {a, d, e}: 4 3 4 4 4
Searching for Frequent Item Sets
• The minimum support is smin = 3 or σmin = 0.3 = 30% in this example. • There are 25 = 32 possible item sets over B = {a, b, c, d, e}. • There are 16 frequent item sets (but only 10 transactions).
Christian Borgelt
Frequent Pattern Mining
9
Christian Borgelt
Frequent Pattern Mining
10
Properties of the Support of Item Sets
• A brute force approach that traverses all possible item sets, determines their support, and discards infrequent item sets is usually infeasible: The number of possible item sets grows exponentially with the number of items. A typical supermarket oﬀers thousands of diﬀerent products. • Idea: Consider the properties of the support, in particular: ∀I : ∀J ⊇ I : KT (J) ⊆ KT (I).
Properties of the Support of Item Sets
• From ∀I : ∀J ⊇ I : sT (J) ≤ sT (I) it follows immediately ∀smin : ∀I : ∀J ⊇ I : sT (I) < smin → sT (J) < smin.
That is: No superset of an infrequent item set can be frequent. • This property is often referred to as the Apriori Property. Rationale: Sometimes we can know a priori, that is, before checking its support by accessing the given transaction database, that an item set cannot be frequent. • Of course, the contraposition of this implication also holds: ∀smin : ∀I : ∀J ⊆ I : sT (I) ≥ smin → sT (J) ≥ smin.
This property holds, since ∀t : ∀I : ∀J ⊇ I : J ⊆ t → I ⊆ t. Each additional item is another condition a transaction has to satisfy. Transactions that do not satisfy this condition are removed from the cover. • It follows: ∀I : ∀J ⊇ I : sT (J) ≤ sT (I).
That is: All subsets of a frequent item set are frequent. • This suggests a compressed representation of the set of frequent item sets (which will be explored later: maximal and closed frequent item sets).
11 Christian Borgelt Frequent Pattern Mining 12
That is: If an item set is extended, its support cannot increase. One also says that support is antimonotone or downward closed.
Christian Borgelt Frequent Pattern Mining
the partially ordered set (2B . d. • A function f : IR → IR is called monotonically nonincreasing if ∀x.t. y ∈ S : x ≤S y ⇒ f (x) ≤R f (y). Monotonicity in Order Theory • Order theory is concerned with arbitrary partially ordered sets. c ∈ S: ◦ a≤a ◦ a≤b∧b≤a ⇒ a=b ◦ a≤b∧b≤c ⇒ a≤c (reﬂexivity) (antisymmetry) (transitivity) Properties of the Support of Item Sets Monotonicity in Calculus and Analysis • A function f : IR → IR is called monotonically nondecreasing if ∀x. all edges lead downwards. then a and b are called comparable. ◦ if neither a ≤ b nor b ≤ a. e}. is called antimonotone or orderreversing if ∀x. • G has the elements of S as nodes. • For every smin the set of frequent item sets FT (smin) is downward closed w. Christian Borgelt Frequent Pattern Mining abcde Hasse diagram of ({a.Reminder: Partially Ordered Sets • A partial order is a binary relation ≤ over a set S which satisﬁes ∀a. GT (θ) = {S ⊆ B  sT (S) < θ} is upward closed. Reminder: Partially Ordered Sets and Hasse Diagrams • A ﬁnite partially ordered set (S. c. • If all pairs of elements of the underlying set S are comparable.or downward closed. where 2B denotes the powerset of B: ∀X ∈ FT (smin) : ∀Y ⊆ B : Y ⊆ X ⇒ Y ∈ FT (smin). 15 Christian Borgelt ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde • Since the set of frequent item sets is induced by the support function. • A function f : S → R. is called if ∀x. ⊆). y ∈ S : x ≤S y ⇒ f (x) ≥R f (y). the graph can always be depicted such that all edges lead downwards. because they lose their pictorial motivation as soon as sets are considered that are not totally ordered. • The notions of upward closed and upper set are deﬁned analogously. ≤).) Frequent Pattern Mining 16 . the order ≤ is called a total order or a linear order.or downward closed are transferred to the support function: Any set of item sets induced by a support threshold θ is up. b. • The Hasse diagram of a total or linear order is a chain. FT (θ) = {S ⊆ B  sT (S) ≥ θ} is downward closed. y : x ≤ y ⇒ f (x) ≥ f (y). • In a total order the reﬂexivity axiom is replaced by the stronger axiom: ◦ a≤b∨b≤a Christian Borgelt (totality) Frequent Pattern Mining Properties of Frequent Item Sets • A subset R of a partially ordered set (S. which is called Hasse diagram. y : x ≤ y ⇒ f (x) ≤ f (y). • In this sense the support of an item set is antimonotone. • Since the graph is acyclic (there is no directed cycle). ⊆). then there is an edge from a to b.r. • Let a and b be two distinct elements of a partially ordered set (S. a ≤ b and not a = b) and there is no element between a and b (that is. (Edge directions are omitted. no c ∈ S with a < c < b). 13 Christian Borgelt Frequent Pattern Mining 14 • A set with a partial order is called a partially ordered set (or poset for short). The terms increasing and decreasing are avoided. ≤) can be depicted as a (directed) acyclic graph G. ≤) is called downward closed if for any element of the set all smaller elements are also in it: ∀x ∈ R : ∀y ∈ S : y≤x ⇒ y∈R In this case the subset R is also called a lower set. then a and b are called incomparable. where S and R are two partially ordered sets. monotone or orderpreserving • A function f : S → R. ◦ if a ≤ b or b ≤ a. the notions of up. The edges are selected according to: a b c d e If a and b are elements of S with a < b (that is. b.
e} {a. Christian Borgelt Frequent Pattern Mining 17 Searching for Frequent Item Sets Idea: Use the properties of the support to organize the search for all frequent item sets. • The structure of the partially ordered set (2B . Since these properties relate the support of an item set to the support of its subsets and supersets. • It improves over the brute force approach by exploiting the apriori property to skip item sets that cannot be frequent because they have an infrequent subset. ⇒ topdown search (from empty set/oneelement sets to larger sets) • Since a partially ordered set can conveniently be depicted by a Hasse diagram. that enumerates candidate item sets and checks their support. e} {a. c. b. c. In practice. Christian Borgelt Hasse diagram for ﬁve items {a. e} = B: a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde (2B . e} {b. d. d. c. it is reasonable to organize the search based on the structure of the partially ordered set (2B . ⊆). white boxes infrequent item sets. c. c. e} {a. e} {a. d. e} a b c d e The Apriori Algorithm ab ac ad ae bc bd be cd ce de [Agrawal and Srikant 1994] abc abd abe acd ace ade bcd bce bde cde Blue boxes are frequent item sets. abcd abce abde acde bcde abcde Christian Borgelt Frequent Pattern Mining 19 Christian Borgelt Frequent Pattern Mining 20 . d. ⊆) helps to identify those item sets that can be skipped due to the apriori property.Searching for Frequent Item Sets • The standard search procedure is an enumeration approach. ⊆). especially the apriori property: ∀I : ∀J ⊃ I : sT (I) < smin → sT (J) < smin. d} {b. • Note that the search may have to visit an exponential number of item sets. d. we will use such diagrams to illustrate the search. e} {b. c. at least if the minimum support is not chosen too low. the search times are often bearable. however. • The search space is the partially ordered set (2B . ⊆) Frequent Pattern Mining 18 Hasse Diagrams and Frequent Item Sets Hasse diagram with frequent item sets (smin = 3): transaction database 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. c} {a. c. d} {a.
while Fk = ∅ do begin (∗ — Apriori algorithm ∗) (∗ initialize the item set size ∗) (∗ start with single element sets ∗) (∗ and determine the frequent ones ∗) (∗ while there are frequent item sets ∗) Ek+1 := candidates(Fk ). T. . Ek := i∈B {{i}}. . T. end (∗ prune ∗) (∗ initialize the set of frequent candidates ∗) (∗ traverse the candidates ∗) (∗ if a candidate is frequent. k F . smin) begin forall e ∈ E do sT (e) := 0. . ik−1. j=1 j (∗ increment the item counter ∗) (∗ return the frequent item sets ∗) end (∗ apriori ∗) Ej : candidate item sets of size j. determine their support. ik−1. • Form candidate item sets with three items (all pairs must be frequent). It is based on two main steps: candidate generation and pruning. forall f1. . smin). end (∗ candidates ∗) Christian Borgelt Frequent Pattern Mining 23 Christian Borgelt Frequent Pattern Mining 24 . forall e ∈ E do if sT (e) ≥ smin then F := F ∪ {e}. • Continue by forming candidate item sets with four. (∗ increment the support counter ∗) F := ∅. return := k + 1. . i′ }. . smin). ﬁve etc. (∗ and determine the frequent item sets ∗) k end.Searching for Frequent Item Sets One possible scheme for the search: • Determine the support of the one element item sets and discard the infrequent items. and discard the infrequent item sets. ∗) (∗ add the new item set to the candidates ∗) (∗ (otherwise it cannot be frequent) ∗) (∗ return the generated candidates ∗) if e ⊆ t (∗ if transaction contains the candidate. ik } The Apriori Algorithm 3 function prune (E. T. k (∗ if all subsets with k items are frequent. forall t ∈ T do forall e ∈ E do (∗ — generate candidates with k + 1 items ∗) (∗ initialize the set of candidates ∗) (∗ traverse all pairs of frequent item sets ∗) (∗ that diﬀer only in one item and ∗) (∗ are in a lexicographic order ∗) (∗ — prune infrequent candidates ∗) (∗ initialize the support counters ∗) (∗ of all candidates to be checked ∗) (∗ traverse the transactions ∗) (∗ traverse the candidates ∗) and ik < i′ do begin (∗ (the order is arbitrary. ik } ′ and f2 = {i1. This is the general scheme of the Apriori Algorithm. ∗) (∗ add it to the set of frequent item sets ∗) (∗ return the pruned set of candidates ∗) if ∀i ∈ f : f − {i} ∈ Fk then E := E ∪ {f }. items until no candidate item set is frequent. return E. . . but ﬁxed) ∗) k (∗ union has k + 1 items ∗) f := f1 ∪ f2 = {i1. • Form candidate item sets with two items (both items must be frequent). . T. ik−1. return F . smin) begin k := 1. f2 ∈ Fk with f1 = {i1. All enumeration algorithms are based on these steps in some form. ik . and discard the infrequent item sets. ∗) then sT (e) := sT (e) + 1. Christian Borgelt Fj : frequent item sets of size j. Fk := prune(Ek . . end. Christian Borgelt Frequent Pattern Mining 21 The Apriori Algorithm 1 function apriori (B. determine their support. . Frequent Pattern Mining 22 The Apriori Algorithm 2 function candidates (Fk ) begin E := ∅. . (∗ create candidates with one item more ∗) Fk+1 := prune(Ek+1.
the candidate generation step may carry out a lot of redundant work. with k items) can be generated in k! diﬀerent ways (on k! paths in the Hasse diagram). Christian Borgelt Frequent Pattern Mining 25 Christian Borgelt Frequent Pattern Mining 26 Searching for Frequent Item Sets • A core problem is that an item set of size k (that is. (For infrequent item sets the number may be smaller. • It is obvious that it suﬃces to consider each item set at most once in order to ﬁnd the frequent ones (infrequent item sets need not be generated at all). ⊆) / its Hasse diagram. • If we consider an item by item process of building an item set (which can be imagined as a levelwise traversal of the partial order). • Question: Can we reduce or even eliminate this redundant work? More generally: How can we make sure that any candidate item set is generated at most once? • Idea: Assign to each item set a unique parent item set. • Assigning unique parents turns the Hasse diagram into a tree. • Collecting the frequent item sets of size k in a set Fk has drawbacks: A frequent item set of size k + 1 can be formed in j= k(k + 1) 2 Improving the Candidate Generation possible ways. • Traversing the resulting tree explores each item set exactly once. from which this item set is to be generated. Christian Borgelt Frequent Pattern Mining 27 Christian Borgelt Searching for Frequent Item Sets • We have to search the partially ordered set (2B . since it suﬃces to generate each candidate item set once. because in principle the items may be added in any order. from which this item set is to be generated.) As a consequence. there are k possible ways of forming an item set of size k from item sets of size k − 1 by adding the remaining item. • Question: Can we reduce or even eliminate this variety? More generally: How can we make sure that any candidate item set is generated at most once? • Idea: Assign to each item set a unique parent item set.Searching for Frequent Item Sets • The Apriori algorithm searches the partial order topdown level by level. Hasse diagram and a possible tree for ﬁve items: a b c d e a b c d e ab ac ad ae bc bd be cd ce de ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcd abce abde acde bcde abcde abcde Frequent Pattern Mining 28 .
by which sth is judged: This ﬁlm oﬀends against all the canons of good taste. ◦ Recursively process all oneelement item sets that are frequent. Christian Borgelt Frequent Pattern Mining 30 Canonical Forms The meaning of the word “canonical”: (source: Oxford Advanced Learner’s Dictionary — Encyclopedic Edition) canon / kæn n/ n 1 general rule. one must ﬁx one of them for a given application. . • Recursive Processing: For a given frequent item set I: ◦ Generate all extensions J of I by one item (that is. ◦ For all J: if J is frequent. In other words. J ⊃ I. the possible parents of I are its maximal proper subsets. the set of all possible parents of an item set I is P (I) = {J ⊂ I  ∃ K : J ⊂ K ⊂ I}. canonical /k n nIkl/ adj . • This canonical form will be used to assign unique parents to all item sets. • In order to single out one element of P (I). in order to lay the foundations for the study of frequent (sub)graph mining. process J recursively.t. and later standard representations of a graph. • Even though this approach is straightforward and simple. . i∈I where the maximum (or minimum) is taken w. . we reformulate it now in terms of a canonical form of an item set. we can simply deﬁne an (arbitrary. . • In the following we will deﬁne a standard representation of an item set. • Questions: ◦ How can we formally assign unique parents? ◦ How can we make sure that we generate only those extensions for which the item set that is extended is the chosen unique parent? Christian Borgelt Frequent Pattern Mining 29 Assigning Unique Parents • Formally. a tree etc. standard or principle. but ﬁxed) global order of the items: i1 < i2 < i3 < · · · < in . a sequence. However. J = I + 1) for which the item set I is the chosen unique parent. Christian Borgelt Frequent Pattern Mining 31 Christian Borgelt Frequent Pattern Mining 32 . Nevertheless there are often several possible choices for a canonical form. . 3 standard. .Searching with Unique Parents Principle of a Search Algorithm based on Unique Parents: • Base Loop: ◦ Traverse all oneelement item sets (their unique parent is the empty set). otherwise discard J. the canonical parent pc(I). . Then the canonical parent of an item set I can be deﬁned as the item set pc(I) = I − {max i} i∈I (or pc(I) = I − {min i}). a e e Canonical Forms of Item Sets • A canonical form of something is a standard representation of it. the chosen order of the items. accepted. • The canonical form must be unique (otherwise it could not be standard).r. .
A Canonical Form for Item Sets • An item set is represented by a code word. a∈I • General Recursive Processing with Canonical Forms: For a given frequent item set I: ◦ Generate all possible extensions J of I by one item (J ⊃ I. but ﬁxed) order of the items.r. and by comparing code words lexicographically w. ⇒ We know the canonical code word of every item set that is processed recursively. • There are k! possible code words for an item set of size k. b. b. J = I + 1). • By introducing an (arbitrary. alternatively. • Note that the preﬁx property immediately implies: Every preﬁx of a canonical code word is a canonical code word itself. However. Remark: These explanations may appear obfuscated. greatest) code word for an item set is deﬁned to be its canonical code word. process J recursively. this deﬁnition is equivalent to pc(I) = I − {max a}. we can deﬁne an order on these code words. • With this code word we know. • Example: Consider the item set I = {a. due to the preﬁx property. each letter represents an item. b. • The lexicographically smallest (or. Christian Borgelt Frequent Pattern Mining 33 Canonical Forms and Canonical Parents • Let I be an item set and wc(I) its canonical code word. Obviously the canonical code word lists the items in the chosen.) Christian Borgelt Frequent Pattern Mining 35 Christian Borgelt Searching with the Preﬁx Property The preﬁx property allows us to simplify the search scheme: • The general recursive processing scheme with canonical forms requires to construct the canonical code word of each created item set in order to decide whether it has to be processed recursively or not. The code word is a word over the alphabet B. d}. d. ⇒ We only have to check whether the code word that results from appending the added item to the given canonical code word is canonical or not. c} and a < b < c. since they are obviously equivalent. otherwise discard J. ◦ The longest proper preﬁx of abde is abd. Example: abc < bac < bca < cab etc. (In the following both statements are called the preﬁx property. the added item). the canonical code words of all child item sets that have to be explored in the recursion with the exception of the last letter (that is. because the items may be listed in any order. • Advantage: Checking whether a given code word is canonical can be simpler/faster than constructing a canonical code word from scratch. • Since the canonical code word of an item set lists its items in the chosen order. the view developed here will help us a lot when we turn to frequent (sub)graph mining. ◦ abd is the canonical code word of pc(I) = {a. for the item set {a.t. e}: ◦ The canonical code word of I is abde. since the core idea and the result are very simple. Frequent Pattern Mining 36 . ◦ Form the canonical code word wc(J) of each extended item set J. Christian Borgelt Frequent Pattern Mining 34 The Preﬁx Property • Note that the considered item set coding scheme has the preﬁx property: The longest proper preﬁx of the canonical code word of any item set is a canonical code word itself. ﬁxed order. this order. the set of all items. but also its canonical code word. ◦ For each J: if the last letter of wc(J) is the item added to I to form J and J is frequent. The canonical parent pc(I) of the item set I is the item set described by the longest proper preﬁx of the code word wc(I). ⇒ With the longest proper preﬁx of the canonical code word of an item set I we not only know the canonical parent of I.
Searching with the Preﬁx Property Principle of a Search Algorithm based on the Preﬁx Property: • Base Loop: ◦ Traverse all possible items. ◦ Check whether the extended code word is the canonical code word of the item set that is described by the extended code word (and. ◦ When v is constructed in the search. ◦ bca is not canonical and thus discarded. ◦ As a consequence. . • Recursive Processing: For a given (canonical) code word of a frequent item set: ◦ Generate all possible extensions by one item. . ik−1. ik−1. • Applied to the Apriori algorithm. its extensions are acdb and acde. • As a consequence. bcd and bce are canonical and therefore processed recursively. ik } and f2 = {i1. Then: ◦ There exist a canonical code word w and a preﬁx v of w. the canonical code word w can never be reached. c. we have a very simple canonical extension rule (that is. ∧ If the added item precedes any of the already present items in the chosen order. ◦ Recursively process each code word that describes a frequent item set. • Consider the recursive processing of the code word acd (this code word is canonical. 1 ≤ j < k : ij < ij+1. . Christian Borgelt Frequent Pattern Mining 37 Christian Borgelt Frequent Pattern Mining 38 Searching with the Preﬁx Property Exhaustive Search • The preﬁx property is a necessary condition for ensuring that all canonical code words can be constructed in the search by appending extensions (items) to visited canonical code words. d. e} and let us assume that we simply use the alphabetical order to deﬁne a canonical form (as before). otherwise discard it. the result is not in canonical form. process the extended code word recursively. the result is in canonical form. such that v is not a canonical code word. This is done by simply appending the item to the code word. Christian Borgelt Frequent Pattern Mining 40 . • Suppose the preﬁx property would not hold. • Consider the recursive processing of the code word bc: ◦ The extended code words are bca. . ⇒ If the added item succeeds all already present items in the chosen order. it is discarded. Christian Borgelt Frequent Pattern Mining 39 Searching with Canonical Forms Straightforward Improvement of the Extension Step: • The considered canonical form lists the items in the chosen item order. the canonical code words of all oneelement item sets. i′ } only if ik < i′ and ∀j. a rule that generates all children and only canonical code words). bcd and bce. because it is not canonical. of course. ⇒ The simpliﬁed search scheme can be exhaustive only if the preﬁx property holds. because its letters are in alphabetical order): ◦ Since acd contains neither b nor e. k k Note that it suﬃces to compare the last letters/items ik and i′ k if all frequent item sets are represented by canonical code words. this means that we generate candidates of size k + 1 by combining two frequent item sets f1 = {i1. Searching with the Preﬁx Property: Examples • Suppose the item base is B = {a. ◦ Forming w by repeatedly appending items must form v ﬁrst (otherwise the preﬁx would diﬀer). that is. . . If it is. ◦ The code word acdb is not canonical and thus it is discarded (because d > b — note that it suﬃces to compare the last two letters) ◦ The code word acde is canonical and therefore it is processed recursively. . b. . whether the described item set is frequent).
e. Christian Borgelt Frequent Pattern Mining 41 Christian Borgelt Canonical Parents and Preﬁx Trees • Item sets. • This search scheme generates each candidate item set at most once. where this item succeeds the last letter (item) of the given code word. Canonical parent tree/preﬁx tree and preﬁx tree with merged siblings for ﬁve items: a b c d e a a b b c d c d e e d d d e d e e e ab ac ad ae bc bd be cd ce de b b c c d c e d e c d c abc abd abe acd ace ade bcd bce bde cde c d d d e e e d e e d d e e abcd abce abde acde bcde d abcde e Frequent Pattern Mining 42 Canonical Parents and Preﬁx Trees a a ab b abc abd abe c d abcd abce abde d abcde ac c ad ae d ade bc c b b bd c c be d bde d e d cd d cde ce de Search Tree Pruning In applications the search tree tends to get very large. ◦ Recursively process each code word that describes a frequent item set. otherwise discard it. c. Christian Borgelt Frequent Pattern Mining 43 Christian Borgelt Frequent Pattern Mining 44 . ◦ If the item set described by the resulting extended code word is frequent. This is done by simply appending the item to the code word. • The item sets counted in a node consist of ◦ all items labeling the edges to the node (common preﬁx) and ◦ one item following the last edge label in the item order. • Support Based Pruning: ◦ No superset of an infrequent item set can be frequent. so pruning is needed.Searching with Canonical Forms Final Search Algorithm based on Canonical Forms: • Base Loop: ◦ Traverse all possible items. b. ◦ Idea: Sets with too many items can be diﬃcult to interpret. process the code word recursively. that is. (apriori property) ◦ No counters for item sets having an infrequent subset are needed. are siblings. • Size Based Pruning: ◦ Prune the tree if a certain depth (a certain size of the item sets) is reached. d. ◦ Explains the unbalanced structure of the full preﬁx tree. whose canonical code words share the same longest proper preﬁx. the canonical code words of all oneelement item sets. • This allows us to represent the canonical parent tree as a preﬁx tree or trie. • Recursive Processing: For a given (canonical) code word of a frequent item set: ◦ Generate all possible extensions by a single item. • Structural Pruning: ◦ Extensions based on canonical code words remove superﬂuous paths. • Based on a global order of the items (which can be arbitrary). because they have (by deﬁnition) the same canonical parent. acd ace d acde bcd bce d bcde A (full) preﬁx tree for the ﬁve items a.
The Order of the Items
• The structure of the (structurally pruned) preﬁx tree obviously depends on the chosen order of the items. • In principle, the order is arbitrary (that is, any order can be used). However, the number and the size of the nodes that are visited in the search diﬀers considerably depending on the order. As a consequence, the execution times of frequent item set mining algorithms can diﬀer considerably depending on the item order. • Which order of the items is best (leads to the fastest search) can depend on the frequent item set mining algorithm used. Advanced methods even adapt the order of the items during the search (that is, use diﬀerent, but “compatible” orders in diﬀerent branches). • Heuristics for choosing an item order are usually based on (conditional) independence assumptions.
The Order of the Items
Heuristics for Choosing the Item Order • Basic Idea: independence assumption It is plausible that frequent item sets consist of frequent items. ◦ Sort the items w.r.t. their support (frequency of occurrence). ◦ Sort descendingly: Preﬁx tree has fewer, but larger nodes. ◦ Sort ascendingly: Preﬁx tree has more, but smaller nodes. • Extension of this Idea: Sort items w.r.t. the sum of the sizes of the transactions that cover them. ◦ Idea: the sum of transaction sizes also captures implicitly the frequency of pairs, triplets etc. (though, of course, only to some degree). ◦ Empirical evidence: better performance than simple frequency sorting.
Christian Borgelt
Frequent Pattern Mining
45
Christian Borgelt
Frequent Pattern Mining
46
Searching the Preﬁx Tree
a a b b c d c d c c e d d e d e
b b c d e
c
d c d e
e d d d e d e e c e b b d c d c c e a d
a
b b
c
d c d e
e d d d e d e e e
e
c d c
c e d e e d d e
e
d d e
e
e d e e
d d e
e
Searching the Preﬁx Tree Levelwise
(Apriori Algorithm Revisited)
d e
d e
• Apriori ◦ Breadthﬁrst/levelwise search (item sets of same size). ◦ Subsets tests on transactions to ﬁnd the support of item sets. • Eclat ◦ Depthﬁrst search (item sets with same preﬁx). ◦ Intersection of transaction lists to ﬁnd the support of item sets.
Christian Borgelt
Frequent Pattern Mining
47
Christian Borgelt
Frequent Pattern Mining
48
Apriori: Basic Ideas
• The item sets are checked in the order of increasing size (breadthﬁrst/levelwise traversal of the preﬁx tree). • The canonical form of item sets and the induced preﬁx tree are used to ensure that each candidate item set is generated at most once. • The already generated levels are used to execute a priori pruning of the candidate item sets (using the apriori property). (a priori: before accessing the transaction database to determine the support) • Transactions are represented as simple arrays of items (socalled horizontal transaction representation, see also below). • The support of a candidate item set is computed by checking whether they are subsets of a transaction or by generating and ﬁnding subsets of a transaction. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
Apriori: Levelwise Search
a:7 b:3 c:7 d:6 e:7
• Example transaction database with 5 items and 10 transactions. • Minimum support: 30%, that is, at least 3 transactions must contain the item set. • All one item sets are frequent → full second level is needed.
Christian Borgelt
Frequent Pattern Mining
49
Christian Borgelt
Frequent Pattern Mining
50
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 a
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 a
e:4
e:4
• Determining the support of item sets: For each item set traverse the database and count the transactions that contain it (highly ineﬃcient). • Better: Traverse the tree for each transaction and ﬁnd the item sets it contains (eﬃcient: can be implemented as a simple doubly recursive procedure).
• Minimum support: 30%, that is, at least 3 transactions must contain the item set. • Infrequent item sets: {a, b}, {b, d}, {b, e}. • The subtrees starting at these item sets can be pruned.
Christian Borgelt
Frequent Pattern Mining
51
Christian Borgelt
Frequent Pattern Mining
52
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:? e:? e:? d:? e:? e:? a
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:? e:? e:? d:? e:? e:? a
e:4
e:4
• Generate candidate item sets with 3 items (parents must be frequent). • Before counting, check whether the candidates contain an infrequent item set. ◦ An item set with k items has k subsets of size k − 1. ◦ The parent item set is only one of these subsets.
• The item sets {b, c, d} and {b, c, e} can be pruned, because ◦ {b, c, d} contains the infrequent item set {b, d} and ◦ {b, c, e} contains the infrequent item set {b, e}.
Christian Borgelt
Frequent Pattern Mining
53
Christian Borgelt
Frequent Pattern Mining
54
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:3 e:3 e:4 d:? e:? e:2 a
Apriori: Levelwise Search
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a, d, e} {b, c, d} {a, c, e} {a, c, d, e} {a, e} {a, c, d} {b, c} {a, c, d, e} {b, c, e} {a, d, e}
a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:3 e:3 e:4 d:? e:? e:2 a
e:4
e:4
• Only the remaining four item sets of size 3 are evaluated. • No other item sets of size 3 can be frequent.
• Minimum support: 30%, that is, at least 3 transactions must contain the item set. • Infrequent item set: {c, d, e}.
Christian Borgelt
Frequent Pattern Mining
55
Christian Borgelt
Frequent Pattern Mining
56
• Fourth access to the transaction database is not necessary. e} {b. Child Pointers: Hash Tables: Apriori: Node Organization 2 • Each node is a vector (array) of item/counter pairs (closed hashing). e} {a. e} a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:3 e:3 e:4 d:? e:? e:2 d e:? a e:4 e:4 • Generate candidate item sets with 4 items (parents must be frequent). • A binary search is necessary to ﬁnd the counter for an item. d} {b. c. check whether the candidates contain an infrequent item set. The order of the items cannot be exploited. • Disadvantage: Counter access is slower due to the binary search. e} a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:3 e:3 e:4 d:? e:? e:2 d e:? a Apriori: Levelwise Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} can be pruned. c. e}. • An item is used as a direct index to ﬁnd the counter. c. d. c. d. ﬁll rate). • Advantage: Faster counter access than with binary search. c} {a. c. → It pays to represent the child pointers in a separate array. e} {a. c. • The index of a counter is computed from the item code. d. d. no unnecessary counters. e} {a. e} {a. c. d. e} {b. d. d} {b. d. c. c. d} {a. Direct Indexing: • Each node is a simple vector (array) of counters. d} {a. • Disadvantage: Memory usage can be high due to “gaps” in the index space. Christian Borgelt Frequent Pattern Mining 57 Christian Borgelt Frequent Pattern Mining 58 Apriori: Node Organization 1 Idea: Optimize the organization of the counters and the child pointers. • Disadvantage: Higher memory usage than sorted vectors (pairs. e} {a. e} {a. e} {b. • Fewer child pointers than counters are needed. • Advantage: Memory usage may be smaller. • The item set {a. e} {a. • Advantage: Counter access is extremely fast. • The deepest level of the item set tree does not need child pointers.Apriori: Levelwise Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. Sorted Vectors: • Each node is a vector (array) of item/counter pairs. c. c. c. e} {a. c} {a. d. • Before counting. because it contains the infrequent item set {c. Christian Borgelt Frequent Pattern Mining 59 Christian Borgelt Frequent Pattern Mining 60 . • Consequence: No candidate item sets with four items. d. c. e} {b. • The sorted array of item/counter pairs can be reused for a binary search. d.
◦ Abort the recursion if the ﬁrst item is beyond the last one in the node.r. ◦ Sort descendingly: preﬁx tree has fewer nodes. ◦ Sort ascendingly: there are fewer and smaller index “gaps”. c. d. • Processing a transaction is a doubly recursive procedure. • Optimizations: ◦ Directly skip all items preceding the ﬁrst item in the node. ◦ Sort the items w. the sum of the sizes of the transactions that cover them. To process a transaction for a node of the item set tree: ◦ Go to the child corresponding to the ﬁrst item in the transaction and count the rest of the transaction recursively for that child. • Idea: It is plausible that frequent item sets consist of frequent items.t.Apriori: Item Coding • Items are coded as consecutive integers starting with 0 (needed for the direct indexing approach).t. ◦ Empirical evidence: sorting ascendingly is better. Christian Borgelt Frequent Pattern Mining 61 Apriori: Recursive Counting • The items in a transaction are sorted (ascending item codes). their frequency (group frequent items). ◦ Abort the recursion if a transaction is too short to reach the deepest level.r. Christian Borgelt Frequent Pattern Mining 62 Apriori: Recursive Counting transaction to count: {a. • The size and the number of the “gaps” in the index space depends on how the items are coded.) ◦ Discard the ﬁrst item of the transaction and process it recursively for the node itself. (In the currently deepest level of the tree we increment the counter corresponding to the item instead of going to the child node. ◦ Empirical evidence: better than simple item frequencies. e} a cde a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:0 e:0 e:0 d:? e:? e:0 a Apriori: Recursive Counting cde a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d e d:1 e:1 e:0 d:? e:? e:0 de a processing: a processing: c e:4 e:4 current item set size: 3 processing: d e processing: a processing: c c de a cde a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:0 e:0 e:0 d:? e:? e:0 processing: a processing: d e:4 d e a cde a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:1 e:1 e:0 d:? e:? e:0 e:4 Christian Borgelt Frequent Pattern Mining 63 Christian Borgelt Frequent Pattern Mining 64 . • Extension: Sort items w.
• For each item the remaining suﬃx is processed in the corresponding child. processing is terminated. counters are incremented for each item in the transaction. • If the remaining transaction (suﬃx) is too short to reach the (currently) deepest level. Christian Borgelt Frequent Pattern Mining 67 Christian Borgelt Frequent Pattern Mining 68 .Apriori: Recursive Counting cde a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d e d:1 e:1 e:1 d:? e:? e:0 e a Apriori: Recursive Counting c de a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:1 e:1 e:1 d:? e:? e:0 a processing: a processing: d processing: e processing: c e:4 e:4 processing: a processing: e (skipped: too few items) cde a:7 b:3 c:7 d:6 e:7 d a e c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:1 e:1 e:1 d:? e:? e:0 processing: c processing: d e:4 a de a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 e:4 c c d d d e d:1 e:1 e:1 d:? e:? e:0 Christian Borgelt Frequent Pattern Mining 65 Christian Borgelt Frequent Pattern Mining 66 Apriori: Recursive Counting de a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 e:4 c c d d e d:1 e:1 e:1 d:? e:? e:1 e a Apriori: Recursive Counting d e a:7 b:3 c:7 d:6 e:7 d c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c d d d:1 e:1 e:1 d:? e:? e:1 a processing: c processing: d processing: e processing: d (skipped: too few items) e:4 processing: c processing: e (skipped: too few items) de a:7 b:3 c:7 d:6 e:7 d a c b b:0 c:4 d:5 e:6 c:3 d:1 e:1 d:4 e:4 c c e d d d:1 e:1 e:1 d:? e:? e:1 • Processing an item set in a node is easily implemented as a simple loop. e:4 • If the (currently) deepest tree level is reached.
e a. c. c. c. e b. • Candidates are formed by merging item sets that diﬀer in only one item. c. e a. e preﬁx tree representation a:7 b:3 c:4 d:2 e:1 c:3 d:3 e:1 e:2 d:1 e:1 e:2 • Items in transactions are sorted w. d b. but ﬁxed order). d b. c. Organization as a Preﬁx Tree: • The items in each transaction are sorted (arbitrary. Software • http://www. transactions are sorted lexicographically. d. c. d. e a. Advantages • “Perfect” pruning of infrequent candidate item sets (with infrequent subsets). d a. c. d.t. d.net/apriori. e a. Christian Borgelt Frequent Pattern Mining 70 Summary Apriori Basic Processing Scheme • Breadthﬁrst/levelwise traversal of the partially ordered set (2B . • Support counting can be done with a doubly recursive procedure. c a. FPgrowth and other algorithms) Christian Borgelt Frequent Pattern Mining 71 Christian Borgelt Frequent Pattern Mining 72 . c. ⊆). • Transactions with the same preﬁx are grouped together. e a. c. e b. some arbitrary order.html Searching the Preﬁx Tree DepthFirst (Eclat. c. • Advantage: a common preﬁx is processed only once. d. d a. • The transactions are stored in a simple list or array. c. d. e a.borgelt. • Support counting takes very long for large transactions.r. • Gains from this organization depend on how the items are coded: ◦ Common transaction preﬁxes are more likely if the items are sorted with descending frequency. e a. • Advantage: identical transaction preﬁxes are processed only once. ◦ However: an ascending order is better for the search and this dominates the execution time. c b. Disadvantages • Can require a lot of memory (since all frequent item sets are represented). d. e b. c. e a. d. e lexicographically sorted a.Apriori: Transaction Representation Direct Representation: • Each transaction is represented as an array of items. Christian Borgelt Frequent Pattern Mining 69 Apriori: Transactions as a Preﬁx Tree transaction database a. e a. then a preﬁx tree is constructed.
red : database with all transactions. then all frequent item sets that do not contain it.t. red : item sets not containing item a (but at least one other item).DepthFirst Search and Conditional Databases • A depthﬁrst search can also be seen as a divideandconquer scheme: First ﬁnd all frequent item sets that contain a chosen item. but not item a. ◦ Restrict the transaction database to those transactions that contain a. item b a a b b c • blue : item sets {a} and {a.t. Christian Borgelt Frequent Pattern Mining 76 . red : item sets containing item a (and at least one other item).r. red : database with transactions containing item a. green: item sets containing item b (and at least one other item). Larger item sets result from adding possible preﬁxes. Christian Borgelt Frequent Pattern Mining 73 DepthFirst Search and Conditional Databases d e d c cd ce de ab ac ad ae bc bd be b c c d d d cde abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde split into subproblems w. • General search procedure: ◦ Let the item order be a < b < c < · · ·. red : conditional database with all transactions. This is the conditional database for item sets without a. Christian Borgelt Frequent Pattern Mining 75 • blue : item set containing only item b. red : item sets containing neither item a nor b (but at least one other item). but not item a. but with item a removed. Recursively search this conditional database for frequent item sets and add the preﬁx a to all frequent item sets found in the recursion. green: item sets containing items a and b (and at least one other item). • green: database with transactions containing both items a and b.r. but with items a and b removed.r. ◦ Remove the item a from the transactions in the full transaction database. • green: conditional database with transactions containing item a. This is the conditional database for the preﬁx a. Recursively search this conditional database for frequent item sets. Christian Borgelt Frequent Pattern Mining 74 DepthFirst Search and Conditional Databases d e d c cd ce de ab ac ad ae bc bd be b c c d d d cde abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde split into subproblems w. green: item sets containing item a (and at least one other item). • green: database with transactions containing item b. but with item b removed. item b a a b b c DepthFirst Search and Conditional Databases d e d c cd ce de ab ac ad ae bc bd be b c c d d d cde abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde split into subproblems w. but not item b. item a a a b b c • blue : item set containing only item a. • With this scheme only frequent oneelement item sets have to be determined. b}.t.
then a is also a perfect extension of any item set J ⊇ I (as long as a ∈ J). an item a ∈ I is called a perfect extension of I w. {a. a (T. with which the recursion is started. Formal Description of the DivideandConquer Scheme A subproblem S0 = (T0. {a}) b¯ (Tab¯. ◦ A subproblem is processed by splitting it into smaller subproblems. T1 comprises all transactions in T0 that contain the item i. a divideandconquer scheme can be described as a set of (sub)problems. • All subproblems that occur in frequent item set mining can be deﬁned by ◦ a conditional transaction database and ◦ a preﬁx (of items). b}) c (Ta¯c.Formal Description of the DivideandConquer Scheme • Generally. {a}) © (Ta. ∅) ¯b¯ (Tabc.r. ∅) ¯b c ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e e e c ¯ c ¯ e e c ¯ e e (Tab¯. where B0 is the set of items occurring in T0. / This can most easily be seen by considering that KT (I) ⊆ KT ({a}) and hence KT (J) ⊆ KT ({a}). T . {b}) ¯ c ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e (Ta¯. ◦ If XT (I) is the set of all perfect extensions of an item set I w. ∅). c}) ¯ (Ta¯ . where P2 = P0. process S2 recursively. T (that is. {b. {a}) b c ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e (Tab. ∅) ¯ ¯ b © b d d d d d b d d d d d ¯ b • Let T be a transaction database over an item base B. {c}) ¯bc include an item (ﬁrst subproblem) • Branch to the right: exclude an item (second subproblem) (Items in the indices of the conditional transaction databases T have been removed from them. P1) with P1 = P0 ∪ {i}. {a. b}) c ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e e e (Ta¯. {a. c}) bc (Tabc. since KT (J) ⊆ KT (I). if XT (I) = {i ∈ B − I  sT (I ∪ {i}) = sT (I)}).t. but again with the item i removed (and empty transactions removed). then all sets I ∪ J with J ∈ 2XT (I) have the same support as I (where 2M denotes the power set of a set M ). P2). • In any case (that is. • Perfect extensions have the following properties: c ¯ ◦ If the item a is a perfect extension of an item set I. where T is the transaction database to mine and the preﬁx is empty.t. all subproblems are tuples S = (D. • The initial problem. / iﬀ the item sets I and I ∪ {a} have the same support: sT (I) = sT (I ∪ {a}) (that is. where D is a conditional transaction database and P ⊆ B is a preﬁx. ◦ If T1 is not empty. P ). regardless of whether sT0 (i) ≥ smin or not): ◦ Form the subproblem S2 = (T2. The preﬁx is a set of items that has to be added to all frequent item sets that are discovered in the conditional transaction database. but with the item i removed (and empty transactions removed). c}) • Branch to the left: (Ta¯ . Given an item set I. if all transactions containing the item set I also contain the item a). ◦ If T2 is not empty. ◦ Form the subproblem S1 = (T1. P0) is processed as follows: • Choose an item i ∈ B0.) Christian Borgelt Frequent Pattern Mining . • Formally. 79 Christian Borgelt Frequent Pattern Mining 80 (Tab. b.r. is S = (T. process S1 recursively. ◦ The initial (sub)problem is the actual problem to solve. {b}) ¯ c (Ta¯c. {a. ∅) z a ¯ (Ta. 77 Christian Borgelt Frequent Pattern Mining 78 Christian Borgelt Frequent Pattern Mining DivideandConquer Recursion Subproblem Tree $ $$ $$$ $$$ $$ $ W $ $ Perfect Extensions The search can easily be improved with socalled perfect extension pruning. which are then processed recursively. T2 comprises all transactions in T0 (whether they contain i or not). • If sT0 (i) ≥ smin (where sT0 (i) is the support of the item i in T0): ◦ Report the item set P0 ∪ {i} as frequent with the support sT0 (i).
• a is a perfect extension of {d. c. ◦ X is the set of perfect extension items. e}: 3 items 4 {a. e} 5: {a. they are removed from the conditional databases. Example: a a a a a (7) (4) (3) (3) (5) a d e (4) ae (6) (3) b bc (3) c (7) Frequent Pattern Mining c cd ce d cd ce d de e (4) (4) (6) (4) (7) 84 . d. c. Consequently. This technique is also known as hypercube decomposition. e} and {a. respectively. d. e} 4: {a. c} both have support 3. ◦ This scheme can speed up the output considerably. e}: {d. d. c. d} 7: {b. c. d. d}: {c. e} 2: {b. ◦ The reason is that generally P1 = P2 ∪ {i} and in this case T1 = T2. but are only used to generate all supersets of the preﬁx having the same support. d} 3: {a. e} as {d. the rest is already formatted in the string. c. • Once identiﬁed. 81 Christian Borgelt Frequent Pattern Mining 82 as {b} and {b. d.Perfect Extensions: Examples transaction database 1: {a. d}: 3 5 {a. item sets are reported in lexicographic order. P. ◦ Backtracking the search (return from recursion) removes an item from the preﬁx string. c}: {a. c. c} 8: {a. P0) is split into ◦ a subproblem S1 = (T1. • The divideandconquer scheme has basically the same structure as without perfect extension pruning. Christian Borgelt Frequent Pattern Mining 83 Christian Borgelt Reporting Frequent Item Sets • With the described divideandconquer scheme. the exact way in which perfect extensions are collected can depend on the speciﬁc algorithm used. ◦ a subproblem S2 = (T2. e} both have support 4. • This can be exploited for eﬃcient item set reporting: ◦ The preﬁx P is a string. in a third element of a subproblem description. d. c. c}: {c. • Suppose the item i is a perfect extension of the preﬁx P0. • Formally. e} • c is a perfect extension of {b} frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a. • There are no other perfect extensions in this example for a minimum support of smin = 3. ◦ Let F1 and F2 be the sets of frequent item sets that are reported when processing S1 and S2. e} 10: {a. e}: 3 6 {a. e} 9: {b. c. a subproblem is a triplet S = (T. However. P2) to ﬁnd all frequent item sets that do not contain the item i. e} 6: {a. e}: {b. ◦ Thus only one item needs to be formatted per reported frequent item set. X). e}: 4 3 4 4 4 Perfect Extension Pruning • Consider again the original divideandconquer scheme: A subproblem S0 = (T0. ◦ It is I ∪ {i} ∈ F1 ⇔ I ∈ F2 . which is extended when an item is added to P . where ◦ T is a conditional transaction database. P1) to ﬁnd all frequent item sets that contain an item i ∈ B0. ◦ P is the set of preﬁx items for T . d}: {a. perfect extension items are no longer processed in the recursion. Christian Borgelt Frequent Pattern Mining Perfect Extension Pruning • Perfect extensions can be exploited by collecting these items in the recursion. because all transaction in T0 contain item i (i is a perfect extension). • Therefore it suﬃces to solve one subproblem (namely S2) and to construct the solution of the other (S1) by adding item i.
. b}) ¯ (Ta¯e. {a}) © (Ta. {a. . 85 Christian Borgelt Frequent Pattern Mining 86 Christian Borgelt Frequent Pattern Mining Global and Local Item Order Local item orders have advantages and disadvantages: • Advantage ◦ In some data sets the order of the conditional item frequencies diﬀers considerably from the global order.Global and Local Item Order • Up to now we assumed that the item order is (globally) ﬁxed. Item Order: DivideandConquer Recursion Subproblem Tree $$$ $ $ W $ $ (T. e}) be (Tacf . ∅) ¯ ¯¯ (Tabd. and determined at the very beginning based on heuristics. ◦ The same heuristics used for determining a global item order suggest that the split item for a given subproblem should be selected from the (conditionally) most frequent item(s). ◦ The gains from the better item order may be lost again due to the more complex processing / conditioning scheme. • However. {a. b. {g}) ¯¯ • All local item orders start with a < . ◦ Not having a globally ﬁxed item order can make it more diﬃcult to determine conditional transaction databases w. ∅) ¯¯ g ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e e e ¯ d e ¯ e e (Tabd. Christian Borgelt Frequent Pattern Mining 87 Christian Borgelt Frequent Pattern Mining 88 Transaction Database Representation . b}) d ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e e e (Ta¯.t. f }) ¯ (Tacg . {c. • All subproblems on the left share a < b < . {a. {a}) b¯ (Tacf . the item orders may diﬀer for every branch of the search tree. two subproblems must share the item order that is ﬁxed by the common part of their paths from the root (initial subproblem). {c}) ¯ f¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e (Tac. d}) (Ta¯ . ◦ There is no need to choose the same item for splitting sibling subproblems (as a global item order would require us to do). {a}) b e ¡¡ ¡ ¡ ¡ ¡ ¡ ¡ e e e (Tac. .. the described divideandconquer scheme shows that a globally ﬁxed item order is more restrictive than necessary: ◦ The item used to split the current subproblem can be any item that occurs in the conditional transaction database of the subproblem. • Disadvantage ◦ The data structure of the conditional databases must allow us to determine conditional item frequencies quickly.r. ∅) a a $$$$$$ ¯ $ z (Ta. • As a consequence. {a. . ◦ Such data sets can sometimes be processed signiﬁcantly faster with local item orders (depending on the algorithm). . ◦ However. . All subproblems on the right share a < c < . split items (depending on the employed data structure).. ∅) ¯ ¯ b © b d d d d d c d d d d d c ¯ (Tab. {c}) ¯ ¯ e e ¯ f g ¯ (Tacg .
t. ◦ The transaction list of item a indicates the transactions that contain it. They diﬀer mainly in how they represent the conditional transaction databases. e Representation: List transactions for each item a b c d e 1 2 2 1 1 3 7 3 2 3 4 9 4 4 4 5 6 6 5 6 7 8 8 8 8 10 9 10 9 10 vertical representation a b 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 1 0 1 1 1 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0 c d e 0 1 1 1 0 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 transaction database a. • Principle: equal preﬁxes of transactions are merged. d. d. e a. 91 Christian Borgelt Frequent Pattern Mining 92 horizontal representation matrix representation Christian Borgelt Frequent Pattern Mining . • However. d. d. c a. ◦ Note that the alternative preﬁx tree organization is still an essentially horizontal representation. e preﬁx tree representation a:7 b:3 c:4 d:2 e:1 c:3 d:3 e:1 e:2 d:1 e:1 e:2 • Note that a preﬁx tree representation is a compressed horizontal representation. e a. e b. c. c. e b. c. c. e a. e b. e a. d. c. d. • A combined representation is the frequent pattern tree (to be discussed later). which indicate the transactions that contain the item. c. • This is most eﬀective if the items are sorted descendingly w. since there are many algorithms that use a combination of the two forms of representing a database. c. d b. d. e a. their support. d. J ⊆ B : KT (I ∪ J) = KT (I) ∩ KT (J). e a. ◦ Generally. e a. c. the database is stored as a list (or array) of transactions. d. it represents its cover KT ({a}). • The alternative is a vertical transaction representation: ◦ For each item a transaction list is created. c. FPgrowth and several other frequent item set mining algorithms rely on the described basic divideandconquer scheme. c. ◦ Advantage: the transaction list for a pair of items can be computed by intersecting the transaction lists of the individual items. d a. e a. a vertical transaction representation can exploit ∀I. c b. d b. d.r. For each item a list (or array) of identiﬁers is stored. ◦ In a vertical representation. • The main approaches are horizontal and vertical representations: ◦ In a horizontal representation. c. e a. c. d a. 89 Christian Borgelt Frequent Pattern Mining 90 Christian Borgelt Frequent Pattern Mining Transaction Database Representation • Horizontal Representation: List items for each transaction • Vertical 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: a.Transaction Database Representation • Eclat. c. e b. c a. e b. e a. • Frequent item set mining algorithms also diﬀer in how they construct new conditional databases from a given one. d b. c. e a. d a. that is. e a. c. e a. this distinction is not pure. e Transaction Database Representation lexicographically sorted a. d. a database is represented by ﬁrst referring with a list (or array) to the diﬀerent items. Transaction Database Representation • The Apriori algorithm uses a horizontal transaction representation: each transaction is an array of the contained items. c. c. d. each of which is a list (or array) of the items contained in it. c.
c. • No subset tests and no subset generation are needed to compute the support. e} {a. • The search scheme is the same as the general scheme for searching with canonical forms having the preﬁx property and possessing a perfect extension rule (generate only canonical extensions). e} {b. If all computed support values were stored. Christian Borgelt Frequent Pattern Mining 93 Christian Borgelt Frequent Pattern Mining 94 Eclat: Subproblem Split a 7 1 3 4 5 6 8 10 b 3 2 7 9 c 7 2 3 4 6 7 8 9 c 7 2 3 4 6 7 8 9 d 6 1 2 4 6 8 10 e 7 1 3 4 5 8 9 10 b c d e 0 4 5 6 3 1 1 4 4 3 6 6 4 8 8 5 10 8 10 ↑ Conditional database for preﬁx a (1st subproblem) a b c d e 7 3 7 6 7 b c d e 0 4 5 6 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. because it (usually) does not store the support of all visited item sets. c. e} {a. The support of item sets is rather determined by intersecting transaction lists. e} {b. c. e} a:7 b:3 c:7 d:6 e:7 b 3 2 7 9 d 6 1 2 4 6 8 10 e 7 1 3 ← Conditional 4 database 5 with item a 8 removed 9 (2nd subproblem) 10 b c d e 3 7 6 7 ↑ Conditional database for preﬁx a (1st subproblem) ← Conditional database with item a removed (2nd subproblem) • Form a transaction list for each item. d. Here: bit vector representation. • Eclat uses a purely vertical transaction representation. e} {a. c. Parthasarathy. Christian Borgelt Frequent Pattern Mining 95 Christian Borgelt Frequent Pattern Mining 96 . c. c} {a. The Eclat Algorithm [Zaki. it could be implemented in such a way that all support values needed for full a priori pruning were available. because it does not store the support of all explored item sets. e} {a. d. ∗ Note that Eclat cannot fully exploit the Apriori property. d. d} {b.∗ As a consequence it cannot fully exploit the Apriori property for pruning. not because it cannot know it. and Li 1997] • Eclat generates more candidate item sets than Apriori.Eclat: Basic Ideas • The item sets are checked in lexicographic order (depthﬁrst traversal of the preﬁx tree). Ogihara. ◦ grey: item is contained in transaction ◦ white: item is not contained in transaction • Transaction database is needed only once (for the single item transaction lists). d} {a. d. c.
d} {b. • With Apriori this item set could be pruned before counting. c. c. • The item set {a. c. c} with the transaction lists of the item sets {a. d. c. • Count the number of bits that are set (number of containing transactions). c. d} and {a. e} {a. Christian Borgelt Frequent Pattern Mining 100 . Christian Borgelt Frequent Pattern Mining 97 Christian Borgelt Frequent Pattern Mining 98 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. d. e} {b. e} {a. c. c. c. c} {a. d. d. Christian Borgelt Frequent Pattern Mining 99 • Intersect the transaction lists for the item sets {a. d. e} {a. because it was known that {c. c. This yields the support of all item sets with the preﬁx ac. d. c. c. d. d} {b. c. c} {a. e} {a. c. e} {b.Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} {b. e} {a. c. d. d. b} is infrequent and can be pruned. e} {b. d. x ∈ {d. e} {a. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 • Intersect the transaction list for the item set {a. d. e} {b. d. c. d} {a. c. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. This yields the support of all item sets with the preﬁx a. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} {a. c. d. • Result: Transaction lists for the item sets {a. d. c. e} {b. e}. e}. • Count the number of bits that are set (number of containing transactions). c. e} {a. e} {a. e} is infrequent. e} {a. c. d. x}. d. e} {a. d} {b. d} and {a. d} {a. e} {a. c. d} {b. c. e} {a. c} {a. c. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 • Intersect the transaction list for item a with the transaction lists of all other items (conditional database for item a). d} {a. • Result: Transaction list for the item set {a. e} {a. c} {a. e}. c. c. c. c. e} {b. e}. d} {a. d. d. c. e} {b. • All other item sets with the preﬁx a are frequent and are therefore kept and processed recursively. e} {a. e} {a. c.
c. e} {a. d. d. d} {b. e} {a. c. e}. c. c. e}. • The search backtracks to the second level of the search tree and intersect the transaction list for the item sets {a. • Since there is no transaction list left (and thus no intersection possible). c} {a. e} {a. e} {b. e} {a. c. c. d} {a. d} and {a. • Result: Transaction list for the item set {a. c} {a. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. the recursion is terminated and the search backtracks again. e} {a. d. c. • Result: Transaction lists for the item sets {b. c. e} {a. e}. c. e} {a. d. d. e} {a. d. the recursion is terminated and the search backtracks again. e} {a. d. e} {a. e} {a.Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. d. d. e} {b. c. Christian Borgelt Frequent Pattern Mining 101 Christian Borgelt Frequent Pattern Mining 102 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. • Since there is only one transaction list left (and thus no intersection possible). c. d. e} is not frequent (support 2/20%) and therefore pruned. the recursion is terminated and the search backtracks. d} {a. e} {b. e} {b. d}. c. d. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 b c:3 d:1 e:1 • The search backtracks to the ﬁrst level of the search tree and intersect the transaction list for b with the transaction lists for c. d} {b. c. c} {a. d} {a. c} {a. c}. {b. c. c. d. and e. d} {b. c. d. c. e} {a. d. e} {b. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 • The item set {a. c. e} {b. e} {a. e} {b. d. d} {a. c. d. d. e} {a. c. e} {a. c. c. c. e} {b. e} {a. • Only one item set has suﬃcient support → prune all subtrees. d} {b. c. d. Christian Borgelt Frequent Pattern Mining 103 Christian Borgelt Frequent Pattern Mining 104 . • Since there is only one transaction list left (and thus no intersection possible). d. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 b c:3 d:1 e:1 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. c. and {b.
d. d. c. c. c. d} {b. c. e} {a. c. c. c. c. e} {b. d. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d d:4 e:4 d e:2 e:4 • The item set {c. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d:4 e:4 d e:2 • Backtrack to the ﬁrst level of the search tree and intersect the transaction list for c with the transaction lists for d and e. e} {b. c. e} {a. the recursion is terminated and the search backtracks. c. d} {a. e} {b.Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. d} {b. c} {a. • Result: Transaction list for the item set {c. e} {a. d. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d:4 e:4 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. c} {a. • With this step the search is ﬁnished. e} {b. d. d. c. e} is not frequent (support 2/20%) and therefore pruned. d} {a. d. e} {b. c. c. c. c. e} {b. d. d. d} {a. e} {a. d} and {c. c. e}. c. d. d. c. d. e} {a. d. d} {b. Christian Borgelt Frequent Pattern Mining 107 Christian Borgelt Frequent Pattern Mining 108 . c} {a. c} {a. e} {a. • The search backtracks to the ﬁrst level of the search tree and intersect the transaction list for d with the transaction list for e. e} {b. e} {b. d. c. d} and {c. c. e} {a. e} {a. e}. e} {a. c. • Result: Transaction lists for the item sets {c. e} {a. e}. e}. e} {a. e} {a. • Result: Transaction list for the item set {d. d. • Since there is no transaction list left (and thus no intersection possible). e} {a. c. d. e} {a. Christian Borgelt Frequent Pattern Mining 105 Christian Borgelt Frequent Pattern Mining 106 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d:4 e:4 d e:2 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} {a. • Intersect the transaction list for the item sets {c. d} {b. c. e} {a. d} {a. d. c. d.
e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d d:4 e:4 d e:2 e:4 • The found frequent item sets coincide. c. d. d} {a. d. e} {a. d. Christian Borgelt Frequent Pattern Mining 109 • Note that the item set {a. e} could be pruned by Apriori without computing its support.append(head(src1)). Eclat: Intersecting Transaction Lists function isect (src1. their frequency (individual or transaction size sum) leads to a better structure of the search tree. d. e} {b. a fundamental diﬀerence is that Eclat usually only writes found frequent item sets to an output ﬁle. e} {a. e} {b.Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. d. c.t. while Apriori keeps the whole search tree in main memory. e} {a. ◦ Sparse: lists of row indices of set bits (transaction lists). e} {a. e} {b. e} {a. • Normal and sparse representation of bit matrices: ◦ Normal: one memory bit per matrix bit. d} {b. d. d. c. (∗ created intersection ∗) while both src1 and src2 are not empty do begin if head(src1) < head(src2) (∗ skip transaction identiﬁers that are ∗) then src1 = tail(src1). e} is infrequent. c} {a. d. because the item set {c. c. (∗ unique to the ﬁrst source list ∗) elseif head(src1) > head(src2) (∗ skip transaction identiﬁers that are ∗) then src2 = tail(src2). d. c. c. (∗ function isect() ∗) 111 Christian Borgelt Frequent Pattern Mining 112 Christian Borgelt Frequent Pattern Mining . Christian Borgelt Frequent Pattern Mining 110 Eclat: Bit Matrices and Item Coding Bit Matrices • Represent transactions as a bit matrix: ◦ Each column corresponds to an item. d} {b. c. • However. (∗ remove the transferred transaction id ∗) end. with those found by the Apriori algorithm. zeros represented. c. e} {b. e} {a. end. c. • Which representation is preferable depends on the ratio of set bits to cleared bits. (∗ unique to the second source list ∗) else begin (∗ if transaction id is in both sources ∗) dst. (∗ from both source lists ∗) return dst. It is debatable whether the potential gains justify the memory requirement. d. e} {a. d} {a. e} {a. src2 = tail(src2). (∗ append it to the output list ∗) src1 = tail(src1).r. of course. c. c. c} {a. • The same can be achieved with Eclat if the depthﬁrst traversal of the preﬁx tree is carried out from right to left and computed support values are stored. src2 : tidlist) begin (∗ — intersect two transaction id lists ∗) var dst : tidlist. Item Coding • Sorting the items ascendingly w. c. ◦ Each row corresponds to a transaction. e} a:7 b:3 c:7 d:6 e:7 a b:0 c:4 d:5 e:6 c d:3 e:3 d e:2 d e:4 c b c:3 d:1 e:1 d d:4 e:4 d e:2 e:4 Eclat: DepthFirst Search 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. c. (∗ return the created intersection ∗) end.
Eclat: Transaction Ranges transaction database a. d a. • Data is represented as lists of transaction identiﬁers (one per item). using diﬀsets speeds up the search considerably. 4 6 . c. b a. e a. 10 8 DT (a  I) contains the indices of the transactions that contain I but not a. c. b ∈ I. b a 1 . d a. 8 9 9 . 10 b Eclat: Diﬀerence sets (Diﬀsets) • In a conditional database. c. e a. all transaction lists are “ﬁltered” by the preﬁx: Only transactions contained in the transaction list for the preﬁx can be in the transaction lists of the conditional database. d c. 7 8 8 . e. • Support counting is done by intersecting lists of transaction identiﬁers. . d c. d c. b a. d a. e. e a. c. b a. c. • Exploit item frequencies and ensure subset relations between ranges from lower to higher frequecies.net/eclat. e. d. . c. e b. d. e a. b} ⊆ tk } = {k  I ⊆ tk ∧ a ∈ tk } −{k  I ⊆ tk ∧ a ∈ tk ∧ b ∈ tk } = {k  I ⊆ tk ∧ a ∈ tk ∧ b ∈ tk } / = {k  I ⊆ tk ∧ b ∈ tk } / −{k  I ⊆ tk ∧ b ∈ tk ∧ a ∈ tk } / / = {k  I ⊆ tk ∧ b ∈ tk } / −{k  I ⊆ tk ∧ a ∈ tk } / = ({k  I ⊆ tk } − {k  I ∪ {b} ⊆ tk }) −({k  I ⊆ tk } − {k  I ∪ {a} ⊆ tk }) = (KT (I) − KT (I ∪ {b}) −(KT (I) − KT (I ∪ {a}) = D(b  I) − D(a  I) Christian Borgelt Frequent Pattern Mining 115 Summary Eclat Basic Processing Scheme • Depthﬁrst traversal of the preﬁx tree. e a. d. 7 8 . Disadvantages • With a sparse transaction list representation (row indices) eclat is diﬃcult to execute for modern processors (branch prediction). . e. so that intersecting the lists is easy. d. e. c. e b. . • Usually (considerably) faster than Apriori. . • The support of direct supersets of I can now be computed as ∀I : ∀a ∈ I : / sT (I ∪ {a}) = sT (I) − DT (a  I). e. 4 e 1 .html Christian Borgelt Frequent Pattern Mining 116 . c. a = b : DT (b  I ∪ {a}) = DT (b  I) − DT (a  I) • For some transaction databases. c a. e a. . Software • http://www. e item frequencies a: b: c: d: e: 7 3 7 6 7 sorted by frequency a. e a. b c. . 9 9 10 . d c. 113 Christian Borgelt Frequent Pattern Mining 114 Eclat: Diﬀsets Proof of the Formula for the Next Level: DT (b  I ∪ {a}) = KT (I ∪ {a}) − KT (I ∪ {a. c. c. c. . b c. d a. 7 c 1 . e a. . d. e. e. • This suggests the idea to use diﬀsets to represent conditional databases: / ∀I : ∀a ∈ I : DT (a  I) = KT (I) − KT (I ∪ {a}) 5 . c. • The transaction lists can be compressed by combining consecutive transaction identiﬁers into ranges. d. Advantages • Depthﬁrst search reduces memory requirements. e. d lexicographically sorted 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: a. c. c. Christian Borgelt Frequent Pattern Mining The diﬀsets for the next level can be computed by / ∀I : ∀a.borgelt. . d a. e. d a. d b. 3 4 . 3 d 2 . b}) = {k  I ∪ {a} ⊆ tk } − {k  I ∪ {a.
This demonstrates that the traversal order for the preﬁx tree and the representation form of the transaction database can be combined freely. Christian Borgelt Frequent Pattern Mining 119 Christian Borgelt Frequent Pattern Mining 120 . • Recursive processing of the conditional transaction databases. 3. for ﬁrst subproblem) ◦ Move all transactions starting with the same item to a new array. Christian Borgelt Frequent Pattern Mining 117 Christian Borgelt Frequent Pattern Mining 118 SaM: Preprocessing the Transaction Database SaM: Basic Operations e removed 1 ad acde bd bcdg bcf abd bde bcde bc abdf 2 g: f: e: a: c: b: d: 1 2 3 4 5 8 8 smin = 3 3 ad eacd bd cbd cb abd ebd ecbd cb abd 4 eacd ecbd ebd abd abd ad cbd cb cb bd 5 1 1 1 2 1 1 2 1 e a c d e c b d e b d a b d a d c b d c b b d 1 1 1 2 1 1 2 1 e a c d e c b d e b d a b d a d c b d c b b d 1 split 2 prefix e 2 1 1 1 2 1 a b d a d c b d c b b d 2 2 2 a c d a b d a d c b d c b b d merge prefix e 1 1 1 a c d c b d b d 1 1 1 a c d c b d b d • Split Step: (on the left. SaM uses a purely horizontal transaction representation. 4.SaM: Basic Ideas • The item sets are checked in lexicographic order (depthﬁrst traversal of the preﬁx tree). Data structure used by the algorithm. • Merge Step: (on the right. 2. Transactions sorted lexicographically in descending order (comparison of items inverted w. • The data structure used is a simply array of transactions.r.t. for second subproblem) ◦ Merge the rest of the transaction array and the copied transactions. Items in transactions sorted ascendingly w. 5. Due to these steps the algorithm is called Split and Merge (SaM). preceding step). ◦ The merge operation is similar to a mergesort phase. 1. ◦ Remove the common leading item (advance pointer into transaction). • The two conditional databases for the two subproblems formed in each step are created with a split step and a merge step. their frequency.r. Original transaction database.t. The SaM Algorithm Split and Merge Algorithm [Borgelt 2008] • While Eclat uses a purely vertical transaction representation. • Step by step elimination of items from the transaction database. Frequency of individual items.
end. else b[0]. (∗ initialize split result and item support ∗) i := a[0]. begin while a is not empty do i := a[0]. • In this case SaM can become fairly slow. (∗ keep only one copy per transaction ∗) while c is not empty do (∗ copy rest of transactions in c ∗) remove c[0] from c and append it to a.wgt := b[0]. (∗ support of the split item ∗) b: array of transactions. (∗ initialize the output array ∗) while b and c are both not empty do (∗ merge split and rest of database ∗) if c[0]. (∗ buﬀer for rest of source array ∗) begin (∗ — merge step ∗) c := a. (∗ empty transactions are eliminated ∗) • Note that the split step also determines the support of the item i. s := 0. report p with support s(i). else if c[0].items < b[0]. Christian Borgelt Frequent Pattern Mining 124 . (∗ move it to the conditional database. the two transaction arrays to merge can substantially diﬀer in size. SaM(b.wgt +c[0]. ◦ This version of mergesort would be equivalent to insertion sort. end. smaller transaction from c ∗) then remove c[0] from c and append it to a.wgt. end. smin).items[0] = i do (∗ next transaction starts with same item ∗) s := s + a[0].items is not empty (∗ if transaction has not become empty ∗) then remove a[0] from a and append it to b. because the merge step processes many more transactions than the split step. 121 Christian Borgelt Frequent Pattern Mining 122 (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ (∗ conditional database to process ∗) preﬁx of the conditional database a ∗) minimum support of an item set ∗) buﬀer for the split item ∗) split result ∗) — split and merge recursion — ∗) while the database is not empty ∗) get leading item of ﬁrst transaction ∗) split step: ﬁrst subproblem ∗) merge step: second subproblem ∗) if the split item is frequent: ∗) extend the preﬁx item set and ∗) report the found frequent item set ∗) process the split result recursively. end.SaM: PseudoCode function SaM (a: array of transactions. b: array of transactions. • Intuitive explanation (extreme case): ◦ Suppose mergesort always merged a single element with the recursively sorted rest of the array (or list). a := empty. (∗ get leading item of ﬁrst transaction ∗) while a is not empty (∗ while database is not empty and ∗) and a[0]. (∗ sum the occurrences/weights ∗) remove b[0] from b and append it to a. (∗ sum occurrences (compute support) ∗) remove i from a[0]. p. (∗ second recursion: executed by loop ∗) Christian Borgelt Frequent Pattern Mining 123 SaM: Optimization • If the transaction database is sparse. (∗ split result ∗) begin (∗ — split step ∗) b := empty. p: set of items.items (∗ copy lex. ∗) then restore the original preﬁx ∗) Frequent Pattern Mining SaM: PseudoCode — Merge Step var c: array of transactions.items > b[0]. smaller transaction from b ∗) then remove b[0] from b and append it to a. (∗ move combined transaction and ∗) (∗ delete the other. if s(i) ≥ smin then p := p ∪ {i}. (∗ function SaM() ∗) Christian Borgelt SaM: PseudoCode — Split Step var i: item. p := p − {i}. end. end. smin: int) var i: item. (∗ remove split item from transaction ∗) if a[0]. equal transaction: ∗) end. else remove a[0] from a. ∗) (∗ otherwise simply remove it: ∗) end.items. ◦ Idea: use the same optimization as in binary search based insertion sort. end. • Possible optimization: ◦ Modify the merge step if the arrays to merge diﬀer signiﬁcantly in size. ◦ As a consequence the time complexity worsens from O(n log n) to O(n2). while b is not empty do (∗ copy rest of transactions in b ∗) remove b[0] from b and append it to a. end. (∗ buﬀer for the split item ∗) s: int.items[0]. end.wgt.items[0]. merge b and the rest of a into a.items (∗ copy lex. remove c[0] from c. move transactions starting with i to b.
(∗ binary search variables ∗) c: array of transactions. (∗ and remove trans.items = c[i]. Software • http://www. Basic Processing Scheme Summary SaM • Depthﬁrst traversal of the preﬁx tree. (∗ that is equal to the one just copied. ◦ Even the transaction array can easily be stored on external storage or as a relational database table. • Easy to implement for operation on external storage / relational databases. b: array of transactions) : array of transactions var l.html Christian Borgelt Frequent Pattern Mining 127 Christian Borgelt Frequent Pattern Mining 128 . • Support counting is done implicitly in the split step. from the rest ∗) while a is not empty do (∗ copy rest of transactions in a ∗) remove a[0] from a and append it to c.. ◦ If both sources have become large. which has to be created by moving and merging transactions from both sources. end. SaM: PseudoCode — Binary Search Based Merge .net/sam.SaM: PseudoCode — Binary Search Based Merge function merge (a. they may be merged in order to empty one source. (∗ decrement the transaction counter ∗) . (∗ then sum the transaction weights ∗) end. ∗) end.borgelt.. • Data is represented as an array of transactions (purely horizontal representation). m. else r := m.wgt. (∗ get its index in the output array ∗) if a is not empty and a[0].wgt +a[0].. (∗ copy the transaction to insert and ∗) i := length(c) − 1. return c. Disadvantages • Can be slow on sparse transaction databases due to the merge step.. (∗ output transaction array ∗) begin (∗ — binary search based merge — ∗) c := empty. (∗ copy lex. ◦ A split result. l := l − 1. larger transaction and ∗) end. one source is the input database and the other source is empty. Advantages • Very simple data structure and processing scheme. r := length(a). remove b[0] from b and append it to c. while b is not empty do (∗ copy rest of transactions in b ∗) remove b[0] from b and append it to c. ◦ The fact that the transaction array is processed linearly is advantageous for external storage operations. (∗ if there is a transaction in the rest ∗) remove a[0] from a. (∗ according to the comparison result ∗) while l > 0 do (∗ while still before insertion position ∗) remove a[0] from a and append it to c. end. r: int. (∗ return the merge result ∗) end. 125 Christian Borgelt Frequent Pattern Mining 126 Christian Borgelt Frequent Pattern Mining SaM: Optimization and External Storage • Accepting a slightly more complicated processing scheme. (∗ initialize the binary search range ∗) while l < r do (∗ while the search range is not empty ∗) m := ⌊ l+r ⌋. • Note that SaM can easily be implemented to work on external storage: ◦ In principle.items then c[i]. is always merged to the smaller source. (∗ function merge() ∗) • Applying this merge procedure if the length ratio of the transaction arrays exceeds 16:1 accelerates the execution on sparse data sets. one may work with double source buﬀering: ◦ Initially. (∗ compute the middle index ∗) 2 if a[m] < b[0] (∗ compare the transaction to insert ∗) then l := m + 1. (∗ and adapt the binary search range ∗) end. (∗ initialize the output array ∗) while a and b are both not empty do (∗ merge the two transaction arrays ∗) l := 0.wgt = c[i]. the transactions need not be loaded into main memory.
5.r. Data structure used by the algorithm (leading items implicit in list). • Note that after a simple reassignment there may be duplicate list elements.Recursive Elimination: Basic Ideas • The item sets are checked in lexicographic order (depthﬁrst traversal of the preﬁx tree).t. Items in transactions sorted ascendingly w. thus employing the core idea of radix sort. transactions are reassigned to other lists (based on the next item in the transaction). see bottom left). preceding step). 4.r. • Step by step elimination of items from the transaction database. 3. • RElim is inspired by the FPGrowth algorithm (discussed later) and closely related to the Hmine algorithm (but simpler data structure) Christian Borgelt Frequent Pattern Mining 129 Christian Borgelt Frequent Pattern Mining 130 RElim: Preprocessing the Transaction Database RElim: Basic Operations d b c a e 0 1 3 3 3 d b c a 0 1 1 1 4 e 3 1 · · · a c d ecbd ebd same abd as for abd SaM ad cbd cb cb bd 5 d b c a e 0 1 3 3 3 initial database prefix e 1 d 1 2 b d b 2 1 b d d 1 1 1 a c d c b d b d 1 d 1 b d 1 c d 1 d 1 2 b d b 2 1 b d d 1 1 1 a c d c b d b d d b c a 0 2 4 4 e eliminated 1 d d 1 1 2 b d b d b 1 2 1 c d b d d 1. only transactions starting with an item are in the corresponding list. • Recursive processing of the conditional transaction databases. Transactions sorted lexicographically in descending order (comparison of items inverted w. Frequent Pattern Mining 131 1 The basic operations of the RElim algorithm. The RElim Algorithm Recursive Elimination Algorithm [Borgelt 2005] • Avoids the main problem of the SaM algorithm: does not use a merge operation to group transactions with the same leading item. The rightmost list is traversed and reassigned: once to an initially empty list array (conditional database for the preﬁx e. Original transaction database.t. However. • After an item has been processed. 2. These two databases are then both processed recursively. see top right) and once to the original list array (eliminating item e. Christian Borgelt Christian Borgelt Frequent Pattern Mining 132 . • RElim rather maintains one list of transactions per item. Frequency of individual items. their frequency.
return n. • Competitive with the fastest algorithms despite this simplicity. Software • http://www. (∗ go to the next list element. (∗ remove the leading item from the copy ∗) if u. (∗ get the list associated with the item ∗) while t = nil do (∗ while not at the end of the list ∗) u := copy of t. (∗ support of the current item ∗) n: int. (∗ initialize the number of found item sets ∗) while a is not empty do (∗ while conditional database is not empty ∗) i := last item of a. (∗ conditional database for current item ∗) t. (∗ create an empty list array ∗) t := a[i]. ∗) k := u.head = u. p := p − {i}. a[k]. (∗ go to the next list element.. (∗ remove the processed list ∗) end. (∗ function RElim() ∗) • In order to remove duplicate elements.succ. and ∗) remove k from u.head.wgt := b[k].head. Christian Borgelt Frequent Pattern Mining 135 Summary RElim Basic Processing Scheme • Depthﬁrst traversal of the preﬁx tree. report p with support s. (∗ sum the transaction weight ∗) end. (∗ return the number of frequent item sets ∗) end.wgt := a[k].succ. p.. a[k].borgelt.head.wgt. (∗ note the current list element.items[0]. smin). (∗ number of found frequent item sets ∗) b: array of transaction lists.items is not empty (∗ add the copy to the conditional database ∗) then u..items. (∗ preﬁx of the conditional database a ∗) (∗ minimum support of an item set ∗) smin: int) : int var i. and ∗) remove k from u..succ = b[k]. u: transaction list element.RElim: PseudoCode function RElim (a: array of transaction lists. (∗ and process it recursively..wgt.wgt. t := t.. b[k]. t := t. (∗ copy the transaction list element. • Support counting is implicit in the (re)assignment step. (∗ sum the transaction weight ∗) (∗ in the list weight/transaction counter ∗) end.items is not empty (∗ reassign the noted list element ∗) then u. (∗ then restore the original item set preﬁx ∗) (∗ go on by reassigning ∗) . ∗) end. (∗ get the list associated with the item ∗) while t = nil do (∗ while not at the end of the list ∗) u := t. Advantages • Simple data structures and processing scheme. t := a[i]. end. (∗ in the list weight/transaction counter ∗) remove a[i] from a.net/relim. ∗) . end. (∗ to traverse the transaction lists ∗) begin (∗ — recursive elimination — ∗) n := 0. it is usually advisable to sort and compress the next transaction list before it is processed. (∗ the processed transactions ∗) Christian Borgelt Frequent Pattern Mining 134 RElim: PseudoCode ..succ = a[k].html Christian Borgelt Frequent Pattern Mining 136 .items[0]. (∗ cond. (∗ buﬀer for the current item ∗) s: int. database to process ∗) p: set of items. end. (∗ then restore the original preﬁx ∗) Christian Borgelt Frequent Pattern Mining 133 RElim: PseudoCode if s ≥ smin then (∗ if the current item is frequent: ∗) . ∗) k := u. b[k]. (∗ report the found frequent item set ∗) b := array of transaction lists.items.. k: item. n := n + 1 + RElim(b...head.wgt +u.wgt +u. Disadvantages • RElim is usually outperformed by FPgrowth (discussed next). (∗ report the found frequent item set ∗) (∗ create conditional database for i ∗) . • Data is represented as lists of transactions (one per item). (∗ get the next item to process ∗) if s ≥ smin then (∗ if the current item is frequent: ∗) (∗ extend the preﬁx item set and ∗) p := p ∪ {i}. (∗ process the created database recursively ∗) (∗ and sum the found frequent item sets. s := a[i].head = u. (∗ remove the leading item from current ∗) if u.
and Yin 2000] • Recursive processing of the conditional transaction databases. • Frequent single item sets can be read directly from the FPtree. Data structure used by the algorithm (details on next slide). 1 adf acde bd bcd bc abd bde bceg cdf abd d: 8 b: 7 c: 5 c: 1 a: 4 e: 3 4 db dbc dba dba dbe dc dcae da bc bce b: 5 d: 8 c: 2 a: 2 e: 1 a: 1 a: 1 e: 1 b: 2 c: 2 e: 1 frequent pattern tree Frequent Pattern Mining 140 Frequent Pattern Mining 139 Christian Borgelt . This combines a horizontal and a vertical database representation. The FPGrowth Algorithm Frequent Pattern Growth Algorithm [Han.t. Items in transactions sorted descendingly w. Transactions sorted lexicographically in ascending order (comparison of items is the same as in preceding step). • The item sets are checked in lexicographic order (depthﬁrst traversal of the preﬁx tree). An FPtree is basically a preﬁx tree with additional structure: nodes of this tree that correspond to the same item are linked. Original transaction database. Frequency of individual items. Simple Example Database 1 adf acde bd bcd bc abd bde bceg cdf abd 2 d: b: c: a: e: f: g: 8 7 5 4 3 2 1 smin = 3 3 da dcae db dbc bc dba dbe bce dc dba 4 db dbc dba dba dbe dc dcae da bc bce 5 FPtree (see next slide) 1. • This data structure is used to compute conditional databases eﬃciently. • Step by step elimination of items from the transaction database. 5. Christian Borgelt 4. Christian Borgelt Frequent Pattern Mining 137 Christian Borgelt Frequent Pattern Mining 138 FPGrowth: Preprocessing the Transaction Database Transaction Representation: FPTree • Build a frequent pattern tree (FPtree) from the transactions (basically a preﬁx tree with links between the branches that link nodes with the same item and a header table for the resulting item lists).FPGrowth: Basic Ideas • FPGrowth means Frequent Pattern Growth. • The transaction database is represented as an FPtree. 3. 2.r. their frequency and infrequent items removed. Pei. All transactions containing a given item can easily be found by the links between the nodes corresponding to this item.
t. • This yields an FPtree of the conditional database (database of transactions containing the item i. • Horizontal Representation: preﬁx tree of transactions Vertical Representation: links between the preﬁx tree branches d: 8 b: 7 c: 5 c: 1 b: 5 d: 8 c: 2 a: 2 e: 1 a: 1 a: 1 b: 2 c: 2 e: 1 frequent pattern tree e: 1 a: 4 e: 3 Recursive Processing • The initial FPtree is projected w.r.e. • From the projected FPtree the frequent item sets containing item i can be read directly. In this way the algorithm can be executed on a ﬁxed amount of memory. In principle. • The projected FPtree is processed recursively. but with this item removed — it is implicit in the FPtree and recorded as a common preﬁx).r. the preﬁx tree). • By processing an FPtree from left to right (or from top to bottom w. • For the insertion into the new tree there are two approaches: ◦ Apart from a parent pointer (which is needed for the path extraction). but usually equally eﬃcient projection scheme is to extract a path to the root as a (reduced) transaction and to insert this transaction into a new FPtree.Transaction Representation: FPTree • An FPtree combines a horizontal and a vertical transaction representation. all nodes referring to the same item can be stored in an array rather than a list. • The FPtree of the conditional database for this item is created by copying the nodes on the paths to the root. all transactions containing this item can be found. the projection may even reuse the already present nodes and the already processed part of the header table (topdown fpgrowth).t. the item corresponding to the rightmost level in the tree (let this item be i). the traversal of the item lists yields the (reduced) transactions in lexicographical order. i. This can be exploited to insert a transaction using only the header table. 143 Christian Borgelt Frequent Pattern Mining 144 c: 1 b: 1 ↑ detached projection • By traversing the node list for the rightmost item. 141 Christian Borgelt Frequent Pattern Mining 142 Note: the preﬁx tree is inverted. • Afterwards the reduced original FPtree is further processed by working on the next level leftwards. Christian Borgelt Frequent Pattern Mining Projecting an FPTree d: 8 b: 7 c: 5 c: 1 b: 5 b: 1 d: 8 d: 2 b: 2 b: 1 c: 2 c: 1 a: 1 c: 2 c: 1 e: 1 ← FPtree with attached projection a: 1 a: 1 a: 2 e: 1 e: 1 d: 2 a: 4 e: 3 d: 2 b: 2 b: 1 c: 1 a: 1 c: 2 a: 1 Projecting an FPTree • A simpler. the item i is noted as a preﬁx that is to be added in deeper levels of the recursion. Christian Borgelt Frequent Pattern Mining . These pointers allow to insert a new transaction topdown. Child pointers are not needed due to the processing scheme (to be discussed). each node possesses a pointer to its ﬁrst child and its right sibling. • The rightmost level of the original (unprojected) FPtree is removed (the item i is removed from the database). ◦ If the initial FPtree has been built from a lexicographically sorted transaction database. there are only parent pointers.
Reducing the Original FPTree d: 8 b: 7 c: 5 c: 1 b: 5 d: 8 c: 2 a: 2 e: 1 a: 1 a: 1 b: 2 c: 2 e: 1 b: 2 c: 2 d: 2 • The original FPtree is reduced by removing the rightmost level. but an item on a level further to the right is frequent. • Implemented by lefttoright levelwise merging of nodes with same parents.or Bonsai pruning. Christian Borgelt Frequent Pattern Mining . with the following advantages: ◦ No need for α. Rather all subsets of the set of items in the chain are formed and reported. This makes it possible to change the item order. • More interesting case: An item corresponding to a middle level is infrequent. since the items can be reordered so that all conditionally frequent items appear on the left. there are also disadvantages: ◦ Either the FPtree has to be traversed twice or pair frequencies have to be determined to reorder the items according to their conditional frequency. the item and the FPtree level are removed without projection. However. • Rebuilding the FPtree: An FPtree may be projected by extracting the (reduced) transactions described by the paths to the root and inserting them into a new FPtree (see above). 147 Christian Borgelt Frequent Pattern Mining 148 • Socalled αpruning or Bonsai pruning of a (projected) FPtree. no projections are computed anymore. Example FPTree with an infrequent item on a middle level: a: 6 a: 6 b: 1 b: 1 c: 4 c: 1 c: 3 d: 3 d: 1 d: 2 a: 6 a: 6 b: 1 c: 4 c: 4 d: 3 d: 3 • Chains: FPgrowth: Implementation Issues If an FPtree has been reduced to a chain. • This yields the conditional database for item sets not containing the item corresponding to the rightmost level. because the perfect extensions can be moved to the left and are processed at the end with the chain optimization. d: 2 b: 1 Christian Borgelt Frequent Pattern Mining 145 Christian Borgelt FPgrowth: DivideandConquer c: 5 c: 1 a: 4 d: 8 b: 7 c: 5 c: 1 a: 2 c: 2 a: 1 a: 1 d: 8 b: 5 c: 2 a: 2 e: 1 a: 1 a: 1 b: 2 c: 2 e: 1 b: 2 c: 2 ↑ Conditional database with item e removed (second subproblem) ← Conditional database for preﬁx e (ﬁrst subproblem) Frequent Pattern Mining 146 a: 4 e: 3 d: 8 b: 7 a: 4 e: 3 d: 8 b: 7 c: 5 c: 1 a: 4 b: 5 e: 1 d: 8 b: 5 e: 1 d: 8 c: 2 a: 2 a: 1 a: 1 b: 2 b: 1 c: 2 a: 1 c: 1 c: 1 a: 1 Pruning a Projected FPTree • Trivial case: If the item corresponding to the rightmost level is infrequent. ◦ No need for perfect extension pruning.
but is highly eﬃcient.) • Solution: The nodes are allocated in one large array per FPtree.net/fpgrowth. Allocating these through the standard memory management is wasteful.) Christian Borgelt Frequent Pattern Mining 149 FPgrowth: Implementation Issues • An FPtree can be implemented with only two integer arrays [Rasz 2004]: ◦ one array contains the transaction counters (support values) and ◦ one array contains the parent pointers (as the indices of array elements). Disadvantages • More diﬃcult to implement than other approaches. Advantages • Often the fastest algorithm or among the fastest algorithms.or Bonsai pruning becomes more complex. • Each FPtree node has a constant size of 16 bytes (2 pointers. • Recursive processing of the conditional database. complex data structure. • However. There is no allocation and deallocation of individual nodes. after some delay. 2 integers).html Experimental Comparison Christian Borgelt Frequent Pattern Mining 151 Christian Borgelt Frequent Pattern Mining 152 .borgelt. (This may waste some memory. the column. ◦ First the row is addressed and then. ◦ Main memory is organized as a “table” with rows and columns. • An FPtree is projected to obtain a conditional database. it has the advantage that no child and sibling pointers are needed and the transactions can be inserted in lexicographic order. there are also disadvantages: ◦ Programming projection and α. • An FPtree can need more memory than a list or array of transactions. Software • http://www. • However. ◦ Accesses to diﬀerent columns in the same row can skip the row addressing. (Allocating many small memory objects is highly ineﬃcient. because less structure is available. each FPtree resides in a single memory block. This reduces the memory requirements to 8 bytes per node.FPgrowth: Implementation Issues • The initial FPtree is built from an arraybased main memory representation of the transaction database (eliminates the need for child pointers). ◦ Reordering the items is virtually ruled out. • This has the disadvantage that the memory savings often resulting from an FPtree representation cannot be fully exploited. • Such a memory structure has advantages due the way in which modern processors access the main memory: Linear memory accesses are faster than random accesses. • As a consequence. Christian Borgelt Frequent Pattern Mining 150 Summary FPGrowth Basic Processing Scheme • Transaction database is represented as a frequent pattern tree.
59602 transactions average transaction size: ≈ 2.borgelt. 100000 transactions average transaction size: ≈ 10.html http://www.1. density: ≈ 0.html 0 1 1000 1200 1400 1600 1800 2000 0 5 10 15 20 25 30 35 40 45 50 webview1 apriori eclat fpgrowth relim sam 1 census apriori eclat fpgrowth relim sam 1 0 0 0 10 20 30 40 50 60 70 80 90 100 33 34 35 36 37 38 39 40 • The test system was an IBM/Lenovo X60s laptop (Intel Centrino Duo L2400.005 The density of a transaction database is the average fraction of all items occurring per transaction: density = average transaction size / number of items Christian Borgelt Frequent Pattern Mining 153 Christian Borgelt Frequent Pattern Mining 154 Experiments: Programs and Test System • All programs are my own implementations. Christian Borgelt Frequent Pattern Mining 155 Christian Borgelt Frequent Pattern Mining 156 .Experiments: Data Sets • Chess A data set listing chess end game positions for king vs. 1. Therefore diﬀerences in speed can only be the eﬀect of the processing schemes.net/eclat.html ◦ ◦ ◦ ◦ ◦ Apriori Eclat FPGrowth RElim SaM http://www.html http://www. Linux 10. programs were compiled with gcc 4.html http://www.67 GHz. Decimal logarithm of execution time in seconds over absolute minimum support. density: ≈ 0. This data set is part of the UCI machine learning repository.S.net/fpm.net/fpgrowth.borgelt.borgelt.5 • Census A data set derived from an extract of the US census bureau data of 1994.net/relim. density: ≈ 0. 75 items.net/sam.2. density: ≈ 0.E.012 • BMSWebview1 A web click stream from a legcare company that no longer exists. 870 items. 0 Experiments: Execution Times chess apriori eclat fpgrowth relim sam T10I4D100K apriori eclat fpgrowth relim sam relim h 2 1 1 • These programs and their source code can be found on my web site: http://www.5. Experiments: Data Sets • T10I4D100K An artiﬁcial data set generated with IBM’s data generator. It has been used in the KDD cup 2000 and is a popular benchmark.borgelt.net/apriori.3. 135 items. king and rook.html http://www. 497 items.borgelt.borgelt.1. All use the same code for reading the transaction database and for writing the found frequent item sets. The name is formed from the parameters given to the generator (for example: 100K = 100000 transactions). 48842 transactions average transaction size: 14. 1 GB main memory) running S. This data set is part of the UCI machine learning repository. which was preprocessed by discretizing numeric attributes.u. 3196 transactions average transaction size: 37.1 The density of a transaction database is the average fraction of all items occurring per transaction: density = average transaction size / number of items.
then a is also a perfect extension of any item set J ⊇ I (as long as a ∈ J). Reducing the Output: Closed and Maximal Item Sets • Since with this deﬁnition we know that ∀smin : ∀I ∈ FT (smin) : I ∈ MT (smin) ∨ ∃J ⊃ I : sT (J) ≥ smin it follows (can easily be proven by successively extending the item set I) ∀smin : ∀I ∈ FT (smin) : ∃J ∈ MT (smin) : I ⊆ J. Christian Borgelt Frequent Pattern Mining 157 Experiments: Perfect Extension Pruning chess apriori eclat fpgrowth w/o pep T10I4D100K apriori eclat fpgrowth w/o pep 2 1 0 1 1000 1 0 1200 1400 1600 1800 2000 0 5 10 15 20 25 30 35 40 45 50 webview1 apriori eclat fpgrowth w/o pep 1 census apriori eclat fpgrowth w/o pep 1 0 0 0 10 20 30 40 50 60 70 80 90 100 33 34 35 36 37 38 39 40 Decimal logarithm of execution time in seconds over absolute minimum support. • Given an item set I. Christian Borgelt Frequent Pattern Mining 158 Maximal Item Sets • Consider the set of maximal (frequent) item sets: MT (smin) = {I ⊆ B  sT (I) ≥ smin ∧ ∀J ⊃ I : sT (J) < smin}. • Once identiﬁed. but are only used to generate all supersets of the preﬁx having the same support. then all sets I ∪ J with J ∈ 2X (where 2X denotes the power set of X) are also frequent and have the same support as I. an item a ∈ I is called a perfect extension of I. / iﬀ I and I ∪ {a} have the same support (all transactions containing I contain a).Reminder: Perfect Extensions • The search can be improved with socalled perfect extension pruning. That is: Every frequent item set has a maximal superset. • Perfect extensions have the following properties: ◦ If the item a is a perfect extension of an item set I. That is: An item set is maximal if it is frequent. perfect extension items are no longer processed in the recursion. in a third element of a subproblem description: S = (D. X). • This can be exploited by collecting perfect extension items in the recursion. but none of its proper supersets is frequent. / ◦ If I is a frequent item set and X is the set of all perfect extensions of I. • Therefore: ∀smin : FT (smin) = I∈MT (smin) 2I Christian Borgelt Frequent Pattern Mining 159 Christian Borgelt Frequent Pattern Mining 160 . P.
That is. c. c}: {c. e}: 3 6 {a. y ∈ R with neither x ≤ y nor y ≤ x. e} 4: {a. c}. e} 9: {b. an item set cannot have a lower support than any of its supersets. c. ab ac ad ae bc bd be This relation follows immediately from ∀I : ∀J ⊇ I : sT (I) ≥ sT (J). c. e} {a. c. c. d. d}: {a. d}: 3 5 {a. c. {a. e} {a. c. e} 6: {a. e}: {b. e} a b c d e ∀smin : ∀I ∈ FT (smin) − MT (smin) : cd ce de sT (I) ≥ J∈MT (smin). that is. c. e} • The maximal item sets are: {b. c. e}: 4 3 4 4 4 • Every frequent item set is a subset of at least one of these sets. • About the support of a nonmaximal frequent item set we only know: transaction database 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a. e} {b. d. no superset of a maximal (frequent) item set is frequent. e} {b. d. • Maximal elements need not be unique. Christian Borgelt Frequent Pattern Mining 161 Christian Borgelt Frequent Pattern Mining 162 Hasse Diagram and Maximal Item Sets Hasse diagram with maximal item sets (smin = 3): Limits of Maximal Item Sets • The set of maximal item sets captures the set of all frequent item sets. d}. d. d. e} 10: {a. c} {a. e}: 3 items 4 {a. c. d} {b. e} {a. d. d. which also preserves knowledge of all support values? Christian Borgelt Frequent Pattern Mining 163 Christian Borgelt Frequent Pattern Mining 164 . ≤). white boxes infrequent item sets. d} 3: {a.J⊃I max sT (J). • Inﬁnite partially ordered sets need not possess a maximal/minimal element. {a. • The notions minimal and minimal element are deﬁned analogously. c.Mathematical Excursion: Maximal Elements • Let R be a subset of a partially ordered set (S. e}. e} 5: {a. c. frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a. {a. e} {a. An element x ∈ R is called maximal or a maximal element of R if ∀y ∈ R : x ≤ y ⇒ x = y. • Note that we have generally ∀smin : ∀I ∈ FT (smin) : sT (I) ≥ J∈MT (smin). d. because there may be elements x. e}: {d. Maximal Item Sets: Example transaction database 1: {a. d}: {c. d} 7: {b. e}. c}: {a. c.J⊇I abc abd abe acd ace ade bcd bce bde cde max sT (J). abcd abce abde acde bcde abcde • Question: Can we ﬁnd a subset of the set of all frequent item sets. c. e} 2: {b. c. d. d} {a. Red boxes are maximal item sets. ⊆): The maximal (frequent) item sets are the maximal elements of FT (smin): MT (smin) = {I ∈ FT (smin)  ∀J ∈ FT (smin) : I ⊆ J ⇒ I = J}. • Here we consider the set FT (smin) as a subset of the partially ordered set (2B . c. c} 8: {a. but then we know at most the support of the maximal item sets exactly. d.
no superset of k∈KT (I) tk has the cover KT (I). (Proof: see the considerations on the next slide) • The set of all closed item sets preserves knowledge of all support values: ∀smin : ∀I ∈ FT (smin) : • Note that the weaker statement ∀smin : ∀I ∈ FT (smin) : sT (I) ≥ J∈CT (smin). . max sT (J) That is: Every frequent item set has a closed superset. it follows (can easily be proven by successively extending the item set I) ∀smin : ∀I ∈ FT (smin) : ∃J ∈ CT (smin) : I ⊆ J. n}  I ⊆ tk } is the cover of I w. which satisﬁes the following conditions ∀X.t.r. Christian Borgelt Frequent Pattern Mining 165 Christian Borgelt Frequent Pattern Mining 166 Closed Item Sets • Alternative characterization of closed item sets: I is closed ⇔ sT (I) ≥ smin ∧ I= k∈KT (I) Mathematical Excursion: Closure Operators • A closure operator on a set S is a function cl : 2S → 2S . Y ⊆ S: ◦ X ⊆ cl (X) ◦ X ⊆ Y ⇒ cl (X) ⊆ cl (Y ) ◦ cl (cl (X)) = cl (X) (cl is extensive) (cl is increasing or monotone) (cl is idempotent) tk .J⊇I max sT (J). • This is derived as follows: since ∀k ∈ KT (I) : I ⊆ tk . since k∈KT (I) tk has the same support. .J⊇I sT (I) = J∈CT (smin). • Note that the above characterization allows us to construct for any item set the (uniquely determined) closed superset that has the same support. • Therefore: ∀smin : FT (smin) = I∈CT (smin) 2I follows immediately from ∀I : ∀J ⊇ I : sT (I) ≥ sT (J). T . it is obvious that ∀smin : ∀I ∈ FT (smin) : I⊆ k∈KT (I) • A set R ⊆ S is called closed if it is equal to its closure: R is closed ⇔ R = cl (R). restricted to the set of frequent item sets: CT (smin) = {I ∈ FT (smin)  I = cl (I)} 167 Christian Borgelt Frequent Pattern Mining 168 Christian Borgelt Frequent Pattern Mining . That is: An item set is closed if it is frequent. If I ⊂ k∈KT (I) tk . but it has a closed superset with the same support: ∀smin : ∀I ∈ FT (smin) : ∃J ⊇ I : J ∈ CT (smin) ∧ sT (J) = sT (I). not only has every frequent item set a closed superset. • The closed (frequent) item sets are induced by the closure operator cl (I) = k∈KT (I) tk . • Since with this deﬁnition we know that ∀smin : ∀I ∈ FT (smin) : I ∈ CT (smin) ∨ ∃J ⊃ I : sT (J) = sT (I) Closed Item Sets • However. an item set cannot have a lower support than any of its supersets.Closed Item Sets • Consider the set of closed (frequent) item sets: CT (smin) = {I ⊆ B  sT (I) ≥ smin ∧ ∀J ⊃ I : sT (J) < sT (I)}. but none of its proper supersets has the same support. . Reminder: KT (I) = {k ∈ {1. tk . that is. . On the other hand. it is not closed.
=true • In a monotone Galois connection. ◦ Since (f1. in an antimonotone Galois connection. Christian Borgelt Frequent Pattern Mining 171 Christian Borgelt Mathematical Excursion: Galois Connections (ii) ∀A ⊆ U : f2(f1(f2(f1(A)))) = f2(f1(A)) (a closure operator is idempotent): ◦ Since both f1 ◦ f2 and f2 ◦ f1 are extensive (see above). ◦ If f1 and f2 are both monotone. we know ∀A ⊆ V : ∀B ⊆ V : A ⊆ f2(f1(A)) ⊆ f2(f1(f2(f1(A)))) B ⊆ f1(f2(B)) ⊆ f1 (f2(f1(f2(B)))) ◦ Choosing B = f1(A′) with A′ ⊆ U . that is. =true 170 A ⊆ f2(B) ⇔ B ⊆ f1(A).Mathematical Excursion: Galois Connections • Let (X. • A function pair (f1. ◦ Since (f1. Y ) be two partially ordered sets. we have ∀A1. f2) is a Galois connection. Y ) = (2V . A2 ∈ X : ◦ ∀B1. f1(A). f2) with f1 : X → Y and f2 : Y → X is called a (monotone) Galois connection iﬀ ◦ ∀A1. A2 ⊆ U : f1(A1) ⊇ f1(A2) ⇒ ∀A1. f1(A). =true Frequent Pattern Mining (see above) 172 . we know ∀A ⊆ U : ∀B ⊆ V : A ⊆ f2(B) ⇔ B ⊆ f1(A). B2 ∈ Y : ◦ ∀A ∈ X : ∀B ∈ Y : A1 B1 A A2 B2 f2(B) ⇒ ⇒ ⇔ f1(A1 ) f2(B1) B f1(A2 ). we have ∀A1. f2(B2). we know ∀A ⊆ U : ∀B ⊆ V : ◦ Choose B = f1(A): ∀A ⊆ U : ◦ Choose A = f2(B): ∀B ⊆ V : f2(B) ⊆ f2(B) ⇔ B ⊆ f1(f2(B)). f2(B2). we obtain ∀A′ ⊆ U : f2(f1(f2(f1(A′)))) ⊆ f2(f1(A′)) ⇔ f1(A′) ⊆ f1(f2(f1(f2(f1 (A′))))) . f2) is a Galois connection. A2 ⊆ U : A1 ⊆ A2 ⇒ ∀A1. • Then the combination f1 ◦ f2 : X → X of the functions of a Galois connection is a closure operator (as well as the combination f2 ◦ f1 : Y → Y ). B2 ∈ Y : ◦ ∀A ∈ X : ∀B ∈ Y : A1 B1 A A2 B2 f2(B) ⇒ ⇒ ⇔ f1(A1 ) f2(B1) B f1(A2 ). A2 ∈ X : ◦ ∀B1. A2 ⊆ U : f2(f1(A1 )) ⊆ f2(f1(A2)). and let the partial orders be the subset relations on these power sets. ⊆) and (Y. ◦ If f1 and f2 are both antimonotone. f2) with f1 : X → Y and f2 : Y → X is called an antimonotone Galois connection iﬀ ◦ ∀A1. A2 ⊆ U : A1 ⊆ A2 ⇒ f2(f1 (A1)) ⊆ f2(f1(A2 )) (a closure operator is increasing or monotone): ◦ This property follows immediately from the fact that the functions f1 and f2 are both (anti)monotone. A2 ⊆ U : A1 ⊆ A2 ⇒ ∀A1. ◦ Choosing A = f2(f1(f2(f1(A′)))) and B = f1(A′). we obtain ∀A′ ⊆ U : f1(A′) ⊆ f1(f2(f1(f2(f1(A′))))). Christian Borgelt Frequent Pattern Mining 169 Christian Borgelt Frequent Pattern Mining Mathematical Excursion: Galois Connections (ii) ∀A1. respectively. ⊆). let (X. (i) ∀A ⊆ U : A ⊆ f2 (f1(A)) (a closure operator is extensive): • A function pair (f1. A2 ⊆ U : f2(f1(A1 )) ⊆ f2(f1(A2)). Mathematical Excursion: Galois Connections • Let the two sets X and Y be power sets of some sets U and V . X ) and (Y. A ⊆ f2(f1(A)) ⇔ f1(A) ⊆ f1(A) . A2 ⊆ U : f1(A1) ⊆ f1(A2) ⇒ ∀A1. both f1 and f2 are antimonotone. X ) = (2U . both f1 and f2 are monotone.
e} 9: {b.n} → 2{1. = f2(J2). I → KT (I) = {k ∈ {1.) • Therefore ﬁnding closed item sets with a given minimum support is equivalent to ﬁnding closed sets of transaction identiﬁers of a given minimum size. e} a b c d e ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde • All frequent item sets are closed with the exception of {b} and {d. d. d} {b. e} {a. n}  J = f1(f2(J)) = KT ( j∈J tj )}.. e} 2: {b. that is. c. c} {a. 173 Christian Borgelt Frequent Pattern Mining 174 • As a consequence f1 ◦ f2 : 2B → 2B . d} 3: {a. ˆ Red boxes are closed item sets. J → KT ( j∈J tj ) is also a closure operator.. e} 10: {a. . c.. • The function pair (f1. there exists a 1to1 relationship between these two sets. e} 5: {a. c.Galois Connections in Frequent Item Set Mining • Consider the partially order sets (2B . white boxes infrequent item sets. c. c.. . Christian Borgelt Frequent Pattern Mining Closed Item Sets: Example transaction database 1: {a. d. c. (This follows immediately from the facts that the Galois connection describes closure operators and that a closure operator is idempotent.. I2 ∈ 2B : I1 ⊆ I2 J1 ⊆ J2 ⇒ f1(I1) = KT (I1) k∈J1 tk ⊇ KT (I2 ) ⊇ k∈J2 tk = f1(I2). J → j∈J tj = {i ∈ B  ∀j ∈ J : i ∈ tj }. d. ... e} 6: {a. d... e}. e} 4: {a... d} {a.. . ˆ abcde Christian Borgelt Frequent Pattern Mining 175 Christian Borgelt Frequent Pattern Mining 176 . e}... e} {b. .. c}: {a. c. I → k∈KT (I) tk is a closure operator.n}. to the sets CB = {I ⊆ B  I = f2(f1(I)) = k∈KT (I) tk } and CT = {J ⊆ {1. both have a support of 3 = 30%. • Furthermore. e}: {b. c}: {c. e}: 3 6 {a. d}: {a. d}: 3 5 {a. f2) is an antimonotone Galois connection: ◦ ∀I1. . . e} {a. d.. c}. if we restrict our considerations to the respective sets of closed sets in both domains.. and f2 : 2{1. Let f1 : 2B → 2{1. .. ⊆)... c.n}. e}: 3 items 4 {a. d.n}. d..n} : ⇒ f2(J1) = ◦ ∀I ∈ 2B : ∀J ∈ 2{1. d. d} 7: {b.. ⊆) and (2{1.n} : I ⊆ f2(J) = j∈J tj ⇔ J ⊆ f1(I) = KT (I). c} 8: {a. e} {a. e}: {d. c.. abcd abce abde acde bcde {d. c. J2 ∈ 2{1. e}: 4 3 4 4 4 Hasse diagram and Closed Item Sets Hasse diagram with closed item sets (smin = 3): transaction database 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: {a... c... d. which is described by the Galois connection: ′−1 ′ ′ f1 = f1 CB is a bijection with f1 = f2 = f2CT . e} {a. e} is a subset of {a. • {b} is a subset of {b. c.n} → 2B . c. c. n}  I ⊆ tk } Galois Connections in Frequent Item Set Mining • Likewise f2 ◦ f1 : 2{1. d. e} {b. d}: {c. e} frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a. ◦ ∀J1. both have a support of 4 = 40%..
• Once identiﬁed. closed item sets do not possess any perfect extensions. c}: {c. d}: 3 5 {a. d}: {a. c. e} 5: {a. d. e} • c is a perfect extension of {b} frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a. c. d} 7: {b. e}: {d. c. c. • Nonclosed item sets possess at least one perfect extension. • Given an item set I. item base item base maximal (frequent) item sets closed (frequent) item sets • The set of closed item sets is the union of the sets of maximal item sets for all minimum support values at least as large as smin: CT (smin) = s∈{smin.. / ◦ If I is a frequent item set and X is the set of all perfect extensions of I. ◦ All maximal item sets are closed. e}: {b. c. e} as {d. c}: {a. d. c. e}: 3 6 {a. d}: {c. e} 2: {b.. c. • This can be exploited by collecting perfect extension items in the recursion.n−1. c. / iﬀ I and I ∪ {a} have the same support (all transactions containing I contain a). d.. c} 8: {a. X). e}: 4 3 4 4 4 as {b} and {b. e} 9: {b. e} 4: {a. e} and {a. c} both have support 3. perfect extension items are no longer processed in the recursion. but are only used to generate all supersets of the preﬁx having the same support. d. e} 10: {a.n} Christian Borgelt Frequent Pattern Mining 179 MT (s) Christian Borgelt Frequent Pattern Mining 180 . P.smin+1. then all sets I ∪ J with J ∈ 2X (where 2X denotes the power set of X) are also frequent and have the same support as I. • a is a perfect extension of {d. Christian Borgelt Frequent Pattern Mining 177 Closed Item Sets and Perfect Extensions transaction database 1: {a. e} both have support 4. an item a ∈ I is called a perfect extension of I.. d. in a third element of a subproblem description: S = (D. Christian Borgelt Frequent Pattern Mining 178 Relation of Maximal and Closed Item Sets empty set empty set Types of Frequent Item Sets: Summary • Frequent Item Set Any frequent item set (support is higher than the minimal support): I frequent ⇔ sT (I) ≥ smin • Closed (Frequent) Item Set A frequent item set is called closed if no superset has the same support: I closed ⇔ sT (I) ≥ smin ∧ ∀J ⊃ I : sT (J) < sT (I) • Maximal (Frequent) Item Set A frequent item set is called maximal if no superset is frequent: I maximal ⇔ sT (I) ≥ smin ∧ ∀J ⊃ I : sT (J) < smin • Obvious relations between these types of item sets: ◦ All maximal item sets and all closed item sets are frequent. then a is also a perfect extension of any item set J ⊇ I (as long as a ∈ J). d} 3: {a. d. • Perfect extensions have the following properties: ◦ If the item a is a perfect extension of an item set I.Reminder: Perfect Extensions • The search can be improved with socalled perfect extension pruning. e}: 3 items 4 {a. e} 6: {a.
1 The density of a transaction database is the average fraction of all items occurring per transaction: density = average transaction size / number of items. e}+: {b. d}+: {a. • Maximal (Frequent) Item Set (marked with ∗) A frequent item set is called maximal if no superset is frequent.005 The density of a transaction database is the average fraction of all items occurring per transaction: density = average transaction size / number of items Types of Frequent Item Sets: Experiments chess frequent closed maximal T10I4D100K frequent closed maximal 7 6 5 4 6 5 4 1000 7 6 5 1200 1400 1600 1800 2000 8 7 6 5 4 0 5 10 15 20 25 30 35 40 45 50 webview1 frequent closed maximal census frequent closed maximal 0 10 20 30 40 50 60 70 80 90 100 33 34 35 36 37 38 39 40 Decimal logarithm of the number of item sets over absolute minimum support. 75 items. The name is formed from the parameters given to the generator (for example: 100K = 100000 transactions). Christian Borgelt Frequent Pattern Mining 181 Christian Borgelt Frequent Pattern Mining 182 Experiments: Data Sets (Reminder) • T10I4D100K An artiﬁcial data set generated with IBM’s data generator. d}+: {c. density: ≈ 0. king and rook. d}+∗: 3 5 {a. density: ≈ 0. e}+: {d. d. It has been used in the KDD cup 2000 and is a popular benchmark. density: ≈ 0. 0 items 1 item ∅+: 10 {a}+: {b}: {c}+: {d}+: {e}+: 7 3 7 6 7 2 items {a. e}+∗ : 3 6 {a. which was preprocessed by discretizing numeric attributes. This data set is part of the UCI machine learning repository. Christian Borgelt Frequent Pattern Mining 183 Christian Borgelt Frequent Pattern Mining 184 . 3196 transactions average transaction size: 37. 497 items. c. • Closed (Frequent) Item Set (marked with +) A frequent item set is called closed if no superset has the same support. density: ≈ 0. 100000 transactions average transaction size: ≈ 10. 870 items. c}+∗: {c. c}+: {a.1.Types of Frequent Item Sets: Summary Experiments: Data Sets (Reminder) • Chess A data set listing chess end game positions for king vs.5.012 • BMSWebview1 A web click stream from a legcare company that no longer exists. e}+∗: 4 3 4 4 4 • Frequent Item Set Any frequent item set (support is higher than the minimal support). c. This data set is part of the UCI machine learning repository. 135 items. e}: 3 items 4 {a. 59602 transactions average transaction size: ≈ 2. 48842 transactions average transaction size: 14.5 • Census A data set derived from an extract of the US census bureau data of 1994.
borgelt. • Special cases in which they are competitive are domains with few transactions and very many items. It is the preﬁx of the conditional database for this node. • The characterization of closed item sets by I closed ⇔ sT (I) ≥ smin ∧ I= k∈KT (I) Filtering Frequent Item Sets • If only closed item sets or only maximal item sets are to be found with item set enumeration approaches.Reminder: Perfect Extension Pruning chess apriori eclat fpgrowth w/o pep T10I4D100K apriori eclat fpgrowth w/o pep 2 1 0 1 1000 1 0 1200 1400 1600 1800 2000 0 5 10 15 20 25 30 35 40 45 50 webview1 apriori eclat fpgrowth w/o pep Searching for Closed and Maximal Item Sets 1 census apriori eclat fpgrowth w/o pep 1 0 0 0 10 20 30 40 50 60 70 80 90 100 33 34 35 36 37 38 39 40 Decimal logarithm of execution time in seconds over absolute minimum support.net/carpenter.html ◦ E = {i ∈ B −H  ∃h ∈ H : h > i} is the set of eliminated items. Christian Borgelt Frequent Pattern Mining 185 Christian Borgelt Frequent Pattern Mining 186 Searching for Closed Frequent Item Sets • We know that it suﬃces to ﬁnd the closed item sets together with their support. ◦ The tail L ⊆ B of a search tree node is the set of items that are frequent in its conditional database. They are the possible extensions of H. • Note that the items in the tail and their support in the conditional database are known. tk suggests to ﬁnd them by forming all possible intersections of the transactions (with at least smin transactions) and checking their support. • Some useful notions for ﬁltering and pruning: ◦ The head H ⊆ B of a search tree node is the set of items on the path leading to it.borgelt.net/ista.html http://www. Christian Borgelt Frequent Pattern Mining 187 Christian Borgelt Frequent Pattern Mining 188 . approaches using this idea are rarely competitive with other methods. An example of such a domain is gene expression analysis. ◦ Note that ∀h ∈ H : ∀l ∈ L : h < l. • However. at least after the search returns from the recursive processing. the found frequent item sets have to be ﬁltered. • Implementations of intersection approaches can be found here: http://www. These items are not considered anymore in the corresponding subtree.
If either is the case. the head is not a closed item set. Christian Borgelt Frequent Pattern Mining . • For the encircled search tree nodes we have: red: head H = {b}. H is not maximal. tail L = {c}. as their support values are available from the conditional transaction databases. ◦ If an item in the tail of a search tree node has the same support as the head. Tail and Eliminated Items d e d c cd ce de ab ac ad ae bc bd be b c c d d d cde abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde A (full) preﬁx tree for the ﬁve items a. • As a consequence. • However. its head is not necessarily a maximal item set. Note that with the latter condition. eliminated items E = {a} green: head H = {a. all item set enumeration approaches for closed and maximal item sets check the deﬁning condition for the tail items. • Maximal Item Sets: Check whether ∃a ∈ E : sT (H ∪ {a}) ≥ smin. which can still render the head nonclosed or nonmaximal. • It can depend on the database structure used whether a check of the deﬁning condition is eﬃcient for the eliminated items or not. otherwise it is. tail L = {d. but rely on a repository of already found closed or maximal item sets. eliminated items E = {b} • The problem are the eliminated items. e. the head is not necessarily a closed item set. It can be concluded that H is closed as soon as the intersection becomes empty. • With such a repository it can be checked in an indirect way whether an item set is closed or maximal. H is not closed. the inverse implications need not hold: ◦ If the tail of a search tree node is empty. • However. a a b b c Closed and Maximal Item Sets • When ﬁltering frequent item sets for closed and maximal item sets the following conditions are easy and eﬃcient to check: ◦ If the tail of a search tree node is not empty. 191 Christian Borgelt Frequent Pattern Mining 192 ∃a ∈ E : KT (H) ⊆ KT (a) (tk − H) = ∅. b. • The blue boxes are the frequent item sets. checking the deﬁning condition can be diﬃcult for the eliminated items. e}. some item set enumeration algorithms do not check the deﬁning condition for the eliminated items. otherwise it is. c. • As a consequence. since additional data (beyond the conditional transaction database) is needed to determine their occurrences in the transactions or their support values. d. If this is the case. c}.Head. ◦ If no item in the tail of a search tree node has the same support as the head. the intersection can be computed transaction by transaction. Christian Borgelt Frequent Pattern Mining 189 Christian Borgelt Frequent Pattern Mining 190 Closed and Maximal Item Sets Check the Deﬁning Condition Directly: • Closed Item Sets: Check whether or check whether k∈KT (H) Closed and Maximal Item Sets • Checking the deﬁning condition directly is trivial for the tail items. its head is not a maximal item set.
c. • All item sets explored from the search tree node with head H and tail L are subsets of H ∪ L (because only the items in L are conditionally frequent). c. ◦ Item sets containing eliminated items are considered only in search tree branches to the left of the considered node. d. e}. because the recursion cannot yield any closed or maximal item sets. the head H need not be processed. (Preferred data structure for the repository: preﬁx tree) • It is checked whether a superset of the head H with the same support has already been found. e} have not been processed then. e}. the item i is a perfect extension of all item sets explored from the search tree node with head H and tail L. Christian Borgelt Frequent Pattern Mining 195 Checking the Eliminated Items: Repository • It is usually advantageous to use not just a single. c. However. • A popular structure for the repository is an FPtree. d. Christian Borgelt Frequent Pattern Mining 196 . {a. • With conditional repositories the check for a known superset reduces to the check whether the conditional repository contains an item set with the next split item and the same support as the current head. (Note that the check is executed before going into recursion. but to create conditional repositories for each recursive call. because the maximal item sets {a. {c. the child node is pruned. e. / it is a perfect extension of all supersets J ⊇ H with i ∈ J. • The item i is an eliminated item. ◦ We need the repository to check for possibly existing closed or maximal supersets that contain one or more eliminated item(s). the set I would not be in the repository already. i ∈ L (item i is not in the tail). • The reason is that a found proper superset I ⊃ H with sT (I) = sT (H) contains at least one item i ∈ I − H that is a perfect extension of H.) • The conditional repositories are obtained by basically the same operation as the conditional transaction databases (projecting/conditioning on the split item). {a. If yes. e} it could be determined with the help of a repository that they are not maximal. because it allows for simple and eﬃcient projection/conditioning. Christian Borgelt Frequent Pattern Mining 193 Checking the Eliminated Items: Repository d e d c cd ce de ab ac ad ae bc bd be b c c d d d cde abc abd abe acd ace ade bcd bce bde c d d d abcd abce abde acde bcde d abcde A (full) preﬁx tree for the ﬁve items a. the head H is neither closed nor maximal. that is. • For none of the frequent item sets {d.) • If the item i is a perfect extension of the head H. ◦ Therefore these branches must already have been processed in order to ensure that possible supersets have already been recorded. Christian Borgelt Frequent Pattern Mining 194 Checking the Eliminated Items: Repository • If a superset of the current head H with the same support has already been found. a simple preﬁx tree that is projected topdown may also be used. • Note that with a repository the depthﬁrst search has to proceed from left to right. which contain only the found closed item sets that contain H. before constructing the extended head of a child node. that is. d} and {c. d}.Checking the Eliminated Items: Repository • Each found maximal or closed item set is stored in a repository. • Even more: the head H need not be processed recursively. global repository. • Consequently. b. / (If i were in L. If the check ﬁnds a superset. and therefore none of them can be closed. because it cannot yield any maximal or closed item sets. Therefore the current subtree of the search tree can be pruned. a a b b c • Suppose the preﬁx tree would be traversed from right to left.
• Perfect extension / parent equivalence pruning can be applied for both closed and maximal item sets. • The union with {j ∈ XT (I ∪ {i})  j > i} represents perfect extension or parent equivalence pruning: all perfect extensions in the tail of I ∪ {i} are immediately added. Intuitively. (All items greater than i∗ can be removed without aﬀecting the support. • Analogously. the whole subtree can be pruned. since KT (I∗) ⊃ KT (I∗ ∪ {i∗}). Let i∗ ∈ I be the (uniquely determined) item satisfying sT ({i ∈ I  i < i∗}) > sT (I) and sT ({i ∈ I  i ≤ i∗}) = sT (I). if all transactions containing I also contain the item a). even more additional pruning of the search tree becomes possible. ⊆).Closed and Maximal Item Sets: Pruning • If only closed item sets or only maximal item sets are to be found. ◦ This pruning method requires a left to right traversal of the preﬁx tree. Christian Borgelt Frequent Pattern Mining 197 Christian Borgelt Frequent Pattern Mining 198 Alternative Description of Closed Item Set Mining • In order to avoid redundant search in the partially ordered set (2B . we may structure the set of closed item sets by assigning unique closed parent item sets. an item a ∈ I is called a perfect extension of I. Intuitively. [Uno et al. • Perfect Extension Pruning / Parent Equivalence Pruning (PEP) ◦ Given an item set I. 2003] Alternative Description of Closed Item Set Mining • Note that 1≤k≤n tk is the smallest closed item set for a given database T . • For the recursive search. the following formulation is useful: Let I ⊆ B be a closed item set. Head Union Tail Pruning • If only maximal item sets are to be found. we assigned a unique parent item set to each item set (except the empty set). Then the whole set XT (I) can be added to the preﬁx. Then the canonical parent pC (I) of I is the item set pC (I) = I∗ ∪ {i ∈ XT (I∗)  i > i∗}. The canonical children of I (that is. • General Idea: All frequent item sets in the subtree rooted at a node with head H and tail L are subsets of H ∪ L. Christian Borgelt Frequent Pattern Mining . the closed item sets that have I as their canonical parent) are the item sets J = I ∪ {i} ∪ {j ∈ XT (I ∪ {i})  j > i} with ∀j ∈ I : i > j and {j ∈ XT (I ∪ {i})  j < i} = XT (J) = ∅. additional pruning of the search tree becomes possible. ◦ As a consequence. since all maximal item sets are closed. the reduced item set I∗ is enhanced by all perfect extension items following the item i∗. because a perfect extension of I∗ ∪ {i∗} need not be a perfect extension of I∗. / iﬀ the item sets I and I ∪ {a} have the same support: sT (I) = sT (I ∪ {a}) (that is. • Frequent Head ∪ Tail Pruning (FHUT) ◦ If H ∪ L is not a subset of an already found maximal item set and by some clever means we discover that H ∪ L is frequent. the item i∗ is the greatest item in I that is not a perfect extension.) Let I∗ = {i ∈ I  i < i∗} and XT (I) = {i ∈ B − I  sT (I ∪ {i}) = sT (I)}. • Note also that the set {i ∈ XT (I∗ )  i > i∗} need not contain all items i > i∗. H ∪ L can immediately be recorded as a maximal item set. • Maximal Item Set Contains Head ∪ Tail Pruning (MFIHUT) ◦ If we ﬁnd out that H ∪ L is a subset of an already found maximal item set. / Hence a can be added directly to the preﬁx of the conditional database. • Let XT (I) = {a  a ∈ I ∧ sT (I ∪ {a}) = sT (I)} be the set of all perfect extension / items. to ﬁnd the canonical parent of the item set I. Then we know: ∀J ⊇ I : sT (J ∪ {a}) = sT (J). no superset J ⊇ I with a ∈ J can be closed. 199 Christian Borgelt Frequent Pattern Mining 200 • Let ≤ be an item order and let I be a closed item set with I = 1≤k≤n tk . • The condition {j ∈ XT (I ∪ {i})  j < i} = ∅ expresses that there must not be any perfect extensions among the eliminated items.
even if the items in J − I are independent of I. ˆ a∈I pT ({a}) ̺ii(I) = min an require a minimum value for this measure. can exceed the number of transactions in the database by far. Additional Frequent Item Set Filtering • General idea: Compare support to expectation.Additional Frequent Item Set Filtering • General problem of frequent item set mining: The number of frequent item sets. then all J ⊃ I are also likely to score high.) p • Assumes full independence of the items in order to form an expectation about the support of an item set. I1 ∩ I2 = ∅ and I1 ∪ I2 = I. this is not surprising: we expect this even if the occurrence of the items is independent. ◦ Additional ﬁltering should remove item sets with a support close to the support expected from an independent occurrence. even the number of closed or maximal item sets. the minimum ensures a low value. • Therefore: Additional ﬁltering is necessary to ﬁnd the ’‘relevant” or “interesting” frequent item sets. the item set I still receives a high evaluation.) p • Advantage: If I contains independent items. (ˆT is the probability estimate based on T . Christian Borgelt Frequent Pattern Mining 203 Christian Borgelt Frequent Pattern Mining 204 . I2  > 1. If there exist high scoring independent subsets I1 and I2 with I1 > 1. sT (I − {a}) · sT ({a}) ˆ ˆ a∈I pT (I − {a}) · pT ({a}) a∈I = pT (I) ˆ . an require a minimum value for this measure. ◦ Item sets consisting of items that appear frequently are likely to have a high support. ◦ However. • Advantage: Can be computed from only the support of the item set and the support values of the individual items. • Disadvantages: We need to know the support values of all subsets I − {a}. (ˆT is the probability estimate based on T . Christian Borgelt Frequent Pattern Mining 201 Christian Borgelt Frequent Pattern Mining 202 Additional Frequent Item Set Filtering Full Independence • Evaluate item sets with ̺ﬁ(I) = sT (I) · nI−1 a∈I sT ({a}) Additional Frequent Item Set Filtering Incremental Independence • Evaluate item sets with pT (I) ˆ n sT (I) = min . • Disadvantage: If some item set I scores high on this measure.
Christian Borgelt Frequent Pattern Mining 205 Christian Borgelt Frequent Pattern Mining 206 Biological Background Structure of a prototypical neuron terminal bouton synapsis Example Application: Finding Neuron Assemblies in Neural Spike Data cell core axon myelin sheath cell body (soma) dendrites Christian Borgelt Frequent Pattern Mining 207 Christian Borgelt Frequent Pattern Mining 208 . (ˆT is the probability estimate based on T . This captures subset independence “incrementally”. • Disadvantages: We need to know the support values of all proper subsets J.Additional Frequent Item Set Filtering Subset Independence • Evaluate item sets with ̺si(I) = n sT (I) pT (I) ˆ = min . ˆ ˆ J⊂I.J=∅ pT (I − J) · pT (J) min Summary Frequent Item Set Mining • With a canonical form of an item set the Hasse diagram can be turned into a much simpler preﬁx tree (→ divideandconquer scheme using conditional databases). • Improvement: Use incremental independence and in the minimum consider only items {a} for which I − {a} has been evaluated high.J=∅ sT (I − J) · sT (J) J⊂I. • Additional ﬁltering is necessary to reduce the size of the output.) p • Advantage: Detects all cases where a decomposition is possible and evaluates them with a low value. • Item set enumeration algorithms diﬀer in: ◦ the traversal order of the preﬁx tree: (breadthﬁrst/levelwise versus depthﬁrst traversal) ◦ the transaction representation: horizontal (item arrays) versus vertical (transaction lists) versus specialized data structures like FPtrees ◦ the types of frequent item sets found: frequent versus closed versus maximal item sets (additional pruning methods for closed and maximal item sets) • An alternative are transaction set enumeration or intersection algorithms. an require a minimum value for this measure.
Christian Borgelt Frequent Pattern Mining 209 Christian Borgelt Frequent Pattern Mining 210 Neuronal Action Potential Higher Level Neural Processing • The lowlevel mechanisms of neural information processing are fairly well understood (neurotransmitters. c Alvin M. could provide a method to test the temporal coincidence hypothesis (see below). • Up to fairly recently it was not possible to record the spikes of enough neurons in parallel to decide between the diﬀerent models.) • Decrease in potential diﬀerence: excitatory synapse Increase in potential diﬀerence: inhibitory synapse • If there is enough net excitatory input. are a topic of current research. Christian Borgelt Frequent Pattern Mining 211 Christian Borgelt Frequent Pattern Mining 212 . the axon is depolarized. A schematic view of an idealized action potential illustrates its various phases as the action potential passes a point on a cell membrane. properly adapted. called neurotransmitters. Actual recordings of action potentials are often distorted compared to the schematic view because of variations in electrophysiological techniques used to make the recording. action potential). • Currently methods are investigated by which it would be possible to check the validity of the diﬀerent coding models.) c Jacob Wilson • When the action potential reaches the terminal boutons. excitation and inhibition. however.wikipedia. (Speed depends on the degree to which the axon is covered with myelin. c en. There are several competing theories (see the following slides) how neurons code and transmit the information they process. • These act on the membrane of the receptor dendrite to change its polarization. (The inside is usually 70mV more negative than the outside. Burt • The resulting action potential travels along the axon. new measurement techniques open up the possibility to record dozens or even up to a hundred neurons in parallel. it triggers the release of neurotransmitters. • The highlevel mechanisms. • Frequent item set mining.Biological Background Biological Background (Very) simpliﬁed description of neural information processing • Axon terminal releases chemicals.org However.
Neuron 1 which was stimulated stronger reached the threshold earlier and initiated a spike sooner than neurons stimulated less. Temporal Coincidence Hypothesis [Gray et al. The stronger stimulus induces spikes earlier and will initiate spikes in the other. Note that this model integrates both the temporal coincidence and the delay coding principles. Eccles 1957.Models of Neuronal Coding Models of Neuronal Coding c Zolt´n N´dasdy a a c Zolt´n N´dasdy a a Frequency Code Hypothesis [Sherrington 1906. 1992. Barlow 1972] Neurons generate diﬀerent frequency of spike trains as a response to diﬀerent stimulus intensities. Spike sequences coincide with the local ﬁeld activity. The sequence of spike propagation is determined by the spatiotemporal conﬁguration of the stimulus as well as the intrinsic connectivity of the network. connected cells in the order of relative threshold and actual depolarization. 1994] Spike occurrences are modulated by local ﬁeld oscillation (gamma). Christian Borgelt Frequent Pattern Mining 215 Christian Borgelt Frequent Pattern Mining 216 . SpatioTemporal Code Hypothesis Neurons display a causal sequence of spikes in relationship to a stimulus conﬁguration. Christian Borgelt Frequent Pattern Mining 213 Christian Borgelt Frequent Pattern Mining 214 Models of Neuronal Coding Models of Neuronal Coding c Zolt´n N´dasdy a a c Zolt´n N´dasdy a a Delay Coding Hypothesis [Hopﬁeld 1995. Diﬀerent delays of the spikes (d2d4) represent relative intensities of the diﬀerent stimulus. Singer 1993. Buzs´ki and Chrobak 1995] a The input current is converted to the spike delay. Tighter coincidence of spikes recorded from diﬀerent neurons represent higher stimulus intensity.
• Dot displays of (simulated) parallel spike trains. but with relevant neurons collected at the bottom. • Without proper intelligent data analysis methods. data c Sonja Gr¨n. vertical: neurons (100) horizontal: time (10 seconds) • In one of these dot displays. time bins are formed. • Question: How can we ﬁnd out which neurons to group together? Christian Borgelt Frequent Pattern Mining 219 Christian Borgelt Frequent Pattern Mining 220 .Models of Neuronal Coding Finding Neuron Assemblies in Neuronal Spike Data c Zolt´n N´dasdy a a data c Sonja Gr¨n. RIKEN Brain Science Institute. Tokyo u • Each time bin gives rise to one transaction. It contains the set of neurons that ﬁre in this time bin (items). Christian Borgelt Frequent Pattern Mining 217 Christian Borgelt Frequent Pattern Mining 218 Finding Neuron Assemblies in Neural Spike Data Finding Neuron Assemblies in Neural Spike Data A Frequent Item Set Mining Approach • The neuronal spike trains are usually coded as pairs of a neuron id and a spike time. Tokyo u Markovian Process of Frequency Modulation [Seidermann et al. the synchronous ﬁring becomes easily visible. 20 neurons are ﬁring synchronously. left: copy of the right diagram of the previous slide right: same data. • Frequent item set mining.5400%/54. 105. • For the (simulated) example data set such an approach detects the neuron assembly perfectly: 80 54 88 28 93 83 39 29 50 24 40 30 32 11 82 69 22 60 5 4 (0. Diﬀerent stimulus conﬁgurations are represented by diﬀerent Markovian sequences across several seconds. it is virtually impossible to detect such synchronous ﬁring. • A synchronously ﬁring set of neurons is called a neuron assembly. sorted by the spike time.1679) • If the neurons that ﬁre together are grouped together. possibly restricted to maximal item sets. • In order to make frequent item set mining applicable. 1996] Stimulus intensities are converted to a sequence of frequency enhancements and decrements in the diﬀerent neurons. RIKEN Brain Science Institute. is then applied with additional ﬁltering of the frequent item sets.
however. can make it diﬃcult to observe the synchrony. Christian Borgelt Frequent Pattern Mining 223 Christian Borgelt Frequent Pattern Mining 224 . only a subset of the neurons (between 50 and 80%) participate (simulations show that this enough to propagate such synchronous activity). Such jitter. but on the right the neurons of the assembly are collected at the bottom. columns are time bins. However. Tokyo u Association Rules • Both diagrams show the same (simulated) data. together with binning the data. proper statistical tests have to be developed. but in a dot display. • Only about 80% of the neurons (randomly chosen) participate in each synchronous ﬁring. • In addition. rows are neurons. that a dot display is usually rotated by 90o: usually customers refer to rows. Christian Borgelt Frequent Pattern Mining 221 Christian Borgelt Frequent Pattern Mining 222 Finding Neuron Assemblies in Neural Spike Data data c Sonja Gr¨n. • In both cases the input can be represented as a binary matrix (the socalled dot display in spike train analysis). it is not to be expected that real world data will be so clean. since the spikes may end up in diﬀerent bins. Hence there is no frequent item set comprising all of them. it is to be expected that each time a neuron assembly is activated. • Note. products to columns. • In these cases postprocessing of the found item sets is necessary in order to collect all neurons of an assembly.Finding Neuron Assemblies in Neural Spike Data Translation of Basic Notions mathematical problem market basket analysis item item base — (transaction id) transaction frequent item set product set of all products customer set of products bought by a customer set of products frequently bought together spike train analysis neuron set of all neurons time bin set of neurons ﬁring in a time bin set of neurons frequently ﬁring together Finding Neuron Assemblies in Neural Spike Data Open Problems and Ongoing Work • The frequent item set mining approach with additional ﬁltering works perfectly if the neurons ﬁre in perfect synchrony. Rather there will be a considerable amount of temporal jitter. • In addition. • Rather a frequent item set mining approach ﬁnds a large number of frequent item sets with 12 to 16 neurons. RIKEN Brain Science Institute.
• Formally. • Construct rules and ﬁlter them w. Generating Association Rules • Which minimum support has to be used for ﬁnding the frequent item sets depends on the deﬁnition of the support of a rule: ◦ If ςT (X → Y ) = σT (X ∪ Y ). General Procedure: • Find the frequent item sets. Y : ∀a ∈ X : and therefore ∀X. ςmin and cmin. then σmin = ςmincmin or equivalently smin = ⌈nςmincmin⌉. s (X ∪ Y ) sT (X ∪ Y ) ≥ T sT (X) sT (X − {a}) That is: Moving an item from the antecedent to the consequent cannot increase the conﬁdence of a rule. • a real number • a real number Desired: • the set of all association rules. . Christian Borgelt Frequent Pattern Mining 226 ςmin. Properties of the Conﬁdence • From ∀I : ∀J ⊆ I : sT (I) ≤ sT (J) it obviously follows ∀X. the set R = {R : X → Y  ςT (R) ≥ ςmin ∧ cT (R) ≥ cmin}. thus forming rules X → Y . we consider rules of the form X → Y . then she/he will probably also buy cheese. . Y : ∀a ∈ X : cT (X → Y ) ≥ cT (X − {a} → Y ∪ {a}).r. am} of items. . Christian Borgelt Frequent Pattern Mining 227 Christian Borgelt Frequent Pattern Mining 228 . ◦ If ςT (X → Y ) = σT (X).r. cmin. Y : ∀a ∈ X : cT (X → Y ) < cmin → cT (X − {a} → Y ∪ {a}) < cmin. . . with X. ◦ Filtering rules w. . .t.t. that is. • Support of a Rule X → Y : Either: ςT (X → Y ) = σT (X ∪ Y ) (more common: rule is correct) Or: ςT (X → Y ) = σT (X) (more plausible: rule is applicable) • Conﬁdence of a Rule X → Y : σ (X ∪ Y ) sT (X ∪ Y ) s (I) cT (X → Y ) = T = = T σT (X) sT (X) sT (X) The conﬁdence can be seen as an estimate of P (Y  X). • As an immediate consequence we have ∀X.t. the minimum conﬁdence. ◦ Filtering rules w. for example: If a customer buys bread and wine.r. the minimum support. • a vector T = (t1. That is: If a rule fails to meet the minimum conﬁdence. the rule construction then traverses all frequent item sets I and splits them into disjoint subsets X and Y (X ∩ Y = ∅ and X ∪ Y = I). • After the frequent item sets have been found.Association Rules: Basic Notions • Often found patterns are expressed as association rules. 0 < cmin ≤ 1. conﬁdence is always necessary. Y ⊆ A and X ∩ Y = ∅. . tn) of transactions over A. support is only necessary if ςT (X → Y ) = σT (X). no rules over the same item set and with a larger consequent need to be considered. Christian Borgelt Frequent Pattern Mining 225 Association Rules: Formal Deﬁnition Given: • a set A = {a1. then σmin = ςmin or equivalently smin = ⌈nςmin⌉. 0 < ςmin ≤ 1.
c. Hm := i∈f {{i}}. Y = {a}.7% a→e 6 (60%) 7 (70%) 85. d} 7: {b. d}: 3 5 {a. (∗ add rule to the result ∗) else Hm := Hm − {h}. c. f2 ∈ Fk (∗ generate candidates with k + 1 items ∗) (∗ initialize the set of candidates ∗) (∗ traverse all pairs of frequent item sets ∗) (∗ (∗ (∗ (∗ (∗ (∗ (∗ — generate association rules ∗) initialize the set of rules ∗) traverse the frequent item sets ∗) start with rule heads (consequents) ∗) that contain only one item ∗) traverse rule heads of increasing size ∗) traverse the possible rule heads ∗) T if s (f −h) ≥ cmin (∗ if the conﬁdence is high enough. end. (∗ otherwise discard the head ∗) Hm+1 := candidates(Hm). d. c. • There are 25 = 32 possible item sets over A = {a. c. d. c}: {c. d → e 4 (40%) 5 (50%) 80% • The minimum support is smin = 3 or σmin = 0. • There are 16 frequent item sets (but only 10 transactions). ak−1. ak } (∗ that diﬀer only in one item and ∗) and f2 = {a1.7% d. c} 8: {a. c. . e} 5: {a. repeat forall h ∈ Hm do s (f ) Generating Association Rules function candidates (Fk ) begin E := ∅. d} 3: {a. X = {c. (∗ union has k + 1 items ∗) k if ∀a ∈ f : f − {a} ∈ Fk then E := E ∪ {f }. d}: {c. end. ak−1. e} 2: {b. e} 9: {c. c. e}: {b. e} frequent item sets 0 items 1 item ∅: 10 {a}: 7 {b}: 3 {c}: 7 {d}: 6 {e}: 7 2 items {a. e}. . c}: {a. e}: 3 6 {a. . e}: 4 3 4 4 4 Generating Association Rules Example: I = {a. e → a) = T sT ({c. . e}: 3 items 4 {a. b. c.3 = 30% in this example. c. e} 4: {a. c.3% e→a 6 (60%) 7 (70%) 85. Christian Borgelt Frequent Pattern Mining 231 Christian Borgelt Frequent Pattern Mining 232 . ∗) T then R := R ∪ {[(f − h) → h]}. ∗) (∗ add the new item set to the candidates ∗) (∗ (otherwise it cannot be frequent) ∗) (∗ return the generated candidates ∗) Christian Borgelt Frequent Pattern Mining 229 Christian Borgelt Frequent Pattern Mining 230 Frequent Item Sets: Example transaction database 1: {a. but ﬁxed) ∗) k f := f1 ∪ f2 = {a1. e}. a′ }. . (∗ or antecedent would become empty ∗) return R. s ({a. (∗ rules ∗) with f1 = {a1. . e}: {d. d}: {a. . . ak . (∗ return the rules found ∗) end. e}. return E. d. . d. a′ } (∗ are in a lexicographic order ∗) k and ak < a′ do begin (∗ (the order is arbitrary. d. e → a 4 (40%) 4 (40%) 100% a. forall f ∈ F do begin m := 1. e}) 3 = = 75% cT (c. (∗ increment the head item counter ∗) (∗ until there are no more rule heads ∗) until Hm = ∅ or m ≥ f . e} 6: {a. e} 10: {a.Generating Association Rules function rules (F). d. . forall f1. R := ∅. end (∗ candidates ∗) (∗ only if all subsets are frequent. ak−1. . . e}) 4 Minimum conﬁdence: 80% association support of support of conﬁdence rule all items antecedent b→c 3 (30%) 3 (30%) 100% d→a 5 (50%) 6 (60%) 83. (∗ create heads with one item more ∗) m := m + 1. c. b.
d} two association rules association rule a→c b→d support of all items 3 (37. • Exploit the preﬁx tree to ﬁnd the support of the body/antecedent. but not at the same time also the rule a → c. c.0%) support of conﬁdence antecedent 5 (62. • For ςT (R) = σ(X) there is no value ςmin that generates only the rule b → d. e} 2: {b. e} 5: {a. σT (Y ) cT (X → Y ) Frequent Pattern Mining 236 Christian Borgelt Frequent Pattern Mining 235 Christian Borgelt . and PT (Y ) • (Absolute) conﬁdence diﬀerence to prior: dT (R) = cT (X → Y ) − σT (Y ) • Lift value: c (X → Y ) lT (R) = T σT (Y ) acd ace d acde bcd bce d bcde A (full) preﬁx tree for the ﬁve items a. • First rule: Get the support of the body/ antecedent from the parent node. c. • Traverse the item set tree breadthﬁrst or depthﬁrst. d} 8: {a.0% Rule Extraction from Preﬁx Tree • Restriction to rules with one item in the head/consequent. b. root pp p pp Let the minimum conﬁdence be cmin = 60%. c. d} 6: {c. c. b.5%) 4 (50. but not the rule a → c. d} 3: {b. c. • The item sets counted in a node consist of ◦ all items labeling the edges to the node (common preﬁx) and ◦ one item following the last edge label in the item order. • Next rules: Discard the head/consequent item from the downward path and follow the remaining path from the current node.Support of an Association Rule The two rule support deﬁnitions are not equivalent: transaction database 1: {a. • (Absolute) diﬀerence of lift quotient to 1: rT (R) = 1 − min cT (X → Y ) σT (Y ) . d} 4: {a. d.5%) 60. b. • For each node traverse the path to the root and generate and test one rule per node. e. • For ςT (R) = σ(X ∪ Y ) and 3 < ςmin ≤ 4 only the rule b → d is generated. • (Absolute) diﬀerence of lift value to 1: qT (R) = cT (X → Y ) −1 σT (Y ) • Based on a global order of the items (which can be arbitrary). head j E p pp ' pp hdnode J j i Jp pp pp prev Q same path J J body isnode Christian Borgelt Frequent Pattern Mining 233 Christian Borgelt Frequent Pattern Mining 234 Reminder: Preﬁx Tree a a ab b abc abd abe c d abcd abce abde d abcde ac c ad ae d ade bc c b b bd c c be d bde d e d cd d cde ce de Additional Rule Filtering: Simple Measures ˆ • General idea: Compare PT (Y  X) = cT (X → Y ) ˆ = cT ( ∅ → Y ) = σT (Y ).0% 4 (50. e} 7: {a.0%) 100.
s5 s3 . .0 n.15 bit/symbol Equal Size Subsets s1 . s3 . j = 1. s4 ..15 s2 2 s3 2 0. X⊆t X⊆t p00 p01 p0. n p. s3. n11 is the number of transactions for which the rule is correct.1 1 An Informationtheoretic Evaluation Measure Information Gain (Kullback and Leibler 1951. P (s5 ) = 0.19 s5 3 Code length: 3. ◦ Ask for containment in an arbitrarily chosen subset. decision tree induction etc. = 1. . . s4 . P (s4) = 0.. p. s4 .830 Christian Borgelt Frequent Pattern Mining 239 Christian Borgelt Frequent Pattern Mining 240 .40 s5 4 s1 2 0. − j=1 pi. n. ◦ A better question scheme than asking for one alternative after the other can easily be found: Divide the set into two subsets of about equal size.j = n. s5 s4. .75 s4 .Additional Rule Filtering: More Sophisticated Measures • Consider the 2 × 2 contingency table or the estimated probability table: X⊆t X⊆t n00 n01 n0. s5 s1.10. 2.59 bit/symbol Code eﬃciency: 0. s3 . s5 s1 1 0.25 P (si) log2 P (si) s1.. It is pij = nij . p. = ni. .19. Y ) = kY H(Y ) H(Y X) kY i=1 • n.j .1 n.j − pij log2 pij for i. satisfying • Shannon Entropy: H(S) = − i=1 n Question/Coding Schemes P (s1 ) = 0. s2 0. − P (s3) = 0.15. .10 0. ◦ Apply this scheme recursively → number of questions bounded by ⌈log2 n⌉.19 0. is the total number of transactions. ◦ Suppose there is an oracle. s4.10 • Intuitively: Expected number of yes/no questions that have to be asked in order to determine the obtaining alternative. H(Y ) H(Y X) H(Y ) − H(Y X) Entropy of the distribution of Y Expected entropy of the distribution of Y if the value of the X becomes known Expected entropy reduction or information gain • General idea: Use measures for the strength of dependence of X and Y . s5 0. . s2 . n. s3 . P (s2 ) = 0. .40 n P (s ) i i=1 Shannon entropy: Linear Traversal i P (si ) log2 P (si) = 2. p10 p11 p1. s4 .664 Code length: 2.40 s2 2 0.16.15 s3 3 0.16 s4 3 0.24 bit/symbol Code eﬃciency: 0. • There is a large number of such measures of dependence originating from statistics. .0 p. i = 1. but responds only if the question can be answered with “yes” or “no”. log2 pi.59 0. Quinlan 1986) n Y ⊆t Y ⊆t Y ⊆t Y ⊆t Based on Shannon Entropy H = − i=1 pi log2 pi − kX (Shannon 1948) Igain(X. Christian Borgelt Frequent Pattern Mining 237 Christian Borgelt Frequent Pattern Mining 238 Interpretation of Shannon Entropy • Let S = {s1 . sn } be a ﬁnite set of alternatives having positive probabilities P (si). n10 n11 n1. s5 s2 . which knows the obtaining alternative. . n.. n.1 is the number of transactions to which the rule is applicable. n = − i=1 pi. s5 0.16 s4 4 0. s2..
◦ Sort the alternatives w. Christian Borgelt Frequent Pattern Mining 243 Interpretation of Shannon Entropy 1 P (s1) = 2 .t.r.15 s2 3 0. • Although this enlarges the question/coding scheme.977 Christian Borgelt Frequent Pattern Mining 242 Question/Coding Schemes • It can be shown that Huﬀman coding is optimal if we have to determine the obtaining alternative in a single instance. • Idea: Process the sequence not instance by instance. s4. the Shannon entropy can easily be interpreted as follows: − i Perfect Question Scheme s1. their probabilities. s3 s1. s2 .25 bit/symbol Code eﬃciency: 0. • However. s2 . s4 .15.40 s5 2 s1 3 0. three or more consecutive instances and ask directly for the obtaining combination of alternatives.955 Code length: 2.875 bit/symbol Code eﬃciency: 1 Christian Borgelt Frequent Pattern Mining 244 . Question/Coding Schemes P (s2 ) = 0.Question/Coding Schemes • Splitting into subsets of about equal size can lead to a bad arrangement of the alternatives into subsets → high expected number of questions.25 0.16 s4 3 0. P (s5 ) = 0. but combine two. s4 . s5 s3. ◦ Always combine those two sets that have the smallest probabilities. s4 s1.35 0.19 0. • ShannonFano Coding (1948) P (s1 ) = 0. s3 . P (si) i occurrence path length in tree probability P (si) = s2 2 s3 3 s4 4 s5 4 In other words.10 0. s4 0. s3 . s3 . this scheme can be improved upon. s4 . the expected number of questions per identiﬁcation of an obtaining alternative cannot be made arbitrarily small.59 i P (si ) log2 P (si) = 2.16. Code length: 1.) • Only if the obtaining alternative has to be determined in a sequence of (independent) situations. − P (s3) = 0.15 bit/symbol (1952) (1948) Huﬀman Coding s1 . s2 0. s2.10 0.19 0.20 bit/symbol Code eﬃciency: 0. ◦ Start with one element sets. s3 . Shannon showed that there is a lower bound. namely the Shannon entropy. s5 s4 .16 s3 2 s4 2 0.40 s5 1 ◦ Build the question/coding scheme bottomup.15 s2 3 s3 3 0. P (s4) = 0. 1 1 1 P (s3) = 8 .19. P (s4 ) = 16 . s5 s1 1 1 2 1 4 1 8 1 16 1 16 P (si) log2 P (si) 1 · log2 . ◦ Split the set so that the subsets have about equal probability (splits must respect the probability order of the alternatives). it is the expected number of needed yes/no questions.60 ◦ Build the question/coding scheme topdown. s5 s1 . Christian Borgelt Frequent Pattern Mining 241 Code length: 2. P (s5 ) = 16 − i P (si) log2 P (si) = 1. s2 .10. s5 s2 . s2 .875 bit/symbol If the probability distribution allows for a perfect Huﬀman code (code eﬃciency 1). • Huﬀman Coding (1952) s4.40 Shannon entropy: Shannon–Fano Coding s1 . s2 s1 3 0. s3 . the expected number of questions per identiﬁcation is reduced (because each interrogation identiﬁes the obtaining alternative for several situations). s5 s3. P (s2) = Shannon entropy: 1 4. (No question/coding scheme has a smaller expected number of questions.41 0. • Good question schemes take the probability of the alternatives into account.25 0. s4 . s5 s1.
1. 47. p.0. 2. 2.8.p.8.73) hours=overtime (26.00) relationship=Wife (4. 57. 1.45) marital=Nevermarried <. 80.3.5.age=young sex=Female (12. 46.0.4.9.5.9. 2.1) n1. 1.occupation=Execmanagerial hours=overtime (5. 99.j n. 1. 1. 54.4.j Christian Borgelt Frequent Pattern Mining 245 Christian Borgelt Frequent Pattern Mining 246 Examples from the Census Data All rules are stated as consequent <.1(1 − p.1.89) relationship=Husband (40.8.j − pij )2 pi.29) occupation=Execmanagerial (12.p. − n1. i=1 j=1 (pi. i=1 j=1 (pi.1 − p11)2 = n. 100. 2.29) hours=overtime <. lift) where the support of a rule is the support of the antecedent.. Y ) = n. • Can be interpreted as a diﬀerence measure.8. 41.96) occupation=Profspecialty (12.j • Side remark: Information gain can also be interpreted as a diﬀerence measure. p1. • Can be interpreted as a diﬀerence measure.antecedent (support%.40) salary>50K <.p.(1 − p1.12) salary>50K <.86) education=Bachelors (16.education=Masters (5. 57.0. 99.58) Christian Borgelt Frequent Pattern Mining 247 Christian Borgelt Examples from the Census Data salary>50K salary>50K salary>50K salary>50K salary>50K salary>50K salary>50K salary>50K <<<<<<<<education=Masters (5.4.. 2. 2. 1.4.occupation=Execmanagerial sex=Male (8.09) sex=Male <.)n.education=Bachelors (16. 54. 44.6.5. Trivial/Obvious Rules edu_num=13 <. 2.51) salary>50K <. χ2(X.6.occupation=Execmanagerial (12.age=young sex=Male (17. Y ) = kX kY A Statistical Evaluation Measure χ2 Measure • Compares the actual joint distribution with a hypothetical independent distribution.)p.1) Igain(X.relationship=Wife (4.1.. 44.00) salary>50K <. (n1. • Uses absolute comparison. − n.4.A Statistical Evaluation Measure χ2 Measure • Compares the actual joint distribution with a hypothetical independent distribution. 54. kX kY • For kX = kY = 2 (as for rule evaluation) the χ2 measure simpliﬁes to χ2(X..4.8.8.8.education=Masters (5.9.0. 41.9.n11)2 (p1.29) Frequent Pattern Mining 248 . 45. 60.occupation=Profspecialty hours=overtime (4. confidence%.4. 1.6.n.99.50) sex=Female <. 2.3.3.88) marital=Marriedcivspouse (45.01) Interesting Comparisons marital=Nevermarried <.education=Bachelors hours=overtime (6.1(n.39) salary>50K <. 40. 3.1 − n.9.relationship=Husband (40. 6.(n.9. . 47. 2.9. 2..j − pij )2 pi.p.. • Uses absolute comparison.. χ2(X. 69. 1.4. Y ) = j=1 i=1 pij log2 pij pi. Y ) = kX kY n.4.p.70) salary>50K <.3.
occupation=Transportmoving education=HSgrad <.Examples from the Census Data salary>50K <.2. 2.86) Examples from the Census Data hours=halftime <. 1. 2.70) hours=overtime <. 43.2.occupation=Otherservice (10. 68.education=Masters (5.hours=overtime marital=Marriedcivspouse (15. 1. 1.9.2.9. 31. 3.65) sex=Female <.5. 2.1.1. 2. 54.occupation=Otherservice age=young (4. 1.4.salary>50K (23.2.occupation=Execmanagerial education=HSgrad <. 1.69) hours=overtime <.9. 70.8.03) sex=Female <. 2.5.6. 33.7.salary>50K (23.03) (4.9. 53.3.occupation=Profspecialty education=Bachelors <. 49. 2. 3.1. 44.1.88) occupation=Profspecialty <. 85.57) Summary Association Rules • Association Rule Induction is a Two Step Process ◦ Find the frequent item sets (minimum support).62) age=young <.81) salary>50K <.0.36) marital=Marriedcivspouse <. ◦ Information gain ◦ χ2 measure Christian Borgelt Frequent Pattern Mining 251 Christian Borgelt Frequent Pattern Mining 252 .5. 2.8.occupation=Execmanagerial marital=Marriedcivspouse (7. 2. 1.workclass=Selfempnotinc (7.education=Bachelors sex=Female (5.1. 40.4.6) Christian Borgelt Frequent Pattern Mining 249 Christian Borgelt Frequent Pattern Mining 250 Examples from the Census Data occupation=Profspecialty <.5. 55.1.1.85) salary>50K <. 2. ◦ Form the relevant association rules (minimum conﬁdence).96) salary>50K <.08) hours=overtime <.70) age=senior <.9. ◦ Filter “interesting” association rules based on minimum support and minimum conﬁdence.4.4. 1.4.2. 1. 67. 1.7.hours=halftime (12. 2.79) age=young <.1.occupation=Execmanagerial salary>50K (6.20) (12. 51. • Generating the Association Rules ◦ Form all possible association rules from the frequent item sets. • Filtering the Association Rules ◦ Compare rule conﬁdence and consequent support.61) (6.occupation=Machineopinspct (12.71) sex=Female <. 56.12) hours=overtime <.5. 36. 53.58) education=Bachelors <.education=Masters (5.occupation=Handlerscleaners (4.6. 1.hours=halftime (12.occupation=Profspecialty marital=Marriedcivspouse (6. 67.education=Bachelors marital=Marriedcivspouse (8. 50. 34.0.4. 1.2.education=Somecollege sex=Female (8. 37.74) occupation=Admclerical <.0.8.8. 50.3.6.occupation=Admclerical (11.7. 31.6.occupation=Execmanagerial (12.
so that support counting becomes more eﬃcient. SDﬁle/Ctab etc. Mining More Complex Patterns Christian Borgelt Frequent Pattern Mining 253 Christian Borgelt Frequent Pattern Mining 254 Molecular Fragment Mining • Motivation: Accelerating Drug Development ◦ Phases of drug development: preclinical and clinical ◦ Data gathering by highthroughput screening: building molecular databases with activity information Motivation: Molecular Fragment Mining ◦ Acceleration potential by intelligent data analysis: (quantitative) structureactivity relationship discovery • Mining Molecular Databases ◦ Example data: NCI DTP HIV Antiviral Screen data set ◦ Description languages for molecules: SMILES. • Frequent (Sub)Graph Mining comprises the other areas: ◦ Trees are special graphs. ◦ Finding common molecular substructures ◦ Finding discriminative molecular substructures Christian Borgelt Frequent Pattern Mining 255 Christian Borgelt Frequent Pattern Mining 256 . ◦ Special data structures to represent the database to mine. • Frequent Sequence Mining and Frequent Tree Mining can exploit: ◦ Specialized canonical forms that allow for more eﬃcient checks. namely graphs that are singly connected. SLN. ◦ Sequences can be seen as special trees. namely the general scheme of searching with a canonical form. namely chains (only one or two branches — depending on the choice of the root).Mining More Complex Patterns • The search scheme in Frequent Graph/Tree/Sequence mining is the same. • We will treat Frequent Graph Mining ﬁrst and will discuss optimizations for the other areas later.
• As a consequence the chances for the development of drugs for target groups ◦ with rare diseases or ◦ with special diseases in developing countries are considerably reduced. it is tried to improve the search for new drug candidates (lead discovery) and their optimization (lead optimization). ca. One possible approach: • With highthroughput screening a very large number of substances is tested automatically and their activity is determined.t. at the same the number of substances under development has gone down drastically. Germany) u Christian Borgelt Frequent Pattern Mining 257 Christian Borgelt Phases of Drug Development • Discovery and Optimization of Candidate Substances ◦ HighThroughput Screening ◦ Lead Discovery and Lead Optimization • Preclinical Test Series (tests with animals. • Therefore approaches to speed up the development process usually target the preclinical phase before the animal tests.com www. • Due to high investments pharmaceutical companies must secure their market position and competitiveness by only a few. • In particular. eﬀectiveness and side eﬀects • Clinical Test Series (tests with humans.com 260 Christian Borgelt Frequent Pattern Mining . 4–6 years) ◦ Phase 1: ca. highly successful drugs.com 259 Christian Borgelt Frequent Pattern Mining c www. ca. 30–80 healthy humans Check for side eﬀects ◦ Phase 2: ca. 3 years) ◦ Fundamental test w. since they serve the purpose to ensure the safety of the patients. • A signiﬁcant reduction of the development time could mitigate this trend or even reverse it.thermo. 100–300 humans exhibiting the symptoms of the target disease Check for eﬀectiveness ◦ Phase 3: up to 3000 healthy and ill humans at least 3 years Detailed check of eﬀectiveness and side eﬀects • Oﬃcial Acceptance as a Drug Frequent Pattern Mining 258 Drug Development: Acceleration Potential • The length of the preclinical and clinical tests series can hardly be reduced.Accelerating Drug Development • Developing a new drug can take 10 to 12 years (from the choice of the target to the introduction into the market). (Source: Bundesministerium f¨r Bildung und Forschung.r.elisatek. Here Intelligent Data Analysis and Frequent Pattern Mining can help.arrayit. • The resulting molecular databases are analyzed by trying to ﬁnd common substructures of active substances. • In recent years the duration of the drug development processes increased continuously.com www.matrixtechcorp. HighThroughput Screening On socalled microplates proteins/cells are automatically combined with a large variety of chemical compounds. www.
CC1=CC=CC(=C1)SC[C]2N=C3C=CC=CC3=C(C)[N+]2=O 64057.OS(O)(=O)=O 22318.NC(=N)NC1=C(SSC2=C(NC(N)=N)C=CC=C2)C=CC=C1. identiﬁcation number. • Substances that reproducibly provided 100% protection are listed as “conﬁrmed active” (CA).2. ﬂuorescence.2.0. Figure c Christof Fattinger.0. 1: CM. • A large number of chemical compounds where tested whether they protect human CEM cells against an HIV1 infection.biotek. • Substances that reproducibly provided at least 50% protection are listed as “moderately active” (CM). • Substances that provided 50% protection were retested.1. Christian Borgelt Form of the Input Data Excerpt from the NCI DTP HIV Antiviral Screen data set (SMILES format): 737. 0: CI).NC1=NC(=C(N=O)C(=N1)O)NC2=CC(=C(Cl)C=C2)Cl 55917.0.2.[O][N+](=O)C1=CC2=C(C=NN=C2C=C1)N3CC3 . molecule description in SMILES notation 877 CM. • 325 CA.CC1=C2C=CC=CC2=N[C](CSC3=CC=CC=C3)[N+]1=O 51342.com Christian Borgelt Frequent Pattern Mining 261 Christian Borgelt Frequent Pattern Mining 262 Example: NCI DTP HIV Antiviral Screen • Among other data sets. polarization etc).CC1=C(SC[C]2N=C3C=CC=CC3=C(C)[N+]2=O)C=CC=C1 64055.r. absorption..t.2.C[N+](C)(C)C1=CC2=C(NC3=CC=CC=C3S2)N=N1 50848. 35 969 CI (total: 37 171 substances) Frequent Pattern Mining 263 Christian Borgelt Frequent Pattern Mining 264 . By analyzing the results one tries to understand the dependence between molecular structure and activity.HighThroughput Screening The ﬁlled microplates are then evaluated in spectrometers (w.CN(C)C1=[S+][Zn]2(S1)SC(=[S+]2)N(C)C 2018. HighThroughput Screening After the measurement the substances are classiﬁed as active or inactive.0.CCCCN(CCCC)C1=[S+][Cu]2(S1)SC(=[S+]2)N(CCCC)CCCC 24479.OC1=C2N=C(NC3=CC=CC=C3)SC2=NC=N1 20625. • All other substances are listed as “conﬁrmed inactive” (CI).0. HoﬀmannLaRoche. luminescence. the National Cancer Institute (NCI) has made the DTP HIV Antiviral Screen Data Set publicly available. 0.moleculardevices.OC1=C2C=NC(=NC2=C(O)N=N1)NC3=CC=C(Cl)C=C3 55721.O=C(N1CCCC[CH]1C2=CC=CN=C2)C3=CC=CC=C3 64054. activity (2: CA.CC1=C2C=CC=CC2=N[C](CSC3=NC4=CC=CC=C4S3)[N+]1=O 66151.com www.0.. QSAR — Quantitative StructureActivity Relationship Modeling In this area a large number of data mining algorithms are used: • feature selection methods • decision trees • neural networks etc.0. Basel c www. 0.N#CC(=CC1=CC=CC=C1)C2=CC=CC=C2 19110.
9464 0. comment 6 5 0 0 1 0 0. .6220 1. Inc.0817 1.) C[1]H:CH:C(F):CH:C[8]:C:@1C[10]HCH(CH2CH2@8)C[20]HC(CH3) (CH2CH2@10)CH(CH2CH2@20)OH Represented Molecule: Full Representation H HHHH H H H H C C C H C C H H O C C C C C F C C C C H C C HHC C H H H H HH H H Christian Borgelt black: nonterminal symbols blue : terminal symbols Simpliﬁed Representation The deﬁnitions of the nonterminals ”Element”.8037 1.0000 0. and ”Label” depend on the chosen description language.2 $$$$ Finding Common Molecular Substructures Some Molecules from the NCI HIV Database date/time etc. program.0000 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 N N O O O P O O O O O 3C 1 5O 2 C C N4 O6 O O O N N N Common Fragment O O N SDﬁle: Structuredata ﬁle Ctab: Connection table (lines 4–16) c Elsevier Science O O O O O N O N N N O O N O N N N Christian Borgelt Frequent Pattern Mining 267 Christian Borgelt Frequent Pattern Mining 268 ..0000 0. Inc. Digit  % Digit Digit 0  1  .0000 0.5342 0. ”Bond”.  9 Frequent Pattern Mining 265 Christian Borgelt Frequent Pattern Mining 266 Input Format: SDﬁle/Ctab LAlanine (13C) user initials.) Input Format: Grammar for SMILES and SLN General grammar for (linear) molecule descriptions (SMILES and SLN): Molecule Branch ::= ::=    Atom ::= LabelDef ::=  Atom Branch ε Bond Atom Branch Bond Label Branch ( Branch ) Branch Element LabelDef ε Label LabelDef c1:c:c(F):c:c2:c:1C1C(CC2)C2C(C)(CC1)C(O)CC2 SLN (SYBYL Line Notation): (Tripos.0000 0. Daylight.B.4244 1 2 1 0 0 0 1 3 1 1 0 0 1 4 1 0 0 0 2 5 2 0 0 0 2 6 1 0 0 0 M END > <value> 0.6622 0.3000 0.7207 2.3695 0.8622 0. .Input Format: SMILES Notation and SLN SMILES Notation: (z. ε=#: .6622 0..0000 0. 3 V2000 C 0 0 C 0 0 C 1 0 N 0 3 O 0 0 O 0 5 N N N O N N O O N N N O N N N O O N O N O O N 0. For SMILES it is: F O Element Bond Label Digit ::= ::= ::= ::= B  C  N  O  F  [H]  [He]  [Li]  [Be]  .
. ◦ Connected substructures suﬃce for most applications. ◦ Find molecular fragments that appear frequently in the active molecules and only rarely in the inactive molecules. . charge. • Graph structure of vertices and edges has to be taken into account. ◦ This can help to identify drug candidates (socalled pharmacophores) and to guide future screening eﬀorts. ◦ This restriction considerably narrows the search space. graphs without these restrictions could be handled as well. Frequent (Sub)Graph Mining: Basic Notions • Let A = {a1. Note also that several vertices and edges may have the same attribute/label. . However. • Rationale in both cases: ◦ The found fragments can give hints which structural properties are responsible for the activity of a molecule. • Discriminative Molecular Substructures ◦ Analyze the active and the inactive molecules. • Finding frequent substructures means to ﬁnd graph fragments that are contained in many graphs in a given database of attributed graphs (user speciﬁes minimum support).Finding Molecular Substructures • Common Molecular Substructures ◦ Analyze only the active molecules. . triple. ⇒ Search partially ordered set of graph structures instead of subsets. • A labeled or attributed graph is a triple G = (V. ◦ E ⊆ V × V − {(v. where ◦ V is the set of vertices. aromatic ring ﬂag • Bond attributes: bond type (single. Main problem: How can we avoid redundant search? • Usually the search is restricted to connected substructures. Note that G is undirected and simple and contains no loops. ℓ). Example: molecule representation • Atom attributes: atom type (chemical element). v)  v ∈ V } is the set of edges. aromatic) 271 Christian Borgelt Frequent Pattern Mining 272 Christian Borgelt Frequent Pattern Mining . am} be a set of attributes or labels. E. ◦ Find molecular fragments that appear frequently in the molecules. double. and ◦ ℓ : V ∪ E → A assigns labels from the set A to vertices and edges. Frequent (Sub)Graph Mining Christian Borgelt Frequent Pattern Mining 269 Christian Borgelt Frequent Pattern Mining 270 Frequent (Sub)Graph Mining: General Approach • Finding frequent item sets means to ﬁnd sets of items that are contained in many transactions.
Frequent (Sub)Graph Mining: Basic Notions Note that for labeled graphs the same notions can be used as for normal graphs. we will use. • If S ⊑ G or S < G. It is understood that no edge (and no vertex) occurs twice. • All other bridges are called leaf bridges (because they are incident to at least one leaf). In other words: an edge is a proper bridge if removing it creates an isolated vertex. • Two diﬀerent vertices are adjacent or connected if they are incident to the same edge. ℓG) and S = (VS . The set of all (connected) subgraphs is analogous to the power set of a set. EG. • A graph is called connected if there exists a path between any two vertices. v). ES . • The set of all connected subgraphs of G is denoted by C(G). • Note that there may be several ways to map a labeled graph S to a labeled graph G so that the connection structure and the vertex and edge labels are preserved. • A connected component of a graph is a subgraph that is connected and maximal in the sense that any larger subgraph containing it is not connected. and the edge is incident to the vertex v. If such a mapping f exists. • An edge of a graph is called a proper bridge if it is a bridge and not incident to a leaf. v ′) or e = (v ′. v) ∈ ES : (f (u). written S ≡ G. 275 Christian Borgelt Frequent Pattern Mining 276 ◦ ∀(u. If S is a (proper) subgraph of G we write S ⊆ G or S ⊂ G. • A vertex of a graph is called a leaf if it is incident to exactly one edge. such that S and G′ are isomorphic. iﬀ S ⊑ G and S ≡ G. Christian Borgelt Frequent Pattern Mining . A subgraph isomorphism of S to G or an occurrence of S in G is an injective function f : VS → VG with ◦ ∀v ∈ VS : ℓS (v) = ℓG(f (v)) and ∧ ℓS ((u. This is the case if there exists a subgraph isomorphism of S to itself (a socalled graph automorphism) that is not the identity. It may even be that the graph S can be mapped in several diﬀerent ways to the same subgraph of G. v)) = ℓG((f (u). This explains the term “subgraph isomorphism”. 273 Christian Borgelt Frequent Pattern Mining 274 Christian Borgelt Frequent Pattern Mining Frequent (Sub)Graph Mining: Basic Notions • Let G = (VG. we will use. f (v))). written S < G. Without formal deﬁnition. Frequent (Sub)Graph Mining: Basic Notions Let S and G be two labeled graphs. we write S ⊑ G. • A subgraph consists of a subset of the vertices and a subset of the edges. • S is properly contained in G. For example. the mapping f preserves the connection structure and the labels. G may possess several subgraphs that are isomorphic to S. Without formal deﬁnition. there are (unconnected) graphs S with S ⊑ G that are not in C(G). • S and G are called isomorphic. • A path is a sequence of edges connecting two vertices. ℓS ) be two labeled graphs. then there exists a (proper) subgraph G′ of G. respectively. A function f mapping S to itself is called a graph automorphism. that is. iﬀ e = (v. It is obvious that for all S ∈ C(G) : S ⊑ G. • An edge of a graph is called a bridge if removing it increases the number of connected components of the graph. there is no other path on which one can reach the one from the other. iﬀ S ⊑ G and G ⊑ S. However.Frequent (Sub)Graph Mining: Basic Notions Note that for labeled graphs the same notions can be used as for normal graphs. In this case a function f mapping S to G is called a graph isomorphism. f (v)) ∈ EG That is. for example: • A vertex v is incident to an edge e. for example: • A vertex of a graph is called isolated if it is not incident to any edge. More intuitively: a bridge is the only connection between two vertices.
Subgraph Isomorphism: Examples
N N O N
G
Subgraph Isomorphism: Examples
N N O N
G
S1
O
S2
S1
O
S2 f2 : VS2 → VG
O N O O
f1 : VS1 → VG
O N O O
O
O
• A molecule G that represents a graph in a database and two graphs S1 and S2 that are contained in G. • The subgraph relationship is formally described by a mapping f of the vertices of one graph to the vertices of another: G = (VG, EG), S = (VS , ES ), f : VS → VG.
• The mapping must preserve the connection structure: ∀(u, v) ∈ ES : (f (u), f (v)) ∈ EG.
• The mapping must preserve vertex and edge labels: ∀v ∈ VS : ℓS (v) = ℓG(f (v)), ∀(u, v) ∈ ES : ℓS ((u, v)) = ℓG((f (u), f (v))).
• This mapping must preserve the connection structure and the labels.
Christian Borgelt Frequent Pattern Mining 277
Here: oxygen must be mapped to oxygen, single bonds to single bonds etc.
Christian Borgelt
Frequent Pattern Mining
278
Subgraph Isomorphism: Examples
N N O N
G
Subgraph Isomorphism: Examples
N N O N
G
S1
O
S2 f2 : VS2 → VG g2 : VS2 → VG
S1
O O
S3 f3 : VS3 → VG g3 : VS3 → VG
f1 : VS1 → VG
O N O O
f1 : VS1 → VG
O N O O
O
O
• There may be more than one possible mapping / occurrence. (There are even three more occurrences of S2.) • However, we are currently only interested in whether there exists a mapping. (The number of occurrences will become important when we consider mining frequent (sub)graphs in a single graph.) • Testing whether a subgraph isomorphism exists between given graphs S and G is NPcomplete (that is, requires exponential time unless P = NP).
Christian Borgelt Frequent Pattern Mining 279
• A graph may be mapped to itself (automorphism). • Trivially, every graph possesses the identity as an automorphism. (Every graph can be mapped to itself by mapping each node to itself.) • If a graph (fragment) possesses an automorphism that is not the identity there is more than one occurrence at the same location in another graph. • The number of occurrences of a graph (fragment) in a graph can be huge.
Christian Borgelt
Frequent Pattern Mining
280
Frequent (Sub)Graph Mining: Basic Notions
Let S be a labeled graph and G a vector of labeled graphs. • A labeled graph G ∈ G covers the labeled graph S or the labeled graph S is contained in a labeled graph G ∈ G iﬀ S ⊑ G.
Frequent (Sub)Graph Mining: Formal Deﬁnition
Given: • a set A = {a1, . . . , am} of attributes or labels, • a vector G = (G1, . . . , Gn) of graphs with labels in A, • a number smin ∈ IN, 0 < smin ≤ n, a number σmin ∈ IR, 0 < σmin ≤ 1, Desired: • the set of frequent (sub)graphs or frequent fragments, that is, or (equivalently) the minimum support.
• The set KG (S) = {k ∈ {1, . . . , n}  S ⊑ Gk } is called the cover of S w.r.t. G. The cover of a graph is the index set of the database graphs that cover it. It may also be deﬁned as a vector of all labeled graphs that cover it (which, however, is complicated to write in formally correct way). • The value sG (S) = KG (S) is called the (absolute) support of S w.r.t. G. 1 The value σG (S) = n KG (S) is called the relative support of S w.r.t. G. The support of S is the number or fraction of labeled graphs that contain it. Sometimes σG (S) is also called the (relative) frequency of S w.r.t. G.
the set FG (smin) = {S  sG (S) ≥ smin} or (equivalently) the set ΦG (σmin) = {S  σG (S) ≥ σmin}.
1 Note that with the relations smin = ⌈nσmin⌉ and σmin = n smin the two versions can easily be transformed into each other.
Christian Borgelt
Frequent Pattern Mining
281
Christian Borgelt
Frequent Pattern Mining
282
Frequent (Sub)Graphs: Example
example molecules (graph database)
S C N C O O S C N F O S C N O
Properties of the Support of (Sub)Graphs
• A brute force approach that enumerates all possible (sub)graphs, determines their support, and discards infrequent (sub)graphs is usually infeasible: The number of possible (connected) (sub)graphs, grows very quickly with the number of vertices and edges.
frequent molecular fragments (smin = 2) ∗ 3
S
(empty graph)
O
C
N
3
O S
3
S C
3
C O
3
C N
• Idea: Consider the properties of the support, in particular: ∀S : ∀R ⊇ S : KG (R) ⊆ KG (S).
2
O S C
3
S C N
2
S C O
3
N C O
This property holds, because ∀G : ∀S : ∀R ⊇ S : R ⊑ G → S ⊑ G. Each additional edge is another condition a database graph has to satisfy. Graphs that do not satisfy this condition are removed from the cover. • It follows: ∀S : ∀R ⊇ S : sG (R) ≤ sG (S).
2 The numbers below the subgraphs state their support.
O S C N
3
2
S C N O
2
2
2
That is: If a (sub)graph is extended, its support cannot increase. One also says that support is antimonotone or downward closed.
Christian Borgelt
Frequent Pattern Mining
283
Christian Borgelt
Frequent Pattern Mining
284
Properties of the Support of (Sub)Graphs
• From ∀S : ∀R ⊇ S : sG (R) ≤ sG (S) it follows ∀smin : ∀S : ∀R ⊇ S : sG (S) < smin → sG (R) < smin.
Reminder: Partially Ordered Sets
• A partial order is a binary relation ≤ over a set S which satisﬁes ∀a, b, c ∈ S: ◦ a≤a ◦ a≤b∧b≤a ⇒ a=b ◦ a≤b∧b≤c ⇒ a≤c (reﬂexivity) (antisymmetry) (transitivity)
That is: No supergraph of an infrequent (sub)graph can be frequent. • This property is often referred to as the Apriori Property. Rationale: Sometimes we can know a priori, that is, before checking its support by accessing the given graph database, that a (sub)graph cannot be frequent. • Of course, the contraposition of this implication also holds: ∀smin : ∀S : ∀R ⊆ S : sG (S) ≥ smin → sG (R) ≥ smin.
• A set with a partial order is called a partially ordered set (or poset for short). • Let a and b be two distinct elements of a partially ordered set (S, ≤). ◦ if a ≤ b or b ≤ a, then a and b are called comparable. ◦ if neither a ≤ b nor b ≤ a, then a and b are called incomparable. • If all pairs of elements of the underlying set S are comparable, the order ≤ is called a total order or a linear order. • In a total order the reﬂexivity axiom is replaced by the stronger axiom: ◦ a≤b∨b≤a
285 Christian Borgelt
That is: All subgraphs of a frequent (sub)graph are frequent. • This suggests a compressed representation of the set of frequent (sub)graphs.
(totality)
Frequent Pattern Mining 286
Christian Borgelt
Frequent Pattern Mining
Properties of the Support of (Sub)Graphs
Monotonicity in Calculus and Analysis • A function f : IR → IR is called monotonically nondecreasing if ∀x, y : x ≤ y ⇒ f (x) ≤ f (y). • A function f : IR → IR is called monotonically nonincreasing if ∀x, y : x ≤ y ⇒ f (x) ≥ f (y). Monotonicity in Order Theory • Order theory is concerned with arbitrary partially ordered sets. The terms increasing and decreasing are avoided, because they lose their pictorial motivation as soon as sets are considered that are not totally ordered. • A function f : S1 → S2, where S1 and S2 are two partially ordered sets, is called monotone or orderpreserving if ∀x, y ∈ S1 : x ≤ y ⇒ f (x) ≤ f (y). • A function f : S1 → S2, is called antimonotone or orderreversing if ∀x, y ∈ S1 : x ≤ y ⇒ f (x) ≥ f (y). • In this sense the support of a (sub)graph is antimonotone.
Christian Borgelt Frequent Pattern Mining 287
Properties of Frequent (Sub)Graphs
• A subset R of a partially ordered set (S, ≤) is called downward closed if for any element of the set all smaller elements are also in it: ∀x ∈ R : ∀y ∈ S : y≤x ⇒ y∈R In this case the subset R is also called a lower set. • The notions of upward closed and upper set are deﬁned analogously. • For every smin the set of frequent (sub)graphs FG (smin) is downward closed w.r.t. the partial order ⊑: ∀S ∈ FG (smin) : S ⊑ R ⇒ R ∈ FG (smin).
• Since the set of frequent (sub)graphs is induced by the support function, the notions of up or downward closed are transferred to the support function: Any set of (sub)graphs induced by a support threshold θ is up or downward closed. FG (θ) = {S  sG (S) ≥ θ} is downward closed, IG (θ) = {S  sG (S) < θ} is upward closed.
Christian Borgelt Frequent Pattern Mining 288
because there may be elements y ∈ R with neither x ≤ y nor y ≤ x. O S C N 3 2 S C N O 2 2 2 Christian Borgelt Frequent Pattern Mining 292 . That is. • The notions minimal and minimal element are deﬁned analogously. no supergraph of a maximal (frequent) (sub)graph is frequent. • Therefore: ∀smin : FG (smin) = S∈MG (smin) C(S). That is: Every frequent (sub)graph has a maximal supergraph. Christian Borgelt Frequent Pattern Mining 291 Maximal (Sub)Graphs: Example example molecules (graph database) S C N C O O S C N F O S C N O frequent molecular fragments (smin = 2) ∗ 3 S (empty graph) O C N 3 O S 3 S C 3 C O 3 C N 2 O S C 3 S C N 2 S C O 3 N C O 2 The numbers below the subgraphs state their support. ≤). That is: A (sub)graph is maximal if it is frequent. • Inﬁnite partially ordered sets need not possess a maximal element. An element x ∈ R is called maximal or a maximal element of R if ∀y ∈ R : x ≤ y ⇒ x = y. • Maximal elements need not be unique. • Here we consider the set FG (smin) together with the partial order ⊑: The maximal (frequent) (sub)graphs are the maximal elements of FG (smin): MG (smin) = {S ∈ FG (smin)  ∀R ∈ FG (smin) : S ⊑ R ⇒ S ≡ R}. but none of its proper supergraphs is frequent. Christian Borgelt Frequent Pattern Mining 289 Christian Borgelt Frequent Pattern Mining 290 Reminder: Maximal Elements • Let R be a subset of a partially ordered set (S.Maximal (Sub)Graphs • Consider the set of maximal (frequent) (sub)graphs / fragments: MG (smin) = {S  sG (S) ≥ smin ∧ ∀R ⊃ S : sG (R) < smin}. Types of Frequent (Sub)Graphs • Since with this deﬁnition we know that ∀smin : ∀S ∈ FG (smin) : S ∈ MG (smin) ∨ ∃R ⊃ S : sG (R) ≥ smin it follows (can easily be proven by successively extending the graph S) ∀smin : ∀S ∈ FG (smin) : ∃R ∈ MG (smin) : S ⊆ R.
follows immediately from ∀S : ∀R ⊇ S : sG (S) ≥ sG (R). max sG (R) cl (I) = k∈KT (I) tk . Christian Borgelt Frequent Pattern Mining 294 Closed (Sub)Graphs • However. That is: Every frequent (sub)graph has a closed supergraph. • Note that we have generally ∀smin : ∀S ∈ FG (smin) : sG (S) ≥ R∈MG (smin). but then we know only the support of the maximal (sub)graphs. Christian Borgelt Frequent Pattern Mining 295 restricted to the set of frequent item sets: CT (smin) = {I ∈ FT (smin)  I = cl (I)} Christian Borgelt Frequent Pattern Mining 296 .R⊃S Closed (Sub)Graphs • Consider the set of closed (frequent) (sub)graphs / fragments: CG (smin) = {S  sG (S) ≥ smin ∧ ∀R ⊃ S : sG (R) < sG (S)}. not only has every frequent (sub)graph a closed supergraph.R⊇S max sG (R). which satisﬁes the following conditions ∀X. Reminder: Closure Operators • A closure operator on a set S is a function cl : 2S → 2S .R⊇S • A set R ⊆ S is called closed if it is equal to its closure: R is closed ⇔ R = cl (R). • The closed (frequent) item sets are induced by the closure operator sG (S) = R∈CG (smin). a (sub)graph cannot have a lower support than any of its supergraphs. that the supergraph need not be unique — see below. • The set of all closed (sub)graphs preserves knowledge of all support values: ∀smin : ∀S ∈ FG (smin) : • Note that the weaker statement ∀smin : ∀S ∈ FG (smin) : sG (S) ≥ R∈CG (smin). • Since with this deﬁnition we know that ∀smin : ∀S ∈ FG (smin) : S ∈ CG (smin) ∨ ∃R ⊃ S : sG (R) = sG (S) max sG (R).) Note. but it has a closed supergraph with the same support: ∀smin : ∀S ∈ FG (smin) : ∃R ⊇ S : R ∈ CG (smin) ∧ sG (R) = sG (S). S ⊆ R. • About the support of a nonmaximal frequent (sub)graphs we only know: ∀smin : ∀S ∈ FG (smin) − MG (smin) : sG (S) ≥ R∈MG (smin).Limits of Maximal (Sub)Graphs • The set of maximal (sub)graphs captures the set of all frequent (sub)graphs. This relation follows immediately from ∀S : ∀R ⊇ S : sG (S) ≥ sG (R). Y ⊆ S: ◦ X ⊆ cl (X) ◦ X ⊆ Y ⇒ cl (X) ⊆ cl (Y ) ◦ cl (cl (X)) = cl (X) (cl is extensive) (cl is increasing or monotone) (cl is idempotent) (Proof: consider the closure operator that is deﬁned on the following slides. That is: A (sub)graph is closed if it is frequent. which also preserves knowledge of all support values? Christian Borgelt Frequent Pattern Mining 293 C(S). a (sub)graph cannot have a lower support than any of its supergraphs.R⊇S it follows (can easily be proven by successively extending the graph S) ∀smin : ∀S ∈ FG (smin) : ∃R ∈ CG (smin) : max sG (R). that is. but none of its proper supergraphs has the same support. • Therefore: ∀smin : FG (smin) = S∈CG (smin) • Question: Can we ﬁnd a subset of the set of all frequent (sub)graphs. that is. however.
this is not possible. ⊆) and (Y. B2 ∈ Y : ◦ ∀A ∈ X : ∀B ∈ Y : A1 B1 A A2 B2 f2(B) ⇒ ⇒ ⇔ f1(A1 ) f2(B1) B f1(A2 ). because the greatest common subgraph of two (or more) graphs need not be uniquely deﬁned. . • Let V be the index set of the database graphs in G. let (X. A2 ∈ X : ◦ ∀B1. . that is V = {1. Christian Borgelt Frequent Pattern Mining A1 ⊆ A2 B1 ⊆ B2 A ⊆ f2(B) ⇒ ⇒ ⇔ f1(A1 ) ⊇ f1(A2 ). . . A2 ∈ 2U : ◦ ∀B1. f1(A). • A function pair (f1. . . Christian Borgelt Frequent Pattern Mining 297 • In a monotone Galois connection. tk by replacing the intersection with the greatest common subgraph. U = {S  ∃i ∈ {1. • As a consequence. B ⊆ f1(A). 300 Frequent Pattern Mining . f1(A). ⊆) and (2{1. f2) is an antimonotone Galois connection. Y ) = (2V . ⊆) and (2V . . . .n}. J → {S ∈ V  ∀k ∈ J : S ⊑ Gk }. respectively. B2 ∈ 2V : ◦ ∀A ∈ 2U : ∀B ∈ 2V : 299 Christian Borgelt and f2 : 2{1. . X ) and (Y.. Galois Connections in Frequent Item Set Mining • Consider the partially order sets Let (2B . ◦ There are two greatest common subgraphs: A−B and B − C... f2(B2). • Unfortunately. n} (set of graph identiﬁers)... Therefore the combination f1 ◦ f2 : 2B → 2B is a closure operator.. that is. I → KT (I) = {k ∈ {1. Galois Connections in Frequent (Sub)Graph Mining • Let G = (G1. n}  I ⊆ tk } • The pair (f1. n} : S ⊑ Gi}. the intersection of a set of database graphs can yield a set of graphs instead of a single common graph.. in an antimonotone Galois connection. Gn) be a vector of database graphs. . and • Then the combination f1 ◦ f2 : X → X of the functions of a Galois connection is a closure operator (as well as the combination f2 ◦ f1 : Y → Y ). . f2 : 2V → 2U I → {k ∈ U  ∀S ∈ I : S ⊑ Gk }. A2 ∈ X : ◦ ∀B1. ◦ Consider the two graphs (which are actually chains): A−B−C and A − B − B − C.. it appears natural to transfer the operation cl (I) = k∈KT (I) Reminder: Galois Connections • Let (X. and let the partial orders be the subset relations on these power sets.Closed (Sub)Graphs • Question: Is there a closure operator that induces the closed (sub)graphs? • At ﬁrst glance. . ⊆). J → j∈J tj = {i ∈ B  ∀j ∈ J : i ∈ tj }. B2 ∈ Y : ◦ ∀A ∈ X : ∀B ∈ Y : A1 B1 A A2 B2 f2(B) ⇒ ⇒ ⇔ f1(A1 ) f2(B1) B f1(A2 ). f2(B1) ⊇ f2(B2).. . • A function pair (f1. ⊆) and Y = (2V . • The function pair (f1. f2) is a Galois connection of X = (2U . Consider the function pair f1 : 2U → 2V . . • Let U be the set of all subgraphs of the database graphs in G... Y ) be two partially ordered sets. .. f2) with f1 : X → Y and f2 : Y → X is called a (monotone) Galois connection iﬀ ◦ ∀A1. f1 : 2B → 2{1.n} → 2B . both f1 and f2 are monotone. • (2U . Christian Borgelt Frequent Pattern Mining 298 Reminder: Galois Connections Galois Connections and Closure Operators • Let the two sets X and Y be power sets of some sets U and V . f2) with f1 : X → Y and f2 : Y → X is called an antimonotone Galois connection iﬀ ◦ ∀A1. that is. both f1 and f2 are antimonotone. X ) = (2U . ⊆). f2(B2).n}. ⊑) are partially ordered sets. ⊑): ◦ ∀A1.
f2) is an (antimonotone) Galois connection. O S C N 3 2 S C N O 2 2 Christian Borgelt Frequent Pattern Mining 2 302 Types of Frequent (Sub)Graphs • Frequent (Sub)Graph Any frequent (sub)graph (support is higher than the minimal support): I frequent ⇔ sG (S) ≥ smin • Closed (Sub)Graph A frequent (sub)graph is called closed if no supergraph has the same support: I closed ⇔ sG (S) ≥ smin ∧ ∀R ⊃ S : sG (R) < sG (S) • Maximal (Sub)Graph A frequent (sub)graph is called maximal if no supergraph is frequent: I maximal ⇔ sG (S) ≥ smin ∧ ∀R ⊃ S : sG (R) < smin • Obvious relations between these types of (sub)graphs: ◦ All maximal and all closed (sub)graphs are frequent. f2 ◦ f1 : 2U → 2U is a closure operator. That is. Christian Borgelt Frequent Pattern Mining 303 Christian Borgelt Frequent Pattern Mining 304 Searching for Frequent (Sub)Graphs . a graph database G iﬀ S ∈ (f2 ◦ f1)({S}) ∧ ∃ G ∈ (f2 ◦ f1)({S}) : S < G. the above deﬁnition simply says that a subgraph S is closed iﬀ ◦ it is a common subgraph of all database graphs containing it and ◦ no supergraph of it is also a common subgraph of these graphs. ◦ All maximal (sub)graphs are closed.t. • Intuitively. • The Galois connection is only needed to prove the closure operator property.Galois Connections in Frequent (Sub)Graph Mining • Since the function pair (f1. Closed (Sub)Graphs: Example example molecules (graph database) S C N C O O S C N F O S C N O frequent molecular fragments (smin = 2) ∗ 3 S (empty graph) O C N 3 O S 3 S C 3 C O 3 C N • The generalization to a Galois connection takes formally care of the problem that the greatest common subgraph may not be uniquely determined. a subgraph S is closed if it is one of the greatest common subgraphs of all database graphs containing it. Christian Borgelt Frequent Pattern Mining 301 2 O S C 3 S C N 2 S C O 3 N C O 2 The numbers below the subgraphs state their support. • This closure operator can be used to deﬁne the closed (sub)graphs: A subgraph S is closed w.r.
example molecules: S *3 O 3 Basic Search Principle • Grow (sub)graphs into the graphs of the given database. but only 4 closed (sub)graphs. ◦ Determine the support and prune infrequent (sub)graphs. • Depthﬁrst search is usually preferable. example molecules: F S O S * O S C C C O N C N Frequent (Sub)Graphs The frequent (sub)graphs form a partially ordered subset at the top. • The empty graph is (formally) contained in all subgraphs. • There is usually no (natural) unique largest graph. • The subgraph (isomorphism) relationship deﬁnes a partial order on subgraphs. • Main problem: A (sub)graph can be grown in several diﬀerent ways. ◦ Add an edge (and maybe a vertex) in each step. C 3 N 3 · · · · S S C S C O S C N C O O S C N F O S C N O 3 S C N O S C N O S C N O S C N O * S O S C C C O N C N O S 2 S C 3 C O 2 C N 3 O C O N C O O S C 2 S C N 3 S C O 2 O C N 2 C C N S C N S C N S C O O C N O S C N 2 S C N O2 C C N N C O S C N O etc. • Standard search strategies: breadthﬁrst and depthﬁrst. (8 more possibilities) 307 Christian Borgelt Christian Borgelt Frequent Pattern Mining Frequent Pattern Mining 308 . since the search tree can be very wide. ◦ Start with a single vertex (seed vertex). example molecules: smin = 2 S C N C O C N C O S F F 1 *3 S 3 O 3 C 3 N 3 S C N C O O S F F S F S 1 O S 2 S C 3 C O 2 C N 3 F S C O S C S C N S C O O C N F S C 1 O S C 2 S C N 3 S C O 2 O C N 2 C N C 1 O S C N F O S C N O O S C F S C N F O S C N F O S C N O S C O O S C N O S C N O S C N C S C N C O C N C O O S C N F O S C N O 1 O S C F 1 S C N F 1 O S C N F 1 O S C 2 N O S C 1 O O S C N 1 O S C N O2 S C N 1 C S C N C O1 C N C O1 Christian Borgelt Frequent Pattern Mining 305 Christian Borgelt Frequent Pattern Mining 306 Closed and Maximal Frequent (Sub)Graphs Partially ordered subset of frequent (sub)graphs. • Therefore: the partially ordered set should be searched topdown. • Closed frequent (sub)graphs are encircled. • The two closed (sub)graphs at the bottom are also maximal.Partially Ordered Set of Subgraphs Hasse diagram ranging from the empty graph to the database graphs. • There are 14 frequent (sub)graphs.
then pc(S) is connected. ◦ Recursively process all vertex attributes that are frequent. Hasse diagram and a possible tree for ﬁve items: Searching for Frequent (Sub)Graphs • We have to search the partially ordered set of (connected) (sub)graphs ranging from the empty graph to the database graphs. otherwise discard R. • Assigning unique parents turns corresponding Hasse diagram into a tree. • Traversing the resulting tree explores each item set exactly once. • Each possible parent contains exactly one edge less than the (sub)graph S. • Traversing the resulting tree explores each (sub)graph exactly once. the canonical parent pc(S): ◦ Let e∗ be the last edge in the order that is not a proper bridge.Reminder: Searching for Frequent Item Sets • We have to search the partially ordered set (2B . as e∗ is not a proper bridge. we can easily single out a unique parent. either a leaf bridge or no bridge). so that we can decide which isolated node to remove.e. we also need an order of the nodes. ◦ Note: if S is connected. process R recursively. the possible parents of S are its maximal proper subgraphs. Christian Borgelt Frequent Pattern Mining 312 . • If we can deﬁne an order on the edges of the (sub)graph S. • Assigning unique parents turns the Hasse diagram into a tree. ⊆) / its Hasse diagram. ◦ If e∗ is a leaf bridge. we also have to remove the created isolated node. (i. • Recursive Processing: For a given frequent (sub)graph S: ◦ Generate all extensions R of S by an edge or by an edge and a vertex (if the vertex is not yet in S) for which S is the chosen unique parent. the set of all possible parents of a (connected) (sub)graph S is P (S) = {R ∈ C(S)  ∃ U ∈ C(S) : R ⊂ U ⊂ S}. ◦ For all R: if R is frequent. • Questions: ◦ How can we formally assign unique parents? ◦ (How) Can we make sure that we generate only those extensions for which the (sub)graph that is extended is the chosen unique parent? Christian Borgelt Frequent Pattern Mining 311 Assigning Unique Parents • Formally. ◦ If e∗ is the only edge of S. ◦ The canonical parent pc(S) is the graph S without the edge e∗. Subgraph Hasse diagram and a possible tree: a b c d e a b c d e * F S O S O S C C C O N C N F F S S O S * O S C C C O N C N ab ac ad ae bc bd be cd ce de ab ac ad ae bc bd be cd ce de F S abc abd abe acd ace ade bcd bce bde cde abc abd abe acd ace ade bcd bce bde cde O S F F S C O S C S C N S C O O C N C N C O S F F S C O S C S C N S C O O C N C N C abcd abce abde acde bcde abcd abce abde acde bcde O S C F S C N F O S C N F O S C N O S C O O S C N O S C N O S C N C S C N C O C N C O O S C F S C N F O S C N F O S C N O S C O O S C N O S C N O S C N C S C N C O C N C O abcde abcde Christian Borgelt Frequent Pattern Mining 309 Christian Borgelt Frequent Pattern Mining 310 Searching with Unique Parents Principle of a Search Algorithm based on Unique Parents: • Base Loop: ◦ Traverse all possible vertex attributes (their unique parent is the empty graph). In other words.
a global order on the vertex and edge attributes is not enough. The code word is a word over the alphabet A. we can deﬁne an order on these code words. • Check database graphs for extensions of known occurrences. and destination node label.) Maintain List of Occurrences • Find and record all occurrences of single node graphs. • Canonical forms for graphs are more complex than canonical forms for item sets (reminder on next slide).Assigning Unique Parents • In order to deﬁne an order of the edges of a given (sub)graph. • There are k! possible code words for an item set of size k. the set of all items. However. because the items may be listed in any order. we will rely on a canonical form of (sub)graphs. Canonical Forms of Graphs • By introducing an (arbitrary. ﬁxed order. because we have to code the connection structure. In principle. • There are two main principles for canonical forms of graphs: ◦ spanning trees Christian Borgelt Support Counting Subgraph Isomorphism Tests • Generate extensions based on global information about edges: ◦ Collect triples of source node label. ◦ One of the code words is singled out as the canonical code word. 313 Christian Borgelt Frequent Pattern Mining 314 and ◦ adjacency matrices. • The lexicographically smallest code word for an item set is the canonical code word. • Advantage: fewer extended fragments and faster support counting. Frequent Pattern Mining Reminder: Canonical Form for Item Sets • An item set is represented by a code word. ◦ The (sub)graph can be reconstructed from the code word. ◦ Each (sub)graph is described by a code word. Obviously the canonical code word lists the items in the chosen. (The database graphs may be restricted to those containing the parent. • Traverse database graphs and test whether generated extension occurs. the same general idea can be used for graphs. b. ◦ Traverse the (extendable) nodes of a given fragment and attach edges based on the collected triples. • A canonical form of a (sub)graph is a special representation of this (sub)graph. ◦ There may be multiple code words that describe the same (sub)graph. • Disadvantage: considerable memory is needed for storing the occurrences. This immediately yields the occurrences of the extended fragments. c} and a < b < c. Christian Borgelt Frequent Pattern Mining 315 Christian Borgelt Frequent Pattern Mining 316 . each letter represents an item. but ﬁxed) order of the items. ◦ It describes the graph structure and the vertex and edge labels (and thus implicitly orders the edges and vertices). edge label. Example: abc < bac < bca < cab for the item set {a. and by comparing code words lexicographically.
◦ Form the canonical code word wc(R) of each extended (sub)graph R. This is done by appending the edge description to the code word. Christian Borgelt Frequent Pattern Mining 319 Searching with the Preﬁx Property Principle of a Search Algorithm based on the Preﬁx Property: • Base Loop: ◦ Traverse all possible vertex attributes.) • The resulting list of code words is sorted lexicographically. ⇒ The longest proper preﬁx of the canonical code word of a (sub)graph S not only describes the canonical parent of S. ◦ If the edge e∗(R) as induced by wc(R) is the edge added to S to form R and R is frequent. Christian Borgelt Frequent Pattern Mining 320 .e. If it is. the description of the added edge). of course. which is the concatenation of the (sorted) edge descriptions (“characters”). the canonical code words of all child (sub)graphs that have be explored in the recursion with the exception of the last letter (that is. • Questions: ◦ How can we formally deﬁne canonical code words? ◦ Do we have to generate all possible extensions of a frequent (sub)graph? Christian Borgelt Frequent Pattern Mining 318 Canonical Forms: Preﬁx Property • Suppose the canonical form possesses the preﬁx property: Every preﬁx of a canonical code word is a canonical code word itself. • The general recursive processing scheme with canonical forms requires to construct the canonical code word of each created (sub)graph in order to decide whether it has to be processed recursively or not. otherwise discard it.Canonical Forms of Graphs: General Idea • Construct a code word that uniquely identiﬁes an (attributed or labeled) graph up to automorphisms (that is. process R recursively.) • Each possible numbering of the vertices of the graph yields a code word. symmetries). • The vertices of the graph must be numbered (endowed with unique labels). • Core problem: Vertex and edge attributes can easily be incorporated into a code word. • Recursive Processing: For a given (canonical) code word of a frequent (sub)graph: ◦ Generate all possible extensions by an edge (and a maybe a vertex). ⇒ We know the canonical code word of any (sub)graph that is processed. ◦ Check whether the extended code word is the canonical code word of the (sub)graph described by the extended code word (and. • General Recursive Processing with Canonical Forms: For a given frequent (sub)graph S: ◦ Generate all extensions R of S by a single edge or an edge and a vertex (if one vertex incident to the edge is not yet part of S). • Basic idea: The characters of the code word describe the edges of the graph. ⇒ We only have to check whether the code word that results from appending the description of the added edge to the given canonical code word is canonical. otherwise discard R. because we need to specify the vertices that are incident to an edge.) Christian Borgelt Frequent Pattern Mining 317 Searching with Canonical Forms • Let S be a (sub)graph and wc(S) its canonical code word. (Note: vertex labels need not be unique — several nodes may have the same label. Let e∗(S) be the last edge in the edge order induced by wc(S) (i. the order in which the edges are described) that is not a proper bridge. whether the described (sub)graph is frequent). one may choose the lexicographically greatest code word. (Alternatively. process the extended code word recursively. that is. • The lexicographically smallest code word is the canonical code word. the canonical code words of single vertex (sub)graphs. due to the preﬁx property. (Note that the graph can be reconstructed from such a code word. but how to describe the connection structure is not so obvious. ⇒ The edge e∗ is always the last described edge. but is its canonical code word. • With this code word we know. ◦ Recursively process each code word that describes a frequent (sub)graph.
and the labels of the incident vertices. 323 Christian Borgelt Frequent Pattern Mining 324 • There are 1 · 9 + 5 · 4 = 6 · 5 − 1 = 29 possible spanning trees for this example. ◦ describing each edge by the numbers of the vertices it connects. ◦ numbering the vertices in the order in which they are visited. Examples of spanning trees: O F O N N O F O O N N O F O O N N O F O O N N O F O O N N O Canonical Forms based on Spanning Trees • A code word describing a graph can be constructed by ◦ systematically constructing a spanning tree of the graph. it will be discussed later how additional graph properties can be exploited to improve the construction of a canonical form if the preﬁx property is not made a requirement. Canonical Forms based on Spanning Trees Christian Borgelt Frequent Pattern Mining 321 Christian Borgelt Frequent Pattern Mining 322 Spanning Trees • A (labeled) graph G is called a tree iﬀ for any pair of vertices in G there exists exactly one path connecting them in G. and ◦ listing the edge descriptions in the order in which the edges are visited. • Disadvantages of the Preﬁx Property: ◦ One has reduced freedom in the deﬁnition of a canonical form. • A spanning tree of a (labeled) connected graph G is a subgraph S of G that ◦ is a tree and ◦ comprises all vertices of G (that is. also applicable. the edge label. Christian Borgelt Frequent Pattern Mining . because both rings have to be cut open. ◦ The preﬁx property usually allows us to easily ﬁnd simple rules to restrict the extensions that need to be generated. in principle.) • The most common ways of constructing a spanning tree are: ◦ depthﬁrst search ⇒ gSpan [Yan and Han 2002] ◦ breadthﬁrst search ⇒ MoSS/MoFa [Borgelt and Berthold 2002] An alternative way is to visit all children of a vertex before proceeding in a depthﬁrst manner (can be seen as a variant of depthﬁrst search). VS = VG). (Edges closing cycles may need special treatment.The Preﬁx Property • Advantages of the Preﬁx Property: ◦ Testing whether a given code word is canonical can be simpler/faster than constructing a canonical code word from scratch. • However. This can make it impossible to exploit certain properties of a graph that can help to construct a canonical form quickly. • In the following we consider mainly canonical forms having the preﬁx property. Other systematic search schemes are.
this value. there would be a starting point and a spanning tree that yield a smaller code word. n − 1}. id ∈ {0. .Canonical Forms based on Spanning Trees • Each starting point (choice of a root) and each way to build a spanning tree systematically from a given starting point yields a diﬀerent code word. . but supported by experiments): ◦ Vertex and edge attributes should be sorted according to their frequency. is ∈ {0. the number of edges of the graph. • Since the edges are listed in the order in which they are visited during the spanning tree construction. . • Edges Closing Cycles: Edges closing cycles may be distinguished from spanning tree edges. . the attribute of an edge. index of the destination vertex of an edge. O F O N N O F O O N N O F O O N N O F O O N N O F O O N N O Canonical Forms based on Spanning Trees • An edge description consists of ◦ the indices of the source and the destination vertex (deﬁnition: the source of an edge is the vertex with the smaller index). That is in the depthﬁrst search expression is underlined is meant as a reminder that the edge descriptions have to be sorted descendingly w. 325 Christian Borgelt Frequent Pattern Mining 326 There are 12 possible starting points and several branching points.r. giving spanning tree edges absolute precedence over edges closing cycles. • Simpliﬁcation: The source attribute is needed only for the ﬁrst edge and thus can be split oﬀ from the list of edge descriptions. the attribute of a vertex. this canonical form has the preﬁx property: If a preﬁx of a canonical code word were not canonical. • Listing the edges in the order in which they are visited can often be characterized by a precedence order on the describing elements of an edge. . Christian Borgelt Frequent Pattern Mining 327 The order of the elements describing an edge reﬂects the precedence order. . . there are several hundred possible code words. (Use the canonical code word of the preﬁx graph and append the missing edge. index of the source vertex of an edge.t. • The lexicographically smallest code word is the canonical code word. n − 1}. ◦ the edge attribute.) Christian Borgelt Frequent Pattern Mining Canonical Forms: Edge Sorting Criteria • Precedence Order for Depthﬁrst Search: ◦ ◦ ◦ ◦ destination vertex index source vertex index edge attribute destination vertex attribute (ascending) (descending) (ascending) (ascending) ← Canonical Forms: Code Words From the described procedure the following code words result (regular expressions with nonterminal symbols): • DepthFirst Search: • BreadthFirst Search: where n m is id a b a (id is b a)m a (is b a id)m (or a (is id b a)m) • Precedence Order for Breadthﬁrst Search: ◦ ◦ ◦ ◦ source vertex index edge attribute destination vertex attribute destination vertex index (ascending) (ascending) (ascending) (ascending) the number of vertices of the graph. Alternative: Sort between the other edges based on the precedence rules. Christian Borgelt Frequent Pattern Mining 328 . ◦ Ascending order seems to be recommendable for the vertex attributes. ◦ the attributes of the source and the destination vertex. . • Order of individual elements (conjectures. As a consequence.
(∗ to traverse the edges of the graph ∗) x : array of vertex.r. (∗ the code word is canonical ∗) end (∗ isCanonical ∗) (∗ for a breadthﬁrst search spanning tree ∗) Christian Borgelt Frequent Pattern Mining 331 Checking for Canonical Form function rec (w: array of int.V do v. possible roots of spanning trees). ◦ If the new edge description is equal. (∗ w: code word to be tested ∗) (∗ k: current position in code word ∗) (∗ x: array of already labeled/numbered vertices ∗) (∗ n: number of labeled/numbered vertices ∗) (∗ i: index of next extendable vertex to check.. (∗ otherwise go to the next vertex ∗) . (∗ if v has a smaller label. ◦ If the new edge description is larger. x: array of vertex. (∗ ﬂag for unnumbered destination vertex ∗) r : boolean.V do begin (∗ traverse the potential root vertices ∗) if v.a = w[0] then begin (∗ if v has the same label. i < n ∗) var d : vertex. n: int.i := 0. the rest of the code word is processed recursively (code word preﬁxes are equal). 329 Christian Borgelt Frequent Pattern Mining 330 S 10N 21O 31C 43C 54O 64=O 73C 87C 80C S 0N1 0C2 1O3 1C4 2C5 4C5 4C6 6O7 6=O8 (Reminder: in A the edges are sorted descendingly w. 1. 0) then return false.i := −1. the second entry.i < 0 then return false.i := −1. the edge can be skipped (new code word is lexicographically larger).Canonical Forms: A Simple Example Checking for Canonical Form: Compare Preﬁxes 0 S N O O A 2 0 S 3 B S C C 4 2 • Base Loop: ◦ Traverse all vertices with a label no less than the current root vertex (ﬁrst character of the code word. return true. (∗ clear the edge markers ∗) forall v ∈ G. (∗ buﬀer for a recursion result ∗) begin (∗ full code word has been generated ∗) if k ≥ length(w) return true. (∗ if there is an unmarked edge. x[0] := v. (∗ clear the vertex indices ∗) forall e ∈ G. x. (∗ abort if a smaller code word is found ∗) v.. (∗ vertex at the other end of an edge ∗) j : int. ◦ If the new edge description is smaller. 1. i: int) : boolean. ∗) end. end.t. • Recursive Processing: 1 N C 3 1 N O O 4 O 6 C 5 example molecule C O 6 C 7 C O 8 O 5 C 8 O 7 depthﬁrst Order of Elements: S ≺ N ≺ O ≺ C Code Words: A: B: Order of Bonds: breadthﬁrst ≺ ◦ The recursive processing constructs alternative spanning trees and compare the code words resulting from it with the code word to check. abort. abort ∗) if v. ◦ In each recursion step one edge is added to the spanning tree and its description is compared to the corresponding one in the code word to check. check rest ∗) v.) Christian Borgelt Frequent Pattern Mining Checking for Canonical Form function isCanonical (w: array of int.E do e. var v : vertex. while i < w[k] do begin (∗ check whether there is an edge with ∗) forall e incident to x[i] do (∗ a source vertex having a smaller index ∗) if e. k : int. (∗ to collect the numbered vertices ∗) begin forall v ∈ G. G: graph) : boolean.a < w[0] then return false. (∗ index of destination vertex ∗) u : boolean. (∗ number and record the root vertex ∗) (∗ check the code word recursively and ∗) if not rec(w. (∗ to traverse the vertices of the graph ∗) e : edge. the code word is not canonical (new code word is lexicographically smaller).i := −1. i := i + 1. Christian Borgelt Frequent Pattern Mining 332 . (∗ clear the vertex index again ∗) end.
return true.i < 0 then begin (∗ traverse the unvisited incident edges ∗) [. Christian Borgelt Frequent Pattern Mining 335 Christian Borgelt Frequent Pattern Mining 336 . end.i.i < 0 then begin (∗ traverse the unvisited incident edges ∗) if e. end e. process the extended code word recursively... x. end.. forall e incident to x[i] (in sorted order) do begin if e.a > w[k + 2] then return true. n := n − 1. (∗ vertex attribute ∗) if d. (∗ edge attribute ∗) d := vertex incident to e other than x[i]. (∗ check destination ∗) if d. x[n] := d.i := −1. if j < w[k + 3] then return false.i := −1. (∗ check the ∗) if e. (∗ check recursively ∗) if u then begin d. that is.i < 0 then j := n else j := d.. If it is. end.i < 0..i := j.. (∗ mark edge and number vertex ∗) if u then begin d.a > w[k + 1] then return true. (∗ unmark edge (and vertex) again ∗) if not r then return false. (∗ check destination vertex index ∗) [. u := d.. end r := rec(w.] (∗ check rest of code word recursively. forall e incident to x[i] (in sorted order) do begin if e. end (∗ rec ∗) Christian Borgelt (∗ return that no smaller code word ∗) (∗ than w could be found ∗) Frequent Pattern Mining 333 if j = w[k + 3] then begin (∗ if edge descriptions are equal ∗) e..] (∗ check the current edge ∗) end. otherwise discard it. (∗ return that no smaller code word ∗) end (∗ rec ∗) (∗ than w could be found ∗) Christian Borgelt Frequent Pattern Mining 334 Canonical Forms: Restricted Extensions Principle of the Search Algorithm up to now: • Generate all possible extensions of a given canonical code word by the description of an edge that extends the described (sub)graph. n := n + 1. can be incident to a new edge) as well as certain edges closing cycles. Restricted Extensions Straightforward Improvement: • For some extensions of a given canonical code word it is easy to see that they will not be canonical themselves. • The trick is to check whether a spanning tree rooted at the same vertex yields a code word that is smaller than the created extended code word.a < w[k + 1] then return false. (∗ evaluate the recursion result: ∗) (∗ abort if a smaller code word was found ∗) end. • This immediately rules out edges attached to certain vertices in the (sub)graph (only certain vertices are extendable. ∗) (∗ because preﬁxes are equal ∗) Checking for Canonical Form . k + 4. • Check whether the extended code word is canonical (and the (sub)graph frequent). return true. i). if d. n.a < w[k + 2] then return false.i := 1.Checking for Canonical Form .
the attribute of its destination vertex must be no less than the attribute of the downward edge’s destination vertex. Canonical Forms: Restricted Extensions BreadthFirst Search: Maximum Source Extension • Extendable Vertices: ◦ Only vertices having an index no less than the maximum source index of edges that are already in the (sub)graph may be extended. breadthﬁrst depthﬁrst breadthﬁrst If other vertices are extended. 7. Edges Closing Cycles: A: none. because the existing cycle edge has the smallest possible source.) • Edges Closing Cycles: ◦ Edges closing cycles must start at an extendable vertex. that is. ◦ They must lead “forward”. to a vertex having a larger index than the extended vertex. a tree with the same root yields a smaller code word. ◦ If the source of the new edge is the one having the maximum source index. 3. 6. 8. 1. 0. that is. B: the edge between the vertices 7 and 8. (That is. the edge description must not precede the description of the downward edge on the path. 8.) • Edges Closing Cycles: ◦ Edges closing cycles must start at an extendable vertex. ◦ The index of the source vertex must precede the index of the source vertex of any edge already incident to the rightmost leaf. (That is. the edge attribute must be no less than the edge attribute of the downward edge. 7. and if it is equal. that is. ◦ If the source vertex of the new edge is not a leaf.Canonical Forms: Restricted Extensions DepthFirst Search: Rightmost Path Extension • Extendable Vertices: ◦ Only vertices on the rightmost path of the spanning tree may be extended. Christian Borgelt Frequent Pattern Mining 339 Christian Borgelt . ◦ They must lead to the rightmost leaf (vertex at end of rightmost path). Christian Borgelt Frequent Pattern Mining 337 Christian Borgelt Frequent Pattern Mining 338 Restricted Extensions: A Simple Example Restricted Extensions: A Simple Example S N O O A 2 0 S 3 B 0 S C C 4 2 1 N C 3 1 N O O 4 O 6 C 5 S N O O A 2 0 S 3 B 0 S C C 4 2 1 N C 3 1 N O O 4 O 6 C 5 example molecule C O 6 C 7 C O 8 O 5 C 8 O 7 example molecule C O 6 C 7 C O 8 O 5 C 8 O 7 depthﬁrst Extendable Vertices: A: vertices on the rightmost path. the edge attribute must be no less. it may be extended only by edges whose descriptions do not precede the description of any downward edge already incident to this vertex. and if it is equal. the attribute of the destination vertex must be no less. Example: A: B: attach a single bond to a carbon atom at the leftmost oxygen atom S 10N 21O 31C 43C 54O 64=O 73C 87C 80C 92C S 10N 21O 32C · · · S 0N1 0C2 1O3 1C4 2C5 4C5 4C6 6O7 6=O8 3C9 S 0N1 0C2 1O3 1C4 2C5 3C6 · · · Frequent Pattern Mining 340 B: vertices with an index no smaller than the maximum source.
Christian Borgelt Frequent Pattern Mining 341 Example Search Tree • Start with a single vertex (seed vertex). example molecules: S C N C O O S C N F O S C N O search tree for seed S: S 1F O S 2C O S C N2 O 1S C N O O1S C O 3 S3 S C3 S C N S C N O S 2O S C O 2 S C 1N C S C N 1C O 2 S≺F≺N≺C≺O ≺= breadthﬁrst search canonical form Frequent Pattern Mining Christian Borgelt 342 Searching without a Seed Atom breadthﬁrst search canonical form N * S≺N≺O≺C ≺= S O C S C N C O C O C C C S C C N C C 12 S C C N O C C 7 O C O O C C 5 O C O C C C Comparison of Canonical Forms (depthﬁrst versus breadthﬁrst spanning tree construction) S C C C S C C C N 3 S C C C O S C C C O S C C C O O O N C C O C O O N C C O C S O N C C O C O clycin cystein serin • Chemical elements processed on the left are excluded on the right. • Determine the support and prune infrequent (sub)graphs. a canonical form test is still necessary. • As a consequence. then the resulting code word may or may not be canonical. • Check for canonical form and prune (sub)graphs with noncanonical code words. then the resulting code word is certainly not canonical. Christian Borgelt Frequent Pattern Mining 343 Christian Borgelt Frequent Pattern Mining 344 . • Depthﬁrst search canonical form ◦ If the extension edge is not a rightmost path extension. then the resulting code word is certainly not canonical. • Breadthﬁrst search canonical form ◦ If the extension edge is not a maximum source extension.Canonical Forms: Restricted Extensions • The rules underlying restricted extensions provide a onesided answer to the questions whether an extension yields a canonical code word. ◦ If the extension edge is a maximum source extension. then the resulting code word may or may not be canonical. ◦ If the extension edge is a rightmost path extension. • Add an edge (and maybe a vertex) in each step (restricted extensions).
• The experimental results reported in the following indicate that it may depend on the data set which canonical form performs better. ◦ Medium number of fragments and closed fragments. widths and depths of the search tree. still large number of closed fragments. to describe the vertex range. namely the index of the maximum edge source. which form leads to the “better” (more eﬃcient) structure of the search tree.Canonical Forms: Comparison DepthFirst vs. Christian Borgelt Frequent Pattern Mining 345 N C O C C C N noncanonical: 3 O noncanonical: 6 346 Christian Borgelt Frequent Pattern Mining Advantage for Rightmost Path Extensions Generate all substructures (that contain nitrogen) of the example molecule: (N ≺ C) Experiments: Data Sets • Index Chemicus — Subset of 1993 ◦ 1293 molecules / 34431 atoms / 36594 bonds ◦ Frequent fragments down to fairly low support values are trees (no rings). only one of which is canonical). N≺O≺C Rightmost Path Extension: N N • The two canonical forms obviously lead to diﬀerent branching factors. C N C C C N C C C C C 3 C 3 N C C C 4 5 C 3 N C C C 5 4 C noncanonical: 3 N C C C 5 noncanonical: 1 347 Christian Borgelt Frequent Pattern Mining 348 Christian Borgelt Frequent Pattern Mining . However. N≺C Search Trees with Maximum Source Extension: N C N C C N C C C C N C C C 5 4 Rightmost Path Extension: N C C N N C N C N C C C N N C N C N C C C C ◦ 17 molecules / 401 atoms / 456 bonds ◦ A large part of the frequent fragments contain one or more rings. Advantage for Maximum Source Extensions Generate all substructures (that contain nitrogen) of the example molecule: Search Trees with Maximum Source Extension: N N C N C C O N C C O N C C O N C O O C N C O O C C C C C O C C C O N C C C C N C O O N C C O C O C C C O C C C C N C O C C C O N C C C N C O C C C C C N C C C O C C C O N C O N C O C C O N C C C N C C C N C C C O C O N C C C O C C C N C C C C C N C O O N C N C C C N C N C O N C C C N C C C O N C C C N C C N C C N C N C C O N C C C N C N C C N C N C N C O O C C C Problem: The two branches emanating from the nitrogen atom start identically. • Also the check for canonical form is slightly more complex (to program) for depthﬁrst search canonical form. it is not immediately clear. BreadthFirst Search Canonical Form • With breadthﬁrst search canonical form the extendable vertices are much easier to traverse. Thus rightmost path extensions try the right branch over and over again. • Steroids N C C C C Problem: The ring of carbon atoms can be closed between any two branches (three ways of building the fragment. C C C C C C C N C N C C C N C C C C C C C C N C ◦ Huge number of fragments. as they always have consecutive indices: One only has to store and update one number.
Steroids Data Set O O O O O O O O O O O Br O O N O O O F O O O O O O N O O O O O O O O O O Experiments: IC93 Data Set 15 10 5 0 3 3.5 5 5.5 6 3 3. The horizontal axis shows the absolute minimal support.5 6 Experimental results on the IC93 data. The curves show the number of generated and processed fragments (top left). Christian Borgelt Frequent Pattern Mining 351 Christian Borgelt Frequent Pattern Mining 352 .5 fragments/104 breadthﬁrst depthﬁrst processed 14 12 10 8 6 4 occurences/106 breadthﬁrst depthﬁrst 5 5. and the execution time in seconds (bottom left) for the two canonical forms/extension strategies.5 6 20 O O O O O O O O time/seconds breadthﬁrst depthﬁrst 15 10 O O O O O O O O O 5 3 3. The curves show the number of generated and processed fragments (top left). number of processed occurrences (top right). The horizontal axis shows the minimal support in percent. number of processed occurrences (top right). Christian Borgelt Frequent Pattern Mining 349 Christian Borgelt Frequent Pattern Mining 350 Experiments: Steroids Data Set 15 fragments/105 breadthﬁrst depthﬁrst processed 12 10 8 occurrences/106 breadthﬁrst depthﬁrst 10 5 6 2 35 30 25 20 15 10 2 3 4 5 6 7 8 2 3 4 5 6 7 8 Equivalent Sibling Pruning time/seconds breadthﬁrst depthﬁrst 3 4 5 6 7 8 Experimental results on the steroids data.5 4 4.5 4 4. and the execution time in seconds (bottom left) for the two canonical forms/extension strategies.5 5 5.5 4 4.
Alternative Test: Equivalent Siblings
• Basic Idea: ◦ If the (sub)graph to extend exhibits a certain symmetry, several extensions may be equivalent (in the sense that they describe the same (sub)graph). ◦ At most one of these sibling extensions can be in canonical form, namely the one least restricting future extensions (lex. smallest code word). ◦ Identify equivalent siblings and keep only the maximally extendable one. • Test Procedure for Equivalence: ◦ Get any graph in which two sibling (sub)graphs to compare occur. (If there is no such graph, the siblings are not equivalent.) ◦ Mark any occurrence of the ﬁrst (sub)graph in the graph. ◦ Traverse all occurrences of the second (sub)graph in the graph and check whether all edges of an occurrence are marked. If there is such an occurrence, the two (sub)graphs are equivalent.
Alternative Test: Equivalent Siblings
If siblings in the search tree are equivalent, only the one with the least restrictions needs to be processed. Example: Mining phenol, pcresol, and catechol.
C C O C C C C C C O C C C C C O C C O C C C C
Consider extensions of a 6bond carbon ring (twelve possible occurrences):
C C O C C3 0 C C
1 2 5 4
C C O C C4 1 C C
2 3
0
5
C C O C C5 2 C C
3 4
1
0
C C O C C4 1 C C
0 5
2
3
Only the (sub)graph that least restricts future extensions (i.e., that has the lexicographically smallest code word) can be in canonical form.
Use depthﬁrst canonical form (rightmost path extensions) and C ≺ O.
Christian Borgelt
Frequent Pattern Mining
353
Christian Borgelt
Frequent Pattern Mining
354
Alternative Test: Equivalent Siblings
• Test for Equivalent Siblings before Test for Canonical Form ◦ Traverse the sibling extensions and compare each pair. ◦ Of two equivalent siblings remove the one that restricts future extensions more. • Advantages: ◦ Identiﬁes some code words that are noncanonical in a simple way. ◦ Test of two siblings is at most linear in the number of edges and at most linear in the number of occurrences. • Disadvantages: ◦ Does not identify all noncanonical code words, therefore a subsequent canonical form test is still needed. ◦ Compares two sibling (sub)graphs, therefore it is quadratic in the number of siblings.
Christian Borgelt Frequent Pattern Mining 355 Christian Borgelt
Alternative Test: Equivalent Siblings
The eﬀectiveness of equivalent sibling pruning depends on the canonical form: Mining the IC93 data with 4% minimal support depthﬁrst breadthﬁrst 156 ( 1.9%) 4195 (83.7%) 7988 (98.1%) 815 (16.3%) 8144 5010 2002 2002
equivalent sibling pruning canonical form pruning total pruning (closed) (sub)graphs found
Mining the steroids data with minimal support 6 depthﬁrst equivalent sibling pruning 15327 ( 7.2%) canonical form pruning 197449 (92.8%) total pruning 212776 (closed) (sub)graphs found 1420 breadthﬁrst 152562 (54.6%) 127026 (45.4%) 279588 1420
Frequent Pattern Mining
356
Alternative Test: Equivalent Siblings
Observations: • Depthﬁrst form generates more duplicate (sub)graphs on the IC93 data and fewer duplicate (sub)graphs on the steroids data (as seen before). • There are only very few equivalent siblings with depthﬁrst form on both the IC93 data and the steroids data. (Conjecture: equivalent siblings result from “rotated” tree branches, which are less likely to be siblings with depthﬁrst form.) • With breadthﬁrst search canonical form a large part of the (sub)graphs that are not generated in canonical form (with a canonical code word) can be ﬁltered out with equivalent sibling pruning. • On the test IC93 data no diﬀerence in speed could be observed, presumably because pruning takes only a small part of the total time. • On the steroids data, however, equivalent sibling pruning yields a slight speedup for breadthﬁrst form (∼ 5%).
Christian Borgelt Frequent Pattern Mining 357 Christian Borgelt Frequent Pattern Mining 358
Canonical Forms based on Adjacency Matrices
Adjacency Matrices
• A (normal, that is, unlabeled) graph can be described by an adjacency matrix: ◦ A graph G with n vertices is described by an n × n matrix A = (aij ). ◦ Given a numbering of the vertices (from 1 to n), each vertex is associated with the row and column corresponding to its number. ◦ A matrix element aij is 1 if there exists an edge between the vertices with numbers i and j and 0 otherwise. • Adjacency matrices are not unique: Diﬀerent numberings of the vertices lead to diﬀerent adjacency matrices.
3
Extended Adjacency Matrices
• A labeled graph can be described by an extended adjacency matrix: ◦ If there is an edge between the vertices with numbers i and j the matrix element aij contains the label of this edge and the special label × (the empty label) otherwise. ◦ There is an additional column containing the vertex labels. • Of course, extended adjacency matrices are also not unique:
1 2 3 4 5 6 7 8 9
C
1 2 3 4 5 6 7 8 9
6
5 4 1 3 2
1 2 3 4 5
1 0 1 0 1 0
2 1 0 1 1 0
3 0 1 0 1 1
4 1 1 1 0 0
5 0 0 1 0 0
1 3 5 2 4
1 2 3 4 5
1 0 1 0 0 0
2 1 0 1 1 0
3 0 1 0 1 1
4 0 1 1 0 1
5 0 0 1 1 0
S C N C5 9 7C O O 4 O8
2
1
6
1 2 3 4 5 6 7 8 9
S N C O C C C O O
S C N C1 9 3C O O 7 O8
2
5
C
4
1 2 3 4 5 6 7 8 9
C N C C S C O O O
Christian Borgelt
Frequent Pattern Mining
359
Christian Borgelt
Frequent Pattern Mining
360
From Adjacency Matrices to Code Words
• An (extended) adjacency matrix can be turned into a code word by simply listing its elements row by row. • Since for undirected graphs the adjacency matrix is necessarily symmetric, it suﬃces to list the elements of the upper (or lower) triangle. • For sparse graphs (few edges) listing only column/label pairs can advantageous, because this reduces the code word length.
3
From Adjacency Matrices to Code Words
• With an (arbitrary, but ﬁxed) order on the label set A (and deﬁning that integer numbers, which are ordered in the usual way, precede all labels), code words can be compared lexicographically: (S ≺ N ≺ O ≺ C ;  ≺ =)
3 6
C 6 S23N45C6OC67CC89=OO S C 2N C5 < 9 7C O O C234N57C89=C6S6COOO 4 O8
1
C 4 S C 2N C1 9 3C O O 7 O8
5
C 6 S C 2N C5 9 7C O O 4 O8
1
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 S N C O C C C O O
Regular expression (nonterminals): (a ( ic b )∗ )n
code word: S 2 3N 4 5C 6O C 6 7C C 8 9= O O
• As for canonical forms based on spanning trees, we then deﬁne the lexicographically smallest (or largest) code word as the canonical code word. • Note that adjacency matrices allow for a much larger number of code words, because any numbering of the vertices is acceptable. For canonical forms based on spanning trees, the vertex numbering must be compatible with a (speciﬁc) construction of a spanning tree.
361 Christian Borgelt Frequent Pattern Mining 362
Christian Borgelt
Frequent Pattern Mining
From Adjacency Matrices to Code Words
• There is a variety of other ways in which an adjacency matrix may be turned into a code word:
1 2 3 4 5 6 7 8 9
3
S C N C5 9 7C O O 4 O8
2
1
C
6
1 2 3 4 5 6 7 8 9
S N C O C C C O O
lower triangle: S N 1C 1O 2C 2C 3 5C 5O 6 7O 7=
columnwise: SNCOCCCOO 11223 5577=
Exploiting Vertex Signatures
(Note that the columnwise listing needs a separator character “”.)
• However, the rowwise listing restricted to the upper triangle (as used before) has the advantage that it has a property analogous to the preﬁx property. In contrast to this, the two forms shown above do not have this property.
Christian Borgelt Frequent Pattern Mining 363 Christian Borgelt Frequent Pattern Mining 364
and C. ◦ As a consequence. which make it necessary to apply a backtracking algorithm. • Equivalence classes with more than one vertex are then processed by appending the (sorted) labels of the incident edges to the vertex signature. ◦ If there is more than one vertex with a certain label. ◦ Continue with the vertices that are two edges away and so on.t. Canonical Form and Vertex and Edge Labels • The complexity of constructing a canonical code word is caused by equal edge and vertex labels. the vertex signatures O and C need to be extended in order to split the corresponding equivalence classes. add a (sorted) list of the labels of the incident edges. • In a second step the (sorted) signatures of the adjacent vertices are appended. This can be achieved by constructing a “local code word” (vertex signature): ◦ Start with the label of the vertex. ◦ If there is more than one vertex with the same list. constructing the canonical code word is straightforward. the connection structure) to distinguish vertices/edges with the same label? • Idea: Describe how the (sub)graph under consideration “looks from a vertex”. Step 1 6 • The initial vertex signatures are simply the vertex labels. The vertex set is then repartitioned based on the extended vertex signature. 365 Christian Borgelt Frequent Pattern Mining 366 Christian Borgelt Frequent Pattern Mining Constructing Vertex Signatures The process of constructing vertex signatures is best described as an iterative subdivision of equivalence classes: • The initial signature of each vertex is simply its label. • However. • Intuitive explanation with an extreme example: Suppose that all vertices of a given (sub)graph have diﬀerent labels. • There are four equivalence classes: S. • The process stops when no replacement splits an equivalence class.Canonical Form and Vertex and Edge Labels • Vertex and edge labels help considerably to construct a canonical code word or to check whether a given code word is canonical: Canonical form check or construction are usually (much) slower/more diﬃcult for unlabeled graphs or graphs with few diﬀerent vertex and edge labels. Christian Borgelt Frequent Pattern Mining 367 Constructing Vertex Signatures 3 S C 2N C5 9 7C O O 4 O8 1 C Vertex Signatures. • The vertex set is split into equivalence classes based on the initial vertex signature (that is. • The reason is that with vertex and edge labels constructed code word preﬁxes may already allow us to make a decision between (sets of) code words. the vertex labels). vertex 1 2 4 8 9 3 6 5 7 signature S N O O O C C C C Christian Borgelt Frequent Pattern Mining 368 .r. O. because they already contain only a single vertex. ◦ The order of each vertex’s neighbors in the canonical form is determined at least by the vertex labels (but maybe also by the edge labels). add a (sorted) list of the lists of the adjacent vertices. • In subsequent steps these signatures of adjacent vertices are replaced by the updated vertex signatures. N. the chosen order). • Question: Can we exploit graph properties (that is. • The equivalence classes S and N need not further processing. Then: ◦ The root/ﬁrst row vertex is uniquely determined: it is the vertex with the smallest label (w.
because two is incident to a single bond. • With this unique vertex labeling.N O . Note that we have always v ∈ o(v). because the identity is always in Fauto(G). • As a result. • Only the signatures of carbons 3 and 6 and the signatures of oxygens 4 and 9 need to be extended further. [McKay 1981] Christian Borgelt Frequent Pattern Mining 372 .C . Step 2 6 1 3 C Vertex Signatures.r. oxygen 4 is incident to a nitrogen atom). though. Step 3 6 • The vertex signatures of the classes that contain more than one vertex are extended by the sorted list of labels of the incident edges.Constructing Vertex Signatures Constructing Vertex Signatures 3 S C 2N C5 9 7C O O 4 O8 1 C Vertex Signatures. because the graph “looks the same” from all of them. for neither graph the equivalence classes can be reduced to single vertices. the other to a double bond.C = O = C . generators of the group of automorphisms can be derived. • It also distinguishes most carbon atoms. • The reason is that both graphs possess automorphisms other then the identity. Fauto(G) is the set o(v) = {u ∈ VG  ∃f ∈ Fauto(G) : u = f (v)}. • This distinguishes the three oxygen atoms. detect automorphisms (vertex numberings leading to the same code word).t. • However. There is also a large variety of other graph properties that may be used. thus speeding up the search. These generators can then be used to avoid exploring implied automorphisms. because they have diﬀerent sets of incident edges. • This distinguishes the two pairs (carbon 3 is adjacent to a sulfur atom. Example: For the following two (unlabeled) graphs such vertex signatures cannot split the sole equivalence class: Automorphism Groups • Let Fauto(G) be the set of all automorphisms of a (labeled) graph G. constructing a canonical code word becomes very simple and eﬃcient. The orbit of a vertex v ∈ VG w. Frequent Pattern Mining 369 S C 2N C5 9 7C O O 4 O8 • The vertex signatures of carbons 3 and 6 and of oxygens 4 and 9 are extended by the sorted list of vertex signatures of the adjacent vertices. For the left graph it is not even possible at all to split the equivalence class. Frequent Pattern Mining 370 vertex 1 2 4 8 9 3 6 5 7 signature S N O O O = C C C C = vertex 1 2 4 8 9 3 6 5 7 signature S N O .C C C = Christian Borgelt Christian Borgelt Elements of Vertex Signatures • Using only (sorted) lists of labels of incident edges and adjacent vertices cannot always distinguish all vertices. ◦ From found automorphisms. if the number of adjacent vertices that are adjacent is incorporated into the vertex signature. • The equivalence class can be split for the right graph. all equivalence classes contain only a single vertex and thus we obtained a unique vertex labeling.S C C . one can exploit that the automorphisms Fauto(G) of a graph G form a group (the automorphism group of G): ◦ During the construction of a canonical code word. Christian Borgelt Frequent Pattern Mining 371 • In order to deal with orbits. • The vertices in an orbit cannot possibly be distinguished by vertex signatures.
(The gains are usually particularly large for graphs with few/no labels. (Basic idea: combine the vertex and edge attributes and the vertex degrees.) ◦ As a consequence it can be diﬃcult to ensure that the resulting canonical form has the preﬁx property. [Borgelt and Fiedler 2007] ◦ Whenever a new (sub)graph is created.Canonical Form and Vertex Signatures • Advantages of Vertex Signatures: ◦ Vertices with the same label can be distinguished in a preprocessing step.) Repository of Processed Fragments • Each (sub)graph should be stored using a minimal amount of memory (since the number of processed (sub)graphs is usually huge). the procedure of equivalent sibling pruning). ◦ Try to avoid a full isomorphism test with a hash table: Employ a hash function that is computed from local graph properties. (In some experiments. ◦ Constructing canonical code words can thus become much easier/faster. ◦ Store a (sub)graph by listing the edges of one occurrence. (Vertices with diﬀerent signatures in a subgraph may have the same signature in a supergraph and vice versa. • The obvious alternative. ◦ Actual isomorphism test: mark stored occurrence and check for fully marked new occurrence (cf. 375 Christian Borgelt Frequent Pattern Mining 376 Christian Borgelt Frequent Pattern Mining . the repository is accessed. of course. a repository of processed (sub)graphs. (Note that for connected (sub)graphs the edges also identify all vertices.) ◦ If an isomorphism test is necessary. inserted into the repository.) • Disadvantages of Vertex Signatures: ◦ Vertex signatures can refer to the graph as a whole and thus may be diﬀerent for subgraphs. ◦ If it contains the (sub)graph. has received fairly little attention. • If the repository is laid out as a hash table with a carefully designed hash function. the repositorybased approach could outperform canonical form pruning by 15%. because the necessary backtracking can often be reduced considerably. Christian Borgelt Frequent Pattern Mining 373 Christian Borgelt Frequent Pattern Mining 374 Repository of Processed Fragments Repository of Processed Fragments • Canonical form pruning is the predominant method to avoid redundant search in frequent (sub)graph mining. do quick checks ﬁrst: number of vertices. number of edges. ﬁrst containing database graph etc.) • The containment test has to be made as fast as possible (since it will be carried out frequently). we know that it has already been processed and therefore it can be discarded. ◦ Only (sub)graphs that are not contained in the repository are extended and. In such a case one may not be able to restrict (sub)graph extensions or to use the simpliﬁed search scheme (only code word checks). it is competitive with canonical form pruning.
(→ fastest for fragments that have to be processed) • Disadvantages of Repositorybased Pruning Multiple isomorphism tests may be necessary for a processed fragment. Repository: Execution Times 80 60 40 20 time/seconds canon.5 4 4.5 3 3.5 4 4. performance of repositorybased pruning. • Left: maximum source extensions • Right: rightmost path extensions • Experimental results on the IC93 data set.5 5 5. Needs far more memory than canonical form pruning. tests processed duplicates 80 60 40 20 0 subgraphs/10 000 generated dupl. tests processed duplicates 80 60 40 20 0 subgraphs/10 000 generated accesses isom.5 5 5. search time in seconds (vertical axis) versus minimum support in percent (horizontal axis).5 6 2 2.5 6 2 2.5 5 5.5 3 3. A repository very diﬃcult to use in a parallel algorithm.Canonical Form Pruning versus Repository • Advantage of Canonical Form Pruning Only one test (for canonical form) is needed in order to determine whether a (sub)graph needs to be processed or not. form repository 80 60 40 20 time/seconds canon. Repository: Numbers of (Sub)Graphs Repository Performance 80 60 40 20 0 2 2.5 5 5.5 4 4.5 6 • Experimental results on the IC93 data set.5 6 • Experimental results on the IC93 data set.5 6 2 2.5 5 5. tests duplicates 80 60 40 20 0 subgraphs/10 000 generated accesses isom.5 3 3.5 6 2 2.5 3 3. • Left: maximum source extensions • Right: rightmost path extensions Christian Borgelt Frequent Pattern Mining 377 Christian Borgelt Frequent Pattern Mining 378 Canonical Form vs.5 4 subgraphs/10 000 generated dupl.5 5 5. (→ slowest for fragments that have to be processed) • Advantage of Repositorybased Pruning Often allows to decide very quickly that a (sub)graph has not been processed.5 3 3.5 4 4. form repository 2 2.5 3 3. • Left: maximum source extensions • Right: rightmost path extensions Christian Borgelt Frequent Pattern Mining 379 Christian Borgelt Frequent Pattern Mining 380 .5 4 4. • Disadvantage of Canonical Form Pruning It is most costly for the (sub)graphs that are created in canonical form. Canonical Form vs. tests duplicates 4. numbers of subgraphs used in the search.
The same basic idea can also be used for graphs. I is not closed. Partial Perfect Extension Pruning • Basic idea of perfect extension pruning: First grow a fragment to the biggest common substructure. since KT (J) ⊆ KT (I). [Yan and Han 2003] example molecules: S C N C O O S C N F O S C N O search tree for seed S: S 1F O S 2C O S C N2 O 1S C N O O1S C O S3 S C3 S C N3 S C N 2 O S 2O S C O 2 O C S C 1+3 embs. their code word).t. no fragment in a subtree to the right of a perfect extension branch can be closed.) N O C S C O O C S C N O C S C C S C N 1+1 embs. • Consequence: It may be necessary to check whether all occurrences of the base fragment lead to the same number of extended occurrences. This can most easily be seen by considering that KT (I) ⊆ KT ({a}) and hence KT (J) ⊆ KT ({a}). Christian Borgelt Frequent Pattern Mining 383 S C 1N C S C N 1C O S≺F≺N≺C≺O ≺= breadthﬁrst search canonical form Frequent Pattern Mining 384 Christian Borgelt . • Suppose that during the search we discover that sT (I ∪ {a}) = sT (I) Perfect Extension Pruning for some item set I and some item a ∈ I.r. Hence a can be added directly to the preﬁx of the conditional database. However. / • As a consequence. we need that a perfect extension of a graph fragment is also a perfect extension of any supergraph of this fragment. • Attention: It may not be enough to compare the support and the number of occurrences of the graph fragment.Reminder: Perfect Extension Pruning for Item Sets • If only closed item sets or only maximal item sets are to be found. • Partial perfect extension pruning: If the children of a search tree vertex are ordered lexicographically (w. (That is. 2+2 embs. Neither is a single bond to nitrogen a perfect extension of OCSC nor is a single bond to oxygen a perfect extension of NCSC. (Even though perfect extensions must have the same support and an integer multiple of the number of occurrences of the base fragment. Then we know ∀J ⊇ I : sT (J ∪ {a}) = sT (J). no superset J ⊇ I with a ∈ J can be closed.) / We call the item a a perfect extension of I. Christian Borgelt Frequent Pattern Mining 381 Christian Borgelt Frequent Pattern Mining 382 Perfect Extensions • An extension of a graph (fragment) is called perfect. additional pruning of the search tree becomes possible. if it can be applied to all of its occurrences in exactly the same way. but needs modiﬁcations.
Fix the latter.≺ = Christian Borgelt Frequent Pattern Mining 388 . • Problem: This pruning method interferes with canonical form pruning. S ≺ N ≺ C ≺ O. Base fragment: SCN Extension to OSCN Shift extension Renumber vertices canonical code: S 0C1 (noncanonical!) code: S 0C1 (invalid) code: S 0C1 canonical code: S 0C1 1N2 1N2 0O3 0O2 0O3 1N2 1N3 • Rebuild the code word from the front: • A new edge description is usually appended at the end of the code word. Some can be checked by simple rules (rightmost path/max. 3. • The code word of a fragment consists of two parts: ◦ a preﬁx ending with the last nonperfect extension edge and ◦ a (possibly empty) suﬃx of perfect extension edges.Full Perfect Extension Pruning • Full perfect extension pruning: [Borgelt and Meinl 2006] Also prune the branches to the left of the perfect extension branch. (In particular. because the extensions in the left siblings cannot be repeated in the perfect extension branch (restricted extensions. This is still the standard procedure is the suﬃx is empty. “simple rules” for canonical form).) • Rather than to actually shift and modify edge description. ◦ Compare two possible code word preﬁxes: S 0O1 and S 0C1. with the search tree on the previous slide it is assigned S 0C1 1N2 0O3. 4. 2. since it is lexicographically smaller.) Christian Borgelt Frequent Pattern Mining 387 ◦ The root vertex (here the sulfur atom) is always in the ﬁxed part. the description of the new edge may be inserted into the suﬃx or even moved directly before the suﬃx. 385 Christian Borgelt Frequent Pattern Mining 386 S C 1N C S C N 1C O S≺F≺N≺C≺O ≺= breadthﬁrst search canonical form Frequent Pattern Mining Christian Borgelt Code Word Reorganization • In order to obtain a proper code. example molecules: S C N C O O S C N F O S C N O Code Word Reorganization • Restricted extensions: Not all extensions of a fragment are allowed by the canonical form. renumbering the vertices is easier. ◦ However. that is. the restrictions on extensions must be mitigated. 0 (zero). since it is lexicographically smaller. However. (Whichever possibility yields the lexicographically smallest code word. Code Word Reorganization: Example • Shift an extension to the proper place and renumber the vertices: 1. • Example: The core problem of obtaining the search tree on the previous slide is how we can avoid that the fragment OSCN is pruned as noncanonical: S 2O S C O 2 search tree for seed S: S 1F O S 2C O S C N2 O 1S C N O S3 S C3 S C N3 S C N 2 O ◦ The breadthﬁrst search canonical code word for this fragment is S 0C1 0O2 1N3. if the suﬃx is not empty. • Consequence: In order to make canonical form pruning and full perfect extension pruning compatible. • Solution: Deviate from appending the description of a new edge. Allow for a (strictly limited) code word reorganization. it must be possible to shift descriptions of new edges past descriptions of perfect extension edges in the code word. ◦ Compare the code word preﬁxes S 0C1 0O2 and S 0C1 1N2 Fix the former. . it is technically easier to rebuild the code word from the front. ◦ Append the remaining perfect extension edge: S 0C1 0O2 1N3 breadthﬁrst search canonical form. It receives the initial vertex index. source extension).
the number of processed occurrences (bottom left). Perfect extensions must be bridges or edges closing a cycle/ring. The horizontal axis shows the minimal support in percent.5 3 3.5 4 Extensions for Molecular Fragment Mining 30 full partial none 20 10 2 2. • Consequence: Additional constraint [Borgelt and Meinl 2006] 6 4 2.5 4 4.5 3 3. The curves show the number of generated fragments (top left).5 3 3.5 4 4.5 6 Experimental results on the IC93 data.5 3 3. obtained with ring mining. and the number of search tree nodes (top right) for the three diﬀerent methods.5 fragments/104 full partial none nodes/103 60 40 20 full partial none 5 5. The horizontal axis shows the minimal support in percent. obtained without ring mining (single bond extensions). Christian Borgelt Frequent Pattern Mining 389 Christian Borgelt Frequent Pattern Mining 390 Experiments: IC93 with Ring Mining fragments/103 nodes/103 60 40 full partial none 20 15 10 full partial none 20 5 0 2 2.5 4 Experimental results on the IC93 data.5 5 5.5 5 5. Christian Borgelt Frequent Pattern Mining 391 Christian Borgelt Frequent Pattern Mining 392 .5 4 4.5 6 occurrences/106 full partial none • Problem: Perfect extensions in cycles may not allow for pruning. The curves show the number of generated fragments (top left).5 3 3.5 14 12 10 8 3 3.Perfect Extensions: Problems with Cycles/Rings example molecules: C C C C N O C C C C N O N O C C C N O C C C C N O C C C N O C C C N O C C C N O C C C C Experiments: IC93 without Ring Mining search tree for seed N: N O C N O N O C C N O C C N N C N O C N O C C N C C N C C C 20 15 10 5 2. and the number of search tree nodes (top right) for the three diﬀerent methods.5 6 2.5 occurrences/105 4 2 2. the number of processed occurrences (bottom left).
• Apply the preprocessing procedure to a grown (sub)graph. • Filter Approaches: ◦ (Sub)graphs/fragments are grown edge by edge (as before). there are two ring identiﬁcation parts per edge: ◦ A marker in the edge attribute. Borgelt 2006] Ring Mining: Treat Rings as Units • General Idea of Ring Mining: A ring (cycle) is either contained in a fragment as a whole or not at all. 395 Christian Borgelt Frequent Pattern Mining 396 • Mark pseudorings.) 6 8 5 1 Extended Preprocessing: (for reordering approach) 2 N 0 • Check for edges that have a ring marker in the edge attribute. ◦ Considerably improves eﬃciency and interpretability. ◦ Infrequent fragments that diﬀer only in a few atoms from frequent fragments can be found. (Note that an edge can be part of several rings. 9 4 7 3 • Mark all edges of rings in a userspeciﬁed size range. and Berthold 2004] ◦ Add a carbon chain in one step. • If such edges exist. • Ring edges have been marked in the preprocessing. and Berthold 2004] ◦ Deﬁne classes of atoms that can be seen as equivalent. that is. and Berthold 2004. ◦ Extensions by a carbon chain match regardless of the chain length. so the (sub)graph must not be reported. but ◦ keep the marker in the edge attribute. • Carbon Chains [Meinl. ◦ A set of ﬂags identifying the diﬀerent rings an edge is contained in. ◦ Found frequent graph fragments are ﬁltered: Graph fragments with incomplete rings are discarded. 393 Christian Borgelt Frequent Pattern Mining 394 ◦ Preprocessing: Find rings in the molecules and mark them. • Wildcard Atoms [Hofer. ◦ Additional search tree pruning: Prune subtrees that yield only fragments with incomplete rings. rings of smaller size than the user speciﬁed. Christian Borgelt Frequent Pattern Mining . the (sub)graph contains unclosed/open rings. ◦ In the search process: Add all atoms and bonds of a ring in one step. (molecular fragment mining: usually rings with 5 – 6 vertices/atoms) • Technically. Borgelt. ◦ only set the ﬂags that identify the rings an edge is contained in. (one of) the containing ring(s) is added as a whole (all of its edges are added). • Reordering Approach ◦ If an edge is added that is part of one or more rings. ignoring its length. which fundamentally distinguishes ring edges from nonring edges. ◦ Incompatibilities with canonical form pruning are handled by reordering code words (similar to full perfect extension pruning). Christian Borgelt Frequent Pattern Mining Ring Mining: Preprocessing Ring mining is simpler after preprocessing the rings in the graphs to analyze: Basic Preprocessing: (for ﬁlter approaches) Filter Approaches: Open Rings Idea of Open Ring Filtering: If we require the output to have only complete rings. ⇒ It is known which edges of a grown (sub)graph are ring edges (in the underlying graphs of the database). but which consist only of edges that are part of rings within the userspeciﬁed size range. Borgelt. but did not receive any ring ﬂag when the (sub)graph was reprocessed. ◦ Combine fragment extensions with equivalent atoms. Borgelt.Extensions of the Search Algorithm • Rings [Hofer. we have to identify and remove fragments with ring edges that do not belong to any complete ring.
8. ◦ One must not commit too early to an order of the edges (because branches may inﬂuence the order of the ring edges). thus distinguishing extensions that ◦ start with the same individual edge. this (sub)graph can be pruned from the search (because there is an open ring that can never be closed). 397 Christian Borgelt Frequent Pattern Mining 398 Christian Borgelt Frequent Pattern Mining Filter Approaches: Merging Ring Extensions Idea of Merging Ring Extensions: The previous methods work on individual edges and hence cannot always detect if an extension only leads to fragments with complete rings that are infrequent. breadthﬁrst B: vertices with an index no smaller than the maximum source. because the existing cycle edge has the smallest possible source. C C C C N O C C C C N O A Reordering Approach • Drawback of Filtering: (Sub)graphs are still extended edge by edge. Christian Borgelt Frequent Pattern Mining 399 Christian Borgelt Frequent Pattern Mining 400 . because any of them may produce valid output. 8. • Canonical form pruning allows to restrict the possible extensions of a fragment. Edges Closing Cycles: A: none. this edge must be part of an incomplete ring. that is. • Add all edges of a ring. 1. B: the edge between the vertices 7 and 8. • Problems of a Reordering Approach: ◦ One must allow for insertions between already added ring edges (because branches may precede ring edges in the canonical form). S N O O O Reminder: Restricted Extensions A 2 0 S 3 B 0 S C C 4 2 1 N C 3 1 N O 4 O 6 C 5 example molecule C O 6 C 7 C O 8 O 5 C 8 O 7 depthﬁrst Extendable Vertices: A: vertices on the rightmost path. If there is a vertex with only one incident ring edge. • If an unextendable vertex of a grown (sub)graph has only one incident ring edge. Advantage of Merging Ring Extensions: • All extensions are removed that become infrequent when completed into rings. • Trim and merge ring extensions that share the same initial edge. • Determine the support of the grown (sub)graphs and prune infrequent ones. ◦ All possible orders of (locally) equivalent edges must be tried. • Better Approach: ◦ Add all edges of a ring in one step. ⇒ Some rings cannot be closed by extending a (sub)graph.) ◦ Reorder certain edges in order to comply with canonical form pruning. 3. create one extended (sub)graph for each ring it is contained in.Filter Approaches: Unclosable Rings Idea of Unclosable Ring Filtering: Grown (sub)graphs with open rings that cannot be closed by future extensions can be pruned from the search. ⇒ Fragments grow fairly slowly. but ◦ lead into rings of diﬀerent size or diﬀerent composition. • All occurrences are removed that lead to infrequent (sub)graphs once rings are completed. (When a ring edge is added. that is. 7. • Obviously. 0. 7. ⇒ Due to previous extensions certain vertices become unextendable. 6. a necessary (though not suﬃcient) condition for all rings being closed is that every vertex has either zero or at least two incident ring edges.
but that could become canonical once branches are added.≺ = 5 4 3 0 N 0C1 0C2 1C3 O 2C4 3C5 4=C5 1C3 2C4 3=C5 4C5 O O 1 N 6 2 N 0C1 0C2 2 N 6 1 O 3 5 N 0C1 0C2 1C3 2O4 2C5 3=C6 5C6 O 4 5 4 O 1 N 0 2 N 0C1 0C2 1O3 1C4 2C5 3C6 5=C6 O 2 N 0 1 O 3 • W. • With an attached branch (close to the root vertex). Christian Borgelt Frequent Pattern Mining 404 . • Rule for keeping noncanonical fragments: If the current code word deviates from the canonical code word in the ﬁxed part. which consists of the additional (ring) edges. • Needed: a rule which noncanonical fragments to keep and which to discard.t. • Justiﬁcation of this rule: ◦ If the deviation is in the ﬁxed part. • As a consequence we can split the code word into two parts: ◦ a ﬁxed preﬁx. the deviation is in the volatile part. The upper/left is the canonical form of the pure ring. 5 3 4 0 Keeping NonCanonical Fragments Solution of the early commitment problem: Maintain (and extend) both orderings of the ring edges and allow for deviations from the canonical form beyond “ﬁxed” edges. • Principle: keep (and. • Volatile suﬃx of a code word: The suﬃx of the code word after the last edge added in an edgebyedge manner (and excluding it). the fragment is pruned. a breadthﬁrst search canonical form. N ≺ O ≺ C. • Idea: adding a ring can be seen as adding its initial edge as in an edgebyedge procedure. the positions of which are not yet ﬁxed. Christian Borgelt Frequent Pattern Mining 403 Search Tree for an Asymmetric Ring with Branches Maintain (and extend) both orderings of the ring edges and allow for deviations from the canonical form beyond ﬁxed edges. no later addition of edges can have any eﬀect on it.r. Christian Borgelt Frequent Pattern Mining 401 Christian Borgelt Frequent Pattern Mining 402 Keeping NonCanonical Fragments • Fixed preﬁx of a code word: The preﬁx of the code word up to (and including) the last edge added in an edgebyedge manner. Illustration: eﬀects of attaching a branch to an asymmetric ring. 5 3 4 0 N 4 2 5 3 0 6 4 5 1 3 0 O 2 1 N O 3 1 6 5 5 0 4 4 6 3 2 0 O 1 2 N 1 O 5 2 6 4 0 3 O N O 4 3 7 6 1 O 5 N 2 O O N O 6 5 7 4 2 O 3 N 1 O O N 0 2 O O N 0 1 O The edges of a grown subgraph are split into • ﬁxed edges (edges that could have been added in an edgebyedge manner). a later extension edge may be inserted in such a way that the code word becomes canonical. the edges of the ring can be ordered in two diﬀerent ways (upper two rows). otherwise it is kept. the other ordering of the ring edges (lower/right) is the canonical form. which is also built by an edgebyedge procedure. however. • volatile edges (edges that have been added with ring extensions and before/between which edges may be inserted). and ◦ a volatile suﬃx. since the ﬁxed part will never be changed. consequently. also extend) fragments that are not in canonical form.Problems of Reordering Approaches One must not commit too early to an order of the edges. and some additional edges. ◦ If. .
◦ The form on the left is canonic. 6 2 6 2 8 5 1 5 9 4 7 4 5 3 1 6 2 6 2 7 3 4 1 5 3 1 4 5 5 2 5 2 6 3 1 3 1 7 4 7 4 8 4 9 8 N N 8 0 3 7 2 N 0 N 7 0 N 6 0 N 4 N 0 N 0 The order of the equivalent edges already in the fragment must be maintained. • However. • On the next level.Search Tree for an Asymmetric Ring with Branches • The search constructs the ring with both possible numberings of the vertices. and this we can achieve only by adding the 3bond ring ﬁrst. we may not reorder equivalent edges freely. and the order of the equivalent new edges must be maintained. all other bonds are volatile. there are two canonical and two noncanonical fragments. Since the code word for this fragment deviates from the canonical one only at the 5th bond. Christian Borgelt Frequent Pattern Mining 407 Christian Borgelt Frequent Pattern Mining 408 . The noncanonical fragment diﬀers in the volatile part (the ﬁrst four bonds are ﬁxed. The noncanonical fragments both diﬀer in the ﬁxed part. because any of them may in the end yield the canonical form. we may not discard it. in which the order of the edges already in the (sub)graph and the order of the newly added edges is preserved. The two sequences of equivalent edges may be merged in a “zipperlike” manner. as this would interfere with keeping certain noncanonical fragments: By keeping some noncanonical fragments we already consider some variants of orders of equivalent edges. and thus are pruned. in the canonical form the upward bond succeeds both other bonds. We cannot (always) decide locally which is the right order. These must not be generated again. • Splicing rule for equivalent edges: (breadthﬁrst search canonical form) The Necessity of PseudoRings The splicing rule explains the necessity of pseudorings: Without pseudorings it is impossible to achieve canonical form in some cases. Christian Borgelt Frequent Pattern Mining 405 Christian Borgelt Frequent Pattern Mining 406 Splicing Equivalent Edges • In principle. which now consists of the ﬁrst three bonds. because otherwise not all orders of equivalent edges are generated. there is one canonical and one noncanonical fragment. • On the third level. because this may depend on edges added later. ◦ In the fragment on the right only the ﬁrst ring bond is ﬁxed. but it deviates from the canonical code word only in the 7th bond) and thus may not be pruned from the search. but not the 3ring. • It is necessary to consider pseudorings for extensions. • Equivalent edges must be spliced in all ways. so it is kept. because in the presence of equivalent edges the order of these edges cannot be determined locally. selecting the next edge from either list. Connected and Nested Rings Connected and nested rings can pose problems. but preserving the order in each list. 6 2 6 2 8 5 1 5 9 4 7 4 5 3 1 6 2 6 2 7 3 4 1 5 3 1 4 5 5 2 5 2 6 3 1 3 1 7 4 7 4 8 4 9 8 N N 8 0 3 7 2 N 0 N 7 0 N 6 0 N 4 N 0 N 0 • Edges are (locally) equivalent if they start from the same vertex. • If we could only add the 5ring and the 6ring. the upward bond from the atom numbered 1 would always precede at least one of the other two bonds that are equivalent to it (since the order of existing bonds must be preserved). all possible orders of equivalent edges have to be considered. • Nevertheless. and lead to vertices with the same vertex attribute. have the same edge attribute.
Ring Key Pruning Example of Dependences between Edges 1 2 0N 3 4 5 1 0N 2 4 5 3 2 0N 2 1 3 4 1 0N 2 3 5 4 (All edge descriptions refer to the vertex numbering in the fragment on the left. ◦ Obviously. • Splicing rule for equivalent edges: (depthﬁrst search canonical form) Avoiding Duplicate Fragments • The splicing rules still allow that the same fragment can be reached in the same form in diﬀerent ways. at least one of the rings containing it must be present. ◦ If a ring edge e1 enforces a ring edge e2. • Since we cannot decide locally which of these edges should be followed ﬁrst when building the spanning tree. we deﬁne that a nonring edge enforces only its own presence. where equivalent edges can be found. it is not possible to form a subfragment with only complete rings that contains e1. only enforces itself and is enforced only by itself. ◦ The requirement of complete rings introduces dependences between edges: The presence of certain edges enforces the presence of certain other edges. any edge in the set {(0. Christian Borgelt Frequent Pattern Mining 409 Christian Borgelt Frequent Pattern Mining 410 Ring Key Pruning Dependences between Edges • The requirement of complete rings introduces dependences between edges. • Ideas underlying such an augmented test: The ﬁrst new ring edge has to be tried in all locations in the volatile part of the code word. we have to try all of these possibilities in order not to miss the canonical one.) • In the fragment on the left.) • A ring edge e1 of a fragment enforces the presence of another ring edge e2 iﬀ the set of rings containing e1 is a subset of the set of rings containing e2. however. 2) and (1. because all of these edges are contained exactly in the 5ring and the 6ring. because both are contained exactly in the 3ring and the 6ring. • The edge (0. 3). every ring edge enforces at least its own presence. (Idea: consider forming subfragments with only complete rings. In this form equivalent edges are adjacent in the canonical code word. even though the rule is slightly simpler. 1). but each time with a diﬀerent ﬁxed part: The position of the ﬁrst edge of a ring extension (after reordering) is the end of the ﬁxed part of the (extended) code word. 5)} enforces the presence of any other edge in this set. the edges (0. but not e2. 5). Christian Borgelt Frequent Pattern Mining 411 Christian Borgelt Frequent Pattern Mining 412 . ◦ In order to capture also nonring edges by such a deﬁnition. Reason: we cannot always distinguish between two diﬀerent orders in which two rings sharing a vertex are added. • Needed: an augmented canonical form test. • In the same way. (1. • In a depthﬁrst search canonical form equivalent edges can be far apart from each other in the code word. Nevertheless some “splicing” is necessary to properly treat equivalent edges in this canonical form. • There are no other enforcement relations between edges. ◦ The same code word of a fragment is created several times. 4).Splicing Equivalent Edges • The considered splicing rule is for a breadthﬁrst search canonical form. ◦ In order for a ring edge to be present in a subfragment. (4. 2) enforce each other. namely by adding (nested) rings in diﬀerent orders. (3.
• N 0C1 is not a ring key. 3). ◦ The code word vuxw ′ cannot have a shorter ring key than vux. Christian Borgelt Frequent Pattern Mining 415 Christian Borgelt Frequent Pattern Mining 416 . 2) is enforced by e2 = (0. e1 = (0. because e4 = (1. because it enforces no edges. • Idea of (Shortest) Ring Key Pruning: Discard fragments that are formed with a code word. . 1. on the considered code word. Ring Key Pruning • Example of (shortest) ring key(s): Breadthﬁrst search (canonical) code word: N 0C1 0C2 0C3 1C2 1C4 3C5 4C5 Edges: e1 e2 e3 e4 e5 e6 e7 • N is obviously not a ring key. 4). • Anchor: If a fragment contains only one ring. . ◦ If by this procedure a ﬁxed ring edge becomes ﬂagless. • N 0C1 0C2 is not a ring key. k ∈ {0. the ﬁrst ring edge enforces the other ring edges and thus the ﬁxed part is a shortest ring key. but depends. 1 2 0N 3 4 5 Christian Borgelt Frequent Pattern Mining 413 Christian Borgelt Frequent Pattern Mining 414 Ring Key Pruning • If only code words with ﬁxed parts that are shortest ring keys are extended. 1) and e2 = (0. Note: The shortest ring key of a code word is uniquely deﬁned. • Any longer preﬁx is a ring key. because it does not enforce. u describes edges originally described by parts of w (u may be empty). Ring Key Pruning Test Procedure of Ring Key Pruning • Check for each volatile edge whether it is enforced by at least one ﬁxed edge: ◦ Mark all rings in the considered fragment (set ring ﬂags).Ring Key Pruning (Shortest) Ring Keys • We consider preﬁxes of code words that contain 4k + 1 characters. • A preﬁx v of a code word vw (whether canonical or not) is called a ring key iﬀ each edge described in w is enforced by at least one edge described in v. but not a shortest ring key. e2 or e3. for example. ◦ Extending this code word generally transforms it into a code word vuxw ′. it suﬃces to check whether the ﬁxed part is a ring key. the edge e is enforced by it. otherwise the edge e is not enforced. 5) and e7 = (4. for example. • The preﬁx v is called a shortest ring key of vw iﬀ it is a ring key and there is no shorter preﬁx that is a ring key for vw. x is the description of the ﬁrst new edge and w ′ describes the remaining old and new edges. m}. • Example: 1 2 0N 2 3 4 2 0N 3 1 4 5 ◦ Extending the 5ring yields the fragment on the right in canonical form with the ﬁrst two edges (that is. the ﬁxed part of which is not a shortest ring key. ◦ The preﬁx N 0C1 0C2 is not a ring key (the grey edges are not enforced) and hence the fragment is discarded. where m is the number of edges of the fragment. because the edges described in vu do not enforce the edge described by x. even though it is in canonical form. because it does not enforce. for which the preﬁx v is a shortest ring key. ◦ Remove all rings containing a given volatile edge e (clear ring ﬂags). • Induction step: ◦ Let vw be a code word with ﬁxed part v and volatile part w. • N 0C1 0C2 0C3 is the shortest ring key. of course. 2) and e5 = (1. 2)) ﬁxed. . e3. 5) are enforced by e3 = (0. . e6 = (3.
and the execution time in seconds (bottom left) for the three diﬀerent strategies. ◦ Check whether the ﬁxed part is a shortest ring key. (solid frame: extended and reported. Christian Borgelt Frequent Pattern Mining 417 Christian Borgelt Frequent Pattern Mining 418 Experiments: IC93 Experiments: NCI HIV Screening Database 8 6 4 fragments/104 reorder merge rings close rings 5 4 3 2 occurrences/106 reorder merge rings close rings 3 2 fragments/104 reorder merge rings close rings 8 6 4 occurrences/107 reorder merge rings close rings 2 0 1 1 2 2.5 4 4.5 3 3. The horizontal axis shows the minimal support in percent. The curves show the number of generated fragments (top left). but not reported. By attaching additional rings any of these fragments may become canonical.5 3 3.5 3 3.5 5 0 0. The preﬁx N 0C1 0C2 describing these edges is not a ring key.5 4 25 20 15 10 5 0 2 2.Search Tree for Nested Rings N 2 1 0N 1 2 3 0N 1 2 0N 3 3 2 4 5 1 0N 3 4 5 2 1 0N 2 3 2 4 5 3 0N 2 4 5 3 1 3 4 5 5 4 2 0N 1 3 4 5 3 0N 1 1 4 5 2 0N 2 3 1 5 4 1 0N 2 also in canonical form Search Tree for Nested Rings • In all fragments in the bottom row of the search tree (fragments with frames) the ﬁrst three edges are ﬁxed.5 5 Experimental results on the IC93 data. no frame: pruned) • The full fragment is generated twice in each form (even the canonical). • Note that for all single ring fragments two of their four children are kept. • In the row above it (fragments without frames). the number of processed occurrences (top right).5 2 2. • Augmented Canonical Form Test: ◦ The created code words have diﬀerent ﬁxed parts.5 2 2. The horizontal axis shows the minimal support in percent.5 1 1.5 4 2 0 0. the number of processed occurrences (top right).5 5 0 2 2. Hence these fragments are kept and processed. 420 Christian Borgelt Christian Borgelt Frequent Pattern Mining .5 4 4. only the ﬁrst two edges are ﬁxed.5 1 1.5 4 Experimental results on the HIV data.5 3 3. Frequent Pattern Mining 419 150 100 50 0 time/seconds reorder merge rings close rings 0. 4 0N 2 1 2 5 2 3 0N 3 4 2 0N 1 3 1 5 4 4 5 5 4 1 2 0N 3 4 2 1 3 4 5 5 4 2 1 0N 3 3 5 1 2 4 1 3 5 The preﬁx N 0C1 0C2 0C3 describing these edges is a shortest ring key. the rest is volatile.5 3 3. and the execution time in seconds (bottom left) for the three diﬀerent strategies. The reason is that the deviation from the canonical form resides in the volatile part of the fragment.5 3 3.5 2 2. even though only the one at the left bottom is in canonical form.5 1 1. (The grey edges are not enforced.5 time/seconds reorder merge rings close rings 4 4. The curves show the number of generated fragments (top left). dashed frame: extended. the rest is volatile.) Hence these fragments are discarded.
NCI DTP HIV Antiviral Screen: AZT Some Molecules from the NCI HIV Database O N N N N O N N O O O N N N O N N N O O N N O O O O O O N N N N N N O O N O O N O N N O O O P O O N N N O O O Found Molecular Fragments O O Common Fragment O N O Christian Borgelt Frequent Pattern Mining 421 Christian Borgelt Frequent Pattern Mining 422 NCI DTP HIV Antiviral Screen: Other Fragments Experiments: Ring Extensions Improved Interpretability N N Fragment 1: CA: 5. Christian Borgelt Frequent Pattern Mining 423 Christian Borgelt Frequent Pattern Mining 424 .00% N S Cl O O S O N N O N S NSC #667948 NSC #698601 Compounds from the NCI cancer data set that contain Fragment 1 but not 2.23% CI/CM: 0.07% O O S O N N Fragment 3: CA: 5.00% Fragment 4: CA: 9.23% CI/CM: 0.85% CI/CM: 0. in CA: 22.08% O N N Fragment 1 basic algorithm freq.07% N O O O P O O Fragment 5: CA: 10.04% O N N O O O Fragment 6: CA: 9.92% CI/CM: 0.85% CI/CM: 0.05% O O S O N N Fragment 2: CA: 4.15% CI/CM: 0. in CA: 20.77% O Fragment 2 with ring extensions freq.
Diﬀerent canonical forms lead to diﬀerent behavior of the search algorithm. CI: 0. This problem is solved with the help of canonical forms of graphs.0% Cl A S N B=O B=S CA: 5.48% freq.net/moss.0% 0. (Carbon) Chain Mining. and Wildcard Vertices. ignoring its length.7% CI/CM: 0.5% 0.13% Cl Christian Borgelt Frequent Pattern Mining 425 Christian Borgelt Frequent Pattern Mining 426 Summary Frequent (Sub)Graph Mining • Frequent (sub)graph mining is closely related to frequent item set mining: Find frequent (sub)graphs instead of frequent subsets.borgelt. • Extensions of the basic algorithm (particularly useful for molecules) include: Ring Mining. • Advantage: Fragments can represent carbon chains of varying length. All frequent fragments can be reconstructed from the closed ones. • Combine fragment extensions with equivalent atoms. CA: 1. • A Java implementation for molecular fragment mining is available at: http://www. • Extension by a carbon chain match regardless of the chain length. • A restriction to closed fragments allows for additional pruning strategies: partial and full perfect extension pruning.5% 3.0% 0.html Christian Borgelt Frequent Pattern Mining 427 Christian Borgelt Frequent Pattern Mining 428 Mining a Single Graph .0% freq. • A core problem of frequent (sub)graph mining is how to avoid redundant search. Example from the NCI Cancer Dataset: Examples from the NCI HIV Dataset: Fragment with Chain N N C* Actual Structures N N N N S N B A=O A=N CA: 5.01% CI/CM: 0.Experiments: Carbon Chains • Technically: Add a carbon chain in one step. • The restriction to closed fragments is a lossless reduction of the output. Experiments: Wildcard Atoms • Deﬁne classes of atoms that can be considered as equivalent. • Advantage: Infrequent fragments that diﬀer only in a few atoms from frequent fragments can be found.
v) ∈ ES : (f (u). Example: input graph: subgraphs: occurrences: sG (A) = 1 sG(B −A−B) = 2 B A A B B B B B 2 2 3 A A A 1 1 1 B B B 3 3 2 B A ◦ ∀(u. EG. E. the mapping f preserves the connection structure and the labels. iﬀ V1 = V2. namely if S possesses an automorphism that is not the identity.Reminder: Basic Notions • A labeled or attributed graph is a triple G = (V. ◦ E ⊆ V × V − {(v. The two subgraph isomorphisms f1 and f2 are called ◦ overlapping. f (v))). Christian Borgelt Frequent Pattern Mining 429 Christian Borgelt Frequent Pattern Mining 430 AntiMonotonicity of Subgraph Support Most natural deﬁnition of subgraph support in a single graph setting: number of occurrences (subgraph isomorphisms). B A written f1 ≡ f2. iﬀ ∀v ∈ VS : f1(v) = f2(v). ℓG) and S = (VS . Example: input graph: subgraphs: occurrences: sG (A) = 1 sG(A−B) = 2 sG(B −A−B) = 2 Relations between Occurrences • Let f1 and f2 two subgraph isomorphisms of S to G and V1 = {v ∈ VG  ∃u ∈ VS : v = f1(u)} and V2 = {v ∈ VG  ∃u ∈ VS : v = f2(u)}. Problem: The number of occurrences of a subgraph is not antimonotone. written f1◦ 2. where ◦ V is the set of vertices. iﬀ V1 ∩ V2 = ∅. • There can be nonidentical. ◦ identical. ◦f B A A B B B B B A A A B B B B B 2 2 3 A A A 1 1 1 B B B 3 3 2 ◦ equivalent. Question: How should we deﬁne subgraph support in a single graph? Christian Borgelt Frequent Pattern Mining 431 Christian Borgelt Frequent Pattern Mining 432 . v)  v ∈ V } is the set of edges. ES . But: Antimonotonicity is vital for the eﬃciency of frequent subgraph mining. • Let G = (VG. Problem: The number of occurrences of a subgraph is not antimonotone. ℓS ) be two labeled graphs. written f1◦f2. f (v)) ∈ EG ∧ ℓS ((u. But: Antimonotonicity is vital for the eﬃciency of frequent subgraph mining. Question: How should we deﬁne subgraph support in a single graph? That is. • Note that identical subgraph isomorphisms are equivalent and that equivalent subgraph isomorphisms are overlapping. v)) = ℓG((f (u). A subgraph isomorphism of S to G or an occurrence of S in G is an injective function f : VS → VG with ◦ ∀v ∈ VS : ℓS (v) = ℓG(f (v)) and AntiMonotonicity of Subgraph Support Most natural deﬁnition of subgraph support in a single graph setting: number of occurrences (subgraph isomorphisms). but equivalent subgraph isomorphisms. and ◦ ℓ : V ∪ E → A assigns labels from the set A to vertices and edges. ℓ).
The overlap graph of S w. An independent vertex set of G is a set I ⊆ V with ∀u. Christian Borgelt Frequent Pattern Mining 433 Christian Borgelt Frequent Pattern Mining 434 Finding a Maximum Independent Set • Unmark all vertices of the overlap graph.t. v) ∈ E. Example: input graph: subgraph: 3 2 1 1 2 3 Maximum Independent Set Support Let G = (V. • Heuristic Greedy Algorithm ◦ Select a vertex with the minimum number of unmarked neighbors and mark all of its neighbors as excluded. EG. EG. if f ′ coincides with f on the vertex set VT of T . VT ⊂ VS . G is the graph O = (VO . and ℓT ≡ ℓS VT ∪ET ). ◦ Process the rest of the graph recursively.r. ℓG). that is. but maybe equivalent) occurrence of S in G. that is. f2 ∈ VO ∧ f1 ≡ f2 ∧ f1◦ 2}. However. v)  v ∈ V }. EO ) be the overlap graph of the occurrences of a labeled graph S = (VS . ℓS ) in a labeled graph G = (VG. f2)  f1. EO ). Let f1 and f2 be two (nonidentical. • Exact Backtracking Algorithm ◦ Find an unmarked vertex with maximum degree and try two possibilities: ◦ Select it for the MIS. ◦ Exclude it from the MIS. (Note: The inverse implication does not hold generally. G is the size of a maximum independent vertex set of O. S.t. ◦ Process the rest recursively and record best solution found. 2 3 1 1 3 2 B B A A B B A A B B B B A A B B A A B B Let O = (VO . ′ ′ f1 and f2 overlap if there exist overlapping T ancestors f1 and f2 of the occurrences f1 and f2. T and f the T ancestor f ′ of the occurrence f is uniquely deﬁned. / I is a maximum independent vertex set iﬀ • it is an independent vertex set and • for all independent vertex sets J of G it is I ≥ J. ℓG) and S = (VS . Observations: For given G. ET = (VT × VT ) ∩ ES . Let T = (VT . An occurrence f ′ of the subgraph T is called a Tancestor of the occurrence f iﬀ f ′ ≡ f VT . that is. B B A A B B A B Notes: Finding a maximum independent vertex set is an NPcomplete problem. E) be an (undirected) graph with vertex set V and edge set E ⊆ V × V − {(v. Let f be an occurrence of S in G. v ∈ I : (u. ES .r. Christian Borgelt Frequent Pattern Mining 435 AntiMonotonicity of MISSupport: Preliminaries Let G = (VG. ℓS ) be two labeled graphs.Overlap Graphs of Occurrences Let G = (VG. ET . • In both algorithms vertices with less than two unmarked neighbors can be selected and all of their neighbors marked as excluded. a greedy algorithm usually gives very good approximations. which has the set VO of occurrences of S in G as its vertex set ◦f and the edge set EO = {(f1.) Christian Borgelt Frequent Pattern Mining 436 . ES . ℓT ) a (nonempty) proper subgraph of S (that is. mark it as selected and mark all of its neighbors as excluded. ℓG) and S = (VS . EG. mark it as excluded. The maximum independent set support (or MISsupport for short) of S w. ES . ℓS ) be two labeled graphs and let VO be the set of all occurrences (subgraph isomorphisms) of S in G. respectively.
G is the size of a maximum independent vertex set of H. a graph G cannot exceed the MISsupport of any (nonempty) proper subgraph T of S.t.t. respectively. • As a consequence. • The set I induces a subset I ′ of the vertices of the overlap graph O ′ of an (arbitrary. • Hence the maximum independent vertex set of O ′ must be at least as large as the maximum independent vertex set of O. The harmful overlap support (or HOsupport for short) of the graph S w. EH ) be the harmful overlap graph of the occurrences of a labeled graph S = (VS . ℓS ) be two labeled graphs and let VH be the set of all occurrences (subgraph isomorphisms) of S in G. ℓG) and S = (VS . ′ ′ so that the T ancestors f1 and f2 of f1 and f2.t. • With similar argument: I ′ is an independent vertex set of the overlap graph O ′.r. The harmful overlap graph of S w.t. 3 2 A A 1 1 B B 2 3 A A B B B B A A B B 2 3 A A 1 1 B B 3 2 B Proof: Identical to proof for MISsupport. 437 Christian Borgelt Frequent Pattern Mining 438 Christian Borgelt Frequent Pattern Mining Harmful Overlap Graphs and Subgraph Support Let G = (VG.r.r. EG. which consists of the (uniquely deﬁned) T ancestors of the vertices in I. EG.) Christian Borgelt Frequent Pattern Mining 439 Christian Borgelt Frequent Pattern Mining 440 . are equivalent. written f1• 2. EH ). f2 ∈ VH ∧ f1 ≡ f2 ∧ f1• 2}. Proof: We have to show that the MISsupport of a subgraph S w. EG. ℓG) and S = (VS . ℓS ) be two labeled graphs and let f1 and f2 be two occurrences (subgraph isomorphisms) of S to G. ℓG). Harmful and Harmless Overlaps of Occurrences Not all overlaps of occurrences are harmful: input graph: subgraph: occurrences: A A A A B B B B C C C C A A A A B B C C A A B C A Let G = (VG. but ﬁxed) subgraph T of the considered subgraph S. •f Let H = (VH . iﬀ •f • they are equivalent or [Fiedler and Borgelt 2007] • there exists a (nonempty) proper subgraph T of S. because no two elements of I can have the same T ancestor. (The same two observations hold. ES . G. • It is I = I ′. G is the graph H = (VH . ES .r. ℓS ) in a labeled graph G = (VG. ES . since I is arbitrary. every independent vertex set of O induces an independent vertex set of O ′ of the same size. which were all that was needed. • Let I be an arbitrary independent vertex set of the overlap graph O of S w. f2 )  f1. f1 and f2 are called harmfully overlapping. Harmful Overlap Graphs and Ancestor Relations input graph: B A B A B A B B A B A B B A B B B A A B B A A B B B B A A B B A A B B B Theorem: HOsupport is antimonotone.AntiMonotonicity of MISSupport: Proof Theorem: MISsupport is antimonotone. which has the set VH of occurrences of S in G as its vertex set and the edge set EH = {(f1.
vb) with va ∈ W and vb ∈ VC − W . v2) ∈ ES  (f1(v1). • Since g(va) ∈ VC . • Hence VE is a maximal set of vertices for which g is a bijection. EC . g must be an automorphism of SE . Hence we have to consider subgraphs of SE in this case. v2) = (f2(u1). 3) Let EE = F1 ∩ F2. must be a bijective mapping. Proof: (by contradiction) • Suppose that there is a connected component SC with W = ∅ and W = VC . v2) ∈ EG  ∃(u1. u2) ∈ ES : (v1. a subgraph isomorphism of SE to itself. • VE is the vertex set of a subgraph SE that induces equivalent ancestors. −1 is the inverse of f . • Exploit the properties of automorphism to exclude vertices from the graph S that cannot be in VE . On this path there must be an edge (va. return false. v2) ∈ EG  ∃(u1. 1) Form the sets V1 = {v ∈ VG  ∃u ∈ VS : v = f1(u)} and V2 = {v ∈ VG  ∃u ∈ VS : v = f2(u)}. where f2 2 Subgraph Support Computation Input: Two (diﬀerent) occurrences f1 and f2 of a labeled graph S = (VS . otherwise return true. u2) ∈ ES : (v1. Restriction to Connected Subgraphs Lemma: Let SC = (VC . Christian Borgelt Frequent Pattern Mining Restriction to Connected Subgraphs The search for frequent subgraphs is usually restricted to connected graphs. ℓS ). g(vb)) ∈ EE (g is an automorphism). but: How do we check whether two occurrences overlap harmfully? Core ideas of the harmful overlap test: • Try to construct a subgraph SE = (VE . but ﬁxed) connected component of the subgraph SE and let W = {v ∈ VC  g(v) ∈ VC } −1 (reminder: ∀v ∈ VE : g(v) = f2 (f1(v)). this implies vb ∈ W . ℓC ) be an (arbitrary. ES . EG. Christian Borgelt Frequent Pattern Mining 443 Christian Borgelt Frequent Pattern Mining 444 . However. v2) = (f1(u1). −1 • For such a subgraph SE the mapping g : VE → VE with v → f2 (f1(v)). Computing the edge set EE of the subgraph SE : 1) Let E1 = {(v1. • However. EE .Subgraph Support Computation Checking whether two occurrences overlap is easy. contradicting vb ∈ VC − W . Output: Whether f1 and f2 overlap harmfully. ES . 441 Christian Borgelt Frequent Pattern Mining 442 • More generally. f1(v2)) ∈ E1 ∩ E2} and F2 = {(v1. • Choose two vertices v1 ∈ W and v2 ∈ VC − W . • Any vertex v ∈ VS − VE cannot contribute to such equivalent ancestors. • It is (va. ℓG). f2(u2))}. 2) Let F1 = {(v1. that is. • v1 and v2 are connected by a path in SC . vb) ∈ EE and therefore (g(va). v2) ∈ ES  (f2(v1). checking all possible subgraphs is prohibitively costly. f1(u2))} and E2 = {(v1. and 3) If VE = W1 ∩ W2 = ∅. since SC is a connected component. We cannot conclude that no edge is needed if the subgraph SE is not connected: there may be a connected subgraph of SE that induces equivalent ancestors of the occurrences f1 and f2. 2) Form the sets W1 = {v ∈ VS  f1 (v) ∈ V1 ∩ V2} W2 = {v ∈ VS  f2 (v) ∈ V1 ∩ V2}. ℓE ) that yields equivalent ancestors of two given occurrences f1 and f2 of a graph S = (VS . ℓS ) in a labeled graph G = (VG. f2(v2)) ∈ E1 ∩ E2}. it follows g(vb) ∈ VC . g is an automorphism of SE ) Then it is either W = ∅ or W = VC .
or HOsupport. (No need to determine a maximum independent vertex set. Then the minimum number of vertex images support (or MNIsupport for short) of S w. G is deﬁned as v∈VS Experimental Results min {u ∈ VG  ∃f ∈ F : f (v) = u}. u) ∈ EE }. • Additional advantage: connected components consisting of isolated vertices can be neglected afterwards. 2) Form the edge set EE of the subgraph SE (as described above) and ′ form the (reduced) vertex set VE = {v ∈ VS  ∃u ∈ VS : (v. because then such a vertex v alone gives rise to equivalent ancestors. ℓS ) in a labeled graph G = (VG. A simple example of harmful overlap without identical images: input graph: occurrences: Input: Final Procedure for Harmful Overlap Test Two (diﬀerent) occurrences f1 and f2 of a labeled graph S = (VS . ℓG) and S = (VS .t.) Example: Christian Borgelt Frequent Pattern Mining TicTacToe win B A A B 447 Christian Borgelt 300 250 200 150 100 50 0 number of subgraphs MNIsupport HOsupport MISsupport 120 140 160 180 200 220 240 260 280 300 448 Frequent Pattern Mining . ℓS ) be two labeled graphs and let F be the set of all subgraph isomorphisms of S to G. −1 i i If ∃i. 1) If ∃v ∈ S : f1(v) = f2(v). ES . EE ). return true. ′ ′ be the connected components of SE = (VE . so it should be the ﬁrst step. ′ does not contain isolated vertices. Christian Borgelt Frequent Pattern Mining 445 Christian Borgelt Frequent Pattern Mining 446 Alternative: Minimum Number of Vertex Images Let G = (VG. (Fairly unintuitive behavior. [Bringmann and Nijssen 2007] Index Chemicus 1993 600 500 400 300 200 100 0 number of subgraphs MNIsupport HOsupport MISsupport # graphs 200 250 300 350 400 450 500 Advantage: • Can be computed much more eﬃciently than MIS. B B A A A A B B subgraph: A B A A A B B Note that the subgraph inducing equivalent ancestors can be arbitrarily complex even if ∀v ∈ VS : f1(v) = f2(v). EG. ES . 1 ≤ i ≤ n. • This test can be performed very quickly.r.Further Optimization The test can be further optimized by the following simple insight: • Two occurrences f1 and f2 overlap harmfully if ∃v ∈ VS : f1(v) = f2(v). otherwise return false.) Disadvantage: • Often counts both of two equivalent occurrences. EG.) (Note that VE i i i 3) Let SC = (VC . Output: Whether f1 and f2 overlap harmfully. return true. ℓG). 1 ≤ i ≤ n : ∃v ∈ VC : g(v) = f2 (f1(v)) ∈ VC . EC ).
alarms in telecommunication networks. contains etc. Proof: look at induced independent vertex sets for substructures. t1 = t2. starts. • Multiple sequences versus a single sequence ◦ Multiple sequences: purchases with rebate cards. for example. are always directed. ◦ a c b a b c b c may not always count as a subsequence abc. web server access protocols. • Alternative: minimum number of vertex images.Summary • Deﬁning subgraph support in the single graph setting: maximum independent vertex set of an overlap graph of the occurrences. • MISsupport is antimonotone. and t1 > t2 ◦ labeled time intervals: relations like before. movement analysis (sports medicine). Christian Borgelt Frequent Pattern Mining 451 Christian Borgelt Frequent Pattern Mining 452 . ◦ DNA sequences can be undirected (both directions can be relevant). ◦ Further distinction: one object per (time) point versus multiple objects. • Harmful overlap support is antimonotone. • Restriction to connected substructures and optimizations. • Software: http://www. • Existence of an occurrence versus counting occurrences ◦ Combinatorial counting (all occurrences) ◦ Maximal number of disjoint occurrences ◦ Temporal support (number of time window positions) ◦ Minimum occurrence (smallest interval) • Relation between the objects in a sequence ◦ items: ◦ labeled time points: only precede and succeed t1 < t2. ◦ Intervals: weather data. Frequent Sequence Mining • Consecutive subsequences versus subsequences with gaps ◦ a c b a b c b a always counts as a subsequence abc. • (Time) points versus time intervals ◦ Points: DNA sequences.html Frequent Sequence Mining Christian Borgelt Frequent Pattern Mining 449 Christian Borgelt Frequent Pattern Mining 450 Frequent Sequence Mining • Directed versus undirected sequences ◦ Temporal sequences.net/moss.borgelt. overlaps. • Deﬁnition of harmful overlap support of a subgraph: existence of equivalent ancestor occurrences. • Simple procedure for testing whether two occurrences overlap harmfully. ◦ Single sequence: alarms in telecommunication networks.
where wm−1 = a1 b1 a2 b2 . bm−1 am−1 bm am. vm • The lexicographically smaller of the two code words is the canonical code word. without loss of generality. bm−1 am−1. In addition. . am−1 bm−1 am bm b1 a1 b2 a2 . the preﬁx wm−1 of which is not canonical. am−1 bm−1 am bm. but the canonical form of the sequence ba is rather ab. . . ◦ As there is only one possible code word per sequence (only one direction). a2 a1 a0 b1 b2 . by adding one item at the front and one item at the end. • As a consequence. which is canonical. . • The reason is that an undirected sequence can be read forward or backward. . . A Canonical Form for Undirected Sequences The code words deﬁned in this way clearly have the preﬁx property: • Suppose the preﬁx property would not hold. and vm−1 < wm−1. this code word is necessarily canonical. . which gives rise to two possible code words. . . Christian Borgelt Frequent Pattern Mining . has the preﬁx ba. we have wm < vm. • Consecutive subsequences are easier to handle: ◦ There are fewer occurrences of a given subsequence. . ◦ The sequence bab. because vm−1 is a preﬁx of vm and wm−1 is a preﬁx of wm. but vm < wm contradicts wm < vm. • As a consequence. . ◦ Other sequences are handled with state machines for containment tests. but the canonical form of the sequence cab is rather bac. . Christian Borgelt Frequent Pattern Mining 454 A Canonical Form for Undirected Sequences • A (simple) possibility to form canonical code words having the preﬁx property is to handle (sub)sequences of even and odd length separately. ◦ For each occurrence there is exactly one possible extensions. a2 a1 b1 b2 . . 455 Christian Borgelt Frequent Pattern Mining 456 • Odd length: The sequence am am−1 . am−1 bm−1. forming the code word is started in the middle. has the preﬁx cab. and that the lexicographically smaller code word is the canonical one. . that is. Then there exists. Christian Borgelt Frequent Pattern Mining 453 A Canonical Form for Undirected Sequences • If the sequences to mine are not directed. a canonical code word wm = a1 b1 a2 b2 . am−1 bm−1 am bm a0 b1 a1 b2 a2 . which is canonical. ◦ This allows for specialized data structures (similar to an FPtree). the smaller (or the larger) of which may then be deﬁned as the canonical code word. . because it does not have the preﬁx property. bm−1 bm is described by the code word or by the code word a1 b1 a2 b2 . vm−1 < wm−1 implies vm < wm. where vm−1 = b1 a1 b2 a2 . ◦ The sequence cabd. • Even length: The sequence am am−1 . . . • Examples (that the preﬁx property is violated): ◦ Assume that the item order is a < b < c . bm−1 bm is described by the code word or by the code word a0 a1 b1 a2 b2 . bm−1 am−1 bm am. we have to look for a diﬀerent way of forming code words (at least if we want the code to have the preﬁx property). • Item sequences are easiest to handle: ◦ There are only two possible relations and thus patterns are simple.Frequent Sequence Mining • Directed sequences are easier to handle: ◦ The (sub)sequence itself can be used as a code word. • Such sequences are extended by adding a pair am+1 bm+1 or bm+1 am+1. . bm−1 am−1 bm am. . • However. a subsequence can not be used as its own code word. . where = b1 a1 b2 a2 . . . . . . .
• A time interval sequence is a set of (labeled) time intervals. 457 Christian Borgelt Frequent Pattern Mining 458 sm = i=1 (ai = bi) • The symmetry ﬂag can be maintained in constant time with sm+1 = sm ∧ (am+1 = bm+1). • In the canonical form. However. l2) with l1 = l2 we have either e1 < s2 or e2 < s1 . • For each sequence a symmetry ﬂag is computed: m Sequences of Time Intervals • A (labeled or attributed) time interval is a triple I = (s. e. l2) be two time intervals. Otherwise they are merged into one interval I = (min{s1. • However. This can conveniently be done with a matrix: A B C A e o b A B C B io e m C a im e • Such a temporal pattern matrix can also be interpreted as an adjacency matrix of a graph. e1.A Canonical Form for Undirected Sequences • Generating and comparing the two possible code words takes linear time. These constraints can be exploited to obtain a simpler canonical form. e1. A before B A meets B A overlaps B A is ﬁnished by B A contains B A is started by B A equals B A A A A A B A B Frequent Pattern Mining Temporal Interval Patterns • A temporal pattern must specify the relations between all referenced intervals. any relation between am+1 and bm+1 is acceptable. thus mapping the problem to frequent (sub)graph mining. • Time intervals can easily be ordered as follows: Let I1 = (s1. • A time interval sequence database is a vector of time interval sequences. which has the interval relationships as edge labels. the input interval sequences may be represented as such graphs. l). it must be am+1 ≤ bm+1. • The permissible extensions depend on the symmetry ﬂag: ◦ if sm = true. time intervals allow for diﬀerent relations. • This rule guarantees that exactly the canonical extensions are created. e is the end time and l is the associated label. “B after A” and “C after B” imply “C after A”). e2. max{e1. s2}. l1) and I2 = (s2 . ◦ if sm = false. It is I1 ≺ I2 iﬀ ◦ s1 < s2 or ◦ s1 = s2 and e1 < e2 or ◦ s1 = s2 and e1 = e2 and l1 < l2. A commonly used set of relations between time intervals are Allen’s interval relations. l1). this can be improved by maintaining an additional piece of information. • Generally. l1) and I2 = (s2 . Christian Borgelt Frequent Pattern Mining Allen’s Interval Relations Due to their temporal extension. the relationships between time intervals are constrained (for example. of which we assume that they are maximal in the sense that for two intervals I1 = (s1. [Kempe 2008] 459 Christian Borgelt Frequent Pattern Mining 460 [Allen 1983] B after A B is met by A B is overlapped by A B ﬁnishes A B during A B starts A B equals A B B B B B A Christian Borgelt . e2}. Applying this rule to check a candidate extension takes constant time. the intervals are assigned in increasing time order to the rows and columns of the temporal pattern matrix. at least the third option must hold. Due to the assumption made above. where s is the start time. e2.
a single sequence can be deﬁned by: ◦ Combinatorial counting (all occurrences) ◦ Maximal number of disjoint occurrences ◦ Temporal support (number of time window positions) ◦ Minimum occurrence (smallest interval) • However. directed and undirected sequences ◦ items versus (labeled) intervals. all of these deﬁnitions suﬀer from the fact that such support is not antimonotone or downward closed : B A B Weakly AntiMonotone / Downward Closed • Let P a pattern space with a subpattern relationship < and let s be a function from P to the real numbers. Christian Borgelt Frequent Pattern Mining [Kempe 2008] 462 Summary Frequent Sequence Mining • Several diﬀerent types of frequent sequence mining can be distinguished: ◦ single and multiple sequences. The reasons is that with minimum occurrence counting the relationship “contains” is the only one that can lead to support anomalies like the one shown above. • Nevertheless an exhaustive pattern search can ensured. • A weakly antimonotone support function can be enough to allow pruning with the Apriori property. in this case it must be made sure that the canonical form assigns an appropriate parent pattern in order to ensure an exhaustive search. s : P → IR. The function s on the pattern space P is called ◦ strongly antimonotone or strongly downward closed iﬀ ∀S ∈ P : ∀R ∈ P (S) : s(R) ≥ s(S). but the support of “A” is only 1. single and multiple objects per position ◦ relations between the objects.Support of Temporal Patterns • The support of a temporal pattern w. Frequent Tree Mining Christian Borgelt Frequent Pattern Mining 463 Christian Borgelt Frequent Pattern Mining 464 . With these rules it is possible to check in constant time whether a possible extension leads to a result in canonical form. the Apriori property can safely be used for pruning. • If temporal interval patterns are extended backwards in time. The support of “A contains B” is 2. Christian Borgelt Frequent Pattern Mining 461 • The support of temporal interval patterns is weakly antimonotone (at least) if it is computed from minimal occurrences. without having to abandon pruning with the Apriori property. For a pattern S ∈ P let P (S) = {R  R < S ∧ ∃ Q : R < Q < S} be the set of all parent patterns of S. deﬁnition of pattern support • All common types of frequent sequence mining possess canonical forms for which canonical extension rules can be found. ◦ weakly antimonotone or weakly downward closed iﬀ ∀S ∈ P : ∃R ∈ P (S) : s(R) ≥ s(S).r. However.t.
the order of the children of each vertex is ﬁxed: it is simply the given order of the outgoing edges. • Reminder: A (labeled) graph G is called a tree iﬀ for any pair of vertices in G there exists exactly one path connecting them in G. their special form makes it easier to compare code words for subtrees. which is then necessarily also the shortest path. If the tree is rooted. and the depth of the source vertex of an edge. • Trees of whichever type are much easier to handle frequent (sub)graphs. where m n a b d is is is is is the number of edges of the tree. • The distance between two vertices of a graph G is the length of a shortest path connecting them. • A tree is called rooted if it has a distinguished vertex. • The preorder code words we consider here have the general form a ( d b a )m . which is necessarily the canonical code word. • However. an edge attribute / label. depthﬁrst or a breadthﬁrst traversal). The depth of a tree is the depth of its deepest vertex. called the root. the number of vertices of the tree. if there is no distinguished vertex). 467 Christian Borgelt Frequent Pattern Mining 468 Christian Borgelt Frequent Pattern Mining . The root vertex itself has depth 0. However. • Therefore rightmost path extension (for a depthﬁrst traversal) and maximum source extension (for a breadthﬁrst traversal) obviously provide a canonical extension rule for rooted ordered trees. the order may be deﬁned on the outgoing edges only. • A tree is called ordered if for each vertex there exists an order on its incident edges. Christian Borgelt Frequent Pattern Mining 465 • A diameter path of a graph is a path having a length that is the diameter of the graph. Note that in a tree there is exactly one path connecting two vertices. Rooted Unordered Trees • Rooted unordered trees can most conveniently be described by socalled preorder code words. • Preorder code words are closely related to spanning trees that are constructed with a depthﬁrst search. • In a rooted tree the depth of a vertex is its distance from the root vertex. • The length of a path is the number of its edges. once a traversal order for the spanning tree is ﬁxed (for example. because a preorder traversal is a depthﬁrst traversal. • In addition. The edges are listed in the order in which they are visited in a preorder traversal. The source vertex of an edge is the vertex that is closer to the root (smaller depth). a vertex attribute / label. Christian Borgelt Frequent Pattern Mining 466 Rooted Ordered Trees • For rooted ordered trees code words derived from spanning trees can directly be used: the spanning tree is simply the tree itself. • If a tree is not rooted (that is. m = n − 1. There is no need for an explicit test for canonical form.Frequent Tree Mining: Basic Notions • Reminder: A path is a sequence of edges connecting two vertices in a graph. because it is mainly the cycles (which may be present in a general graph) that make it diﬃcult to construct the canonical code word. • The diameter of a graph is the largest distance between any two vertices. Rooted trees are often seen as directed: all edges are directed away from the root. there is only one possible code word. • As a consequence. Frequent Tree Mining: Basic Notions • Reminder: A path is a sequence of edges connecting two vertices in a graph. the root of the spanning tree is ﬁxed: it is simply the root of the rooted ordered tree. it is called free.
Christian Borgelt Frequent Pattern Mining 469 Christian Borgelt d b b a a Rooted Unordered Trees • All possible preorder code words can be obtained from one preorder code word by exchanging substrings of the code word that describe sibling subtrees. the code words can be compared lexicographically. consider the following tree: a b b c c d b d c c d c • By deﬁning an (arbitrary. but ﬁxed) order on the vertex labels and using the standard order of the integer numbers. in this code word the children of the root are exchanged: • The above rooted unordered tree can be described by the code word a 0b 1d 1b 2b 2c 1a 0b 1a 1b • Note that the code word consists of substrings that describe the subtrees: b a 0 b 1 a 1 b 0 b 1 d 1 b 2 b 2 c 1 a a b a c Frequent Pattern Mining a b b a b d b b b c 470 a 0 b 1 d 1 b 2 b 2 c 1 a 0 b 1 a 1 b The subtree strings are separated by a number stating the depth of the parent. the label of the vertex that is farther away from the root). (This shows the advantage of using the vertex depth rather than the vertex index: no renumbering of the vertices is necessary in such a exchange. In rooted trees edge labels can always be combined with the destination vertex label (that is.Rooted Unordered Trees a b d b b c a a b b Rooted Unordered Trees Exchanging code words on the same level exchanges branches/subtrees. we deﬁne the lexicographically greatest code word as the canonical code word.) Rooted Unordered Trees • In order to understand the core problem of obtaining an extension rule for rooted unordered trees. • How can this tree be extended by adding a child to the grey vertex? That is. a 0 b 1 d 1 b 2 b 2 c 1 a 0 b 1 a 1 b For example. • The canonical code word for the tree on the previous slides is a 0b 1d 1b 2c 2b 1a 0b 1b 1a Christian Borgelt Frequent Pattern Mining 471 c d • The canonical code word for this tree results from the shown order of the subtrees: a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d Any exchange of subtrees leads to a lexicographically smaller code word. (Note that vertex labels are always compared to vertex labels and integers to integers. because these two elements alternate. what label may the child vertex have if the result is to be canonical? Christian Borgelt Frequent Pattern Mining 472 .) • Contrary to the common deﬁnition used in all earlier cases. For simplicity we omit edge labels.
Christian Borgelt Frequent Pattern Mining 473 Christian Borgelt Frequent Pattern Mining 474 Rooted Unordered Trees a b c d c d c b d c c d b c Rooted Unordered Trees • That a possible exchange of subtrees at vertices closer to the root never yield looser restrictions is no accident. • Only if w1 = w3. we observe that the child must not have a label succeeding “c”. showing that w2 provides no looser restriction of w4 than w3. their code words. ◦ w1 ≥ w3. Christian Borgelt Frequent Pattern Mining .t.r. then we also have w3 = w1 ≥ w2. because otherwise exchanging the subtrees of the parent of the grey vertex would yield a lexicographically larger code word: a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2d < a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2d 1c 2d 2c • The subtrees of any vertex must be sorted descendingly w. we observe that the child must not have a label succeeding “b”. the code words w1 and w3 do not already determine the order of the subtrees of the vertex labeled with “a”. However.t. In this case we have w2 ≥ w4. because otherwise an exchange of subtrees at the node labeled “a” would lead to a lexicographically larger code word.Rooted Unordered Trees a b c d c d c b d c c d b c Rooted Unordered Trees a b c d c d c b d c c d b c • In the ﬁrst place. because otherwise exchanging the new vertex with the other child of the grey vertex would yield a lexicographically larger code word: a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2e < a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2e 2d • Generally. Then we know the following relationships between subtree code words: ◦ w1 ≥ w2 and w3 ≥ w4.r. their code words. • Secondly. • Suppose a rooted tree is described by a canonical code word a 0 b 1 w1 1 w2 0 b 1 w3 1 w4 . because otherwise exchanging the subtrees of the root vertex of the tree would yield a lexicographically larger code word: a 0b 1c 2d 2c 1c 2d 2b 0b 1c 2d 2c 1c 2d 2c < a 0b 1c 2d 2c 1c 2d 2c 0b 1c 2d 2c 1c 2d 2b • The subtrees of any vertex must be sorted descendingly w.t.r. their labels. 475 Christian Borgelt Frequent Pattern Mining 476 • Thirdly. because otherwise an exchange of subtrees at the nodes labeled with “b” would lead to a lexicographically larger code word. we observe that the child must not have a label succeeding “d”. the children of a vertex must be sorted descendingly w.
• Even Diameter: The original diameter path represents two paths from the root to two leaves. • If x ≤ y. ◦ Odd Diameter: The edge in the middle of a diameter path is uniquely determined. If a future extension diﬀers from w2. If w has such a suﬃx. • Odd Diameter: The original diameter path represents one path from the root to a leaf in each of the two rooted trees the free tree is split into. the extension is canonical if and only if xa ≤ w2. • If x > y. This vertex can be used as the root of a rooted tree. the extended code word is canonical if and only if a ≤ w3. we obtain the following simple extension rule: • Let w be the canonical code word of the rooted tree to extend and let d be the depth of the rooted tree (that is. • Knowledge of the value of y and the two starting points of the string w1 in w can be maintained in constant time: As long as no two sibling vertices carry the same label. Christian Borgelt Frequent Pattern Mining 479 Free Trees • Main problem of the procedure for growing free trees: The initially grown diameter path must remain identiﬁable. Christian Borgelt Frequent Pattern Mining 480 . where w3 is a string not containing any integer x′ ≤ x. let the considered extension be xa with x ∈ IN0 and a a vertex label. (Otherwise the preﬁx property cannot be guaranteed. let y = d (depth of the tree). • Knowledge of the strings w3 for all possible values of x (0 ≤ x < d) can maintained in constant time: It suﬃces to record the starting points of the substrings that describe the rightmost subtree on each tree level. check whether w has a suﬃx xw3. • Let y be the smallest integer for which w has a suﬃx of the form y w1w2 y w1 with y ∈ IN0 and w1 and w2 strings not containing any y ′ ≤ y (w2 may be empty). otherwise w1 is extended.Rooted Unordered Trees As a consequence. these paths must be the lexicographically smallest and the lexicographically largest path leading to this depth. If w does not possess such a suﬃx. Removing this edge splits the free tree into two rooted trees. free trees of even and odd diameter are treated separately. Christian Borgelt Frequent Pattern Mining 477 Rooted Unordered Trees The discussed extension rule is very eﬃcient: • Comparing the elements of the extension takes constant time (at most one integer and one label need to be compared). In addition. • With this extension rule no subsequent canonical form test is needed.) • In order to solve this problem it is exploited that in the canonical code word for a rooted unordered tree code words describing paths from the root to a leaf vertex are lexicographically increasing if the paths are listed from left to right. To keep them identiﬁable. • Procedure for growing free trees: ◦ First grow a diameter path using the canonical form for sequences. y is set to the depth of the parent. These paths must be the lexicographically smallest paths leading to this depth. ◦ Extend the diameter path into a tree by adding branches. • General ideas for a canonical form for free trees: ◦ Even Diameter: The vertex in the middle of a diameter path is uniquely determined. At most one of these starting points can change with an extension. it is y = d. • Similar to sequences. the depth of the deepest vertex). If a sibling with the same label is added. the extended code word is always canonical. If w does not have such a suﬃx. Christian Borgelt Frequent Pattern Mining 478 Free Trees • Free trees can be handled by combining the ideas of how to handle sequences and rooted unordered trees. it is y = d. w1 = a occurs at the position of the w3 for y and at the extension vertex label.
helsinki. • Additional ﬁltering is important to single out the relevant patterns. ◦ For general graphs.net/sam.borgelt. Software Software for frequent pattern mining can be found at • my web site: http://www.borgelt.html ◦ ◦ ◦ ◦ ◦ ◦ Apriori Eclat FPGrowth RElim SaM MoSS http://www. (Discard noncanonical code words. The web site oﬀers all source code. That is: No superpattern of an infrequent pattern is frequent. Christian Borgelt Frequent Pattern Mining 483 Christian Borgelt Frequent Pattern Mining 484 . restricted extensions allow to reduce the number of actual canonical form tests considerably.html http://www. there is no order on adjacent vertices. process only canonical ones.html http://www.cs.borgelt.borgelt.net/eclat. 2001. trees. [Luccio et al. several data sets.html • the Frequent Item Set Mining Implementations (FIMI) Repository http://fimi. • Rooted unordered trees ◦ The root is ﬁxed. • A core ingredient of the search is a canonical form of the type of pattern.fi/ This repository was set up with the contributions to the FIMI workshops in 2003 and 2004.borgelt. ◦ Both rightmost path extension and maximum source extension obviously provide a canonical extension rule for rooted ordered trees. and graphs.net/fpgrowth. • Frequent pattern mining algorithms prune with the Apriori property: ∀P : ∀S ⊃ P : sD (P ) < smin → sD (S) < smin.) ◦ It is desirable that the canonical form possesses the preﬁx property. and the results of the competition. sequences.borgelt.net/apriori.borgelt.net/fpm. ◦ Except for general graphs there exist perfect extension rules.Summary Frequent Tree Mining • Rooted ordered trees ◦ The root is ﬁxed and the order of the children of each vertex is ﬁxed.html http://www. ◦ There exists a canonical extension rule based on sorted preorder strings (constant time for ﬁnding allowed extensions).html http://www.net/relim. but there is no order of the children. 2004] • Free trees ◦ No node is ﬁxed as the root. where each submission had to be accompanied by the source code of an implementation. ◦ Purpose: ensure that each possible pattern is processed at most once. ◦ There exists a canonical extension rule based on depth sequences (constant time for ﬁnding allowed extensions) [Nijssen and Kok 2004] Summary Frequent Pattern Mining Christian Borgelt Frequent Pattern Mining 481 Christian Borgelt Frequent Pattern Mining 482 Summary Frequent Pattern Mining • Possible types of patterns: item sets.net/moss.html http://www.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.