You are on page 1of 93

LCML PKDD zo

++
September ¸–p, zo++
The MinimumCode Length for
Clustering Using the Gray Code
Mahlto SUG|¥AMA
†,‡
,Aklhlro ¥AMAMOTO


Kyoto Unlverslty

1SPS Pesearch Pellow
+/z6
Contributions
+. The MCL (MinimumCode Length)
– A new measure to score clusterlng results
– Needed to dlstlngulsh each cluster under some fixed encod-
lng scheme for real-valued varlables
z. COOL (COding-Oriented cLustering)
– A general clusterlng approach
– Always finds the best clusters (i.e., the global optlmal solutlon)
whlch mlnlmlzes the MCL ln O(nd)
– Parameter tunlng ls not needed
¸. G-COOL (COOL with the Gray code)
– Achleves lnternal coheslon and external lsolatlon
– Plnds arbltrary shaped clusters
z/z6
Demonstration (Synthetic Dataset)
0
0.5
1
0 0.5 1
¸/z6
G-COOL
0
0.5
1
0 0.5 1
¸/z6
K-means
0
0.5
1
0 0.5 1
¸/z6
Results (Real datasets)
B
i
n
a
r
y

l
t
e
r
i
n
g
O
r
i
g
i
n
a
l

i
m
a
g
e
Delta Dragon Europe Norway Ganges
(/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
(/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
¸/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
¸/z6
Clustering Focusing on Compression
• The MDL approach
|
Kontkanen et al., zoo¸
|
– Data encodlng has to be optlmlzed
∘ All encodlng schemes are (lmpllcltly) consldered
∘ The tlme complexlty ⩾ O(n
2
)
• The Kolmogorov complexlty approach |Clllbrasl, zoo¸|
– Measures the dlstance between data polnts based on com-
presslon of finlte sequences
∘ Dlfficult to apply multlvarlate data
– Actual clusterlng process ls the tradltlonal agglomeratlve hl-
erarchlcal clusterlng
∘ The tlme complexlty ⩾ O(n
2
)
• 8oth approaches are not sultable for masslve data
6/z6
Our Strategy
• Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
¸. Can find arbltrary shaped clusters
)/z6
Our Strategy
• Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
¸. Can find arbltrary shaped clusters
• Solutions:
+. Plx an encodlng scheme for contlnuous varlables
– Motlvated by Computable Analysis
|
welhrauch, zooo
|
z. Clusterlng ¬ Dlscretlzlng real-valued data
– Always finds the best results w.r.t. the MCL
¸. Use the Gray code for real numbers
|
Tsulkl, zooz
|
– Dlscretlzed data polnts are overlapped and ad[acent clus-
ters are merged
)/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
)/z6
MCL (MinimumCode Length)
• The MCL ls the code length of the maxlmally compressed
clusters by uslng a fixed encodlng scheme
• The MCL ls calculated ln O(nd) by uslng radlx sort
– n and d are the number of data and dlmenslon, resp.
8/z6
MCL (MinimumCode Length)
• The MCL ls the code length of the maxlmally compressed
clusters by uslng a fixed encodlng scheme
• The MCL ls calculated ln O(nd) by uslng radlx sort
– n and d are the number of data and dlmenslon, resp.
Example: X ¬ }0.l, 0.2, 0.8, 0.9},

l
¬ }}0.l, 0.2}, }0.8, 0.9}}

2
¬ }}0.l}, }0.2, 0.8}, }0.9}}
– Use blnary encodlng
– whlch ls preferred!
8/z6
Binary Encoding
0
1
P
o
s
i
t
i
o
n
0
1
2
3
4
0.5
0.1 0.2
00011...
00110...
0.8 0.9
11001...
11100...
p/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ l + l ¬ 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ l + l ¬ 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ l + l ¬ 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ 3 ⋅ 4 ¬ l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ 3 ⋅ 4 ¬ l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ 3 ⋅ 4 ¬ l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL ¬ 3 ⋅ 4 ¬ l2
+o/z6
Definition of MCL
• Plx an embeddlng γ : ℝ
d
→Σ
ω
(Σ ¬ }0, 1} usually)
• Por p ∈ range(γ) and P ⊂ range(γ), define
Φ(p∣ P) ≔
{
w ∈ Σ

p ∈ ↑w and P ∩ ↑v ¬ ∅for all v
such that |v| ¬ |w| and p ∈ ↑v
}
– Lach element ln Φ(p∣ P) ls a prefix that dlscrlmlnates p from P
Por a partltlon ¬ }C
l
, …, C
K
} of a data set X,
MCL() ≔ ∑
i∈}l,…,K}
L
i
(), where
L
i
() ≔ mln
{
|W|
γ(C
i
) ⊆ ↑W and
W ⊆

x∈C
i
Φ(γ(x) ∣ γ(X ⧵ C
i
))
}
++/z6
Minimizing MCL and Clustering
Clusterlngunder the MCL crlterlonls tofindthe global op-
tlmal solutlon that mlnlmlzes the MCL
– Plnd
op
such that

op
∈ argmln
∈(X)
⩾K
MCL(),
where (X)
⩾K
¬ } ls a partltlon of X ∣ #C ⩾ K }
• we glve the lower bound of the number of clusters K as a
lnput parameter

op
becomes one set }X} wlthout thls assumptlon
+z/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
+z/z6
Optimization by COOL
• COOL solves the optlmlzatlon problem ln O(nd)
– n and d are the number of data and dlmenslon, resp.
∘ The naive approach takes exponentlal tlme and space
– Computlng process of the MCL becomes clusterlng process lt-
self vla dlscretlzatlon
• COOL ls level-wlse, and makes the level-k partltlon
k
from k ¬ l, 2, …, whlch holds the followlng condltlon:
– Por all x, y ∈ X, they are ln the same cluster ⟺
v ¬ w for some v ⊏ γ(x) and w ⊏ γ(y) wlth |v| ¬ |w| ¬ k
∘ Level-k partltlons form hlerarchy
∘ Por C ∈
k
, there exlsts ⊆
k+l
such that ⋃ ¬ C
• Por all C ∈
op
, there exlsts k such that C ∈
k
+¸/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++¸
8 o.¸p8
C o.¸z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A o.++¸
8 o.¸p8
C o.¸z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL ¬ l + l ¬ 2
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
MCL ¬ 2 ⋅ 4 ¬ 8
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
MCL ¬ 6 + 6 ¬ l2
+(/z6
Noise Filtering by COOL
• Nolse filterlng ls easlly lmplemented ln COOL
• Define
⩾N
≔ }C ∈ ∣ #C ⩾ N} for a partltlon
– See a cluster C as nolses lf #C < N
• Lxample: Glven ¬ }}0.l}, }0.4, 0.5, 0.6}, }0.9}}

⩾2
¬ }}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
• we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+¸/z6
Noise Filtering by COOL
• Nolse filterlng ls easlly lmplemented ln COOL
• Define
⩾N
≔ }C ∈ ∣ #C ⩾ N} for a partltlon
– See a cluster C as nolses lf #C < N
• Lxample: Glven ¬ }}0.l}, }0.4, 0.5, 0.6}, }0.9}}

⩾2
¬ }}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
• we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+¸/z6
Algorithmof COOL
|nput: A data set X, two lower bounds K and N
Output: The optlmal partltlon
op
and nolses
functlon COOL(X, K, N)
+: Plnd partltlons
l
⩾N
, …,
m
⩾N
such that ‖
m−l
⩾N
‖ < K ⩽ ‖
m
⩾N

z: (
op
, MCL(
op
)) ←P|NDCLUSTLPS(X, K, }
l
⩾N
, …,
m
⩾N
})
¸: return (
op
, X ⧵ ⋃
op
)
functlon P|NDCLUSTLPS(X, K, }
l
, …,
m
})
+: Plnd k such that ‖
k−l
‖ < K and ‖
k
‖ ⩾ K
z:
op

k
¸: lf K ¬ 2 then return (
op
, MCL(
op
))
(: for each C ln
l
∪ …∪
k−l
¸: (, L) ←P|NDCLUSTLPS(X ⧵ C, K − l, }
l
, …,
k
})
6: lf MCL(∪ C) < MCL(
op
) then
op
←C ∪
): return (
op
, MCL(
op
))
+6/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
+6/z6
Gray Code
• Peal numbers ln |0, l| are encoded wlth 0, 1, and ⊥
8lnary: 0.l →00011…, 0.25 →00111…
Gray: 0.l →00010…, 0.25 →0⊥100…
• Orlglnally, another blnary encodlng of natural numbers
– Lspeclally lmportant ln appllcatlons of converslon between
analog and dlgltal lnformatlon
|
Knuth, zoo¸
|
The Gray code embeddlng ls an ln[ectlon γ
G
that maps x ∈ |0, l|
to an lnfinlte sequence p
0
p
l
p
2
…, where
– p
i
≔ 1 lf 2
−i
m−2
−(i+l)
< x < 2
−i
m+2
−(i+l)
for an odd m, p
i
≔ 0
lf the same holds for an even m, and p
i
≔ ⊥lf x ¬ 2
−i
m−2
−(i+l)
for some lnteger m
– Por a vector x ¬ (x
l
, …, x
d
), γ
G
(x) ¬ p
l
l
…p
d
l
p
l
2
…p
d
2

+)/z6
Gray Code Embedding
0 0.5 1
0
1
2
3
5
P
o
s
i
t
i
o
n
0.8
10101...
0.25
0⊥100...
0.1
00010...
+8/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++¸
8 o.¸p8
C o.¸z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
ld value
A o.++¸
8 o.¸p8
C o.¸z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
ld value
A 0
8 0, ⊥1
C 1, ⊥1
D 1, ⊥1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
ld value
A 0
8 0, ⊥1
C 1, ⊥1
D 1, ⊥1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
ld value
A 0
8 0, ⊥1
C 1, ⊥1
D 1, ⊥1
L 1
MCL ¬ l ⋅ 2 ¬ 2
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
00 10 11 01
Lv. 2
0⊥1 ⊥10 1⊥1
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, ⊥10
C 10, ⊥10
D 10, 1⊥1
L 11, 1⊥1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
00 10 11 01
Lv. 2
0⊥1 ⊥10 1⊥1
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, ⊥10
C 10, ⊥10
D 10, 1⊥1
L 11, 1⊥1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
⊥1
00 10 11 01
Lv. 2
0⊥1 ⊥10 1⊥1
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, ⊥10
C 10, ⊥10
D 10, 1⊥1
L 11, 1⊥1
MCL ¬ 2 ⋅ 3 ¬ 6
+p/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL ¬ l + l ¬ 2
+p/z6
Theoretical Analysis of G-COOL
• Use the Gray code as a fixed encodlng ln COOL
– |t achleves lnternal coheslon and external lsolatlon
• Theorem: For the level-k partition
k
, x, y ∈ X are in the
same cluster if d

(x, y) < 2
−(k+l)
– Thus x, y are ln the dlfferent clusters only lf d

(x, y) ⩾ 2
−(k+l)
– d

(x, y) ¬ max
i∈}l,…,d}
|x
i
− y
i
| (L

metrlc)
∘ Two ad[acent lntervals overlap and they are agglomerated
• Corollary: In the optimal partition
op
, for all x ∈ C (C ∈

op
), its nearest neighbor y ∈ C
– y ls nearest nelghbor of x ⟺ y ∈ argmln
y∈X
d

(x, y)
zo/z6
Demonstration of G-COOL
0 0.5 1
0
0.5
1
G-COOL
0 0.5 1
0
0.5
1
COOL with the binary encoding
z+/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
z+/z6
Experimental Methods
• Analyze G-COOL emplrlcally wlth synthetlc and real
datasets compared to D8SCAN and K-means
– Synthetlc datasets were generated by the P package cluster-
Generation
|
Qlu and 1oe, zoo6
|
∘ n ¬ l, 500 for each cluster and d ¬ 3
– Peal datasets were geospatlal lmages from Larth-as-Art
∘ reduced to zoo × zoo plxels, translated lnto blnary lmages
– All data were normallzed by mln-max normallzatlon
• G-COOL was lmplemented by P (verslon z.+z.+)
• |nternal and Lxternal measure were used
– |nternal: MCL, connectlvlty, Sllhouette wldth
– Lxternal: ad[usted Pand lndex
zz/z6
Results (Synthetic datasets)
MCL
G-COOL
DBSCAN
K-means
50000
5000
Number of clusters
2 4 6
500
Data show mean ± s.e.m.
Each experiment was performed
20 times
Bad
Good
z¸/z6
Results (Synthetic datasets)
Connectivity Silhouette width
1000
1
100
Number of clusters
2 4 6
0.1
10
0.6
0.3
0.5
Number of clusters
2 4 6
0.4
G-COOL
DBSCAN
K-means
Good
Bad
Bad
Good
z¸/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
20
5
15
Number of clusters
2 4 6
0
10
1.0
0.3
0.9
Number of clusters
2 4 6
0.6
G-COOL
DBSCAN
K-means
Good
Bad
z¸/z6
Results (Synthetic datasets)
MCL
G-COOL
Data show mean ± s.e.m.
Each experiment was performed
20 times
5000
3000
0
0 4 6 2 8 10
4000
2000
1000
The noise parameter N
Bad
Good
z(/z6
Results (Synthetic datasets)
Connectivity Silhouette width
80
20
60
0
40
0.6
0.2
0.4
0 4 6 2 8 10 0 4 6 2 8 10
The noise parameter N The noise parameter N
0
Good
Bad
Bad
Good
z(/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
2.0
0.5
1.5
0 4 6
0
1.0
1.0
0.4
0.8
0.6
2 8 10
The noise parameter N
0 4 6 2 8 10
The noise parameter N
0.2
0
Good
Bad
z(/z6
Results (Real datasets)
B
i
n
a
r
y

l
t
e
r
i
n
g
O
r
i
g
i
n
a
l

i
m
a
g
e
Delta Dragon Europe Norway Ganges
z¸/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
z¸/z6
Results (Real datasets)
Name n K Punnlng tlme (s) MCL
GC KM GC KM
Delta zo)(8 ( +.+¸8 o.o+z (o+o (pzz
Dragon zp8z6 z o.¸p¸ o.oz6 ¸po6 )+66
Lurope +)¸8o 6 z.(o( o.o(+ z¸zo +zz+o
Norway zz))+ ¸ o.)(6 o.oz6 +8zo 6++(
Ganges +8o+p 6 o.¸p¸ o.oz6 z¸zo +z¸z6
GC: G-COOL, KM: K-means
z¸/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
¸. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
¸. Lxperlments
6. Concluslon
z¸/z6
Conclusion
• |ntegrate clusterlng and lts evaluatlon ln the codlng-
orlented manner
– An effectlve solutlon for two essentlal problems, how to mea-
sure goodness of results and how to find good clusters
∘ No dlstance calculatlon and no data dlstrlbutlon
• Key ideas:
+. Fix of an encoding scheme for real-valued variables
– |ntroduced the MCL focuslng on compresslon of clusters
– Pormulatedclusterlngwlththe MCL, andconstructedCOOL
that finds the global optlmal solutlon llnearly
z. The Gray code
– we showed efficlency and effectlveness of G-COOL by the-
oretlcally and experlmentally
z6/z6
Appendix
A-+/A-+(
Notation (/)
• A datum x ∈ ℝ
d
, a data set X ¬ }x
l
, …, x
n
}
– #X ls the number of elements ln X
– X ⧵ Y ls the relatlve complement of Y ln X
• Clusterlng ls partltlon of X lnto K subsets (clusters) C
l
, …, C
K
– C
i
≠ ∅and C
i
∩ C
j
¬ ∅
– we call ¬ }C
l
, …, C
K
} a partltlon of X
– (X) ¬ } ∣ ls a partltlon of X}
• The set of finlte and lnfinlte sequences over an alphabet Σ are
denoted by Σ

and Σ
ω
, resp.
– The length |w| ls the number of symbols other than ⊥
∘ |f w ¬ 11⊥100⊥⊥…, then |w| ¬ 5
– Por a set of sequences W, |W| ¬ ∑
w∈W
|w|
A-z/A-+(
Notation (/)
• An embeddlng of ℝ
d
ls an ln[ectlve functlon γ fromℝ
d
to Σ
ω
• Por p, q ∈ Σ
ω
, define p ⩽ q lf p
i
¬ q
i
for all i wlth p
i
≠ ⊥
– |ntultlvely, q ls more concrete than p
• Por w ∈ Σ

, we wrlte w ⊏ p lf w⊥
ω
⩽ p
– ↑w ¬ }p ∈ range(γ) ∣ w ⊏ p} for w ∈ Σ

– ↑W ¬ }p ∈ range(γ) ∣ w ⊏ p for some w ∈ W} for W ⊆ Σ

• The followlng monotonlclty holds
– γ
−l
(↑v) ⊆ γ
−l
(↑w) lff v⊥
ω
⩾ w⊥
ω
A-¸/A-+(
Optimization by COOL
• The optlmal partltlon
op
canbe constructedby the level-
k partltlons
Por all C ∈
op
, there exlsts k such that C ∈
k
• The level-k partltlons have the hlerarchlcal structure
– Por each C ∈
k
we have ⋃¬ C for some D ⊆
k+l
– COOL ls slmllar to dlvlslve hlerarchlcal clusterlng
COOL always outputs the global optlmal partltlon
op
• The tlme complexlty ls O(nd) (best) and O(nd + K! ) (worst)
– Usually K ≪n holds, hence O(nd)
A-(/A-+(
Clustering Process of COOL
0 1
0
1
A-¸/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-¸/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
Lv. 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
1 2
3
5
4 6
2 1 3 4 5 6
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-¸/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-¸/A-+(
The Multi-Dimensional Gray Code
• Usethewrapplngfunctlonφ(p
l
, …, p
d
) ≔ p
l
l
…p
d
l
p
l
2
…p
d
2

Define the d-dlmenslonal Gray code embeddlng γ
d
G
: ℐ→
Σ
ω
⊥,d
by γ
d
G
(x
l
, …, x
d
) ≔ φ(γ
G
(x
l
), …, γ
G
(x
d
))
• we abbrevlate d of γ
d
G
lf lt ls understood from the context
A-6/A-+(
Internal Measures
• Connectlvlty |Handl et al., zoo¸|
– Conn() ¬ ∑
x∈X

M
i¬l
f(x, nn(x, i))/i
∘ nn(x, j) ls the i-th nelghbor of x, f(x, y) ls 0 lf x and y be-
long to the same cluster, and l otherwlse
∘ M ls an lnput parameter (we set as +o)
– Takes values from 0 to ∞, should be mlnlmlzed
• Sllhouette wldth
– The average of Sllhouette value S(x) for each x
S(x) ¬ (b(x) − a(x)/ max(b(x), a(x)))
∘ a(x) ¬ ‖C‖
−l

y∈C
d(x, y) (x ∈ C)
∘ b(x) ¬ mln
D∈⧵C
‖D‖
−l

y∈D
d(x, y)
– Takes values from−l to l, should be maxlmlzed
A-)/A-+(
External Measures
• Ad[usted Pand lndex
– Let the result be ¬ }C
l
, …, C
K
} and the correct partl-
tlon be ¬ }D
l
, …, D
M
}
– Suppose n
ij
≔ ‖}x ∈ X ∣ x ∈ C
i
, x ∈ D
j
}‖. Then

i, j n
ij
C
2
− (∑
i ‖C
i

C
2

h ‖D
j

C
2
)/
n
C
2
2
−l
(∑
i ‖C
i

C
2
+ ∑
h ‖D
j

C
2
) − (∑
i ‖C
i

C
2

h ‖D
j

C
2
)/
n
C
2
A-8/A-+(
Discussion
• Pesults for synthetlc datasets
– 8est performance under the lnternal measures
– (nearly) 8est performance under the lnternal measures
– G-COOL ls efficlent and effectlve
∘ D8SCAN ls sensltlve to lnput parameters
– The MCL works well as an lnternal measure
• Pesults for real datasets
– not good, and not bad
∘ There are no clear clusters orlglnally
• G-COOL ls a good clusterlng method
A-p/A-+(
Related Work
• Partltlonal methods
|
Chao[l et al., zoop
|
• Mass-based methods
|
Tlng and wells, zo+o
|
• Denslty-based methods (D8SCAN
|
Lster et al., +pp6
|

• Hlerarchlcal clusterlng methods
(CUPL
|
Guha et al., +pp8
|
, CHAMLLLON
|
Karypls et al.,
+ppp
|
)
• Grld-based methods
(ST|NG
|
wang et al., +pp)
|
, waveCluster
|
Shelkholeslaml et
al., +pp8
|
)
A-+o/A-+(
Future Works
• Speedlng up by uslng tree-structures such as 8DD
• Apply to anomaly detectlon
• Theoretlcal analysls, ln partlcular relatlon wlth Com-
putable Analysls
– Admlsslblllty ls a key property
A-++/A-+(
References
|
8erkhln, zoo6
|
P. 8erkhln. A survey of clusterlng data mlnlng technlques.
Grouping Multidimensional Data, pages z¸–)+, zoo6.
|
Chao[l et al., zoop
|
v. Chao[l, M. A. Hasan, S. Salem, and M. 1. Zakl. SPAPCL:
Aneffectlve andefficlent algorlthmfor mlnlngarbltrary shape-based
clusters. Knowledge and Information Systems, z+(z):zo+–zzp, zoop.
|
Clllbrasl and vltanyl, zoo¸
|
P. Clllbrasl and P. M. 8. vltanyl. Cluster-
lng by compresslon. IEEE Transactions on Information Theory,
¸+(():+¸z¸–+¸(¸, zoo¸.
|
Lster et al., +pp6
|
M. Lster, H. P. Krlegel, 1. Sander, and X. Xu. A denslty-
based algorlthm for dlscoverlng clusters ln large spatlal databases
wlth nolse. |n Proceedings of KDD, p6, zz6–z¸+, +pp6.
|
Guha et al., +pp8
|
S. Guha, P. Pastogl, and K. Shlm. CUPL: An effi-
clent clusterlng algorlthm for large databases. Information Systems,
z6(+):¸¸–¸8, +pp8.
|
Handl et al., zoo¸
|
1. Handl, 1. Knowles, and D. 8. Kell. Computatlonal
A-+z/A-+(
cluster valldatlon ln post-genomlc data analysls. Bioinformatics,
z+(+¸):¸zo+, zoo¸.
|
Karypls et al., +ppp
|
G. Karypls, H. Lul-Hong, and v. Kumar. CHAMLLLON:
Hlerarchlcal clusterlng uslng dynamlc modellng. Computer,
¸z(8):68–)¸, +ppp.
|
Kontkanen and Myllymakl, zoo8
|
P. Kontkanen and P. Myllymakl. An em-
plrlcal comparlson of NML clusterlng algorlthms. |n Proceedings of
Information Theory and Statistical Learning, zoo8.
|
Kontkanen et al., zoo¸
|
P. Kontkanen, P. Myllymakl, w. 8untlne, 1. Plssanen,
andH. Tlrrl. AnMDL framework for data clusterlng. |n P. Grunwald, |. 1.
Myung, and M. Pltt, edltors, Advances in MinimumDescription Length:
Theory and Applications. M|T Press, zoo¸.
|
Knuth, zoo¸
|
D. L. Knuth. The Art of Computer Programming, Volume , Fas-
cicle : Generating All Tuples and Permutations. Addlson-wesley Pro-
fesslonal, zoo¸.
|
Qlu and 1oe, zoo6
|
w. Qlu and H. 1oe. Generatlon of randomclusters wlth
speclfied degree of separatlon. Journal of Classification, z¸:¸+¸–¸¸(,
zoo6.
A-+¸/A-+(
|
Shelkholeslaml et al., +pp8
|
G. Shelkholeslaml, S. Chatter[ee, and
A. Zhang. waveCluster: A multl-resolutlon clusterlng approach for
very large spatlal databases. |n Proceedings of the th International
Conference on Very Large Data Bases, pages (z8–(¸p, +pp8.
|
Tlng and wells, zo+o
|
K. M. Tlng and 1. P. wells. Multl-dlmenslonal mass
estlmatlon and mass-based clusterlng. |n Proceedings of th IEEE In-
ternational Conference on Data Mining, pages ¸++ – ¸zo, zo+o.
|
Tsulkl, zooz
|
Hldekl Tsulkl. Peal number computatlon through Gray code
embeddlng. Theoretical Computer Science, z8((z):(6)–(8¸, zooz.
|
wang et al., +pp)
|
w. wang, 1. ¥ang, and P. Muntz. ST|NG: A statlstlcal
lnformatlon grld approach to spatlal data mlnlng. |n Proceedings
of the rd International Conference on Very Large Data Bases, pages
+86–+p¸, +pp).
|
welhrauch, zooo
|
K. welhrauch. Computable Analysis. Sprlnger, zooo.
A-+(/A-+(