You are on page 1of 93

LCML PKDD zo++

September p, zo++
The MinimumCode Length for
Clustering Using the Gray Code
Mahlto SUG|AMA
,
Aklhlro AMAMOTO

Kyoto Unlverslty

1SPS Pesearch Pellow


+/z6
Contributions
+. The MCL (MinimumCode Length)
A new measure to score clusterlng results
Needed to dlstlngulsh each cluster under some xed encod-
lng scheme for real-valued varlables
z. COOL (COding-Oriented cLustering)
A general clusterlng approach
Always nds the best clusters (i.e., the global optlmal solutlon)
whlch mlnlmlzes the MCL ln O(nd)
Parameter tunlng ls not needed
. G-COOL (COOL with the Gray code)
Achleves lnternal coheslon and external lsolatlon
Plnds arbltrary shaped clusters
z/z6
Demonstration (Synthetic Dataset)
0
0.5
1
0 0.5 1
/z6
G-COOL
0
0.5
1
0 0.5 1
/z6
K-means
0
0.5
1
0 0.5 1
/z6
Results (Real datasets)
B
i
n
a
r
y

l
t
e
r
i
n
g
O
r
i
g
i
n
a
l

i
m
a
g
e
Delta Dragon Europe Norway Ganges
(/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
(/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
/z6
Clustering Focusing on Compression
The MDL approach
|
Kontkanen et al., zoo
|
Data encodlng has to be optlmlzed
All encodlng schemes are (lmpllcltly) consldered
The tlme complexlty O(n
2
)
The Kolmogorov complexlty approach |Clllbrasl, zoo|
Measures the dlstance between data polnts based on com-
presslon of nlte sequences
Dlcult to apply multlvarlate data
Actual clusterlng process ls the tradltlonal agglomeratlve hl-
erarchlcal clusterlng
The tlme complexlty O(n
2
)
8oth approaches are not sultable for masslve data
6/z6
Our Strategy
Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
. Can nd arbltrary shaped clusters
)/z6
Our Strategy
Requirements:
+. Past, and llnear ln the data slze
z. Pobust to changes ln lnput parameters
. Can nd arbltrary shaped clusters
Solutions:
+. Plx an encodlng scheme for contlnuous varlables
Motlvated by Computable Analysis
|
welhrauch, zooo
|
z. Clusterlng Dlscretlzlng real-valued data
Always nds the best results w.r.t. the MCL
. Use the Gray code for real numbers
|
Tsulkl, zooz
|
Dlscretlzed data polnts are overlapped and ad[acent clus-
ters are merged
)/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
)/z6
MCL (MinimumCode Length)
The MCL ls the code length of the maxlmally compressed
clusters by uslng a xed encodlng scheme
The MCL ls calculated ln O(nd) by uslng radlx sort
n and d are the number of data and dlmenslon, resp.
8/z6
MCL (MinimumCode Length)
The MCL ls the code length of the maxlmally compressed
clusters by uslng a xed encodlng scheme
The MCL ls calculated ln O(nd) by uslng radlx sort
n and d are the number of data and dlmenslon, resp.
Example: X }0.l, 0.2, 0.8, 0.9},

l
}}0.l, 0.2}, }0.8, 0.9}}

2
}}0.l}, }0.2, 0.8}, }0.9}}
Use blnary encodlng
whlch ls preferred!
8/z6
Binary Encoding
0
1
P
o
s
i
t
i
o
n
0
1
2
3
4
0.5
0.1 0.2
00011...
00110...
0.8 0.9
11001...
11100...
p/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
ld value
A o.+
8 o.z
C o.8
D o.p
MCL l + l 2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
ld value
A o.+
8 o.z
C o.8
D o.p
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
MCL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D
0 1
Lv. 1
00 10 11 01
Lv. 2
Lv. 3
000 001 110 111
ld value
A o.+
8 o.z
C o.8
D o.p
MCL 3 4 l2
+o/z6
Denition of MCL
Plx an embeddlng :
d

( }0, 1} usually)
Por p range() and P range(), dene
(p P)
{
w

p w and P v for all v


such that |v| |w| and p v
}
Lach element ln (p P) ls a prex that dlscrlmlnates p from P
Por a partltlon }C
l
, , C
K
} of a data set X,
MCL()
i}l,,K}
L
i
(), where
L
i
() mln
{
|W|
(C
i
) W and
W

xC
i
((x) (X C
i
))
}
++/z6
Minimizing MCL and Clustering
Clusterlngunder the MCL crlterlonls tondthe global op-
tlmal solutlon that mlnlmlzes the MCL
Plnd
op
such that

op
argmln
(X)
K
MCL(),
where (X)
K
} ls a partltlon of X #C K }
we glve the lower bound of the number of clusters K as a
lnput parameter

op
becomes one set }X} wlthout thls assumptlon
+z/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
+z/z6
Optimization by COOL
COOL solves the optlmlzatlon problem ln O(nd)
n and d are the number of data and dlmenslon, resp.
The naive approach takes exponentlal tlme and space
Computlng process of the MCL becomes clusterlng process lt-
self vla dlscretlzatlon
COOL ls level-wlse, and makes the level-k partltlon
k
from k l, 2, , whlch holds the followlng condltlon:
Por all x, y X, they are ln the same cluster
v w for some v (x) and w (y) wlth |v| |w| k
Level-k partltlons form hlerarchy
Por C
k
, there exlsts
k+l
such that C
Por all C
op
, there exlsts k such that C
k
+/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL l + l 2
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
ld value
A 00
8 01
C 10
D 10
L 11
MCL 2 4 8
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
+(/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
00 10 11 01
Lv. 2
100 101
Lv. 3
ld value
A 00
8 01
C 100
D 101
L 11
MCL 6 + 6 l2
+(/z6
Noise Filtering by COOL
Nolse lterlng ls easlly lmplemented ln COOL
Dene
N
}C #C N} for a partltlon
See a cluster C as nolses lf #C < N
Lxample: Glven }}0.l}, }0.4, 0.5, 0.6}, }0.9}}

2
}}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+/z6
Noise Filtering by COOL
Nolse lterlng ls easlly lmplemented ln COOL
Dene
N
}C #C N} for a partltlon
See a cluster C as nolses lf #C < N
Lxample: Glven }}0.l}, }0.4, 0.5, 0.6}, }0.9}}

2
}}0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
we lnput the lower bound N of the cluster slze as a lnput
parameter
0 0.5 1 0.25 0.75
N = 2
+/z6
Algorithmof COOL
|nput: A data set X, two lower bounds K and N
Output: The optlmal partltlon
op
and nolses
functlon COOL(X, K, N)
+: Plnd partltlons
l
N
, ,
m
N
such that
ml
N
< K
m
N

z: (
op
, MCL(
op
)) P|NDCLUSTLPS(X, K, }
l
N
, ,
m
N
})
: return (
op
, X
op
)
functlon P|NDCLUSTLPS(X, K, }
l
, ,
m
})
+: Plnd k such that
kl
< K and
k
K
z:
op

k
: lf K 2 then return (
op
, MCL(
op
))
(: for each C ln
l

kl
: (, L) P|NDCLUSTLPS(X C, K l, }
l
, ,
k
})
6: lf MCL( C) < MCL(
op
) then
op
C
): return (
op
, MCL(
op
))
+6/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
+6/z6
Gray Code
Peal numbers ln |0, l| are encoded wlth 0, 1, and
8lnary: 0.l 00011, 0.25 00111
Gray: 0.l 00010, 0.25 0100
Orlglnally, another blnary encodlng of natural numbers
Lspeclally lmportant ln appllcatlons of converslon between
analog and dlgltal lnformatlon
|
Knuth, zoo
|
The Gray code embeddlng ls an ln[ectlon
G
that maps x |0, l|
to an lnnlte sequence p
0
p
l
p
2
, where
p
i
1 lf 2
i
m2
(i+l)
< x < 2
i
m+2
(i+l)
for an odd m, p
i
0
lf the same holds for an even m, and p
i
lf x 2
i
m2
(i+l)
for some lnteger m
Por a vector x (x
l
, , x
d
),
G
(x) p
l
l
p
d
l
p
l
2
p
d
2

+)/z6
Gray Code Embedding
0 0.5 1
0
1
2
3
5
P
o
s
i
t
i
o
n
0.8
10101...
0.25
0100...
0.1
00010...
+8/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A o.++
8 o.p8
C o.z6
D o.)o+
L o.)p6
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
ld value
A 0
8 0, 1
C 1, 1
D 1, 1
L 1
MCL l 2 2
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
+p/z6
COOL with Gray Code (G-COOL)
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
1
00 10 11 01
Lv. 2
01 10 11
0 1
Lv. 1
0 1
Lv. 1
ld value
A 00
8 01, 10
C 10, 10
D 10, 11
L 11, 11
MCL 2 3 6
+p/z6
COOL with Binary Encoding
0 0.5 1 0.25 0.75
A B C D E
0 1
Lv. 1
ld value
A 0
8 0
C 1
D 1
L 1
MCL l + l 2
+p/z6
Theoretical Analysis of G-COOL
Use the Gray code as a xed encodlng ln COOL
|t achleves lnternal coheslon and external lsolatlon
Theorem: For the level-k partition
k
, x, y X are in the
same cluster if d

(x, y) < 2
(k+l)
Thus x, y are ln the dlerent clusters only lf d

(x, y) 2
(k+l)
d

(x, y) max
i}l,,d}
|x
i
y
i
| (L

metrlc)
Two ad[acent lntervals overlap and they are agglomerated
Corollary: In the optimal partition
op
, for all x C (C

op
), its nearest neighbor y C
y ls nearest nelghbor of x y argmln
yX
d

(x, y)
zo/z6
Demonstration of G-COOL
0 0.5 1
0
0.5
1
G-COOL
0 0.5 1
0
0.5
1
COOL with the binary encoding
z+/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
z+/z6
Experimental Methods
Analyze G-COOL emplrlcally wlth synthetlc and real
datasets compared to D8SCAN and K-means
Synthetlc datasets were generated by the P package cluster-
Generation
|
Qlu and 1oe, zoo6
|
n l, 500 for each cluster and d 3
Peal datasets were geospatlal lmages from Larth-as-Art
reduced to zoo zoo plxels, translated lnto blnary lmages
All data were normallzed by mln-max normallzatlon
G-COOL was lmplemented by P (verslon z.+z.+)
|nternal and Lxternal measure were used
|nternal: MCL, connectlvlty, Sllhouette wldth
Lxternal: ad[usted Pand lndex
zz/z6
Results (Synthetic datasets)
MCL
G-COOL
DBSCAN
K-means
50000
5000
Number of clusters
2 4 6
500
Data show mean s.e.m.
Each experiment was performed
20 times
Bad
Good
z/z6
Results (Synthetic datasets)
Connectivity Silhouette width
1000
1
100
Number of clusters
2 4 6
0.1
10
0.6
0.3
0.5
Number of clusters
2 4 6
0.4
G-COOL
DBSCAN
K-means
Good
Bad
Bad
Good
z/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
20
5
15
Number of clusters
2 4 6
0
10
1.0
0.3
0.9
Number of clusters
2 4 6
0.6
G-COOL
DBSCAN
K-means
Good
Bad
z/z6
Results (Synthetic datasets)
MCL
G-COOL
Data show mean s.e.m.
Each experiment was performed
20 times
5000
3000
0
0 4 6 2 8 10
4000
2000
1000
The noise parameter N
Bad
Good
z(/z6
Results (Synthetic datasets)
Connectivity Silhouette width
80
20
60
0
40
0.6
0.2
0.4
0 4 6 2 8 10 0 4 6 2 8 10
The noise parameter N The noise parameter N
0
Good
Bad
Bad
Good
z(/z6
Results (Synthetic datasets)
Runtime (s) Adjusted Rand index
2.0
0.5
1.5
0 4 6
0
1.0
1.0
0.4
0.8
0.6
2 8 10
The noise parameter N
0 4 6 2 8 10
The noise parameter N
0.2
0
Good
Bad
z(/z6
Results (Real datasets)
B
i
n
a
r
y

l
t
e
r
i
n
g
O
r
i
g
i
n
a
l

i
m
a
g
e
Delta Dragon Europe Norway Ganges
z/z6
Results (Real datasets)
G
-
C
O
O
L
K
-
m
e
a
n
s
Delta Dragon Europe Norway Ganges
z/z6
Results (Real datasets)
Name n K Punnlng tlme (s) MCL
GC KM GC KM
Delta zo)(8 ( +.+8 o.o+z (o+o (pzz
Dragon zp8z6 z o.p o.oz6 po6 )+66
Lurope +)8o 6 z.(o( o.o(+ zzo +zz+o
Norway zz))+ o.)(6 o.oz6 +8zo 6++(
Ganges +8o+p 6 o.p o.oz6 zzo +zz6
GC: G-COOL, KM: K-means
z/z6
Outline
o. Overvlew
+. 8ackground and Our Strategy
z. MCL and Clusterlng
. COOL Algorlthm
(. G-COOL: COOL wlth the Gray Code
. Lxperlments
6. Concluslon
z/z6
Conclusion
|ntegrate clusterlng and lts evaluatlon ln the codlng-
orlented manner
An eectlve solutlon for two essentlal problems, how to mea-
sure goodness of results and how to nd good clusters
No dlstance calculatlon and no data dlstrlbutlon
Key ideas:
+. Fix of an encoding scheme for real-valued variables
|ntroduced the MCL focuslng on compresslon of clusters
Pormulatedclusterlngwlththe MCL, andconstructedCOOL
that nds the global optlmal solutlon llnearly
z. The Gray code
we showed eclency and eectlveness of G-COOL by the-
oretlcally and experlmentally
z6/z6
Appendix
A-+/A-+(
Notation (/)
A datum x
d
, a data set X }x
l
, , x
n
}
#X ls the number of elements ln X
X Y ls the relatlve complement of Y ln X
Clusterlng ls partltlon of X lnto K subsets (clusters) C
l
, , C
K
C
i
and C
i
C
j

we call }C
l
, , C
K
} a partltlon of X
(X) } ls a partltlon of X}
The set of nlte and lnnlte sequences over an alphabet are
denoted by

and

, resp.
The length |w| ls the number of symbols other than
|f w 11100, then |w| 5
Por a set of sequences W, |W|
wW
|w|
A-z/A-+(
Notation (/)
An embeddlng of
d
ls an ln[ectlve functlon from
d
to

Por p, q

, dene p q lf p
i
q
i
for all i wlth p
i

|ntultlvely, q ls more concrete than p
Por w

, we wrlte w p lf w

p
w }p range() w p} for w

W }p range() w p for some w W} for W

The followlng monotonlclty holds



l
(v)
l
(w) l v

A-/A-+(
Optimization by COOL
The optlmal partltlon
op
canbe constructedby the level-
k partltlons
Por all C
op
, there exlsts k such that C
k
The level-k partltlons have the hlerarchlcal structure
Por each C
k
we have C for some D
k+l
COOL ls slmllar to dlvlslve hlerarchlcal clusterlng
COOL always outputs the global optlmal partltlon
op
The tlme complexlty ls O(nd) (best) and O(nd + K! ) (worst)
Usually K n holds, hence O(nd)
A-(/A-+(
Clustering Process of COOL
0 1
0
1
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-/A-+(
Clustering Process of COOL
0 1
0
1 K = 2
Lv. 1
Cluster
1 2 3
1
2 3
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
1 2 3
1
2 3
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
1 2
3
5
4 6
2 1 3 4 5 6
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
Clustering Process of COOL
0 1
0
1
Lv. 1
Cluster
K = 5
Lv. 2
A-/A-+(
The Multi-Dimensional Gray Code
Usethewrapplngfunctlon(p
l
, , p
d
) p
l
l
p
d
l
p
l
2
p
d
2

Dene the d-dlmenslonal Gray code embeddlng


d
G
:

,d
by
d
G
(x
l
, , x
d
) (
G
(x
l
), ,
G
(x
d
))
we abbrevlate d of
d
G
lf lt ls understood from the context
A-6/A-+(
Internal Measures
Connectlvlty |Handl et al., zoo|
Conn()
xX

M
il
f(x, nn(x, i))/i
nn(x, j) ls the i-th nelghbor of x, f(x, y) ls 0 lf x and y be-
long to the same cluster, and l otherwlse
M ls an lnput parameter (we set as +o)
Takes values from 0 to , should be mlnlmlzed
Sllhouette wldth
The average of Sllhouette value S(x) for each x
S(x) (b(x) a(x)/ max(b(x), a(x)))
a(x) C
l

yC
d(x, y) (x C)
b(x) mln
DC
D
l

yD
d(x, y)
Takes values froml to l, should be maxlmlzed
A-)/A-+(
External Measures
Ad[usted Pand lndex
Let the result be }C
l
, , C
K
} and the correct partl-
tlon be }D
l
, , D
M
}
Suppose n
ij
}x X x C
i
, x D
j
}. Then

i, j n
ij
C
2
(
i C
i

C
2

h D
j

C
2
)/
n
C
2
2
l
(
i C
i

C
2
+
h D
j

C
2
) (
i C
i

C
2

h D
j

C
2
)/
n
C
2
A-8/A-+(
Discussion
Pesults for synthetlc datasets
8est performance under the lnternal measures
(nearly) 8est performance under the lnternal measures
G-COOL ls eclent and eectlve
D8SCAN ls sensltlve to lnput parameters
The MCL works well as an lnternal measure
Pesults for real datasets
not good, and not bad
There are no clear clusters orlglnally
G-COOL ls a good clusterlng method
A-p/A-+(
Related Work
Partltlonal methods
|
Chao[l et al., zoop
|
Mass-based methods
|
Tlng and wells, zo+o
|
Denslty-based methods (D8SCAN
|
Lster et al., +pp6
|

Hlerarchlcal clusterlng methods


(CUPL
|
Guha et al., +pp8
|
, CHAMLLLON
|
Karypls et al.,
+ppp
|
)
Grld-based methods
(ST|NG
|
wang et al., +pp)
|
, waveCluster
|
Shelkholeslaml et
al., +pp8
|
)
A-+o/A-+(
Future Works
Speedlng up by uslng tree-structures such as 8DD
Apply to anomaly detectlon
Theoretlcal analysls, ln partlcular relatlon wlth Com-
putable Analysls
Admlsslblllty ls a key property
A-++/A-+(
References
|
8erkhln, zoo6
|
P. 8erkhln. A survey of clusterlng data mlnlng technlques.
Grouping Multidimensional Data, pages z)+, zoo6.
|
Chao[l et al., zoop
|
v. Chao[l, M. A. Hasan, S. Salem, and M. 1. Zakl. SPAPCL:
Aneectlve andeclent algorlthmfor mlnlngarbltrary shape-based
clusters. Knowledge and Information Systems, z+(z):zo+zzp, zoop.
|
Clllbrasl and vltanyl, zoo
|
P. Clllbrasl and P. M. 8. vltanyl. Cluster-
lng by compresslon. IEEE Transactions on Information Theory,
+(():+z+(, zoo.
|
Lster et al., +pp6
|
M. Lster, H. P. Krlegel, 1. Sander, and X. Xu. A denslty-
based algorlthm for dlscoverlng clusters ln large spatlal databases
wlth nolse. |n Proceedings of KDD, p6, zz6z+, +pp6.
|
Guha et al., +pp8
|
S. Guha, P. Pastogl, and K. Shlm. CUPL: An e-
clent clusterlng algorlthm for large databases. Information Systems,
z6(+):8, +pp8.
|
Handl et al., zoo
|
1. Handl, 1. Knowles, and D. 8. Kell. Computatlonal
A-+z/A-+(
cluster valldatlon ln post-genomlc data analysls. Bioinformatics,
z+(+):zo+, zoo.
|
Karypls et al., +ppp
|
G. Karypls, H. Lul-Hong, and v. Kumar. CHAMLLLON:
Hlerarchlcal clusterlng uslng dynamlc modellng. Computer,
z(8):68), +ppp.
|
Kontkanen and Myllymakl, zoo8
|
P. Kontkanen and P. Myllymakl. An em-
plrlcal comparlson of NML clusterlng algorlthms. |n Proceedings of
Information Theory and Statistical Learning, zoo8.
|
Kontkanen et al., zoo
|
P. Kontkanen, P. Myllymakl, w. 8untlne, 1. Plssanen,
andH. Tlrrl. AnMDL framework for data clusterlng. |n P. Grunwald, |. 1.
Myung, and M. Pltt, edltors, Advances in MinimumDescription Length:
Theory and Applications. M|T Press, zoo.
|
Knuth, zoo
|
D. L. Knuth. The Art of Computer Programming, Volume , Fas-
cicle : Generating All Tuples and Permutations. Addlson-wesley Pro-
fesslonal, zoo.
|
Qlu and 1oe, zoo6
|
w. Qlu and H. 1oe. Generatlon of randomclusters wlth
specled degree of separatlon. Journal of Classication, z:+(,
zoo6.
A-+/A-+(
|
Shelkholeslaml et al., +pp8
|
G. Shelkholeslaml, S. Chatter[ee, and
A. Zhang. waveCluster: A multl-resolutlon clusterlng approach for
very large spatlal databases. |n Proceedings of the th International
Conference on Very Large Data Bases, pages (z8(p, +pp8.
|
Tlng and wells, zo+o
|
K. M. Tlng and 1. P. wells. Multl-dlmenslonal mass
estlmatlon and mass-based clusterlng. |n Proceedings of th IEEE In-
ternational Conference on Data Mining, pages ++ zo, zo+o.
|
Tsulkl, zooz
|
Hldekl Tsulkl. Peal number computatlon through Gray code
embeddlng. Theoretical Computer Science, z8((z):(6)(8, zooz.
|
wang et al., +pp)
|
w. wang, 1. ang, and P. Muntz. ST|NG: A statlstlcal
lnformatlon grld approach to spatlal data mlnlng. |n Proceedings
of the rd International Conference on Very Large Data Bases, pages
+86+p, +pp).
|
welhrauch, zooo
|
K. welhrauch. Computable Analysis. Sprlnger, zooo.
A-+(/A-+(