You are on page 1of 57

|LLL |CDM zo++

December +++(, zo++


A Fast and Flexible Clustering
AlgorithmUsing Binary Discretization
Mahlto SUG|AMA
,
Aklhlro AMAMOTO

Kyoto Unlverslty

1SPS Pesearch Pellow


+/zo
Main Results
+. BOOL (Bianry cOding Oriented cLustering)
A clusterlng algorlthm for multlvarlate data
to detect arbltrarlly shaped clusters
Nolse tolerant
Pobust to changes ln lnput parameters
z. BOOL is faster than K-means and the fastest among
the algorithms to detect arbitrarily shaped clusters
|t ls about two to three orders of magnltude faster than
two state-of-the-art algorlthms |Chao[l et al. SDMzo++,
Chao[l et al. KA|S zoop| that can detect non-convex
clusters of arbltrary shapes
z/zo
Evaluation with Typical Benchmarks
we used four synthetlc databases (DS+ - DS()
Prom the CLUTO webslte
http://glaros.dtc.umn.edu/gkhome/views/cluto/
Typlcal benchmarks for (spatlal) clusterlng
CHAMLLLON
|
Karypls et al., +ppp
|
, SPAPCL
|
Chao[l et al., zoop
|
,
A8ACUS
|
Chao[l et al., zo++
|
, and so on
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
/zo
Results for Benchmarks
Punnlng tlme (ln seconds)
|n table, n denotes the number of data polnts
Pesults for A8ACUS and SPAPCL are from
|
Chao[l et al., zo++
|
Thelr source codes are not publlcly avallable
Non-convex Convex
n 8OOL D8SCAN A8ACUS SPAPCL K-means
DS 8ooo o.oo( p.pp +.) +.8 o.oo8
DS 8ooo o.oo( +o.o(+ +. +. o.oo8
DS +oooo o.o+o +.8z +.p z. o.o6
DS 8ooo o.oo p.p() +.) +.8 o.o+8
(/zo
Results (Original Data)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
/zo
Results (BOOL)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0
0 100 200 400 500 600 700 300
100
200
300
400
6/zo
Results (K-means)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
)/zo
Evaluation with Natural Images
we used four natural lmages
Prom the 8erkeley segmentatlon database and benchmark
http://www.eecs.berkeley.edu/Research/Projects/
CS/vision/grouping/segbench/
(8+ z+ ln slze, +(,(o+ plxels ln total
The PG8 (red-green-blue) values for each plxel were obtalned
by pre-precesslng, same as
|
Chao[l et al., zo++
|
Horse Mushroom Pyramid Road
8/zo
Results for Natural Images
Punnlng tlme (ln seconds)
|n table, n denotes the number of data polnts
Pesults for A8ACUS and SPAPCL are from
|
Chao[l et al., zo++
|
Non-convex Convex
n 8OOL A8ACUS SPAPCL Kmeans
Horse +((o+ o.z +.z (+.8 o.6)(
Mushroom +((o+ o.)6+ zp. +.((p
Pyramid +((o+ o.+8) ++. o.z(
Road +((o+ o.+8o +(.p o.zop
p/zo
Results (Original Data)
Horse Mushroom
Pyramid Road
+o/zo
Results (BOOL)
Horse Mushroom
Pyramid Road
++/zo
Results (K-means)
Horse Mushroom
Pyramid Road
+z/zo
Results for UCI Data
Punnlng tlme (ln seconds)
n d K 8OOL K-means D8SCAN
ecoli 6 ) 8 o.oo+ o.ooz o.+++
sonar zo8 6o z o.oo( o.oo o.+(p
shuttle +(oo p ) o.oz o.o6
wdbc 6p o z o.oo( o.oo( o.zzz
wine +)8 + o.ooz o.ooz o.o()
wine quality (8p8 ++ ) o.o+p o.oz6 ).6o+
+/zo
Results for UCI Data
Ad[usted Pand lndex
n d K 8OOL K-means D8SCAN
ecoli 6 ) 8 o.)( o.(pp o.+zz
sonar zo8 6o z o.o+ o.oo6( o.ooo6
shuttle +(oo p ) o.)6+ o.+(z
wdbc 6p o z o.68o6 o.(p+( o.o
wine +)8 + o.(68 o.() o.zp)+
wine quality (8p8 ++ ) o.o++ o.oopp o.o+(
+(/zo
Motivation
The K-means algorlthm ls wldely used
Slmple and eclent
The maln drawback ls lts llmlted clusterlng ablllty:
Non-spherlcal clusters cannot be found
Many shape-based algorlthms have been proposed
Can nd clusters of arbltrary shape
Drawbacks:
+. Not scalable (tlme complexlty ls quadratlc or cublc)
z. Not robust to lnput parameters
+/zo
Key Strategy
Pequlrements:
+. Past (llnear ln data slze)
z. Pobust
. Plnd arbltrary shaped clusters
Key idea: Translate lnput data onto blnary words uslng
blnary dlscretlzatlon
A hlerarchy of clusters ls naturally produced by
lncreaslng the accuracy of the dlscretlzatlon
Lach cluster ls constructed by
agglomeratlng ad[acent smaller clusters
performed llnearly by sortlng blnary representatlons
+6/zo
Clustering Process
ld x y
a o.+( o.+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a o.+( o.+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Agglomerate Adjacent Clusters
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Go to Next Level
ld x y
a o.+( o.+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Discretize Database at Level
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Agglomerate Adjacent Clusters
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Construct Clusters with Sorting
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Sort by x-axis Sort by y-axis
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
e 2 1
b 2 2
d 1 3
c 2 3
+)/zo
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
e 2 1
b 2 2
d 1 3
c 2 3
+)/zo
Sort by x-axis
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Finish (... or go to the next level)
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Distance Parameter l
l = 1
l = 2
l = 3
+8/zo
Noise Filtering by BOOL
Nolse lterlng ls lmportant ln clusterlng
Dene
N
{C #C N} for a partltlon
See a cluster C as nolses lf #C < N
Lxample: Glven {{0.l}, {0.4, 0.5, 0.6}, {0.9}}

2
{{0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
we lnput the number (nolse parameter) N as the lower
bound of the slze of cluster ln 8OOL
+p/zo
Conclusion
we have developed a clusterlng algorlthm 8OOL
Pastest w.r.t ndlng arbltrarlly shaped clusters
Pobust, hlghly scalable, and nolse tolerant
The key to performance ls dlscretlzatlon based on
the blnary encodlng and sortlng of data polnts
Puture work: 8OOL ls slmple and can easlly be extended
to other machlne learnlng tasks
e.g., anomaly detectlon, seml-supervlsed learnlng
zo/zo
Appendix
A-+/A-zo
Other Experiments
we evaluate 8OOL experlmentally to determlne lts scala-
blllty and eectlveness for varlous types of databases
Pandomly generated synthetlc databases (d 2)
Pamous synthetlc databases for spatlal clusterlng (d 2)
Peal databases generated from natural lmages (d 3)
Peal databases from the UC| reposltory (d 7, 9, 30, 60)
weusedMac OS Xverslon+o.6.(zz.z6-GHz Quad-Core
|ntel Xeon CPUs, +z G8 of memory)
8OOL was lmplemented ln C, complled wlth gcc (.z.+
All experlments were performed ln P verslon z.+z.z
All databases are pre-processed by mln-max normallza-
tlon ln 8OOL
A-z/A-zo
Speed w.r.t. Data and Clusters
Pandomly generated synthetlc databases
K 5 (left), n l0000 (rlght)
5l0' l0` l0" l0' l0"
Number of data polnts n
5l0" 5l0` l0
P
u
n
n
l
n
g

t
l
m
e

(
s
)
l0"
l0
l
l0"
l0
35 l0 20 30 40
Number of clusters K
25 l5 5
l0`
l0
l0'
l0
l
8OOL
K-means
D8SCAN
A-/A-zo
Quality w.r.t. Data and Clusters
Pandomly generated synthetlc databases
K 5 (left), n l0000 (rlght)
510 10 10 10 10
Number of data points n
510 510 10
A
d
j
u
s
t
e
d

R
a
n
d

i
n
d
e
x
0.9
0.92
0.96
1.00
0.98
0.94
35 10 20 30 40
Number of clusters K
25 15 5
0.4
0.5
0.7
1.0
0.9
0.6
0.8
A-(/A-zo
Speed w.r.t. Input Parameters
Pandomly generated synthetlc databases
K 5 and n l0000
10
0.020
0.015
0.010
0.005
0
2 4 6 8
Distance parameter l
R
u
n
n
i
n
g

t
i
m
e

(
s
)
10
0.020
0.015
0.010
0.005
0
2 4 6 8
Noise parameter N
0
A-/A-zo
Quality w.r.t. Input Parameters
Pandomly generated synthetlc databases
K 5 and n l0000
10
1.00
0.98
0.94
0.92
0.9
2 4 6 8
Distance parameter l
A
d
j
u
s
t
e
d

R
a
n
d

i
n
d
e
x
0.96
10
1.2
0.8
1.0
0.6
0.4
0.2
0
-0.2
2 4 6 8
Noise parameter N
0
A-6/A-zo
Notation
we treat a target database as a relatlon
|
Garcla-Mollna et al., zoo8
|
The columns are called attributes
The domaln of each attrlbute ls |0, l|
Attrlbutes are composed of natural numbers {l, 2, , d}
Lachtuple corresponds toa data polnt lnthe d-dlmenslonal
Luclldean space
d
x
i
ls the value for attrlbute i of x
Por M {l, 2, , d},
M
(X) denotes a database that has
only the columns for attrlbutes M of X
Clusterlng ls partltlon of X lnto K subsets (clusters) C
l
, , C
K
C
i
and C
i
C
j

we call {C
l
, , C
K
} a partltlon of X
(X) { ls a partltlon of X}
A-)/A-zo
Discretization
Por a data polnt x ln a database X, dlscretlzatlon at level k
ls an operator
k
for x, where each value x
i
ls mapped to
a natural number m such that x
i
0 lmplles m 0, and
x
i
0 lmplles
m

0 lf 0 < x
i
2
k
,
1 lf 2
k+l
< x
i
2
k+2
,
...
2
k
l lf 2
k+(kl)
< x
i
l.
we use the same operator
k
for dlscretlzatlonof a database X,
i.e., each data polnt x ln X ls dlscretlzed to
k
(x) ln the database

k
(X).
A-8/A-zo
Reachability
Glven a dlstance parameter l , a data polnt x ln X ls
reachable at level k from a data polnt y lf there exlsts a
chaln of polnts z
l
, z
2
, , z
p
(p 2) such that z
l
x, z
p
y,
and the dlstance
d
0
(
k
(z
i
),
k
(z
i+l
)) l and d

(
k
(z
i
),
k
(z
i+l
)) l
for all i {l, 2, , p l}
The dlstance between x and y ls dened by
d
0
(x, y)
d

il
(x
i
, y
i
), where

0 lf x
i
y
i
,
l lf x
i
y
i
,
d

(x, y) max
i{l,,d}
|x
i
y
i
|
ln the L
0
and L

metrlcs, respectlvely
A-p/A-zo
Pseudo-Code (/)
Input: Database X,
lower bound on number of clusters K,
nolse parameter N, and
dlstance parameter l
Output: Partltlon
function 8OOL(X, K, N)
+: k l // k ls level of dlscretlzatlon
z: repeat
: MAKLH|LPAPCH(X, k, N)
(: k k + l
: until # K
6: output
A-+o/A-zo
Pseudo-Code (/)
function MAKLH|LPAPCH(X, k, N)
+: {{x} x X}
z: h d // d ls number of attrlbutes of X
: X
D

k
(X) // dlscretlze X at level k
(: X S

l
(X
D
)
S

2
(X
D
)
S

d
(X
D
)
(X)
: repeat
6: X S

h
(X
D
)
(X)
): AGGL(X, , h)
8: h h l
p: until h 0
+o: {C #C N}
++: output
A-++/A-zo
Pseudo-Code (/)
function AGGL(X, , h)
+: for each ob[ect x of X
z: y successlve ob[ect of x
: if d
0
(
k
(x),
k
(y)) l and d

(
k
(x),
k
(y)) l then
(: delete C x and D y from, and add C D
: end if
6: end for
): output
A-+z/A-zo
Level-k partition and Sorting
8OOL construct clusters through level-k partltlons
(k l, 2, 3, )
Apartltlonof a database X ls a level-k partltlon, denotedby
k
,
lf lt satlses the followlng condltlon:
Por all palrs x andy, the palr are lnthe same cluster ly ls reach-
able at level k from x
Sortlng of a database X ls dened as follows:
Let Y be a key database s.t. #X #Y and Y has only one of Xs
attrlbutes
The expresslon S
Y
(X) ls the database X for whlch data polnts
are sorted ln the order lndlcated by Y
Tles keep the orlglnal order of X
A-+/A-zo
Properties of Reachability
|f the dlstance parameter l l, the condltlon ls exactly the
same as
d
l
(
k
(z
i
),
k
(z
i+l
)) l
d
l
ls the Manhattan dlstance (L
l
metrlc)
The notlon of reachablllty ls symmetrlc
|f a data polnt x ls reachable at level k from y, then y ls
reachable from x
A-+(/A-zo
Hierarchy of Clusters
Level-k partltlons have a hlerarchlcal structure
Por the level-k and k + l partltlons
k
and
k+l
,
the followlng condltlon holds:
Por every cluster C
k
, there exlsts a set of clusters

k+l
such that C.
Thls ls why, for two ob[ects x and y, lf d
0
(
k
(x),
k
(y)) l
and d

(
k
(x),
k
(y)) l for some k, then the same holds
for all k

, wlth k

k.
8OOL can be vlewed as a dlvlslve hlerarchlcal clusterlng
algorlthm
A-+/A-zo
Adjusted Rand Index
Let the result be {C
l
, , C
K
} andthe correct partltlon
be {D
l
, , D
M
}
Suppose n
ij
{x X x C
i
, x D
j
}. Then

i, j n
ij
C
2
(
i C
i

C
2

h D
j

C
2
)/
n
C
2
2
l
(
i C
i

C
2
+
h D
j

C
2
) (
i C
i

C
2

h D
j

C
2
)/
n
C
2
A-+6/A-zo
Shape-Based Clustering Algorithms
Many shape-based algorlthms have been proposed
See
|
8erkhln, zoo6, Halkldl et al., zoo+, 1aln et al., +ppp
|
Partltlonal algorlthms
A8ACUS
|
Chao[l et al., zo++
|
, SPAPCL
|
Chao[l et al., zoop
|
Mass-based algorlthms
|
Tlng and wells, zo+o
|
Denslty-based algorlthms
D8SCAN
|
Lster et al., +pp6
|
Hlerarchlcal clusterlng algorlthms
CUPL
|
Guha et al., +pp8
|
, CHAMLLLON
|
Karypls et al., +ppp
|
Grld-based algorlthms
ST|NG
|
wang et al., +pp)
|
A-+)/A-zo
References
|
8erkhln, zoo6
|
P. 8erkhln. A survey of clusterlng data mlnlng technlques.
Grouping Multidimensional Data, pages z)+, zoo6.
|
Chao[l et al., zo++
|
v. Chao[l, G. Ll, H. lldlrlm, and M. 1. Zakl. A8ACUS:
Mlnlng arbltrary shaped clusters from large datasets based on back-
bone ldentlcatlon. |n Proceedings of SIAMInternational Confer-
ence on Data Mining, pages zpo6, zo++.
|
Chao[l et al., zoop
|
v. Chao[l, M. A. Hasan, S. Salem, and M. 1. Zakl. SPAPCL:
Aneectlve andeclent algorlthmfor mlnlngarbltrary shape-based
clusters. Knowledge and Information Systems, z+(z):zo+zzp, zoop.
|
Lster et al., +pp6
|
M. Lster, H. P. Krlegel, 1. Sander, and X. Xu. A denslty-
based algorlthm for dlscoverlng clusters ln large spatlal databases
wlth nolse. |n Proceedings of KDD, p6, zz6z+, +pp6.
|
Garcla-Mollna et al., zoo8
|
H. Garcla-Mollna, 1. D. Ullman, and 1. wldom.
Database systems: The complete book. Prentlce Hall Press, zoo8.
|
Guha et al., +pp8
|
S. Guha, P. Pastogl, and K. Shlm. CUPL: An e-
A-+8/A-zo
clent clusterlng algorlthm for large databases. Information Systems,
z6(+):8, +pp8.
|
Halkldl et al., zoo+
|
M. Halkldl, . 8atlstakls, and M. vazlrglannls. On clus-
terlng valldatlon technlques. Journal of Intelligent Information Sys-
tems, +)(z):+o)+(, zoo+.
|
Hlnneburg and Kelm, +pp8
|
A. Hlnneburg and D. A. Kelm. An eclent ap-
proach to clusterlng ln large multlmedla databases wlth nolse. |n
Proceedings of the Fourth International Conference on Knowledge Dis-
covery and Data Mining, pages 86, +pp8.
|
1aln et al., +ppp
|
A. K. 1aln, M. N. Murty, and P. 1. Plynn. Data clusterlng: A
revlew. ACMComputing Surveys, +():z6(z, +ppp.
|
Karypls et al., +ppp
|
G. Karypls, H. Lul-Hong, and v. Kumar. CHAMLLLON:
Hlerarchlcal clusterlng uslng dynamlc modellng. Computer,
z(8):68), +ppp.
|
Lln et al., zoo
|
1. Lln, L. Keogh, S. Lonardl, and 8. Chlu. A symbollc repre-
sentatlon of tlme serles, wlth lmpllcatlons for streamlng algorlthms.
|n Proceedings of the th ACMSIGMODWorkshop on Research Issues in
Data Mining and Knowledge Discovery, pages +++, zoo.
A-+p/A-zo
|
Qlu and 1oe, zoo6
|
w. Qlu and H. 1oe. Generatlon of randomclusters wlth
specled degree of separatlon. Journal of Classication, z:+(,
zoo6.
|
Shelkholeslaml et al., +pp8
|
G. Shelkholeslaml, S. Chatter[ee, and
A. Zhang. waveCluster: A multl-resolutlon clusterlng approach for
very large spatlal databases. |n Proceedings of the th International
Conference on Very Large Data Bases, pages (z8(p, +pp8.
|
Tlng and wells, zo+o
|
K. M. Tlng and 1. P. wells. Multl-dlmenslonal mass
estlmatlon and mass-based clusterlng. |n Proceedings of th IEEE In-
ternational Conference on Data Mining, pages ++ zo, zo+o.
|
wang et al., +pp)
|
w. wang, 1. ang, and P. Muntz. ST|NG: A statlstlcal
lnformatlon grld approach to spatlal data mlnlng. |n Proceedings
of the rd International Conference on Very Large Data Bases, pages
+86+p, +pp).
A-zo/A-zo