|LLL |CDM zo

++
December ++–+(, zo++
A Fast and Flexible Clustering
AlgorithmUsing Binary Discretization
Mahlto SUG|¥AMA
†,‡
,Aklhlro ¥AMAMOTO


Kyoto Unlverslty

1SPS Pesearch Pellow
+/zo
Main Results
+. BOOL (Bianry cOding Oriented cLustering)
• A clusterlng algorlthm for multlvarlate data
to detect arbltrarlly shaped clusters
• Nolse tolerant
• Pobust to changes ln lnput parameters
z. BOOL is faster than K-means and the fastest among
the algorithms to detect arbitrarily shaped clusters
• |t ls about two to three orders of magnltude faster than
two state-of-the-art algorlthms |Chao[l et al. SDMzo++,
Chao[l et al. KA|S zoop| that can detect non-convex
clusters of arbltrary shapes
z/zo
Evaluation with Typical Benchmarks
• we used four synthetlc databases (DS+ - DS()
– Prom the CLUTO webslte
∘ http://glaros.dtc.umn.edu/gkhome/views/cluto/
• Typlcal benchmarks for (spatlal) clusterlng
– CHAMLLLON
|
Karypls et al., +ppp
|
, SPAPCL
|
Chao[l et al., zoop
|
,
A8ACUS
|
Chao[l et al., zo++
|
, and so on
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
¸/zo
Results for Benchmarks
• Punnlng tlme (ln seconds)
– |n table, n denotes the number of data polnts
– Pesults for A8ACUS and SPAPCL are from
|
Chao[l et al., zo++
|
∘ Thelr source codes are not publlcly avallable
Non-convex Convex
n 8OOL D8SCAN A8ACUS SPAPCL K-means
DS 8ooo o.oo( p.p¸p +.) +.8 o.oo8
DS 8ooo o.oo( +o.o(+ +.¸ +.¸ o.oo8
DS +oooo o.o+o +¸.8¸z +.p z.¸ o.o¸6
DS 8ooo o.oo¸ p.p() +.) +.8 o.o+8
(/zo
Results (Original Data)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
¸/zo
Results (BOOL)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0
0 100 200 400 500 600 700 300
100
200
300
400
6/zo
Results (K-means)
DS1 DS2
0 100 200 300 400 500 600
0
50
100
150
200
250
300
0 200 400 600 800
50
100
150
DS3 DS4
0 100 200 400 500 600 700 300
100
200
300
400
0 100 200 400 500 600 700 300
100
200
300
400
0
)/zo
Evaluation with Natural Images
• we used four natural lmages
– Prom the 8erkeley segmentatlon database and benchmark
∘ http://www.eecs.berkeley.edu/Research/Projects/
CS/vision/grouping/segbench/
• (8+ × ¸z+ ln slze, +¸(,(o+ plxels ln total
– The PG8 (red-green-blue) values for each plxel were obtalned
by pre-precesslng, same as
|
Chao[l et al., zo++
|
Horse Mushroom Pyramid Road
8/zo
Results for Natural Images
• Punnlng tlme (ln seconds)
– |n table, n denotes the number of data polnts
– Pesults for A8ACUS and SPAPCL are from
|
Chao[l et al., zo++
|
Non-convex Convex
n 8OOL A8ACUS SPAPCL Kmeans
Horse +¸((o+ o.z¸¸ ¸+.z (+.8 o.6)(
Mushroom +¸((o+ o.)6+ zp.¸ – +.((p
Pyramid +¸((o+ o.+8) ++.¸ – o.z¸(
Road +¸((o+ o.+8o +(.p – o.zop
p/zo
Results (Original Data)
Horse Mushroom
Pyramid Road
+o/zo
Results (BOOL)
Horse Mushroom
Pyramid Road
++/zo
Results (K-means)
Horse Mushroom
Pyramid Road
+z/zo
Results for UCI Data
• Punnlng tlme (ln seconds)
n d K 8OOL K-means D8SCAN
ecoli ¸¸6 ) 8 o.oo+ o.ooz o.+++
sonar zo8 6o z o.oo( o.oo¸ o.+(p
shuttle +(¸oo p ) o.oz¸ o.o6¸ –
wdbc ¸6p ¸o z o.oo( o.oo( o.zzz
wine +)8 +¸ ¸ o.ooz o.ooz o.o()
wine quality (8p8 ++ ) o.o+p o.oz6 ).6o+
+¸/zo
Results for UCI Data
• Ad[usted Pand lndex
n d K 8OOL K-means D8SCAN
ecoli ¸¸6 ) 8 o.¸)(¸ o.(¸pp o.+zz¸
sonar zo8 6o z o.o+¸¸ o.oo6( o.ooo6
shuttle +(¸oo p ) o.)6¸+ o.+(¸z –
wdbc ¸6p ¸o z o.68o6 o.(p+( o.¸¸¸o
wine +)8 +¸ ¸ o.(6¸8 o.¸¸() o.zp)+
wine quality (8p8 ++ ) o.o+¸+ o.oopp o.o+¸(
+(/zo
Motivation
• The K-means algorlthm ls wldely used
– Slmple and efficlent
– The maln drawback ls lts llmlted clusterlng ablllty:
Non-spherlcal clusters cannot be found
• Many shape-based algorlthms have been proposed
– Can find clusters of arbltrary shape
– Drawbacks:
+. Not scalable (tlme complexlty ls quadratlc or cublc)
z. Not robust to lnput parameters
+¸/zo
Key Strategy
• Pequlrements:
+. Past (llnear ln data slze)
z. Pobust
¸. Plnd arbltrary shaped clusters
• Key idea: Translate lnput data onto blnary words uslng
blnary dlscretlzatlon
– A hlerarchy of clusters ls naturally produced by
lncreaslng the accuracy of the dlscretlzatlon
• Lach cluster ls constructed by
agglomeratlng ad[acent smaller clusters
– performed llnearly by sortlng blnary representatlons
+6/zo
Clustering Process
ld x y
a o.+( o.¸+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a o.+( o.¸+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Discretize Database at Level 
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Agglomerate Adjacent Clusters
0 1
0
1
Dlscretlzatlon level: +
ld x y
a 0 0
b 1 1
c 1 1
d 0 1
e 1 0
+)/zo
Go to Next Level
ld x y
a o.+( o.¸+
b o.66 o.)+
c o.)z o.86
d o.(8 o.8p
e o.)( o.(8
+)/zo
Discretize Database at Level 
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Discretize Database at Level 
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Agglomerate Adjacent Clusters
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Construct Clusters with Sorting
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
b 2 2
c 2 3
d 1 3
e 2 1
+)/zo
Sort by x-axis →Sort by y-axis
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
e 2 1
b 2 2
d 1 3
c 2 3
+)/zo
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
e 2 1
b 2 2
d 1 3
c 2 3
+)/zo
Sort by x-axis
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Compare Each Point to the Next Point
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Finish (... or go to the next level)
0 1 2 3
0
1
2
3
Dlscretlzatlon level: z
ld x y
a 0 1
d 1 3
e 2 1
b 2 2
c 2 3
+)/zo
Distance Parameter l
l = 1
l = 2
l = 3
+8/zo
Noise Filtering by BOOL
• Nolse filterlng ls lmportant ln clusterlng
• Define
⩾N
≔ {C ∈ ∣ #C ⩾ N} for a partltlon
– See a cluster C as nolses lf #C < N
• Lxample: Glven ¬ {{0.l}, {0.4, 0.5, 0.6}, {0.9}}

⩾2
¬ {{0.4, 0.5, 0.6}}, and 0.l and 0.9 are nolses
• we lnput the number (nolse parameter) N as the lower
bound of the slze of cluster ln 8OOL
+p/zo
Conclusion
• we have developed a clusterlng algorlthm 8OOL
– Pastest w.r.t findlng arbltrarlly shaped clusters
– Pobust, hlghly scalable, and nolse tolerant
• The key to performance ls dlscretlzatlon based on
the blnary encodlng and sortlng of data polnts
• Puture work: 8OOL ls slmple and can easlly be extended
to other machlne learnlng tasks
– e.g., anomaly detectlon, seml-supervlsed learnlng
zo/zo
Appendix
A-+/A-zo
Other Experiments
• we evaluate 8OOL experlmentally to determlne lts scala-
blllty and effectlveness for varlous types of databases
– Pandomly generated synthetlc databases (d ¬ 2)
– Pamous synthetlc databases for spatlal clusterlng (d ¬ 2)
– Peal databases generated from natural lmages (d ¬ 3)
– Peal databases from the UC| reposltory (d ¬ 7, 9, 30, 60)
• weusedMac OS Xverslon+o.6.¸(z×z.z6-GHz Quad-Core
|ntel Xeon CPUs, +z G8 of memory)
– 8OOL was lmplemented ln C, complled wlth gcc (.z.+
– All experlments were performed ln P verslon z.+z.z
• All databases are pre-processed by mln-max normallza-
tlon ln 8OOL
A-z/A-zo
Speed w.r.t.  Data and  Clusters
• Pandomly generated synthetlc databases
– K ¬ 5 (left), n ¬ l0000 (rlght)
5·l0' l0` l0" l0' l0"
Number of data polnts n
5·l0" 5·l0` l0´
P
u
n
n
l
n
g

t
l
m
e

(
s
)
l0"
l0´
l
l0"
l0´
35 l0 20 30 40
Number of clusters K
25 l5 5
l0`
l0´
l0'
l0
l
8OOL
K-means
D8SCAN
A-¸/A-zo
Quality w.r.t.  Data and  Clusters
• Pandomly generated synthetlc databases
– K ¬ 5 (left), n ¬ l0000 (rlght)
510 10 10 10 10
Number of data points n
510 510 10
A
d
j
u
s
t
e
d

R
a
n
d

i
n
d
e
x
0.9
0.92
0.96
1.00
0.98
0.94
35 10 20 30 40
Number of clusters K
25 15 5
0.4
0.5
0.7
1.0
0.9
0.6
0.8
A-(/A-zo
Speed w.r.t. Input Parameters
• Pandomly generated synthetlc databases
– K ¬ 5 and n ¬ l0000
10
0.020
0.015
0.010
0.005
0
2 4 6 8
Distance parameter l
R
u
n
n
i
n
g

t
i
m
e

(
s
)
10
0.020
0.015
0.010
0.005
0
2 4 6 8
Noise parameter N
0
A-¸/A-zo
Quality w.r.t. Input Parameters
• Pandomly generated synthetlc databases
– K ¬ 5 and n ¬ l0000
10
1.00
0.98
0.94
0.92
0.9
2 4 6 8
Distance parameter l
A
d
j
u
s
t
e
d

R
a
n
d

i
n
d
e
x
0.96
10
1.2
0.8
1.0
0.6
0.4
0.2
0
-0.2
2 4 6 8
Noise parameter N
0
A-6/A-zo
Notation
• we treat a target database as a relatlon
|
Garcla-Mollna et al., zoo8
|
– The columns are called attributes
– The domaln of each attrlbute ls |0, l|
– Attrlbutes are composed of natural numbers {l, 2, …, d}
∘ Lachtuple corresponds toa data polnt lnthe d-dlmenslonal
Luclldean space ℝ
d
– x
i
ls the value for attrlbute i of x
– Por M ⊆ {l, 2, …, d}, π
M
(X) denotes a database that has
only the columns for attrlbutes M of X
• Clusterlng ls partltlon of X lnto K subsets (clusters) C
l
, …, C
K
– C
i
≠ ∅and C
i
∩ C
j
¬ ∅
– we call ¬ {C
l
, …, C
K
} a partltlon of X
– (X) ¬ { ∣ ls a partltlon of X}
A-)/A-zo
Discretization
• Por a data polnt x ln a database X, dlscretlzatlon at level k
ls an operator Δ
k
for x, where each value x
i
ls mapped to
a natural number m such that x
i
¬ 0 lmplles m ¬ 0, and
x
i
≠ 0 lmplles
m ¬





0 lf 0 < x
i
⩽ 2
−k
,
1 lf 2
−k+l
< x
i
⩽ 2
−k+2
,
...
2
k
− l lf 2
−k+(k−l)
< x
i
⩽ l.
– we use the same operator Δ
k
for dlscretlzatlonof a database X,
i.e., each data polnt x ln X ls dlscretlzed to Δ
k
(x) ln the database
Δ
k
(X).
A-8/A-zo
Reachability
• Glven a dlstance parameter l ∈ ℕ, a data polnt x ln X ls
reachable at level k from a data polnt y lf there exlsts a
chaln of polnts z
l
, z
2
, …, z
p
(p ⩾ 2) such that z
l
¬ x, z
p
¬ y,
and the dlstance
d
0

k
(z
i
), Δ
k
(z
i+l
)) ⩽ l and d


k
(z
i
), Δ
k
(z
i+l
)) ⩽ l
for all i ∈ {l, 2, …, p − l}
– The dlstance between x and y ls defined by
d
0
(x, y) ¬
d

i¬l
δ(x
i
, y
i
), where δ ¬

0 lf x
i
¬ y
i
,
l lf x
i
≠ y
i
,
d

(x, y) ¬ max
i∈{l,…,d}
|x
i
− y
i
|
ln the L
0
and L

metrlcs, respectlvely
A-p/A-zo
Pseudo-Code (/)
Input: Database X,
lower bound on number of clusters K,
nolse parameter N, and
dlstance parameter l
Output: Partltlon
function 8OOL(X, K, N)
+: k ←l // k ls level of dlscretlzatlon
z: repeat
¸: ←MAKLH|LPAPCH¥(X, k, N)
(: k ←k + l
¸: until # ⩾ K
6: output
A-+o/A-zo
Pseudo-Code (/)
function MAKLH|LPAPCH¥(X, k, N)
+: ←{{x} ∣ x ∈ X}
z: h ←d // d ls number of attrlbutes of X
¸: X
D
←Δ
k
(X) // dlscretlze X at level k
(: X ←S
π
l
(X
D
)
∘ S
π
2
(X
D
)
∘ …∘ S
π
d
(X
D
)
(X)
¸: repeat
6: X ←S
π
h
(X
D
)
(X)
): ←AGGL(X, , h)
8: h ←h − l
p: until h ¬ 0
+o: ←{C ∈ ∣ #C ⩾ N}
++: output
A-++/A-zo
Pseudo-Code (/)
function AGGL(X, , h)
+: for each ob[ect x of X
z: y ←successlve ob[ect of x
¸: if d
0

k
(x), Δ
k
(y)) ⩽ l and d


k
(x), Δ
k
(y)) ⩽ l then
(: delete C ∋ x and D ∋ y from, and add C ∪ D
¸: end if
6: end for
): output
A-+z/A-zo
Level-k partition and Sorting
• 8OOL construct clusters through level-k partltlons
(k ¬ l, 2, 3, …)
– Apartltlonof a database X ls a level-k partltlon, denotedby
k
,
lf lt satlsfies the followlng condltlon:
Por all palrs x andy, the palr are lnthe same cluster lffy ls reach-
able at level k from x
• Sortlng of a database X ls defined as follows:
– Let Y be a key database s.t. #X ¬ #Y and Y has only one of X’s
attrlbutes
– The expresslon S
Y
(X) ls the database X for whlch data polnts
are sorted ln the order lndlcated by Y
∘ Tles keep the orlglnal order of X
A-+¸/A-zo
Properties of Reachability
• |f the dlstance parameter l ¬ l, the condltlon ls exactly the
same as
d
l

k
(z
i
), Δ
k
(z
i+l
)) ⩽ l
– d
l
ls the Manhattan dlstance (L
l
metrlc)
• The notlon of reachablllty ls symmetrlc
– |f a data polnt x ls reachable at level k from y, then y ls
reachable from x
A-+(/A-zo
Hierarchy of Clusters
• Level-k partltlons have a hlerarchlcal structure
• Por the level-k and k + l partltlons
k
and
k+l
,
the followlng condltlon holds:
– Por every cluster C ∈
k
, there exlsts a set of clusters

k+l
such that ⋃¬ C.
• Thls ls why, for two ob[ects x and y, lf d
0

k
(x), Δ
k
(y)) ⩽ l
and d


k
(x), Δ
k
(y)) ⩽ l for some k, then the same holds
for all k

, wlth k

⩽ k.
• 8OOL can be vlewed as a dlvlslve hlerarchlcal clusterlng
algorlthm
A-+¸/A-zo
Adjusted Rand Index
• Let the result be ¬ {C
l
, …, C
K
} andthe correct partltlon
be ¬ {D
l
, …, D
M
}
• Suppose n
ij
≔ ‖{x ∈ X ∣ x ∈ C
i
, x ∈ D
j
}‖. Then

i, j n
ij
C
2
− (∑
i ‖C
i

C
2

h ‖D
j

C
2
)/
n
C
2
2
−l
(∑
i ‖C
i

C
2
+ ∑
h ‖D
j

C
2
) − (∑
i ‖C
i

C
2

h ‖D
j

C
2
)/
n
C
2
A-+6/A-zo
Shape-Based Clustering Algorithms
• Many shape-based algorlthms have been proposed
– See
|
8erkhln, zoo6, Halkldl et al., zoo+, 1aln et al., +ppp
|
• Partltlonal algorlthms
– A8ACUS
|
Chao[l et al., zo++
|
, SPAPCL
|
Chao[l et al., zoop
|
• Mass-based algorlthms
|
Tlng and wells, zo+o
|
• Denslty-based algorlthms
– D8SCAN
|
Lster et al., +pp6
|
• Hlerarchlcal clusterlng algorlthms
– CUPL
|
Guha et al., +pp8
|
, CHAMLLLON
|
Karypls et al., +ppp
|
• Grld-based algorlthms
– ST|NG
|
wang et al., +pp)
|
A-+)/A-zo
References
|
8erkhln, zoo6
|
P. 8erkhln. A survey of clusterlng data mlnlng technlques.
Grouping Multidimensional Data, pages z¸–)+, zoo6.
|
Chao[l et al., zo++
|
v. Chao[l, G. Ll, H. ¥lldlrlm, and M. 1. Zakl. A8ACUS:
Mlnlng arbltrary shaped clusters from large datasets based on back-
bone ldentlficatlon. |n Proceedings of  SIAMInternational Confer-
ence on Data Mining, pages zp¸–¸o6, zo++.
|
Chao[l et al., zoop
|
v. Chao[l, M. A. Hasan, S. Salem, and M. 1. Zakl. SPAPCL:
Aneffectlve andefficlent algorlthmfor mlnlngarbltrary shape-based
clusters. Knowledge and Information Systems, z+(z):zo+–zzp, zoop.
|
Lster et al., +pp6
|
M. Lster, H. P. Krlegel, 1. Sander, and X. Xu. A denslty-
based algorlthm for dlscoverlng clusters ln large spatlal databases
wlth nolse. |n Proceedings of KDD, p6, zz6–z¸+, +pp6.
|
Garcla-Mollna et al., zoo8
|
H. Garcla-Mollna, 1. D. Ullman, and 1. wldom.
Database systems: The complete book. Prentlce Hall Press, zoo8.
|
Guha et al., +pp8
|
S. Guha, P. Pastogl, and K. Shlm. CUPL: An effi-
A-+8/A-zo
clent clusterlng algorlthm for large databases. Information Systems,
z6(+):¸¸–¸8, +pp8.
|
Halkldl et al., zoo+
|
M. Halkldl, ¥. 8atlstakls, and M. vazlrglannls. On clus-
terlng valldatlon technlques. Journal of Intelligent Information Sys-
tems, +)(z):+o)–+(¸, zoo+.
|
Hlnneburg and Kelm, +pp8
|
A. Hlnneburg and D. A. Kelm. An efficlent ap-
proach to clusterlng ln large multlmedla databases wlth nolse. |n
Proceedings of the Fourth International Conference on Knowledge Dis-
covery and Data Mining, pages ¸8–6¸, +pp8.
|
1aln et al., +ppp
|
A. K. 1aln, M. N. Murty, and P. 1. Plynn. Data clusterlng: A
revlew. ACMComputing Surveys, ¸+(¸):z6(–¸z¸, +ppp.
|
Karypls et al., +ppp
|
G. Karypls, H. Lul-Hong, and v. Kumar. CHAMLLLON:
Hlerarchlcal clusterlng uslng dynamlc modellng. Computer,
¸z(8):68–)¸, +ppp.
|
Lln et al., zoo¸
|
1. Lln, L. Keogh, S. Lonardl, and 8. Chlu. A symbollc repre-
sentatlon of tlme serles, wlth lmpllcatlons for streamlng algorlthms.
|n Proceedings of the th ACMSIGMODWorkshop on Research Issues in
Data Mining and Knowledge Discovery, pages +–++, zoo¸.
A-+p/A-zo
|
Qlu and 1oe, zoo6
|
w. Qlu and H. 1oe. Generatlon of randomclusters wlth
speclfied degree of separatlon. Journal of Classification, z¸:¸+¸–¸¸(,
zoo6.
|
Shelkholeslaml et al., +pp8
|
G. Shelkholeslaml, S. Chatter[ee, and
A. Zhang. waveCluster: A multl-resolutlon clusterlng approach for
very large spatlal databases. |n Proceedings of the th International
Conference on Very Large Data Bases, pages (z8–(¸p, +pp8.
|
Tlng and wells, zo+o
|
K. M. Tlng and 1. P. wells. Multl-dlmenslonal mass
estlmatlon and mass-based clusterlng. |n Proceedings of th IEEE In-
ternational Conference on Data Mining, pages ¸++ – ¸zo, zo+o.
|
wang et al., +pp)
|
w. wang, 1. ¥ang, and P. Muntz. ST|NG: A statlstlcal
lnformatlon grld approach to spatlal data mlnlng. |n Proceedings
of the rd International Conference on Very Large Data Bases, pages
+86–+p¸, +pp).
A-zo/A-zo