Professional Documents
Culture Documents
results in a trade-off between the clustering quality and the running time. For now, we deal with this
problem by introducing the increasing penalty factor f and define
4. Algorithm BKM+
The proposed algorithm is presented in Algorithm 1. We call it balanced k-means revisited (BKM+).
In each iteration of the algorithm, each data point is assigned to the cluster j which minimizes the
cost term ||xi − c j ||2 + pnow · n j . In this process, the data point is first partly removed from its old cluster,
see function RemovePoint, and afterward added to its new cluster and completely removed from its
old cluster, see function AddPoint. Directly after the computation of the cost term, the variable pnext,i ,
corresponding to min jnew ∈{1,...,k} {scale(iter, i, jnew )}, is computed using the function PenaltyNext, which
implements Equation 4. If this number is a candidate for pnext , its value is taken. After all data points are
assigned, the penalty term pnow is set for the following iteration by multiplying pnext by the increasing
penalty factor f . Afterward, the assignments of the data points start again. The algorithm terminates as
soon as the largest cluster is at most ∆nmax data points larger than the smallest cluster.
Table 2. Results for the SSE, NMI, and running time if a balanced clustering is required, n
denotes the size of the data set, d its dimension, and k refers to the number of clusters sought
in the data set.
Dataset (n,d,k) Algo SSE NMI Time (s)
S1 RKM 1.089e+13 0.948 0.11
(5000,2,15) BKM 1.100e+13 0.946 3364.98
BKM+ 1.097e+13 0.946 0.10
BKM++ 1.090e+13 0.947 0.11
S2 RKM 1.428e+13 0.921 0.12
(5000,2,15) BKM 1.431e+13 0.920 3504.10
BKM+ 1.456e+13 0.917 0.09
BKM++ 1.430e+13 0.922 0.10
S3 RKM 1.734e+13 0.796 0.14
(5000,2,15) BKM 1.736e+13 0.797 12074.14
BKM+ 1.736e+13 0.798 0.07
BKM++ 1.734e+13 0.797 0.08
S4 RKM 1.651e+13 0.729 0.16
(5000,2,15) BKM 1.652e+13 0.728 9621.18
BKM+ 1.652e+13 0.729 0.08
BKM++ 1.651e+13 0.730 0.09
1.4 6.83 6.51 6.35 6.69 6.94 7.08 8.70 1.4 6.24 6.18 6.19 6.19 6.23 6.16 6.19 1.4 6.22 6.20 6.22 6.27 6.30 6.24 6.22 8
7
1.2 5.39 5.28 5.33 5.26 5.14 5.20 6.56 1.2 5.23 5.33 5.31 5.27 5.16 5.19 5.04 1.2 5.39 5.50 5.42 5.45 5.31 5.39 5.23
1.1 4.71 4.37 4.34 4.34 4.36 4.34 4.88 1.1 4.81 4.65 4.67 4.66 4.58 4.61 4.46 1.1 5.04 4.96 5.05 4.97 4.90 4.94 4.82 6
MSE
1.05 4.16 4.06 4.04 4.03 4.05 4.01 4.05 1.05 4.27 4.25 4.25 4.29 4.18 4.14 4.10 1.05 4.49 4.63 4.54 4.48 4.55 4.30 4.40 5
1.01 4.12 3.99 3.95 3.93 3.93 3.92 3.93 1.01 4.01 4.06 3.99 3.98 3.97 3.96 3.94 1.01 4.22 4.23 4.21 4.16 4.17 4.09 4.06
1.001 4.10 3.99 3.94 3.94 3.90 3.90 3.93 1.001 3.96 3.96 3.96 3.95 3.95 3.94 3.93 1.001 4.12 4.09 4.17 4.04 4.09 4.04 3.98
4
1.0001 4.12 3.98 3.94 3.94 3.91 3.91 3.92 1.0001 3.95 4.01 3.93 3.97 3.93 3.93 3.92 1.0001 4.13 4.15 4.01 4.07 4.05 4.05 3.99
1
01
05
1
15
01
01
05
1
15
1
05
1
15
4
00
0.
0.
0.
0.
0.
0.
00
0.
0.
0.
0.
0.
0.
0
0.
0.
0.
0.
0.
0.
0.
0.
0.
Increasing Penalty Factor f
1.4 0.18 0.18 0.18 0.18 0.17 0.18 0.19 1.4 0.16 0.16 0.16 0.16 0.16 0.16 0.18 1.4 0.16 0.15 0.15 0.15 0.15 0.16 0.17 3.0
2.5
1.2 0.26 0.26 0.26 0.25 0.25 0.25 0.27 1.2 0.22 0.23 0.22 0.22 0.22 0.22 0.24 1.2 0.22 0.22 0.22 0.22 0.22 0.22 0.24 2.0
Time in sec
1.1 0.38 0.39 0.39 0.39 0.39 0.38 0.38 1.1 0.34 0.34 0.33 0.33 0.34 0.33 0.33 1.1 0.33 0.33 0.33 0.33 0.33 0.32 0.32 1.5
1.05 0.56 0.57 0.58 0.57 0.58 0.57 0.57 1.05 0.51 0.51 0.51 0.51 0.51 0.51 0.51 1.05 0.51 0.51 0.50 0.50 0.50 0.50 0.50 1.0
1.01 1.36 1.39 1.38 1.39 1.40 1.38 1.43 1.01 1.31 1.32 1.29 1.30 1.32 1.30 1.33 1.01 1.30 1.31 1.28 1.29 1.31 1.29 1.33
0.5
1.001 2.55 2.56 2.59 2.61 2.60 2.64 2.83 1.001 2.49 2.48 2.48 2.50 2.53 2.56 2.72 1.001 2.49 2.48 2.47 2.50 2.52 2.55 2.72
1.0001 2.84 2.91 2.91 2.92 2.92 2.95 3.25 1.0001 2.79 2.83 2.80 2.81 2.84 2.86 3.15 1.0001 2.78 2.83 2.79 2.81 2.83 2.85 3.14
1
01
05
1
15
1
01
05
1
15
1
05
1
15
4
00
0.
0.
0.
00
0.
0.
0.
00
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
0.
Figure 12. Effect of parameter selection on MSE and processing time for data set A3.