You are on page 1of 7

Pattern Recognition Letters 29 (2008) 1385–1391

Contents lists available at ScienceDirect

Pattern Recognition Letters


journal homepage: www.elsevier.com/locate/patrec

An efficient k0 -means clustering algorithm


Krista Rizman Žalik *
University of Maribor, Faculty of Natural Sciences and Mathematics, Department of Mathematics and Computer Science, Koroška Cesta 160, 2000 Maribor, Slovenia

a r t i c l e i n f o a b s t r a c t

Article history: This paper introduces k0 -means algorithm that performs correct clustering without pre-assigning the
Received 29 March 2007 exact number of clusters. This is achieved by minimizing a suggested cost-function. The cost-function
Received in revised form 24 December 2007 extends the mean-square-error cost-function of k-means. The algorithm consists of two separate steps.
Available online 4 March 2008
The first is a pre-processing procedure that performs initial clustering and assigns at least one seed point
to each cluster. During the second step, the seed-points are adjusted to minimize the cost-function. The
Communicated by L. Heutte
algorithm automatically penalizes any possible winning chances for all rival seed-points in subsequent
iterations. When the cost-function reaches a global minimum, the correct number of clusters is
Keywords:
Clustering analysis
determined and the remaining seed points are located near the centres of actual clusters. The simulated
k-Means experiments described in this paper confirm good performance of the proposed algorithm.
Cluster number Ó 2008 Elsevier B.V. All rights reserved.
Cost-function
Rival penalized

1. Introduction error-function, where kxt  cik2 is a chosen distance measurement


between data point xt and the cluster centre ci.
Clustering is a search for hidden patterns that may exist in data- The k-means algorithm assigns an input data point xt into the
sets. It is a process of grouping data objects into disjointed clusters ith cluster if the cluster membership function I(xt, i) is 1.
so that the data in each cluster are similar, yet different to the oth- ( )
ers. Clustering techniques are applied in many application areas 1 if i ¼ arg minðkxt  cj k2 Þ j ¼ 1; . . . ; k
Iðxt ; iÞ ¼ ð2Þ
such as data analyses, pattern recognition, image processing, and 0 otherwise
information retrieval.
k-Means is a typical clustering algorithm (MacQueen, 1967). It Here c1, c2, cj, . . . , ck are called cluster centres which are learned by
is attractive in practice, because it is simple and it is generally very the following steps:
fast. It partitions the input dataset into k clusters. Each cluster is
represented by an adaptively-changing centroid (also called cluster Step 1: Initialize k cluster centres c1, c2, . . . , ck by some initial val-
centre), starting from some initial values named seed-points. ues called seed-points, using random sampling.
k-Means computes the squared distances between the inputs (also For each input data point xt and all k clusters, repeat steps 2 and
called input data points) and centroids, and assigns inputs to the 3 until all centres converge.
nearest centroid. An algorithm for clustering N input data points Step 2: Calculate cluster membership function I(xt, i) by Eq. (2)
x1, x2, . . . , xN into k disjoint subsets Ci, i = 1, . . . , k, each containing and decide the membership of each input data point in one of
ni data points, 0 < ni < N, minimizes the following mean-square-er- the k clusters whose cluster centre is closest to that point.
ror (MSE) cost-function: Step 3: For all k cluster centres, set ci to be the centre of mass of
X
k X all points in cluster Ci.
J MSE ¼ kxt  ci k2 ð1Þ
i¼1 xt 2C i
Although k-means has been widely used in data analyses, pat-
xt is a vector representing the t-th data point in the cluster Ci and ci tern recognition and image processing, it has three major
is the geometric centroid of the cluster Ci. Finally, this algorithm limitations:
aims at minimizing an objective function, in this case a squared-
(1) The number of clusters must be previously known and fixed.
(2) The results of k-means algorithm depend on initial cluster
* Tel.: +386 02 229 38 21; fax: +386 02 251 81 80. centres (initial seed-points).
E-mail address: krista.zalik@uni-mb.si (3) The algorithm contains the dead-unit problem.

0167-8655/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.patrec.2008.02.014
1386 K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391

The major limitation of the k-means algorithm is that the num- In the late 1980s, it was pointed-out that the classical k-means
ber of clusters must be pre-determined and fixed. Selecting the algorithm has the so-called dead-unit or underutilization problem
appropriate number of clusters is critical. It requires a priori (Xu, 1993). Each centre, initialized far away from the input data
knowledge about the data or, in the worst case, guessing the num- points, may never win in the process of assigning a data point to
ber of clusters. When the input number of clusters (k) is equal to the nearest centre, and so it then stays far away from the input
the real number of clusters (k0 ), the k-means algorithm correctly data objects, becoming a dead-unit.
discovers all clusters, as shown in Fig. 1 where cluster centres Over the last fifteen years, new advanced k-means algorithms
are marked by squares. Otherwise, it gives incorrect clustering re- have been developed that eliminate the dead-unit problem as,
sults, as illustrated in Fig. 2a–c. When clustering real data, the for example, the Frequency sensitive competitive algorithm (FSCL)
number of clusters is unknown ahead and has to be estimated. (Ahalt et al., 1990). A typical strategy is to reduce the learning rates
Finding the correct number of clusters is usually performed over of frequent winners. Each cluster centre counts the number when
many clustering runs using different numbers of clusters. it wins the competition, and consequently reduces its learning rate.
The performances of the k-means algorithm depend on initial If a centre wins too often, it does not cooperate in the competition.
cluster centres (initial seed-points). Furthermore, the final parti- FSCL solves the dead-unit problem and successfully identifies clus-
tion depends on the initial configuration. Some research has solved ters, but only when the number of clusters is known in advance
this problem by proposing an algorithm for computing initial clus- and appropriately preselected; otherwise, the algorithm performs
ter centres for k-means clustering (Khan and Ahmad, 2004; Red- badly.
mond and Heneghan, 2007). Genetic algorithms have been Solving the selection of a correct cluster number has been tried
developed for selecting centres in order to seed the popular k- in two ways. The first one invokes some heuristic approaches. The
means method for clustering (Laszlo and Mukherjee, 2007). Stein- clustering algorithm is run many times with the number of clusters
ley and Brusco (2007) evaluated twelve procedures proposed in the gradually increasing from a certain initial value to some threshold
literature for initializing k-means clustering and to introduce rec- value that is difficult to set. The second is to formulate cluster
ommendations for best practices. They recommended the method number selection by choosing a component number in a finite mix-
of multiple random starting-points for general use. In general, ini- ture model. The earliest method for solving the model selection
tial cluster centres are selected randomly. An assumption from problem may be to choose the optimal number of clusters by
their studies is that the number of clusters is known ahead. They Akaike’s information criterion or its extensions AIC (Akaike,
conclude that even the best initial strategy for clustering centres 1973; Bozdogan, 1987). Other criteria include Schwarz’s Bayesian
and minimizing the mean-square-error cost-function, do not lead interface criterion – (BIC) (Schwarz, 1978), Rissanen’s minimum
to the best dataset partition. description length – (MML) (Wallace and Dowe, 1999) and Bez-
dek’s partition coefficients – (PC) (Bezdek, 1981). As reported in
(Oliver et al., 1996), BIC and MML perform comparably and outper-
form the AIC and PC criteria. These existing criteria may overesti-
mate or underestimate the cluster number, because of difficulty
in choosing an appropriate penalty function. Better results are ob-
tained by a number selection criterion developed from Ying-Yang
machine (Xu, 1997) which means, unfortunately, laborious
computing.
To tackle the problem of appropriate selection for number of
clusters, the rival penalized competitive learning (RPCL) algorithm
was proposed (Xu, 1993), which adds a new mechanism to FSCL.
For each input data point, the basic idea is that, not only the cluster
centre of a winner cluster is modified to adapt to the input data
point, but also the cluster centre of its rival cluster (second winner)
is de-learned by a smaller learning rate. Many experiments have
shown that RPCL can select the correct cluster number by driving
extra cluster centres far away from the input dataset. Although
Fig. 1. A dataset with three clusters recognized by k-means algorithm for k = 3. the RPCL algorithm has had success in some applications, such as

Fig. 2. k-Means produces wrong clusters for k = 1 (a), k = 2 (b) and k = 4 (c) for the same dataset as in Fig. 1, which consists of three clusters; the black square denotes the
location of the converged cluster centre.
K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391 1387

colour-image segmentation and image features extraction, it is (1) Areas with dense samples strongly attract centres, and
rather sensitive to the selection of de-learning rate (Law and (2) Each cluster centre pushes all other cluster centres away in
Cheung, 2003; Cheug, 2005; Ma and Cao, 2006). The RPCL algo- order to give maximal information about patterns formed
rithm was proposed heuristically. It has been shown that RPCL by input data points. This enables the possibility of moving
can be regarded as a fast approximate implementation of a special extra cluster centres away from the sample data. When a
case Bayesian Ying-Yang (BYY) harmony learning on a Gaussian cluster centre is driven away from the sample data, the cor-
mixture (Xu, 1997). The ability to select a number of clusters is responding cluster can be neglected, because it is empty.
provided by the ability of Bayesian Ying-Yang learning model
selection. There is still a lack of mathematical theory for directly We want to obtain maximal information about patterns formed
describing the correct convergence behaviour of RPCL which by input data points. The amount of information each cluster gives
selects the correct number of clusters, while driving all other us about the dataset can be quantified. Discovering an ith cluster Ci
unnecessary cluster centres far away from the sample data. having ni elements in a dataset with N elements gives us the
This paper presents a new k0 -means algorithm, which is an amount of information I(Ci)
extension of k-means, without the three major drawbacks stated
IðC i Þ ¼ j logðni =NÞj: ð3Þ
at the beginning of this section. The algorithm has a similar
mechanism to RPCL in that it performs clustering without pre- This information is a measure of decreasing uncertainty about the
determining the correct cluster number. The problem of the dataset. The logarithm is selected for measuring information since
suggested k0 -means algorithm’s correct convergence is investi- it is additive when concatenating independent, unrelated amounts
gated via a cost-function approach. A special cost-function is sug- of information for a whole system, e.g. if it discovers a cluster. For
gested since the k-means cost-function (Eq. (1)) cannot be used a dataset with N elements forming k distinguishable clusters, the
for determining the number of clusters, because it decreases amount of information is I(C1) + I(C2) +    + I(Ck).
monotonically with any increase in cluster number. It is shown We have to maximize the amount of information and minimize
that, when the cost-function reduces into a global minimum, the uncertainty about the system JI (Eq. (4)).
correct number of cluster centres converges into an actual cluster X
k X
k
centre, while driving all other initial centres far away from the in- J I ¼ ni E log2 ðpðC i ÞÞ pðC i Þ ¼ 1 0 6 pðC i Þ 6 1; i ¼ 1; . . . ; k
put dataset, and corresponding clusters can be neglected, because i¼1 i¼1

they are empty. ð4Þ


Section 2 constructs a new cost-function. Rival penalized mech-
p(Ci) is the probability that the input data is in the Ci cluster (sub-
anism analysis of the proposed cost-function is presented in Sec-
set). E is a constant and is just a choice of measurement units. E
tion 3. Section 4 describes the k0 -means algorithm for minimizing
should be from the range of point coordinates. The coordinates
the proposed cost-function. Section 5 presents the experimental
magnitude does not matter, because we only care about point dis-
evaluation. The paper is summarized in Section 6.
tances. Setting parameter E is discussed and experimentally proved
in Sections 4 and 5.
2. The cost-function In view of the above considerations we were motivated to con-
struct a cost-function composed of the mean-square-error JMSE and
The k-means algorithm minimizes the mean-square-error cost- information uncertainty as
function JMSE (Eq. 1), which decreases monotonically with any
J ¼ J I þ J MSE ð5Þ
increase of cluster number. Such a function cannot be used for
identifying the correct number of clusters and cannot be used for Data metric dm used for clustering, which minimizes the upper
the RPCL algorithm. This section introduces a new cost-function cost-function (Eq. (5)), where cluster Ci having centre ci and xt is
using the following two characteristics: an input data point, is

Fig. 3. Dataset with 800 data objects clustered into four clusters and values of functions JI, JMSE and JI + JMSE for cluster number k = 1–9.
1388 K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391

X
k the second part of the proposed metric smaller (Eq. 7). We sup-
dmðxt ; C i Þ ¼ kxt  ci k2  Elog2 ðpðC i ÞÞ pðC i Þ ¼ 1 0 6 pðC i Þ 6 1; pose that the first cluster has less elements than the second
i¼1 ð0Þ ð0Þ ð0Þ
n0 < n1 . During data scanning, if the centre c1 of the second
i ¼ 1; . . . ; k ð6Þ
cluster with more elements wins when adapting to the input data
ð0Þ
point xt then it moves towards the first cluster centre c0 and con-
We assign an input data point xt into cluster Cj if the cluster mem-
sequently, the separating line is moved towards the left as shown
bership function I(xt, i) Eq. (7) is 1.
in Fig. 4b. Region 1 of the first cluster is becoming smaller while
 
1 if i ¼ arg minðdmðxt ; jÞÞ j ¼ 1; . . . ; N region 2 of the second cluster is expanding towards the left. The
Iðxt ; iÞ ¼ ð7Þ same repeats through out the next iterations to points that are
0 otherwise
near or on a separating line, until c1 gradually converges to the ac-
The input data point xt effects the cluster centre of cluster Ci. The tual cluster centre through minimizing data metric dm (Eq. 6) and
winner’s centre is modified in order to also contain the input data the centre c0 moves towards the cluster’s boundary. The first (riv-
xt and the term E log2 p(Ci) in the data metric (Eq. 6) is automati- al) cluster has less and less elements until the number of elements
cally decreased for the rival centre, because p(Ci) is decreased and decreases to 0 and its competition chance reaches zero. From Eq.
the sum of all probabilities (p(Ci), i = 1, . . . , k) is 1. The rival cluster (7) we see that then the data metric dm becomes infinity. Cluster
centres are automatically penalized in the sense of a winning centre c0 becomes dead without chance to win again. When a
ðtÞ
chance. Such penalization of the rival cluster centres can reduce a cluster centre ci is far away from the input data then it is on
winning chance for rival cluster centres to zero. This rival penalized one side of the input data and it cannot be winner for any new
mechanism is briefly described in the next section. sample. Change of cluster centre Dci directs to the outside of the
The minimization of information uncertainty JI allocates the sample data. If every cluster centre goes away from the sample
proper number of clusters to data points, while minimization of dataset then the JMSE cost-function becomes greater and greater.
JMSE makes clustering of input data possible. The values for both This contradicts the assumption and fact that algorithm decreases
functions JMSE and JI over nine values of cluster numbers (k) for a the function JMSE and proves that some centres exists within the
dataset with a cardinality of 800 regarding four Gaussian distribu- sample data.
tions are shown in Fig. 3. The nodes on the curves in Fig. 3 denote The analysis of multiple clusters is more complicated, because
the global minimum values for cost-functions JI and JMSE and their of interactive effects among clusters. In Section 5 various datasets
sum J for various cluster numbers (k). The global minimum for the have been tested to prove the convergence behaviour of data met-
sum of both functions corresponds to the number of actual clusters ric that automatically penalizes the winning chance of all rival
(k = k0 ). cluster centres in the subsequent iterations while winning cluster
centres are moved toward actual cluster centres.
3. The rival penalized mechanism
4. k0 -Means algorithm
This section analyzes the rival penalized mechanism of the pro-
posed metric in Eq. (6). The data assignment based on the data It is clear from Section 3, that the proposed metric automati-
metric to the winner’s cluster centre reduces JMSE and drives a cally penalizes all rival cluster centres in the competition to get a
group of cluster centres to converge onto the centres of actual clus- new point into the cluster. We propose a k0 -algorithm that mini-
ters. The winner’s centre is modified to also contain the input data mizes the proposed cost-function and data metric. It has two
xt. The second term in the data metric is automatically decreased phases. In the first phase we allocate k0 cluster centres in such a
for the rival centres. We show that such a penalization of rival clus- way that in each cluster there are one or more cluster centres.
ter centres can reduce a winning chance for rival cluster centres to We suppose the input number of cluster centres k is greater than
zero. the real number of clusters k0 . In the second phase, all rival cluster
We consider a simple example of one Gaussian distribution centres in the same cluster are pushed out of the cluster, thus rep-
forming one cluster. We set the input number of clusters to be resenting a cluster with no elements. The detailed k0 -means algo-
2. The number of input data points is 200, the mean vector is rithm consisting of two completely separated phases is suggested
(190, 90) and the standard variance is (0.5, 0.2). At the beginning as follows.
(t = 0), the data is divided into two clusters with two cluster cen- For the first phase we use k-means algorithm as initial cluster-
tres, as shown in Fig. 4a, where each cluster centre is indicated by ing to allocate k cluster centres so that each actual cluster has at
ð0Þ ð0Þ
a rectangle. We denote them as c0 and c1 . t represents the num- least one or more centres. We suppose that the input parameter,
ber of iterations that the data has been repeatedly scanned. Data the number of clusters, is greater than the actual number of clus-
metric (Eq. 2) divides the cluster into two regions by a virtual sep- ters that the data performs: k > k0 .
arating line, as shown in Fig. 4a. Data points on the line are the
same distance from both cluster centres. In the next iteration, Step 1: Randomly initialize the k cluster centres in the input
they are assigned to a cluster with more elements that make dataset.

Fig. 4. The clustering process of one Gaussian distribution with an input parameter-number of clusters k = 2 after: (a) 10 iterations (b) 15 iterations and (c) 20 iterations.
K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391 1389

Step 2: Randomly pick up a data point xt from the input dataset Steps 1 and 2 are repeatedly implemented until all cluster cen-
and for j = 1, 2, . . . , k calculate the class membership function tres remain unchanged for all input data points, or they change less
I(xt, j) by Eq. (2). Every point is assigned to the cluster whose than some threshold value. At the end k0 clusters are discovered,
centroid is the closest to that point. where k0 is the number of actual clusters. The initial seed-points
Step 3: For all k cluster centres, set ci to be the centre of mass of – cluster centres – will converge towards the centroid of the input
all points in cluster Ci. data clusters. All extra seed-points, the difference between k and k0 ,
will be driven away from the dataset.
1 X The number of recognized clusters k0 is implicitly defined by
ci ¼ xt ð8Þ
jC i j x 2C parameter E (Eq. (6)). E is just a choice of measurement units. E
t i
should be from the range of point coordinates. The coordinate’s
magnitude does not matter, because we only care about point dis-
Steps 2 and 3 are repeatedly implemented until all cluster centres tances. However, it has been shown by experiments that a wide
remain unchanged or until they change to some threshold value. interval exists for E when a consistent number of actual clusters
The stopping threshold value is usually selected as being very are discovered in the sample dataset. The heuristic for parameter
small. The other way to stop the algorithm is to set an upper E is given in Eq. (9).
number of iterations to a certain threshold value. At the end of
the first phase of the algorithm each cluster has at least one E 2 ½a; 3a a ¼ averageðrÞ þ averageðd=2Þ ð9Þ
centre. where r is the average radius of clusters after the first phase of the
In the first phase, we do not include the extended clustering algorithm and d is the smallest distance between two cluster cen-
membership function described by Eq. (7), because the first step tres greater than 3r. For stronger clustering one can double param-
aims to allocate the initial seed-points into some desired regions, eter E. If E is smaller than suggested, the algorithm cannot push the
rather than making a precise cluster number estimation. This is redundant cluster centres away from the input regions. On the
achieved by the second phase that repeats the following two steps other hand, if E is too large, the algorithm pushes almost all cluster
until all cluster centres converge. centres away from the input data.

Step 1: For each input data point xt and all k clusters randomly 5. Experimental results
pick a data point xt from the input dataset and for j = 1, 2, . . . , k
calculate the cluster membership function I(xt, j) by Eq. (7). Three simulated experiments were carried-out to demonstrate
Every point is assigned to the cluster whose centroid is closest the performance of the k0 -means algorithm. This algorithm has also
to that point, as defined by the cluster membership function been applied to the clustering of a real dataset. The stopping
I(xt, j). threshold value was selected to 106.
Step 2: For all k cluster centres set ci to be the centre of mass of
all points in cluster Ci (Eq. 8). 5.1. Experiment 1

Experiment 1 used 470 points from a mixture of four Gaussian


distributions. The detail parameters of input dataset are given in
Table 1, where Ni, ci, ri and ai denote the number of samples, the
Table 1
Parameters of dataset 1 where number of samples N = 470 mean vector, the standard variance, and the mixing proportion.
The input number of clusters k was set to 10. Fig. 5a shows all
Cluster number i Ni ci ri ai
10 clusters and centres after the first phase of the algorithm. Each
1 100 (0.5, 0.5) (0.1, 0.1) 0.213 cluster has at least one seed point. After the second phase only four
2 50 (1, 1) (0.1, 0.1) 0.106
seed-points denoted four cluster centres. As shown in Fig. 5b, the
3 160 (1.5, 1.5) (0.2, 0.1) 0.25
4 160 (1.4, 2.3) (0.4, 0.2) 0.34 data forms four well-separated clusters. The parameters of the four
well-recognized clusters are given in Table 2.

Fig. 5. (a) Clusters discovered for k = 10 by k-means algorithm and by suggested k0 -means algorithm (b).
1390 K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391

Table 2 Table 4
The four discovered clusters in experiment 1 Predicted number of components for different standard deviations

Cluster number i Ni ci k rx = ry = 0.67 rx = ry = 1 rx = ry = 1.2 rx = ry = 1.33


1 100 (0.496, 0.501) 1 0 3 14 45
2 50 (0.993, 0.985) 2 0 0 0 0
3 167 (1.483, 1.51) 3 True 99 97 86 55
4 155 (1.356, 2.303) 4 1 0 0 0
5 0 0 0 0

5.2. Experiment 2
BIC, MML. The comparison presented by Oliver et al. (1996) was
In Experiment 2, 800 data points were used, also from a mixture used. The same mixture of three Gaussian components with the
pffiffiffiffiffiffi
of four Gaussians. The three sets of data S1, S2, and S3 were gener- mean of the first component being (0, 0), the second (2, 12),
ated at different degrees of overlap among the clusters. The sets and the third (4, 0) was used. As a dataset in this Experiment,
had different variances of Gaussian distributions and different 100 data points were generated from this distribution. The results
numbers of input datasets is controlled by mixing proportions ai. of our method are given in Table 4 for four values of standard
The detail parameters for these datasets are given in Table 3. deviations.
In sets S1 and S2, the data has a symmetric structure and each The counts (e.g., 99 in the first block) indicate the times that the
cluster has the same number of elements. For such datasets, when actual number of clusters (k = 3) were confirmed in 100 experi-
these clusters are separated at a certain degree, it is usual for the ments repeated using different cluster centre initialization. The ini-
algorithm converges correctly. tial number of clusters had been set to 5.
It can be observed from Fig. 6 that all three datasets resulted in If we compare the obtained results with MML, AIC, PC, MDL and
correct convergence. The input number of cluster centres was set ICOMP criteria as presented by Oliver et al. (1996) for three com-
to 7. Four cluster centres were located around the centres of the ponent distribution then the k0 -means algorithm gives consider-
four actual clusters, while the three cluster centres were sent far ably better results. The k0 -means method confirms the actual
away from the data. Results show that this algorithm can also dis- (true) number of clusters in 100 experiments repeated using differ-
cover clusters that do not form well-separated clusters as dataset ent initializations more frequently that other criteria. When an
S3. incorrect number of clusters was obtained, the k0 -means predicted
less clusters, but the AIC, PC, MDL and ICOMP criteria often pre-
5.3. Experiment 3 dicted more clusters.

The k0 -means method was compared to previous model selec- 5.4. Experiment 4 with real dataset
tion criteria and Gaussian mixture estimation methods MDL, AIC,
The k0 -algorithm was applied also to a real dataset. Clustering of
the wine dataset (Blake and Merz, 1998) was performed, which is a
Table 3 typical real dataset for testing clustering (http://mlearn.ics.uci.edu/
Parameters of three datasets for experiment 2 databases/wine/). The dataset consisted of 178 samples of three
Dataset number Cluster number (i) Ni ci ri ai types of wine. These data were the results from a chemical analysis
of wines grown in the same region but derived from three different
S1 1 200 (1, 2) (0.2, 0.2) 0.25
2 200 (2, 1) (0.2, 0.2) 0.25 cultivars. The analysis determined quantities of 13 constituents.
3 200 (3, 2) (0.2, 0.2) 0.25 The correct number of elements in each cluster is: 48, 71, 59.
4 200 (2, 3) (0.2, 0.2) 0.25 These wine data were first regularized into an interval of
S2 1 200 (1, 2 ) (0.4, 0.4) 0.250
[0, 300] and then the k0 -means algorithm was applied to solve
2 200 (2, 1 ) (0.4, 0.4) 0.250
3 200 (3, 2 ) (0.4, 0.4) 0.250
the unsupervised clustering problem of the wine data by setting
4 200 (2, 3 ) (0.4, 0.4) 0.250 k = 6.
S3 1 400 (1, 2 ) (0.4, 0.4) 0.364 The k0 -means algorithm detected three classes in the wine data-
2 400 (2, 1 ) (0.4, 0.4) 0.364 set with a clustering accuracy of 97.75% (there were four errors)
3 150 (3, 2 ) (0.4, 0.4) 0.136
which is a rather good result for unsupervised learning methods.
4 150 (2, 3 ) (0.4, 0.4) 0.136
This is the same result as performed by the method of linear mix-

Fig. 6. Three sets of input data used in Experiment 2 and clusters discovered by the proposed k0 means algorithm.
K.R. Žalik / Pattern Recognition Letters 29 (2008) 1385–1391 1391

ing kernels with information minimization criterion (Roberts et al., sum of mean-square-error and information uncertainty. Its rival
2000). penalized mechanism has been shown. As the cost-function re-
duces to a global minimum, the algorithm separates the input
5.5. Discussion and experimental results number k0 (k0 is the actual number of clusters) of cluster centres
that converge towards actual cluster centres. The other (k  k0 )
As shown by experiments the k0 -means can allocate the correct centres are moved far away from the dataset and never win in
number of clusters at or in the near of actual cluster centres. Exper- competition for any data sample. It has been demonstrated by
iment 3 showed that the k0 -means algorithm is insensitive to initial experiments, that this algorithm can efficiently determine the ac-
values of cluster centres and leads to good result. We also found tual number of clusters in artificial and real datasets.
from Experiment 4 on real dataset, that the algorithm also worked
well in high dimensional space when the clusters had been sepa- References
rated to the degree as in the Experiment 2. Simulation experiments
also proved that when the initial cluster centres are randomly-se- Ahalt, S.C., Krishnamurty, A.K., Chen, P., Melton, D.E., 1990. Competitive algorithms
for vector quantization. Neural Networks 3, 277–291.
lected from the input dataset, then the dead-unit problem does not Akaike, H., 1973. Information theory and an extension of the maximum likelihood
occur. The experiments also showed that, if two or more clusters principle. In: Proc. 2nd Internat. Symp. on Information Theory, pp. 267–281.
are seriously overlapped, the algorithm regards them as one cluster Bezdek, J., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms.
Plenum Press, New York.
and this leads to an incorrect result. When clusters are elliptical, or Blake, C.L., Merz, C.J., 1998. UCI Repository for machine learning databases, Irvine
some other forms, the algorithm can still detect the number of Dept. Inf. Comput. Sci., Univ. California [Online]. <http://mlearn.ics.uci.edu/
clusters, but clustering is not as good. For the classification of ellip- MLRepository.html>.
Bozdogan, H., 1987. Model selection and Akaike’s information criterion the general
tical clusters, the Mahalanobis distance gives better clustering than theory and its analytical extensions. Psyhometrika 52, 345–370.
Euclidean distance in cost-function and data metric (Ma and Cao, Cheug, Y.M., 2005. On rival penalization controlled competitive learning for
2006). clustering with automatic cluster number selection. IEEE Trans. Knowledge
Data Eng. 17, 1583–1588.
According to analysis of the data metric and simulation experi-
Khan, S., Ahmad, A., 2004. Cluster centre initialization algorithm for k-means
ments we claim that when the input parameter for the number of clustering. Pattern Recognition Lett. 25, 1293–1302.
clusters k is not much larger than actual number of clusters k0 , the Laszlo, M., Mukherjee, S., 2007. A genetic algorithm that exchanges neighbouring
centres for k-means clustering. Pattern Recognition Lett. 28, 2359–2366.
algorithm converges correctly. However, when k is much larger
Law, L.T., Cheung, Y.M., 2003. Colour image segmentation using rival penalized
than k0 , the number of discovered clusters is usually greater than k0 . controlled competitive learning. In: Proc. 2003 Internat. Joint Conf. on Neural
From the following simulation results, we have demonstrated Networks (IJCNN’2003), Portland, Oregon, USA, pp. 20–24.
that there exists a large valid range of k for each dataset. On each Ma, J., Cao, B., 2006. The Mahalanobis distance based rival penalized competitive
learning algorithm. Lect. Note Comput. Sci. 3971, 442–447.
of the three datasets from Experiment 2 we run the algorithm MacQueen, J.B., 1967. Some methods for clustering and analysis of multivariate
100 times for values k > k0 . We increased k from k0 and computed observations. Proc. 5th Berkeley Symp. on Math. Statist. Prob., vol. 1. University
the percentage of the valid results. The upper boundary of the valid of California Press, Berkeley, pp. 281–297.
Oliver, J., Baxter, R., Wallace, C., 1996. Unsupervised learning using MML. In: Proc.
range for k is the largest integer k at which the valid percentage is 13th Internat. Conf. on Mach. Learn., pp. 364–372.
larger or equal to a certain threshold value. We choose it 98%. The Redmond, S.J., Heneghan, C., 2007. A method for initializing the k-means clustering
valid range for the first dataset S1 is 4–24, for the second is 4–16 algorithm using kd-trees. Pattern Recognition Lett. 28, 965–973.
Roberts, S.J., Everson, R., Rezek, I., 2000. Maximum certainty data partitioning.
and the third is 4–9. The parameter E in data metric has to be dou- Pattern Recognition 33, 833–839.
bled for a greater number k. Schwarz, G., 1978. Estimating the dimension of a model. Ann. Statist. 6, 461–464.
Steinley, D., Brusco, M.J., 2007. Initialization k-means batch clustering: a critical
evaluation of several techniques. J. Classif. 24, 99–121.
6. Conclusions Wallace, C., Dowe, D., 1999. Minimum message length and Kolmogorov complexity.
Comput. J. 42, 270–283.
Xu, L., 1993. Rival penalized competitive learning for cluster analysis, RBF net and
A new clustering algorithm named k0 -means is presented which curve detection. IEEE Trans. Neural Network 4, 636–648.
performs correct clustering without predetermining the exact Xu, L., 1997. Bayesian Ying-Yang machine, clustering and number of clusters.
number of clusters k. It minimizes cost-function defined as the Pattern Recognition Lett. 18, 1167–1178.

You might also like