You are on page 1of 4

Novel Methods for Initial Seed Selection

in K-Means Clustering for Image Compression


K. Somasundaram
1
and M. Mary Shanthi Rani
2

Department of Computer Science & Applications,
Gandhigram Rural Institute, Deemed University, Tamil Nadu, India
e-mail:
1
ksomasundaram@hotmail.com,
2
shanthifrank@yahoo.com
Abstract!In this paper, we propose two methods to
construct the initial codebook for K-means clustering
based on covariance and spectral decomposition.
Experimental results with standard images show that the
proposed methods produces better quality reconstructed
images measured in terms of Peak Signal to Noise Ratio
(PSNR).
Keywords: Codebook, Variance, Spectral
Decomposition, Eigen value
I. INTRODUCTION
Clustering is an important partitioning technique
which organizes information in the form of clusters
such that patterns within a cluster are more similar to
each other than the patterns belonging to different
clusters [1]. Traditionally, clustering techniques are
broadly divided into hierarchical and partitioning.
Hierarchical algorithms build clusters hierarchically,
whereas partitioning algorithms determine all clusters at
once. The partitioning methods generally result in a set
of K clusters, each object belonging to one cluster. Each
cluster may be represented by a centroid or a cluster
representative which is some sort of summary
description of all the objects contained in a cluster.
Hierarchical algorithms can be agglomerative (bottom-
up) or divisive (top-down). Agglomerative algorithms
begin with each object as a separate cluster and merge
them successively into larger clusters until the
termination criteria is reached. Divisive algorithms
begin with the whole set and proceed to recursively
divide it into smaller clusters.
The Pairwise Nearest Neighbor (PNN) [2] method
belongs to the class of agglomerative clustering
methods. It generates hierarchical clustering using a
sequence of merge operations until the desired number
of clusters is obtained. The main drawback of the PNN
method is its slowness and the time complexity of even
the fastest implementation of the PNN method is lower
bounded by the number of data objects.
K-means [3] is one of the most popular partitioning
techniques which has great number of applications in
the fields of image and video compression [4],[5],
image segmentation [6], pattern recognition and data
mining[7]-[8]. The rest of the paper is organized as
follows: Section II gives an overview of K-means
clustering technique, Section III briefly describes the
method, Section IV presents the results and
performance analysis of the proposed method and
Section V concludes our work.
II. K-MEANS CLUSTERING
The K-means algorithm is a broadly used VQ
technique and was also graded as one of the top ten
algorithms in data mining [9]. This is iterative in nature
and generates a codebook from the training data using a
distortion measure appropriate for the given application
[3], [10].
It is simple and easy to implement and the
computation time mainly depends on the amount of
training data, codebook size, vector dimension, and
distortion measure for convergence. It clusters the given
objects, based on their attributes into K partitions. K-
means comprises of four steps: initialization,
classification, computational and convergence criteria.
There are two issues in creating a K-means
clustering model:
1. Determining the optimal number of clusters to
create
2. Determining the center of each cluster
Determining the number of clusters (K) is specific
to the problem domain. The overall quality of clustering
is the average distance from each data point to its
associated cluster center. Given the number of clusters
K, the second part of the problem is determining where
to place the center of each cluster. Often, points are
scattered and don't fall into easily recognizable groups.
The algorithm starts by partitioning the input points
into K initial sets, either at random or using some
heuristic data. It then calculates the centroid of each set
and constructs a new partition by associating each point
with the closest centroid. The centroid of each set is
recalculated and the algorithm is repeated by alternate
application of these two steps until convergence is
reached where the points no longer switch clusters or
the overall squared error is less than the convergence
threshold.
Although the K-means algorithm procedure will
always terminate, it does not necessarily find the most
optimal configuration, corresponding to the global
objective function minimum. The algorithm is also
significantly sensitive to the initial randomly selected
cluster centers. A simple approach to reduce this effect
is to make multiple runs of the algorithm with different
Novel Methods for Initial Seed Selection in K-Means Clustering for Image Compression 83
K classes and choose the best one according to a given
criterion.
An important component of any clustering
algorithm is the distance measure which is used to find
the similarity between data objects. Typically, the K-
means algorithm uses squared Euclidean distance
measure to determine the distance between an object
and its cluster centroid.
The conventional K-means algorithm for designing
a codebook C = { c
i
, i=1,", N} of size N from the
given number of M training vectors T = { x
m
,
m=1,",M} is given as follows :
Step 1: Initialization:
Iteration number n = 0; codebook at iteration n,
Cn = { cni ; i = 1,",N }; (1)
Let be the convergence threshold.
Step 2: Partitioning
Find the nearest-neighbour partition
V(cnj )={ xm

T: Q(xm)=cnj}, j =1,",N} (2)


where, Q denotes vector quantization operation
and is defined as follows:
Q(x) = cni if d(x, cnj ) d(x, cni ), i = 1,",N (3)
under some distance measure d(x, y),x, y

RK.
Step 3: Codebook Update
Update code vectors Cn = { cnj ; j = 1,"N } to
Cn+1 = {
1 n
j
c
; j = 1,",N } as
1 n
j
c
= C(V (cnj )), (4)
where, C(V (cnj )) is the centroid of the partition V
(cnj ) under the given distance measure d(x, y).
Step 4: Convergence Check
Stop if
1

n n
d d
n
d /
(5)
where, dn =
M / 1

)) ( , (
1
m
M
m
m
x Q x d



Otherwise, replace n by n+ 1 and go to Step 2.
Though K-means algorithm is the most common
and extensively used clustering algorithm due to its
superior scalability and efficiency, its increased time
complexity for convergence poses a serious concern.
This makes it necessary to develop robust strategies to
enable fast convergence of K-means.
A modified K-means algorithm (MODI) was
proposed by Lee et al. [11] which converges faster than
conventional K-means by using a scaled updating
scheme along the direction of the local gradient by a
step-size larger than that used by the centroid update of
the conventional K-means algorithm. While the use of a
scaled-update can accelerate the convergence, use of a
#fixed$ scaling for the entire range of iterations results
in the use of larger step sizes even at iterations closer to
convergence. Consequently, this increases the number
of iterations required for convergence and undesirably
high perturbations of the code vectors as well, to a
poorer local optimum. Kuldip and Ramasubramanian
(NEW) [12] proposed the use of a variable scale factor
which is a function of the iteration number. It offers
faster convergence than the modified K-means
algorithm with a fixed scale factor, without affecting
the optimality of the codebook.
Despite the efficiency of K-means algorithm, it has
some critical issues like apriori fixation of number of
clusters and random selection of initial seeds.
Inappropriate choice of initial seeds may yield poor
results in addition to increase in computation time for
convergence. This has stimulated the researchers to
focus their attention on the initialization step and many
novel initialization methods have been proposed.
The earliest method of initializing the K means
algorithm was done by Forgy in 1965 [13] who chose
points randomly. A pioneering work on seed
initialization was proposed by Ball and Hall (BH) [14].
A similar approach named as Simple Cluster Seeking
(SCS) was proposed by Tou and Gonzales [15]. The
SCS method chooses the first input vector as the first
seed and the rest of the seeds are selected, provided
they are %d& distance apart from all selected seeds. The
SCS and BH methods are sensitive to the parameter d
and the order of the inputs.
A novel initialization method suitable for large
datasets was proposed by Bradley and Fayyad (BF)
[16]. The main idea of their algorithm is to select %m&
subsamples from the dataset, apply the K-means on
each subsample independently, keep the final N centers
from each subsample and produce a set that contains
mN points. A new approach for optimizing K-means
clustering in aspects of accuracy and computation time
has been proposed by Ali and Kiyoki [17]. This
algorithm designates the initial centroids& positions in
the farthest accumulated distance between the data
points in the vector space analogous to choosing pillars&
locations as far as possible from each other within the
pressure distribution of the roof structures for a stable
building. Data points with maximum accumulated
distance among the data distribution are chosen as the
initial centroids.
Babu and Murty [18], proposed a method of near
optimal seed selection using genetic programming.
However, the problem with genetic algorithm is that the
results vary significantly with the choice of population
size, and crossover and mutation probabilities. Huang
and Harris [19] proposed the Direct Search Binary
Splitting (DSBS) method. This method is similar to the
Binary Splitting algorithm except that the splitting step
is enhanced through the use of Principle Component
Analysis (PCA).
Since vector quantization is a natural application of
K-means [20], the centroid index or cluster index and
the centroid are also referred to as #code index$ and
#code vector$ respectively. The result of K-means, the
set of centroids is referred to as the codebook which is
used to quantize vectors with minimum distortion.
84 National Conference on Signal and Image Processing (NCSIP-2012)
III. PROPOSED METHODS
In this paper, two methods Variant Block and
Eigen Block are proposed to construct the initial
codebook to achieve fast convergence, eventually
reducing the computational time of K-means clustering
technique for compressing images.
A. Variant Block Method
The first method is based on the idea that high
variant blocks stand a good choice of initial codebook
as they would maximize intercluster distance. Novelty
in this method is the number of clusters formed %K& is
derived based on the average variance of image blocks.
Nevertheless, the user needs to specify the upper limit
for %K& to control compression rate. The algorithm for
the proposed method is outlined as follows:
Step 1: Divide the input image into 4x4 blocks.
Step 2: Find the covariance of each image block.
Step 3: Find the average covariance and mark it
as threshold value.
Step 4: Find the total number of image blocks %n&
with variance greater than the threshold
value.
Step 5: Set K=n if K>n.
Step 6: Choose only %K& high variant image
blocks
Step 7: Convert each high variant block into 16-
element vector and keep them as initial
seeds.
Step 8: Apply K-means algorithm to construct the
final codebook
The proposed method is image specific as every
image has its own threshold value which is dependent
on the variance of its image blocks.
B. Eigen Block Method
The second method is based on the spectral
decomposition of an image. Clustering algorithms
depend upon the knowledge or acquisition of similarity
information i.e. the affinity between the points in
euclidean space. Spectral techniques, which make use
of information obtained from the eigenvectors and
eigen values of a matrix, have attracted the researchers&
attention with respect to clustering recently. Eigen value
is very important for many applications in science and
engineering such as solution to linear and non-linear
differential equations, boundary value problem, markov
chain, network analysis and population growth model
etc. Spectral clustering methods perform well for some
cases where classical methods (K-means, LBG) fail.
However, for very non-compact clusters, they also tend
to have problems. Hence, spectral decomposition is
incorporated merely as a method for determining the
block structure of the affinity matrix. Consequently, it is
advantageous for clustering techniques, if the affinity
matrix has a clear block structure.
In the proposed method, the significance of eigen
value is used to form the initial codebook. As blocks
having high eigen values are the principal components
of an image, the initial codebook is constructed is
constructed with such blocks. The following steps
describe the algorithm of the proposed eigen block
method.
Step 1: Divide the input image into 4x4 blocks.
Step 2: Form the affinity matrix of each block.
Step 3: Find the eigen values of all blocks.
Step 4: Choose %K& image blocks having high
eigen values.
Step 5: Convert each image block into 16-
element vector.
Step 6: Form the initial codebook with those K
16-element vectors having high eigen
values.
Step7: Apply K-means algorithm with the initial
centroids to construct the final codebook.
IV. RESULTS AND DISCUSSION
The performance of the proposed methods were
evaluated with several test images of size 256256 with
initial block size of 44 on a machine with core2 duo
processor at 2 GHZ using MATLAB. The quality of the
reconstructed image is measured in terms of Peak
Signal to Noise Ratio (PSNR) defined by
PSNR = 20 log
10


255
MSE
(6)
where, MSE is the mean-square error measuring the
deviation of the reconstructed image from the original
image. Q refers to the structural similarity index which
is an universal quality index that models any distortion
as a combination of three different factors - loss of
correlation, mean distortion, and variance distortion.
TABLE 1: PERFORMANCE COMPARISON OF THE PROPOSED METHODS
Test
Image
Method
Execution
Time (in sec.)
Bit Rate
(bpp)
PSNR
Structural
Similarity
Index (Q)
Lena
Variant
Block
Method
39.4 0.21 29.25 0.67
Eigen Block
Method
37.4 0.25 29.24 0.63
K-means 65.0 0.25 29.16 0.69
Boats
Variant
Block
method
24.5 0.25 28.20 0.60
Eigen Block
method
41.5 0.25 28.06 0.59
K-means 135.0 0.25 27.90 0.66
Barbara
Variant
Block
Method
32.8 0.23 25.33 0.68
Eigen Block
method
38.3 0.25 25.18 0.66
K-means 77.5 0.25 25.29 0.69
Novel Methods for Initial Seed Selection in K-Means Clustering for Image Compression 85
Bit rate is measured in bits per pixel (BPP).
Experimental results with test images Lena, Boats and
Barbara are given in Table 1. Table 1 clearly reveals
that both methods achieve fast convergence than K-
means with random initialization of seeds with
enhanced picture quality and bit rate. The execution
time is reduced to almost half of that of traditional K-
means thereby depicting the efficiency of the proposed
methods. Moreover, variant block method can be more
effective than similar methods as it is a variable bit rate
technique and image specific. It achieves variable bit
rate as the size of the codebook depends on the average
variance of the image which varies with different
images. Fig.1 shows the reconstructed images of Lena
using K-means with random initialization and the
proposed methods for visual comparison.


(a) (b)

(c) (d)
Fig. 1a): Original Lena image ; Reconstructed image using b) K-
means c) Variant Block method d) Eigen Block method
V. CONCLUSION
Both variant block and eigen block methods give
better image quality in terms of PSNR value than the
traditional K-means method. Moreover, in terms of
execution time, both of them outperform the K-means
method. Among the two, variant block method
performs better than eigen block method in terms of the
quality of the reconstructed image and execution time.
Both methods stand a good choice for online digital
image applications where execution time plays a vital
role for effective implementation.
REFERENCES
[1] Ankerst, M., M. Breunig, H.P. Kriegel and J. Sander, #OPTICS:
Ordering points to identify the clustering structure$, Proceedings
of ACM SIGMOD International Conference on Management of
Data Mining, ACM Press, Philadelphia, Pennsylvania, United
States, pp. 49-60, June 1999.
[2] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, #When
is %nearest neighbor& meaningful?$, Proceedings of the
International Conference on Database Theory, Jerusalem, pp.
217*235, January 1999.
[3] Linde Y., Buzo A., and Gray R.M., #An Algorithm for Vector
Quantizer Design$, IEEE Transactions on Communication, Vol.
28, pp. 84*95, 1980.
[4] N. Venkateswaran, and Y. V. Ramana Rao, #K-Means
Clustering Based Image Compression in Wavelet Domain$,
Information Technology Journal, Vol.6, pp. 148*153, 2007.
[5] Gersho A., and Gray R.M. #Vector Quantization and Signal
compression$, Kluwer Academic Publishers, New York, pp.
761, 1992.
[6] H.P. Ng, S.H. Ong, K.W.C. Foong, P.S. Goh, and W.L.
Nowinski, #Medical image segmentation using k-means
clustering and improved watershed algorithm$, IEEE Southwest
Symposium on Image Analysis and Interpretation, Denver,
pp.61-65, 2006.
[7] Duda, R.O. and P.E. Hart, Pattern Classification and Scene
Analysis. John Wiley Sons, New York, pp. 482, 1973.
[8] Jiang, D.J. Pei and A. Zhang, #An interactive approach to
mining gene expression data$, IEEE Transactions on
Knowledge and Data Engineering, Vol.17, pp. 1363-1380, 2005.
[9] XindongWu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh,
Qiang Yang,Hiroshi Motoda,Geoffrey J. McLachlan,Angus Ng,
Bing Liu,Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach,David
J. Hand, Dan Steinberg, #Top 10 algorithms in data mining$,
Knowledge and Information Systems Journal, Vol.14, pp. 1-37,
2008.
[10] MacQueen, J.B., Some Method for Classification and Analysis
of Multivariate Observations, Proceedings of the Berkeley
Symposium on Mathematical Statistics and Probability,
(MSP&67), Berkeley, University of California Press, pp: 281-
297, 1967.
[11] D. Lee, S. Baek, and K. Sung, #Modified k-means algorithm for
vector quantizer design$, IEEE Signal Processing Letters, Vol.
4, pp. 2*4, 1997.
[12] Kuldip K. Paliwal and V. Ramasubramanian, Comments on
#Modified K-means Algorithm for Vector Quantizer Design$,
IEEE Transactions on Image Processing, Vol. 9, No. 11,
pp.1964-1967, 2000.
[13] E. Forgy, #Cluster analysis of multivariate data: efficiency vs
interpretability of classification$, Biometrics, Vol. 21, pp.768-
769, 1965.
[14] Ball G.H. and Hall D.J., #PROMENADE-an Online Pattern
Recognition System$, Stanford Research Institute Memo,
Stanford University, 1967.
[15] Tou, J. and R. Gonzales, Pattern Recognition Principles,
Addision-Wesley, Reading, MA., pp: 377, 1977
[16] Bradley, P.S. and U.M. Fayyad, # Refining initial points for K-
means clustering$, Proceedings of the 15th International
Conference on Machine Learning (ICML&98), ACM Press,
Morgan Kaufmann, San Francisco, pp. 91-99, 1998
[17] Ali Ridho Barakbah and Yasushi Kiyoki, #A New Approach for
Image Segmentation using Pillar-Kmeans Algorithm $, World
Academy of Science, Engineering and Technology Journal,
Vol.59, pp.23-28, 2009.
[18] G. P. Babu and M. N. Murty, #A near-optimal initial seed value
selection in K-means algorithm using a genetic algorithm$,
Pattern Recognition Letters, Vol. 14, pp. 763*769, 1993.
[19] C. Huang and R. Harris, #A comparison of several codebook
generation approaches$, IEEE Transactions on Image
Processing, Vol. 2 (1), pp. 108*112, 1993.
[20] J.S. Pan, Z.M. Lu and S.H. Sun, #An efficient encoding
algorithm for vector quantization based on subvector
technique$, IEEE Transactions on Image Processing, Vol 12,
No.3, pp. 265*270, 2003.