You are on page 1of 4

International Journal of Advance Foundation and Research in Computer (IJAFRC)

Volume 2, Issue 8, August - 2015. ISSN 2348 4853

A Comparative Study Of K-Means, K-Medoid And Enhanced


K-Medoid Algorithms
A. Rajarajeswari, 2R. Malathi Ravindran
Scholar Of Computer Science
2Assistant Professor Of Computer Science,
NGM Collge, Pollachi.
1

1 Research

ABSTRACT
Clustering plays a vital role in research area in the field of data mining. Clustering is a process of
partitioning a set of data in a meaningful sub classes called clusters. It helps users to understand
the natural grouping of cluster from the data set. It is unsupervised classification that means it
has no predefined classes. Data are grouped into clusters in such a way that data of the same
group are similar and those in other groups are dissimilar. It aims to minimize intra-class
similarity while to maximize interclass dissimilarity. Clustering is useful to obtain interesting
patterns and structures from a large set of data. Clustering can be applied in many areas, such as
marketing studies, DNA analyses, city planning, text mining, and web documents classification.
Large datasets with many attributes make the task of clustering complex. Many methods have
been developed to deal with these problems. There are number of techniques proposed by
several researchers to analyze the performance of clustering algorithms in data mining. All these
techniques are not suggesting good results for the chosen data sets and for the algorithms in
particular. The choice of clustering algorithm depends both on the type of data available and on
the particular purpose and application. Clustering analysis is one of the main analytical methods
in data mining. Some of the clustering algorithms are suit for some kind of input data. In this
paper, three well known partitioning based methods k-means, K-medoid and enhanced K-medoid
are compared. The study given here explores the behavior of these three methods. Our
experimental results have shown that enhanced K-medoid performed better than k-means and Kmedoid in terms of cluster quality and elapsed time.
Index Terms: - Clustering, Classification, Partition Clustering, K-means, K-medoid, enhanced K-medoid.

I. INTRODUCTION

This study is aimed to give a comparative review of three of the various partitioning based clustering
methods. Clustering is a division of data objects into groups of similar objects. Such groups are called
clusters. Objects possessed by same cluster tend to be similar, while dissimilar objects are possessed by
different clusters. These clusters represent groups of data and provide simplification by representing
many data objects by fewer clusters. And, this helps to model data by its clusters.
Clustering is similar to classification in which data are grouped. A cluster is therefore a collection of
objects which are similar between them and are dissimilar to the objects belonging to other clusters.
There exist a large number of clustering algorithms in the literature. The choice of clustering algorithm
depends both on the type of data available and on the particular purpose and application. The data
clustering is a big problem in a wide variety of different areas, like pattern recognition & bio-informatics.
Clustering is a data description method in data mining which collects most similar data. The purpose is to
organize a collection of data items in to clusters, such that items within a cluster are more similar to each
7 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
other than they are in other clusters. Clustering analysis is one of the main analytical methods in data
mining. K-means is the most popular and partition based clustering algorithm. But it is computationally
expensive and the quality of resulting clusters heavily depends on the selection of initial centroid and the
dimension of the data. Several methods have been proposed in the literature for improving performance
of the K-means clustering algorithm. In this research, the most representative algorithms K-Means, Kmedoid and enhanced K-medoid were examined and analyzed based on their basic approach. The best
algorithm in each category was found out based on their performance.
Clustering methods are mainly suitable for investigation of interrelationships between samples to make a
preliminary assessment of the sample structure. Clustering techniques are required because it is very
difficult for humans to intuitively understand data in a high-dimensional space.
II.PARTITION CLUSTERING
A partitioning method constructs k (k<n) clusters of n data sets (objects) where each cluster is also
known as a partition. It classifies the data in to k groups while satisfying following conditions :

Each partition should have at least one object.


Each object should belong to exactly one group.

The number of partitions to be constructed (k) this type of clustering method creates an initial partition.
Then it moves the object from one group to another using iterative relocation technique to find the global
optimal partition. A good partitioning is one in which distance between objects in the same partition is
small (related to each other) whereas the distance between objects of different partitions is large (they
are very different from each other).
A. K-Means Clustering
This segment describes the original K-Means clustering algorithm. The idea is to classify a given set of
data into k number of transfer clusters, where the value of k is fixed in advance. The algorithm consists of
two separate phases: the first stage is to define k centroids, one for each cluster. The next stage is to take
each point belonging to the given data set and associate it to the nearest centroid. Euclidean distance is
generally considered to determine the distance between data points and the centroids. When all the
points are included in some clusters, the first step is completed and an early grouping is done. At this
point we need to recalculate the new centroids, as the inclusion of new points may lead to a change in the
cluster centroids. Once we find k new centroids, a new binding is to be created between the same data
points and the nearest new centroid, generating a loop. As a result of this loop, the k centroids may
change their position in a step by step manner. Eventually, a situation will be reached where the
centroids do not move anymore. This signifies the convergence criterion for clustering.
The process, which is called K-Means, appears to give partitions which are reasonably efficient in the
sense of within class variance, corroborated to some extend by mathematical analysis and practical
experience. Also, the K-Means procedure is easily programmed and is computationally economical, so
that it is feasible to process very large samples on a digital computer. K-Means algorithm is one of first
which a data analyst will use to investigate a new data set because it is algorithmically simple, relatively
robust and gives good enough answers over a wide variety of data sets.
B. K-Mediod Clustering
K-medoid is a partition clustering algorithm which needs to select k clustering centers from data objects
and establish an initial partition nearest to clustering centre for other data before iterating and moving
www.ijafrc.org
8 | 2015, IJAFRC All Rights Reserved

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
clustering centers continuously until an optimum partition is reached. Due to the randomness of K value
and initial selection of clustering centers, the efficiency and accuracy of it is very low. The improved Kmedoid algorithm which adds k value under constraint conditions as clustering centers and only needs
one iteration to get clustering results can not only solve the randomness of clustering center selection but
also can improve its efficiency toward complicated data to achieve global optimization.
The k-means method uses centroid to represent the cluster and it is sensitive to outliers. This means, a
data object with an extremely large value may disrupt the distribution of data. K-medoid method
overcomes this problem by using medoids to represent the cluster rather than centroid. A medoid is the
most centrally located data object in a cluster. Here, k data objects are selected randomly as medoids to
represent k cluster and remaining all data objects are placed in a cluster having medoid nearest (or most
similar) to that data object. After processing all data objects, new medoid is determined which can
represent cluster in a better way and the entire process is repeated. Again all data objects are bound to
the clusters based on the new medoids. In each iteration, medoids change their location step by step, or
in other words, medoids move in each iteration. This process is continued until no any medoid move. As a
result, k clusters are found representing a set of n data objects.
C. Enhanced K-Medoid Clustering
K-medoid has high accuracy of pattern matching and is not sensitive to dirty and abnormal data. Besides,
it requires enormous amount of iteration and a lot of time. It reduces the efficiency of clustering;
therefore, it can only be used to handle small amount of data. With the development of information and
internet, data on database increases rapidly not only in amount but also in complexity. In view of this, it
should not pay the attention to the efficiency of information acquisition, but also to the accuracy of it.
Under this condition, an enhanced K-medoid clustering algorithm is established which can meet the
needs of complex data set.
This new enhanced K-medoid algorithm assumes all the data points as medoids and calculates the costs
for individual points. After calculating the total cost of all the data points, it specifies the number of
clusters in which the original data to be grouped. Since K-medoid algorithm is an unsupervised
algorithm, the number of clusters are specified. The medoids are selected from the data points in which
that data point scored the least minimum cost. For example, if there are ten clusters, the first 10 least
minimum cost points are selected as medoids. This algorithm overcomes the problem of possibility to
check all the data points as medoids. Manhattan distance metric is used to calculate the distance between
the cluster points.
III.CONCLUSION
From the above study, it can be concluded that partitioning based clustering methods are suitable for
small to medium sized data sets. K-means, K-medoid and enhanced K-medoid require specifying k,
number of desired clusters, in advance. Result and runtime depends upon initial partition for all of these
methods. The advantage of K-means is its low computation cost, while drawback is sensitivity to noisy
data and outliers. Compared to this, K-medoid is not sensitive to noisy data and outliers and it has high
computation cost. But when enhanced K-medoid is compared to K-means and K-medoid, it is much
efficient, not sensitive to noisy data and outliers. Its performance towards large data sets is also
competent.
IV.REFERENCES

9 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org

International Journal of Advance Foundation and Research in Computer (IJAFRC)


Volume 2, Issue 8, August - 2015. ISSN 2348 4853
[1]

L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis,


John Wiley & Sons, 1990.

[2]

A. K. Jain, M. N. Murty, and P. J. Flynn Data clustering: a review. ACM Computing Surveys, Vol
.31No 3,pp.264 323, 1999.

[3]

J. Han and M. Kamber. Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers,
August 2000.

[4]

Rui Xu, Donlad Wunsch, Survey of Clustering Algorithm, IEEE Transactions on Neural Networks,
Vol. 16, No. 3, may 2005.

[5]

Sanjay garg, Ramesh Chandra Jain, Variation of k-mean Algorithm: A study for High Dimensional
Large data sets, Information Technology Journal5 (6), 1132 1135, 2006.

[6]

K. A. Abdul Nazeer, M. P. Sebastian Improving the Accuracy and Efficiency of the K-Means
Clustering Algorithm Proceedings of the World Congress on Engineering 2009 Vol - I WCE 2009,
July 1 - 3, 2009, London, U.K.

[7]

T. Velmurugan,and T. Santhanam, A Survey of Partition based Clustering Algorithms in Data


Mining: An Experimental Approach An experimental approach. Information. Technology. Journal,
Vol, 10,No .3 , pp478- 484,2011.

[8]

D. Klein, S.D. Kamvar, and C. Manning, From instance level Constraints to Space-level
Constraints: Making the Most of Prior Knowledge in Data Clustering, In Proc. ICML02, Sydney,
Australia.

10 | 2015, IJAFRC All Rights Reserved

www.ijafrc.org