You are on page 1of 6

(IJCSIS) International Journal of Computer Science and Information Security,

Vol. 9, No. 4, April 2011

An Efficient Constrained K-Means Clustering using


Self Organizing Map
M.Sakthi1 and Dr. Antony Selvadoss Thanamani2 

1
Research Scholar 2Associate Professor and Head, Department of Computer Science, NGM College, Pollachi, Tamilnadu.

Abstract--- The rapid worldwide increase in the data available analysis. It is very complex to list the different scientific fields
leads to the difficulty for analyzing those data. Organizing and applications that have utilized clustering method as well
data into interesting collection is one of the most basic forms as the thousands of existing techniques.
of understanding and learning. Thus, a proper data mining
approach is required to organize those data for better The main aim of data clustering is to identify the natural
understanding. Clustering is one of the standard approaches in classification of a set of patterns, points, or objects. Webster
the field of data mining. The main of this approach is to defines cluster analysis as “a statistical classification method
organize a dataset into a set of clusters, which consists of for discovering whether the individuals of a population fall
similar data items, as calculated by some distance function. K- into various groups by making quantitative comparisons of
Means algorithm is the widely used clustering algorithm multiple characteristics”. The another definition of clustering
because of its ability and simple nature. When the dataset is is: Provided a representation of n objects, determine K groups
larger, K-Means will misclassify the data points. For according to the measure of similarity like similarities among
overcoming this problem, some constraints must be included objects in the same group are high whereas the similarities
in the algorithm. The resulting algorithm is called as between objects in different groups are low.
Constrained K-Means Clustering. The constraints used in this
The main advantages of using the clustering algorithms are:
paper are Must-link constraint, Cannot-link constraint, δ-
constraint and ε-constraint. For generating the must-link and • Compactness of representation.
cannot-link constraints, Self Organizing Map (SOM) is used in • Fast, incremental processing of new data points.
this paper. The experimental result shows that the proposed
• Clear and fast identification of outliers.
algorithm results in better classification than the standard K-
Means clustering technique. The widely used clustering technique is K-Means clustering.
This is because K-Means is very simple to implement and also
Keywords--- K-Means, Self Organizing Map (SOM),
it is effective in clustering. But K-Means clustering will lack
Constrained K-Means
performance when large dataset is involved for clustering.
I. INTRODUCTION This can be solved by including some constraints [8, 9] in the
clustering algorithm; hence the resulting clustering is called as

T HE growth and development in sensing and storage


technology and drastic development in the applications
such as internet search, digital imaging, and video surveillance
Constrained K-Means Clustering [7, 10]. The constraints used
in this paper are Must-link constraint, Cannot-link constraint
[14, 16], δ-constraint and ε-constraint. Self Organizing Map
have generated many high-volume, high-dimensional data (SOM) is used in this paper for generating the must-link and
sets. As the majority of the data are stored digitally in cannot-link constraints.
electronic media, they offer high prospective for the
development of automatic data analysis, classification, and II. RELATED WORKS
retrieval approaches.
Zhang Zhe et al., [1] proposed an improved K-Means
Clustering is one of the most popular approaches used for data clustering algorithm. K-means algorithm [8] is extensively
analysis and classification. Cluster analysis is widely used in utilized in spatial clustering. The mean value of each cluster
disciplines that involve analysis of multivariate data. A search centroid in this approach is taken as the Heuristic information,
through Google Scholar found 1,660 entries with the words so it has some limitations such as sensitive to the initial
data clustering that comes into sight in 2007 alone. This huge centroid and instability. The enhanced clustering algorithm
amount of data provides the significance of clustering in data referred to the best clustering centroid which is searched
during the optimization of clustering centroid. This increases

94 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 4, April 2011

the searching probability around the best centroid and the better initial centers to enhance the quality of k-means and
enhanced the strength of the approach. The experiment is to minimize the computational complexity of k-means
performed on two groups of representative dataset and from approach. The proposed GK-means integrates grid structure
the experimental observation, it is clearly noted that the and spatial index with k-means clustering approach.
improved K-means algorithm performs better in global Theoretical analysis and experimental observation show that
searching and is less sensitive to the initial centroid. the proposed approach performs significantly with higher
efficiency.
Hai-xiang Guo et al., [2] put forth an Improved Genetic k-
means Algorithm for Optimal Clustering. The value of k must Trujillo et al., [5] proposed a combining K-means and
be known in advance in the traditional k-means approach. It is semivariogram-based grid clustering approach. Clustering is
very tough to confirm the value of k accurately in advance. widely used in various applications which include data
The author proposed an enhanced genetic k-means clustering mining, information retrieval, image segmentation, and data
(IGKM) and builds a fitness function defined as a product of classification. A clustering technique for grouping data sets
three factors, maximization of which guarantees the formation that are indexed in the space is proposed in this paper. This
of a small number of compact clusters with large separation approach mainly depends on the k-means clustering technique
between at least two clusters. Finally, the experiments are and grid clustering. K-means clustering is the simplest and
conducted on two artificial and three real-life data sets that most widely used approach. The main disadvantage of this
compare IGKM with other traditional methods like k-means approach is that it is sensitive to the selection of the initial
algorithm, GA-based technique and genetic k-means algorithm partition. Grid clustering is extensively used for grouping data
(GKM) by inter-cluster distance (ITD), inner-cluster distance that are indexed in the space. The main aim of the proposed
(IND) and rate of separation exactness. From the experimental clustering approach is to eliminate the high sensitivity of the
observation, it is clear that IGKM reach the optimal value of k k-means clustering approach to the starting conditions by
with high accuracy. using the available spatial information. A semivariogram-
based grid clustering technique is used in this approach. It
Yanfeng Zhang et al., [3] proposed an Agglomerative Fuzzy utilizes the spatial correlation for obtaining the bin size. The
K-means clustering method with automatic selection of cluster author combines this approach with a conventional k-means
number (NSS-AKmeans) approach for learning optimal clustering technique as the bins are constrained to regular
number of clusters and for providing significant clustering blocks while the spatial distribution of objects is irregular. An
results. High density areas can be detected by the NSS- effective initialization of the k-means is provided by
AKmeans and from these centers the initial cluster centers semivariogram. From the experimental results, it is clearly
with a neighbor sharing selection approach can also be observed that the final partition protects the spatial distribution
determined. Agglomeration Energy (AE) factor is proposed in of the objects.
order to choose a initial cluster for representing global density
relationship of objects. Moreover, in order to calculate local Huang et al., [6]  put forth the automated variable weighting in
neighbor sharing relationship of objects, Neighbors Sharing k-means type clustering that can automatically estimate
Factor (NSF) is used. Agglomerative Fuzzy k-means variable weights. A novel approach is introduced to the k-
clustering algorithm is then utilized to further merge these means algorithm to iteratively update variable weights
initial centers to get the preferred number of clusters and depending on the present partition of data and a formula for
create better clustering results. Experimental observations on weight calculation is also proposed in this paper. The
several data sets have proved that the proposed clustering convergency theorem of the new clustering algorithm is given
approach was very significant in automatically identifying the in this paper. The variable weights created by the approach
true cluster number and also providing correct clustering estimates the significance of variables in clustering and can be
results. deployed in variable selection in various data mining
applications where large and complex real data are often used.
Xiaoyun Chen et al., [4]  described a GK-means: an efficient Experiments are conducted on both synthetic and real data and
K-means clustering algorithm based on grid. Clustering it is found from the experimental observation that the
analysis is extensively used in several applications such as proposed approach provides higher performance when
pattern recognition, data mining, statistics etc. K-means compared the traditional k-means type algorithms in
approach, based on reducing a formal objective function, is recovering clusters in data.
most broadly used in research. But, user specification is
needed for the k number of clusters and it is difficult to choose III. METHODOLOGY
the effective initial centers. It is also very susceptible to noise
data points. In this paper, the author mainly focuses on option

95 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 4, April 2011

The methodology proposed for clustering the data is presented Step 5: Repeat steps 3 and 4 until no point changes its cluster
in this section. Initially, K-Means clustering is described. Then assignment, or until a maximum number of passes through the
the constraint based K-Means clustering is provided. Next, the data set is performed.
constraints used in Constrained K-Means algorithm are
presented. For the generation of constraints like must-link and Function violate-constraints ( )
cannot-link, Self Organizing Map is used in this paper.
if must_link constraint not satisfied
K-Means Clustering
return true
Provided a data set of data samples, a preferred number of
elseif cannot_link constraint not satisfied
clusters, k, and a set of k initial starting points, the k-means
clustering technique determines the desired number of distinct return true
clusters and their centroids. A centroid is defined as the point
whose coordinates are determined by calculating the average elseif δ-constraint not satisfied
of each of the coordinates (i.e., feature values) of the points of
the jobs allocated to the cluster. Properly, the k-means return true
clustering algorithm follows the following steps.
elseif ε-constraint not satisfied
Step 1: Choose a number of desired clusters, k.
return true
Step 2: Choose k starting points to be used as initial estimates
else
of the cluster centroids. These are the initial starting values.
return false
Step 3: Examine each point in the data set and assign it to the
cluster whose centroid is nearest to it. Constraints used for Constrained K-Means Clustering
Step 4: When each point is assigned to a cluster, recalculate The Constraints [11, 12, 13] used for Constrained K-Means
the new k centroids. Clustering are
Step 5: Repeat steps 3 and 4 until no point changes its cluster • Must-link constraint
assignment, or until a maximum number of passes through the
• Cannot-link constraint
data set is performed.
• δ-constraint
Constrained K-Means Clustering • ε-constraint

Constrained K-Means Clustering [15] is similar to the Consider S = {s1, s2,…,sn} as a set of n data points that are to
standard K-Means Clustering algorithm with the exception is be separated into clusters. For any pair of points si and sj in S,
that the constraints must be satisfied while assigning the data the distance between them is represented by d(si, sj) with a
points into the cluster. The algorithm for Constrained K- symmetric property in order that d(si, sj) = d(sj,si). The
Means Clustering is described below. constraints are:

Step 1: Choose a number of desired clusters, k. • Must-link constraints indicates that two points si and
sj (i ≠ j) in S have to be in the same cluster.
Step 2: Choose k starting points to be used as initial estimates • Cannot-link constraints indicates that two point si and
of the cluster centroids. These are the initial starting values. sj (i ≠ j) in S must not be placed in the same cluster.
• δ-Constraint: This constraint represents a value δ > 0.
Step 3: Examine each point in the data set and assign it to the
Properly, for any pair of clusters Si and Sj (i ≠ j), and
cluster whose centroid is nearest to it only when the violate-
any pair of points sp and sq such that sp Si and sq
constraints ( ) returns false
Sj, d(sp, sq) ≥ δ.
Step 4: When each point is assigned to a cluster, recalculate • ε-Constraint: This constraint represents a value ε > 0
the new k centroids. and the feasibility need is the following: for any
cluster Si containing two or more points and for any
point sp Si, there must be another point sq Si such
that d(sp, sq) ≤ ε.

96 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 4, April 2011

Must-link constraint and Cannot-link constraint are constraints are derived. The data points in a cluster are
determined with the help of appropriate neural network. For considered as must link constraint and data points outside the
this purpose, this paper uses Self Organizing Map. clusters are considered as cannot link constraints. These
constraints are used in the constraints checking module of
Self Organizing Map constrained K-Means algorithm.

Self-Organizing Maps (SOM) is a general type of neural


network technique that is nonlinear regression method that can
be utilized to determine relationships among inputs and
outputs or categorize data so as to reveal so far unidentified
patterns or structures. It is an outstanding technique in
exploratory phase of data mining. The results of the
examination represents that self-organizing maps can be a
feasible technique for categorization of large quantity of data.
The SOM has set up its place as an expensively used
technique in data-analysis and visualization of high-
dimensional data. Among other statistical technique the SOM
has no close counterpart, and thus it offers a balancing sight to
the data. On the other hand, SOM is the most extensively used
technique in this group as it offers some notable merits among Figure 1: Architecture of self-organizing map
the substitutes. These comprise, ease of use, particularly for
inexperienced users, and highly intuitive display of the data IV. EXPERIMENTAL RESULTS
anticipated on to a regular two-dimensional slab, as on a sheet The proposed technique is experimented using the two
of a paper. The most important prospective of the SOM is in benchmark datasets which are Iris and Wine Dataset from the
exploratory data analysis that varies from regular statistical UCI machine learning Repository [17]. All algorithms are
data analysis in that there are no assumed set of hypotheses implemented under the same initial values and stopping
that are validated in the analysis. As an alternative, the conditions. The experiments are all performed on a GENX
hypotheses are created from the data in the data-driven computer with 2.6 GHz Core (TM) 2 Duo processors using
exploratory stage and validated in the confirmatory stage. MATLAB version 7.5.
There are few demerits where the exploratory stage may be
adequate alone, such as visualization of data with no Experiment with Iris Dataset
additional quantitative statistical inference upon it. In practical
data analysis problems the majority of mission is to identify The Iris flower data set (Fisher's Iris data set) is a multivariate
dependencies among variables. In such a difficulty, SOM can data set. The dataset comprises of 50 samples from each of
be utilized for getting insight to the data and for the original three species of Iris flowers (Iris setosa, Iris virginica and Iris
search of potential dependencies. In general the findings versicolor). Four features were measured from every sample;
require to be validated with more conventional techniques, for they are the length and the width of sepal and petal, in
the purpose of assessing the assurance of the conclusions and centimeters. Based on the combination of the four features,
to discard those that are not statistically important. Fisher has developed a linear discriminant model to
distinguish the species from each other. It is used as a typical
Initially the chosen parameters are normalized and then test for many classification techniques. The proposed method
initialize the SOM network. Then SOM is trained to offer the is tested first using this Iris dataset. This database has four
maximum likelihood estimation, so that an exacting stock can continuous features consisting of 150 instances: 50 for each
be linked with a particular node in the categorization layer. class.
The self-organizing networks suppose a topological structure
between the cluster units. There are m cluster units, To evaluate the efficiency of the proposed approach, this
prearranged in a one or two dimensional array: the input technique is compared with the existing K-Means algorithm.
signals are n dimensional. Figure 1 represents architecture of The Mean Square Error (MSE) of the centers
self-organizing network (SOM) that consists of input layer, || || where vc is the computed center and vt is the
and Kohonen or clustering layer. true center. The cluster centers found by the proposed K-
Means are closer to the true centers, than the centers found by
Finally the categorized data is obtained from the SOM. From K-Means algorithm. The mean square error for the four cluster
this obtained categorized data, must link and cannot link centers for the two approaches are presented in table I. The

97 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 4, April 2011

resulted execution for the proposed and standard K-Means Proposed K-


K-Means
algorithms is provided in figure 2. Means
Cluster 1 0.4364 0.3094

Cluster 2 0.5562 0.3572


Cluster 3 0.2142 0.1843

TABLE I
1.2
MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE K-Means
CLUSTERS IN THE IRIS DATASET 1
Proposed
Proposed K- K-Means

Time (seconds)
K-Means 0.8
Means
Cluster 1 0.3765 0.2007 0.6
Cluster 2 0.4342 0.2564
0.4
Cluster 3 0.3095 0.1943
  0.2

1.2 0
K-Means
1 Figure 3: Execution Time for Wine Dataset
Proposed
K-Means
Time (seconds)

0.8 From the experimental observations it can be found that the


proposed approach produces better clusters than the existing
0.6 approach. The MSE value is highly reduced for both the
dataset. This represents the better accuracy for the proposed
0.4 approach. Also, the execution time is reduced when compared
to the existing approach. This is true in both the dataset.
0.2
V. CONCLUSION
0
  The increase in the number of data world wide leads to the
Figure 2: Execution Time for Iris Dataset requirement for the better analyzing technique for better
understanding of data. One of the most essential modes of
Experiment with Wine Dataset understanding and learning is categorizing data into
reasonable groups. This can be achieved by a famous data
The wine dataset is the results of a chemical analysis of wines mining technique called Clustering. Clustering is nothing but
grown in the same region in Italy but derived from three separating the given data into particular groups according to
different cultivars. The analysis established the quantities of the separation among the data points. This will helps in better
13 constituents found in each of the three types of wines. The understanding and analyzing of the vast data. One of the
classes 1, 2 and 3 have 59, 71 and 48 instances respectively. widely used clustering is K-Means clustering because it is
There are totally 13 Number of Attributes. simple and efficient. But it lacks accuracy of classification
when large data are used in clustering. So the K-Means
The MSE value for the three clusters is presented in Table II.
clustering needs to be improved to suit for all kinds of data.
The resulted execution for the proposed and standard K-
Hence the new clustering technique called Constrained K-
Means algorithms is provided in figure 2.
Means Clustering is introduced. The constraints used in this
TABLE II paper are Must-link constraint, Cannot-link constraint, δ-
constraint and ε-constraint. SOM is used in this paper for
MEAN SQUARE ERROR VALUE OBTAINED FOR THE THREE generating Must-link and Cannot-link constraints. The
CLUSTERS IN THE WINE DATASET experimental result shows that the proposed technique results
in better classification and also takes lesser time for

98 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 9, No. 4, April 2011

classification. In future, this work can be extended by using [17] Merz C and Murphy P, UCI Repository of Machine Learning Databases,
Available: ftp://ftp.ics.uci.edu/pub/machine-Learning-databases.
more suitable constraints in the Constrained K-Means
Clustering technique.

REFERENCES
[1] Zhang Zhe, Zhang Junxi and Xue Huifeng, "Improved K-Means
Clustering Algorithm", Congress on Image and Signal Processing, Vol.
5, Pp. 169-172, 2008.
[2] Hai-xiang Guo, Ke-jun Zhu, Si-wei Gao and Ting Liu, "An Improved
Genetic k-means Algorithm for Optimal Clustering", Sixth IEEE
International Conference on Data Mining Workshops, Pp. 793-797,
2006.
[3] Yanfeng Zhang, Xiaofei Xu and Yunming Ye, "NSS-AKmeans: An
Agglomerative Fuzzy K-means clustering method with automatic
selection of cluster number", 2nd International Conference on Advanced
Computer Control, Vol. 2, Pp. 32-38, 2010.
[4] Xiaoyun Chen, Youli Su, Yi Chen and Guohua Liu, "GK-means: an
Efficient K-means Clustering Algorithm Based on Grid", International
Symposium on Computer Network and Multimedia Technology, Pp. 1-
4, 2009.
[5] Trujillo, M., Izquierdo, E., "Combining K-means and semivariogram-
based grid clustering", 47th International Symposium, Pp. 9-12, 2005.
[6] Huang, J.Z., Ng, M.K., Hongqiang Rong and Zichen Li, "Automated
variable weighting in k-means type clustering", IEEE Transactions on
Pattern Analysis and Machine Intelligence, Vol. 27, No. 5, Pp. 657-668,
2005.
[7] Yi Hong and Sam Kwong “Learning Assignment Order of Instances for
the constrained k-means clustering algorithm” IEEE Transactions on
Systems, Man, and Cybernetics, Vol 39, No 2. April, 2009.
[8] I. Davidson,M. Ester and S.S. Ravi, “Agglomerative hierarchical
clustering with constraints: Theoretical and empirical results”, in Proc.
of Principles of Knowledge Discovery from Databases, PKDD 2005.
[9] Wagstaff, Kiri L., Basu, Sugato, Davidson, Ian “When is constrained
clustering beneficial, and why?” National Conference on Aritficial
Intelligence, Boston, Massachusetts 2006.
[10] Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrodl “Constrained
K-means Clustering with Background Knowledge” ICML '01
Proceedings of the Eighteenth International Conference on Machine
Learning, 2001.
[11] I. Davidson, M. Ester and S.S. Ravi, “Efficient incremental constrained
clustering”. In Thirteenth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2007, August 12-15, San Jose,
California, USA.
[12] I. Davidson, M. Ester and S.S. Ravi, “Clustering with constraints:
Feasibility issues and the K-means algorithm”, in proc. SIAM SDM
2005, Newport Beach, USA.
[13] D. Klein, S.D. Kamvar and C.D. Manning, “From Instance-Level
constraintes to space-level constraints: Making the most of Prior
Knowledge in Data Clustering”, in proc. 19th Intl. on Machine Learning
(ICML 2002), Sydney, Australia, July 2002, p. 307-314.
[14] N. Nguyen and R. Caruana, “Improving classification with pairwise
constraints: A margin-based approach”, in proc. of the European
Conference on Machine Learning and Principles and Practice of
Knowledge Discovery in Databases (ECML PKDD’08).
[15] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, “Constrained
Kmeans clustering with background knowledge”, in: Proc. Of 18th Int.
Conf. on Machine Learning ICML’01, p. 577 - 584.
[16] Y. Hu, J. Wang, N. Yu and X.-S. Hua, “Maximum Margin Clustering
with Pairwise Constraints”, in proc. of the Eighth IEEE International
Conference on Data Mining (ICDM) , 253-262, 2008.

99 http://sites.google.com/site/ijcsis/
ISSN 1947-5500