Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Standard view
Full view
of .
0 of .
Results for:
P. 1
An Efficient Constrained K-Means Clustering using Self Organizing Map

An Efficient Constrained K-Means Clustering using Self Organizing Map

Ratings: (0)|Views: 217|Likes:
The rapid worldwide increase in the data available leads to the difficulty for analyzing those data. Organizing data into interesting collection is one of the most basic forms of understanding and learning. Thus, a proper data mining approach is required to organize those data for better understanding. Clustering is one of the standard approaches in the field of data mining. The main of this approach is to organize a dataset into a set of clusters, which consists of similar data items, as calculated by some distance function. K-Means algorithm is the widely used clustering algorithm because of its ability and simple nature. When the dataset is larger, K-Means will misclassify the data points. For overcoming this problem, some constraints must be included in the algorithm. The resulting algorithm is called as Constrained K-Means Clustering. The constraints used in this paper are Must-link constraint, Cannot-link constraint, δ-constraint and ε-constraint. For generating the must-link and cannot-link constraints, Self Organizing Map (SOM) is used in this paper. The experimental result shows that the proposed algorithm results in better classification than the standard K-Means clustering technique.
The rapid worldwide increase in the data available leads to the difficulty for analyzing those data. Organizing data into interesting collection is one of the most basic forms of understanding and learning. Thus, a proper data mining approach is required to organize those data for better understanding. Clustering is one of the standard approaches in the field of data mining. The main of this approach is to organize a dataset into a set of clusters, which consists of similar data items, as calculated by some distance function. K-Means algorithm is the widely used clustering algorithm because of its ability and simple nature. When the dataset is larger, K-Means will misclassify the data points. For overcoming this problem, some constraints must be included in the algorithm. The resulting algorithm is called as Constrained K-Means Clustering. The constraints used in this paper are Must-link constraint, Cannot-link constraint, δ-constraint and ε-constraint. For generating the must-link and cannot-link constraints, Self Organizing Map (SOM) is used in this paper. The experimental result shows that the proposed algorithm results in better classification than the standard K-Means clustering technique.

Availability:

See more
See less

05/11/2011

pdf

text

original

Abstract---
The rapid worldwide increase in the data availableleads to the difficulty for analyzing those data. Organizingdata into interesting collection is one of the most basic formsof understanding and learning. Thus, a proper data miningapproach is required to organize those data for better understanding. Clustering is one of the standard approaches inthe field of data mining. The main of this approach is toorganize a dataset into a set of clusters, which consists of similar data items, as calculated by some distance function. K-Means algorithm is the widely used clustering algorithm because of its ability and simple nature. When the dataset islarger, K-Means will misclassify the data points. For overcoming this problem, some constraints must be includedin the algorithm. The resulting algorithm is called asConstrained K-Means Clustering. The constraints used in this paper are Must-link constraint, Cannot-link constraint,
δ
-constraint and
ε
-constraint. For generating the must-link andcannot-link constraints, Self Organizing Map (SOM) is used inthis paper. The experimental result shows that the proposedalgorithm results in better classification than the standard K-Means clustering technique.
Keywords---
K-Means, Self Organizing Map (SOM),Constrained K-MeansI.

I
NTRODUCTION
HE growth and development in sensing and storagetechnology and drastic development in the applicationssuch as internet search, digital imaging, and video surveillancehave generated many high-volume, high-dimensional datasets. As the majority of the data are stored digitally inelectronic media, they offer high prospective for thedevelopment of automatic data analysis, classification, andretrieval approaches.Clustering is one of the most popular approaches used for dataanalysis and classification. Cluster analysis is widely used indisciplines that involve analysis of multivariate data. A searchthrough Google Scholar found 1,660 entries with the wordsdata clustering that comes into sight in 2007 alone. This hugeamount of data provides the significance of clustering in dataanalysis. It is very complex to list the different scientific fieldsand applications that have utilized clustering method as wellas the thousands of existing techniques.The main aim of data clustering is to identify the naturalclassification of a set of patterns, points, or objects. Webster defines cluster analysis as “a statistical classification methodfor discovering whether the individuals of a population fallinto various groups by making quantitative comparisons of multiple characteristics”. The another definition of clusteringis: Provided a representation of n objects, determine K groupsaccording to the measure of similarity like similarities amongobjects in the same group are high whereas the similarities between objects in different groups are low.The main advantages of using the clustering algorithms are:

Compactness of representation.

Fast, incremental processing of new data points.

Clear and fast identification of outliers.The widely used clustering technique is K-Means clustering.This is because K-Means is very simple to implement and alsoit is effective in clustering. But K-Means clustering will lack  performance when large dataset is involved for clustering.This can be solved by including some constraints [8, 9] in theclustering algorithm; hence the resulting clustering is called asConstrained K-Means Clustering [7, 10]. The constraints usedin this paper are Must-link constraint, Cannot-link constraint[14, 16],
δ
-constraint and
ε
-constraint. Self Organizing Map(SOM) is used in this paper for generating the must-link andcannot-link constraints.II.

ELATED
W
ORKS
Zhang Zhe
et al.,
[1] proposed an improved K-Meansclustering algorithm. K-means algorithm [8] is extensivelyutilized in spatial clustering. The mean value of each cluster centroid in this approach is taken as the Heuristic information,so it has some limitations such as sensitive to the initialcentroid and instability. The enhanced clustering algorithmreferred to the best clustering centroid which is searchedduring the optimization of clustering centroid. This increases
An Efficient Constrained K-Means Clustering usingSelf Organizing Map
M.Sakthi
1

and

Dr.

Antony

Thanamani
2

1
Research Scholar
2
Associate Professor and Head, Department of Computer Science, NGM College, Pollachi, Tamilnadu
.
T
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4, April 201194http://sites.google.com/site/ijcsis/ISSN 1947-5500

the searching probability around the best centroid andenhanced the strength of the approach. The experiment is performed on two groups of representative dataset and fromthe experimental observation, it is clearly noted that theimproved K-means algorithm performs better in globalsearching and is less sensitive to the initial centroid.Hai-xiang Guo
et al.,
[2] put forth an Improved Genetic k-means Algorithm for Optimal Clustering. The value of k must be known in advance in the traditional k-means approach. It isvery tough to confirm the value of k accurately in advance.The author proposed an enhanced genetic k-means clustering(IGKM) and builds a fitness function defined as a product of three factors, maximization of which guarantees the formationof a small number of compact clusters with large separation between at least two clusters. Finally, the experiments areconducted on two artificial and three real-life data sets thatcompare IGKM with other traditional methods like k-meansalgorithm, GA-based technique and genetic k-means algorithm(GKM) by inter-cluster distance (ITD), inner-cluster distance(IND) and rate of separation exactness. From the experimentalobservation, it is clear that IGKM reach the optimal value of k with high accuracy.Yanfeng Zhang
et al.,
[3] proposed an Agglomerative FuzzyK-means clustering method with automatic selection of cluster number (NSS-AKmeans) approach for learning optimalnumber of clusters and for providing significant clusteringresults. High density areas can be detected by the NSS-AKmeans and from these centers the initial cluster centerswith a neighbor sharing selection approach can also bedetermined. Agglomeration Energy (AE) factor is proposed inorder to choose a initial cluster for representing global densityrelationship of objects. Moreover, in order to calculate localneighbor sharing relationship of objects, Neighbors SharingFactor (NSF) is used. Agglomerative Fuzzy k-meansclustering algorithm is then utilized to further merge theseinitial centers to get the preferred number of clusters andcreate better clustering results. Experimental observations onseveral data sets have proved that the proposed clusteringapproach was very significant in automatically identifying thetrue cluster number and also providing correct clusteringresults.Xiaoyun Chen
et al.,
[4]

described a GK-means: an efficientK-means clustering algorithm based on grid. Clusteringanalysis is extensively used in several applications such as pattern recognition, data mining, statistics etc. K-meansapproach, based on reducing a formal objective function, ismost broadly used in research. But, user specification isneeded for the k number of clusters and it is difficult to choosethe effective initial centers. It is also very susceptible to noisedata points. In this paper, the author mainly focuses on optionthe better initial centers to enhance the quality of k-means andto minimize the computational complexity of k-meansapproach. The proposed GK-means integrates grid structureand spatial index with k-means clustering approach.Theoretical analysis and experimental observation show thatthe proposed approach performs significantly with higher efficiency.Trujillo
et al.,
[5] proposed a combining K-means andsemivariogram-based grid clustering approach. Clustering iswidely used in various applications which include datamining, information retrieval, image segmentation, and dataclassification. A clustering technique for grouping data setsthat are indexed in the space is proposed in this paper. Thisapproach mainly depends on the k-means clustering techniqueand grid clustering. K-means clustering is the simplest andmost widely used approach. The main disadvantage of thisapproach is that it is sensitive to the selection of the initial partition. Grid clustering is extensively used for grouping datathat are indexed in the space. The main aim of the proposedclustering approach is to eliminate the high sensitivity of thek-means clustering approach to the starting conditions byusing the available spatial information. A semivariogram- based grid clustering technique is used in this approach. Itutilizes the spatial correlation for obtaining the bin size. Theauthor combines this approach with a conventional k-meansclustering technique as the bins are constrained to regular  blocks while the spatial distribution of objects is irregular. Aneffective initialization of the k-means is provided bysemivariogram. From the experimental results, it is clearlyobserved that the final partition protects the spatial distributionof the objects.Huang
et al.,
[6]

put forth the automated variable weighting ink-means type clustering that can automatically estimatevariable weights. A novel approach is introduced to the k-means algorithm to iteratively update variable weightsdepending on the present partition of data and a formula for weight calculation is also proposed in this paper. Theconvergency theorem of the new clustering algorithm is givenin this paper. The variable weights created by the approachestimates the significance of variables in clustering and can bedeployed in variable selection in various data miningapplications where large and complex real data are often used.Experiments are conducted on both synthetic and real data andit is found from the experimental observation that the proposed approach provides higher performance whencompared the traditional k-means type algorithms inrecovering clusters in data.III.

M
ETHODOLOGY

(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4, April 201195http://sites.google.com/site/ijcsis/ISSN 1947-5500

The methodology proposed for clustering the data is presentedin this section. Initially, K-Means clustering is described. Thenthe constraint based K-Means clustering is provided. Next, theconstraints used in Constrained K-Means algorithm are presented. For the generation of constraints like must-link andcannot-link, Self Organizing Map is used in this paper.
K-Means Clustering
Provided a data set of data samples, a preferred number of clusters, k, and a set of k initial starting points, the k-meansclustering technique determines the desired number of distinctclusters and their centroids. A centroid is defined as the pointwhose coordinates are determined by calculating the averageof each of the coordinates (i.e., feature values) of the points of the jobs allocated to the cluster. Properly, the k-meansclustering algorithm follows the following steps.
Step 1:
Choose a number of desired clusters, k.
Step 2:
Choose k starting points to be used as initial estimatesof the cluster centroids. These are the initial starting values.
Step 3:
Examine each point in the data set and assign it to thecluster whose centroid is nearest to it.
Step 4:
When each point is assigned to a cluster, recalculatethe new k centroids.
Step 5:
Repeat steps 3 and 4 until no point changes its cluster assignment, or until a maximum number of passes through thedata set is performed.
Constrained K-Means Clustering
Constrained K-Means Clustering [15] is similar to thestandard K-Means Clustering algorithm with the exception isthat the constraints must be satisfied while assigning the data points into the cluster. The algorithm for Constrained K-Means Clustering is described below.
Step 1:
Choose a number of desired clusters, k.
Step 2:
Choose k starting points to be used as initial estimatesof the cluster centroids. These are the initial starting values.
Step 3:
Examine each point in the data set and assign it to thecluster whose centroid is nearest to it only when the
violate-constraints ( )
returns
false

Step 4:
When each point is assigned to a cluster, recalculatethe new k centroids.
Step 5:
Repeat steps 3 and 4 until no point changes its cluster assignment, or until a maximum number of passes through thedata set is performed.
Function violate-constraints ( )

if
return
true
elseif
return
true
elseif

δ
-constraint not satisfied
return
true
elseif

ε
-constraint not satisfied
return
true
else

return
false
Constraints used for Constrained K-Means Clustering
The Constraints [11, 12, 13] used for Constrained K-MeansClustering are

δ
-constraint

ε
-constraintConsider S = {s
1
, s
2
,…,s
n
} as a set of n data points that are to be separated into clusters. For any pair of points s
i
and s
j
in S,the distance between them is represented by d(s
i
, s
j
) with asymmetric property in order that d(s
i
, s
j
) = d(s
j
,s
i
). Theconstraints are:

indicates that two points s
i
ands
j
(i
j) in S have to be in the same cluster.

indicates that two point s
i
ands
j
(i
j) in S must not be placed in the same cluster.

δ
-Constraint:
This constraint represents a value
δ
> 0.Properly, for any pair of clusters S
i
and S
j
(i
j), andany pair of points s
p
and s
q
such that s
p

S
i
and s
q

S
j
, d(s
p
, s
q
)

δ
.

ε
-Constraint:
This constraint represents a value
ε
> 0and the feasibility need is the following: for anycluster S
i
containing two or more points and for any point s
p

S
i
, there must be another point s
q

S
i
suchthat d(s
p
, s
q
)

ε
.
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 9, No. 4, April 201196http://sites.google.com/site/ijcsis/ISSN 1947-5500