Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Enhancing K-Means Algorithm with Semi-Unsupervised Centroid Selection Method

Enhancing K-Means Algorithm with Semi-Unsupervised Centroid Selection Method

Ratings: (0)|Views: 221 |Likes:
Published by ijcsis
The k-means algorithm is one of the frequently used clustering methods in data mining, due to its performance in clustering massive data sets. The final clustering result of the kmeans clustering algorithm is based on the correctness of the initial centroids, which are selected randomly. The original k-means algorithm converges to local minimum, not the global optimum. The k-means clustering performance can be enhanced if the initial cluster centers are found. To find the initial cluster centers a series of procedure is performed. Data in a cell is partitioned using a cutting plane that divides cell in two smaller cells. The plane is perpendicular to the data axis with very high variance and is intended to minimize the sum squared errors of the two cells as much as possible, while at the same time keep the two cells far apart as possible. Cells are partitioned one at a time until the number of cells equals to the predefined number of clusters, K. The centers of the K cells become the initial cluster centers for K-means. In this paper, an efficient method for computing initial centroids is proposed. A Semi Unsupervised Centroid Selection Method is used to compute the initial centroids. Gene dataset is used to experiment the proposed approach of data clustering using initial centroids. The experimental results illustrate that the proposed method is very much apt for the gene clustering applications.
The k-means algorithm is one of the frequently used clustering methods in data mining, due to its performance in clustering massive data sets. The final clustering result of the kmeans clustering algorithm is based on the correctness of the initial centroids, which are selected randomly. The original k-means algorithm converges to local minimum, not the global optimum. The k-means clustering performance can be enhanced if the initial cluster centers are found. To find the initial cluster centers a series of procedure is performed. Data in a cell is partitioned using a cutting plane that divides cell in two smaller cells. The plane is perpendicular to the data axis with very high variance and is intended to minimize the sum squared errors of the two cells as much as possible, while at the same time keep the two cells far apart as possible. Cells are partitioned one at a time until the number of cells equals to the predefined number of clusters, K. The centers of the K cells become the initial cluster centers for K-means. In this paper, an efficient method for computing initial centroids is proposed. A Semi Unsupervised Centroid Selection Method is used to compute the initial centroids. Gene dataset is used to experiment the proposed approach of data clustering using initial centroids. The experimental results illustrate that the proposed method is very much apt for the gene clustering applications.

More info:

Published by: ijcsis on Jan 20, 2011
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

10/09/2013

pdf

text

original

 
  Abstract—
The k-means algorithm is one of the frequently usedclustering methods in data mining, due to its performance inclustering massive data sets. The final clustering result of the kmeansclustering algorithm is based on the correctness of the initialcentroids, which are selected randomly. The original k-meansalgorithm converges to local minimum, not the global optimum. Thek-means clustering performance can be enhanced if the initial cluster centers are found. To find the initial cluster centers a series of  procedure is performed. Data in a cell is partitioned using a cutting plane that divides cell in two smaller cells. The plane is perpendicular to the data axis with very high variance and is intended to minimizethe sum squared errors of the two cells as much as possible, while atthe same time keep the two cells far apart as possible. Cells are partitioned one at a time until the number of cells equals to the predefined number of clusters, K. The centers of the K cells becomethe initial cluster centers for K-means.
 
In this paper, an efficientmethod for computing initial centroids is proposed. A SemiUnsupervised Centroid Selection Method is used to compute theinitial centroids. Gene dataset is used to experiment the proposedapproach of data clustering using initial centroids. The experimentalresults illustrate that the proposed method is vey much apt for thegene clustering applications.
 Index Terms—
 
Clustering algorithm, K-means algorithm, Data partitioning, initial cluster centers, semi-unsupervised gene selection.
 I.
 
I
 NTRODUCTION
 LUSTERING, or unsupervised classification, will beconsidered as a mixture of problem where the aim is to partition a set of data object into a predefined number of clusters [13]. Number of clusters might be established bymeans of the cluster validity criterion or described by user.
 
Clustering problems are broadly used in many applications,such as customer segmentation, classification, and trendanalysis. For example, consider that customers purchased aretail database records containing items. A clustering methodcould group the customers in such a way that customers withsimilar buying patterns are in the same cluster. Several real-word applications deal with high dimensional data. It isalways a challenge for clustering algorithms because of themanual processing is practically not possible. A high qualitycomputer-based clustering removes the unimportant featuresand replaces the original set by a smaller representative set of data objects.K-means is a well known prototype-based [14], partitioningclustering technique that attempts to find a user-specified
R. Shanmugasundram, Associate Professor, Department of Computer Science, Erode Arts & Science College, Erode, India.Dr. S. Sukumaran, Associate Professor, Department of Computer Science,Erode Arts and Science College, Erode, India.
number of clusters (K), which are represented by their centroids.The K-means algorithm is as follows:1. Select initial centers of the K clusters. Repeat the steps 2through 3 until the cluster membership stabilizes.2. Generate a new partition by assigning each the data to itsclosest cluster centers.3. Compute new cluster centers as centroids of the clusters.Though K-means is simple and can be used for a widevariety of data types, it is quite sensitive to initial positions of cluster centers. The final cluster centroids may not be optimalones as the algorithm can converge to local optimal solutions.An empty cluster can be attained if no points are allocated tothe cluster during the assignment step. Therefore, it isimportant for K-means to have good initial cluster centers [15,16]. In this paper a Semi-Unsupervised Selection Method(SCSM) is presented. The organization of this paper is asfollows. In the next section, the literature survey is presented.In Section III, efficient semi-unsupervised centroid selectionalgorithm is presented. The experimental results and are presented in Section IV. Section V concludes the paper.II.
 
L
ITERATURE
S
URVEY
 Clustering statistical data has been studied from early timeand lots of advanced models as well as algorithms have been proposed. This section of the paper provides a view on therelated research work in the field of clustering that may assistthe researchers.Bradley and Fayyad together in [2] put forth a technique for refining initial points for clustering algorithms, in particular k-means clustering algorithm. They presented a fast and efficientalgorithm for refining an initial starting point for a generalclass of clustering algorithms. The iterative techniques that aremore sensitive to initial starting conditions were used in mostof the clustering algorithms like K-means, and EM normallyconverges to one local minima. They implemented thisiterative technique for refining the initial condition whichallows the algorithm to converge to a better local minimumvalue. The refined initial point is used to evaluate the performance of K-means algorithm in clustering the givendata set. The results illustrated that the refinement run time issignificantly lower than the time required to cluster the fulldatabase. In addition, the method is scalable and can becoupled with a scalable clustering algorithm to concentrate onthe large-scale clustering problems especially in case of datamining.Yang et al. in [3] proposed an efficient data clusteringalgorithm. It is well known that K-means (KM) algorithm isone of the most popular clustering techniques because it is
Enhancing K-Means Algorithm withSemi-Unsupervised Centroid Selection Method
R. Shanmugasundaram and Dr. S. Sukumaran
C
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 9, December 2010337http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
unproblematic to implement and works rapid in mostsituations. But the sensitivity of KM algorithm to initializationmakes it easily trapped in local optima. K-Harmonic Means(KHM) clustering resolves the problem of initialization faced by KM algorithm. Even then KHM also easily runs into localoptima. PSO algorithm is a global optimization technique. Ahybrid data clustering algorithm based on the PSO and KHM(PSOKHM) was proposed by Yang et al. in [3]. This hybriddata clustering algorithm utilizes the advantages of both thealgorithms. Therefore the PSOKHM algorithm not only helpsthe KHM clustering run off from local optima but alsoconquer the inadequacy of the slow convergence speed of thePSO algorithm. They conducted experiments to compare thehybrid data clustering algorithm with that of PSO and KHMclustering on seven different data sets. The results of theexperiments show that PSOKHM was simply superior to theother two clustering algorithms.Huang in [4] put forth a technique that enhances theimplementation of K-Means algorithm to various data sets.Generally, the efficiency of K-Means algorithm in clusteringthe data sets is high. The restriction for implementing K-Means algorithm to cluster real world data which containscategorical value is because of the fact that it was mostlyemployed to numerical values. They presented two algorithmswhich extend the k-means algorithm to categorical domainsand domains with mixed numeric and categorical values. Thek-modes algorithm uses a trouble-free matching dissimilaritymeasure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method tomodernize modes in the clustering process to decrease theclustering cost function. The k-prototypes algorithm, from thedefinition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categoricalattributes. The experiments were conducted on well knownsoybean disease and credit approval data sets to demonstratethe clustering performance of the two algorithms.Kluger [5] first proposed spectral biclustering for  processing gene expression data. But Kluger’s focus is mainlyon unsupervised clustering, not on gene selection.There are some present works related to the findinginitialization centroids.1. Compute mean (
μ
 j) and standard deviation (
σ
j) for every jth attribute values.2. Compute percentile Z1, Z2,…, Zk corresponding to areaunder the normal curve from – 
to (2s-1)/2k, s=1, 2, … ,k (clusters).3. Compute attribute values xs =zs
σ
 j+
μ
 j corresponding tothese percentiles using mean and standard deviation of theattribute.4. Perform the K-means to cluster data based on jth attributevalues using xs as initial centers and assign cluster labels toevery data.5. Repeat the steps of 3-4 for all attributes (l).6. For every data item t create the string of the class labelsPt = (P1, P2,…, Pl) where Pj is the class label of t when usingthe jth attribute values for step 4 clustering.7. Merge the data items which have the same pattern stringPt yielding K 
clusters. The centroids of the K 
clusters arecomputed. If K 
> K, apply Merge- DBMSDC (Density basedMulti Scale Data Condensation) algorithm [6] to merge these
clusters into K clusters.8. Find the centroids of K clusters and use the centroid asinitial centers for clustering the original dataset using K Means.Although the mentioned initialization algorithms can helpfinding good initial centers for some extent, they are quitecomplex and some use the K-Means algorithm as part of their algorithms, which still need to use the random method for cluster center initialization. The proposed approach for findinginitial cluster centroid is presented in the following section.III.
 
M
ETHODOLOGY
 
3.1.
 
 Initial Cluster Centers Deriving from DataPartitioning
The algorithm follows a novel approach that performs data partitioning along the data axis with the highest variance. Theapproach has been used successfully for color quantization [7].The data partitioning tries to divide data space into small cellsor clusters where intercluster distances are large as possibleand intracluster distances are small as possible.
Fig. 1 Diagram of ten data points in 2D, sorted by its X value, with anordering number for each data point
For instance, consider Fig. 1. Suppose ten data points in 2Ddata space are given.The goal is to partition the ten data points in Fig. 1 into twodisjoint cells where sum of the total clustering errors of thetwo cells is minimal, see Fig. 2. Suppose a cutting plane perpendicular to X-axis will be used to partition the data. LetC
1
and C
2
be the first cell and the second cell respectively and
and
be the cell centroids of the first cell and the secondcell, respectively. The total clustering error of the first cell isthus computed by:


,
 (1)and the total clustering error of the second cell is thuscomputed by:


,
 (2)
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 9, December 2010338http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
where c
i
is the i
th
data in a cell. As a result, the sum of totalclustering errors of both cells are minimal (as shown in Fig.2.)
Fig. 2 Diagram of partitioning a cell of ten data points into twosmaller cells, a solid line represents the intercluster distance and dashlines represent the intracluster distanceFig. 3Illustration of partitioning the ten data points into two smaller cells using m as a partitioning point. A solid line in the squarerepresents the distance between the cell centroid and a data in cell, adash line represents the distance between m and data in each cell anda solid dash line represents the distance between m and the datacentroid in each cell
The partition could be done using a cutting plane that passesthrough m. Thus
,
,
,

 
,
,
,
 
(3)(as shown in Fig. 3). Thus

,

,
,
.|
|
 

,

,
,
.|
|
 (4)m is called as the partitioning data point where |C1| and |C2|are the numbers of data points in cluster C1 and C2respectively. The total clustering error of the first cell can beminimized by reducing the total discrepancies between all datain first cell to m, which is computed by:


,
 (5)The same argument is also true for the second cell. The totalclustering error of second cell can be minimized by reducingthe total discrepancies between all data in second cell to m,which is computed by:


,
 (6)where d(c
i
,c
m
) is the distance between m and each data ineach cell. Therefore the problem to minimize the sum of totalclustering errors of both cells can be transformed into the problem to minimize the sum of total clustering error of alldata in the two cells to m.The relationship between the total clustering error and theclustering point may is illustrated in Fig. 4, where thehorizontal-axis represents the partitioning point that runs from1 to n where n is the total number of data points and thevertical-axis represents the total clustering error. When m=0,the total clustering error of second cell equals to the totalclustering error of all data points while the total clusteringerror of first cell is zero. On the other hand, when m=n, thetotal clustering error of the first cell equals to the totalclustering error of all data points, while the total clusteringerror of the second cell is zero.
Fig. 4 Graphs depict the total clustering error, lines 1 and 2 representthe total clustering error of the first cell and second cell, respectively,Line 3 represents a summation of the total clustering errors of thefirst and the second cells
A parabola curve shown in Fig. 4 represents a summation of the total clustering error of the first cell and the second cell,represented by the dash line 2. Note that the lowest point of the parabola curve is the optimal clustering point (m). At this point, the summation of total clustering error of the first celland the second cell are minimum.Since time complexity of locating the optimal point m isO(n
2
), the distances between adjacent data is used along the X-axis to find the approximated point of n but with time of O(n).Let
 
,

 
 be the squared Euclidean distance of adjacent data points along the X-axis.If i is in the first cell then

,

. On the onehand, if i is in the second cell then

,

(asshown in Fig. 5).
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 9, December 2010339http://sites.google.com/site/ijcsis/ISSN 1947-5500

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->