Read without ads and support Scribd by becoming a Scribd Premium Reader.
 
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8
 21
A survey: Performance improving of K-mean by Genetic Algorithm
Amit Dubey
1
, Prof. Anurag Jain
2
and Dr. A.K. Sachan
3
 
1
Computer Science department Radharaman Institute of science & technology, Bhopal
 
2
Computer Science department Radharaman Institute of science & technology, Bhopal
3
Computer Science department Radharaman Institute of science & technology, Bhopalamit_23dubey@yahoo.co.inAnurag.akjain@gmail.comsachanak_12@yahoo.com
Abstract
This paper presents a new initialization technique for k mean clustering. centroid selection performed by Geneticalgorithm in the K mean algorithm. These centroids act as starting points for k-means. This paper is a survey of Improved K mean using Genetic algorithm. To measure the cluster compactness a within cluster scatter criteria hasbeen used.
Keywords:
K-Means, GAIK, genetic algorithm, IGA-FKKM, Entropy Weighting
I.
 
Introduction
Clustering is the process of grouping data into groups having similar properties. It is widely used in many areas,including data mining, statistics, biology, and machine learning. A cluster has objects with high similarity, but isdissimilar to the objects in other clusters [1]. These similarities are assessed based on the attribute value.
1.2 Types of clustering
1.
 
Partition based: - The partitioning method initially creates partitions. Then an iterative relocation techniqueis used to improve partitioning and moves objects from one group to another.2.
 
Hierarchical: - A hierarchical method creates a hierarchical decomposition of the given set of data objects.3.
 
Density based: The density based approach is to continue growing the given cluster as long as the densityi.e. number of objects or data points in the neighborhood exceeds some threshold.4.
 
Grid based: - Grid based methods quantize the object space into a finite number of cells that forms a gridstructure.
5.
 
Model based clustering: - The model based clustering hypothesizes a model for each of the clusters andfinds the best fitted data according to the given model
.
K-means algorithm which is a partition based clustering, and it is one of the most popular methods used in dataclustering due to its good computational performance [2]. However, it is well known that its result depends on theinitialization process, which is generally done by random selection. To improve the performance a new initializationtechnique has been proposed. Different runs of K-means on the same input data may produce different results.Genetic Algorithms are based on the ideas of natural evolution. In general, GA start with an initial population, andthen a new population is created based on the fitness value of chromosomes. Fitness is the measure for how good is
 
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8
 22
the population. Typically a distance measure is the most common [3]. Then a process called crossover is done overthe new population by swapping the substrings from selected chromosomes in order to produce new chromosomes.After that mutation process is applied to produce randomization. This process continues until a terminationcondition is achieved.
II.
 
K-Means For Clustering
K-Means is one of the most common algorithms used for clustering. The algorithm classifies pixels to a predefinednumber of clusters (assume k clusters). The idea is to choose random cluster centers called centroids, one for eachcluster. These centroids are preferred to be as far as possible from each other. Initial points affect the clusteringprocess and results. After that, each pixel will be taken into consideration to calculate similarity with all clustercenters through a distance measure, and it will be assigned to the most similar cluster, the nearest cluster center.When this assignment process is over, a new centroid is calculated for each cluster using the pixels in it. For eachcluster, the mean value will be calculated for the coordinates of all the points in that cluster and set as thecoordinates of the new center. Once we have these k new centroids or center points, the assignment process muststart over. This process is repeated until there is no change in centroids. Finally, this algorithm aims at minimizing anobjective function, which is in this case is a squared error function as given by eq. 1 [4].(1)In this formula K is the number of clusters, x represents a data point, Ck represents cluster k, mk represents themean of the cluster k, and A is the total number of attributes for a data point. The K-means algorithm is expressed asfollows [4].
Step 1:
 
Choose random k points and set as cluster centers.
Step 2:
 
Assign each object to the closest centroid’s cluster.
Step 3
:
When all objects have been assigned, recalculate the positions of the centroids.
Step 4:
 
Go back to Steps 2 unless the centroids are not changing.One drawback of K-means is that it is sensitive to the initially selected points, and so it does not always produce thesame output. To avoid this problem, the algorithm may run many times before taking an average values for all runs,or at least take the median value.
III.
 
Initializing K-Means With GA
In literature it has been found that Genetic algorithm is used to initialize K-means and known as GA initialized K-means (GAIK). The purpose of GA is to optimize the performance of K-means. It has been also noticed that theperformance of K-means depends upon the initial centroid selection. GA provides the initial cluster centroids, which
 
International Journal of Computational Intelligence and Information Security, August 2011 Vol. 2, No. 8
 23
act as starting point for Kmeans. For using GAs into clustering, an initial population of random clusters is generated.At each generation, each individual is evaluated and recombined with others on the basis of its fitness. Newindividuals are created using crossover and mutation.
 A. Chromosome representation
The first step of GA is representation (or encoding) of chromosomes. The encoding may be done in binary, integeror real numbers. Different research uses different encoding schemes. Fig 1 shows cluster centers as chromosomes.
Figure: 1 Encoding in genetic algorithms [5]
 B. Fitness evaluation
A fitness function is needed to evaluate the fitness of chromosomes. The fitness function should return some realvalue. Eq. 2 is used for fitness evaluation.(2)Cluster Ck, which makes it similar to the k-means algorithm [6].
 
C. Selection
Selecting chromosomes for production of new generation is called Selection. Selection is done on the basis of thefitness value. The best fitted chromosomes are selected for crossover. There are verities of selection procedures likeuniform selection, roulette wheel selection, tournament etc.
 
 D. Crossover
The purpose of crossover is to create two new individuals chromosomes from two existing chromosomes selectedfrom current population. Typical crossover is one point crossover, two point crossover, cycle crossover and uniformcrossover. Fig. 2 shows the generation of new chromosomes through crossover process.
Search History:
Searching...
Result 00 of 00
00 results for result for
  • p.
  • More From This User

    Notes
    Load more