583 views

Uploaded by ahmed_fahim98

save

- Robust MCR for Problem Images FACSS 2006
- 07 Corolla PRG
- A Meta-heuristic Approach for Image Segmentation using Firefly Algorithm
- Analysis of Clustering technique
- 12 Clustering Algorithms1
- K-Means Clustering Tutorial_ Matlab Code
- rfid-main
- Anomaly Detection Using Metaheuristic Firefly Harmonic Clustering
- Oc 2423022305
- An R Package for Spatial Coverage Sampling and Random Sampling From Compact Geographical Strata by K-means
- Meta-heuristic Based Clustering of Two-dimensional Data Using-2
- Player Modeling Evaluation for Interactive Fiction
- Development of Pattern Knowledge Discovery Framework Using
- OPTIMAL PARAMETER SELECTION FOR UNSUPERVISED NEURAL NETWORK USING GENETIC ALGORITHM.pdf
- Botschen Thelen Pieters - Using Means-End Structures for Benefit Segmentation
- TMM09
- Determining the Optimal Distribution of Bee Colony Locations
- Frequent Item set Mining of Big Data for Social Media
- Radial Basis Function Artificial Neural Network: Spread Selection
- Meteorological Data Analysis Using Hdinsight With Rtvs
- Neurofuzzy Modelling
- A Classification of CKD Cases Using MultiVariate K-Means Clustering.
- Sapiens: A Brief History of Humankind
- The Unwinding: An Inner History of the New America
- Yes Please
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- John Adams
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- The Prize: The Epic Quest for Oil, Money & Power
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- This Changes Everything: Capitalism vs. The Climate
- Grand Pursuit: The Story of Economic Genius
- The Emperor of All Maladies: A Biography of Cancer
- Team of Rivals: The Political Genius of Abraham Lincoln
- The New Confessions of an Economic Hit Man
- Rise of ISIS: A Threat We Can't Ignore
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Bad Feminist: Essays
- How To Win Friends and Influence People
- Steve Jobs
- Angela's Ashes: A Memoir
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- The Light Between Oceans: A Novel
- The Incarnations: A Novel
- You Too Can Have a Body Like Mine: A Novel
- Life of Pi
- The Love Affairs of Nathaniel P.: A Novel
- We Are Not Ourselves: A Novel
- The First Bad Man: A Novel
- The Rosie Project: A Novel
- The Blazing World: A Novel
- The Flamethrowers: A Novel
- Brooklyn: A Novel
- A Man Called Ove: A Novel
- Bel Canto
- The Master
- Interpreter of Maladies
- Beautiful Ruins: A Novel
- The Kitchen House: A Novel
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Wallcreeper
- My Sister's Keeper: A Novel
- A Prayer for Owen Meany: A Novel
- The Cider House Rules
- The Perks of Being a Wallflower
- The Bonfire of the Vanities: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- Curso de SQL Server-Aulaclic
- Personal Material Database - Matereality
- 2014 11 - #AMIA2014 Panel - Local Terminologies and Standards - The Regenstrief Experience
- Term Paper Library Management System
- MBA 3 a List of Topics
- Essbase Optimization
- Redes - Modelo Iso-osi
- Thesis Strijker
- Sql Otimizado
- Data Flow Diagram (DFD)
- resume final
- data mining.pdf
- Buscadores médicos legal
- Uso_de_Funciones_SQL_de_Grupo.pptx
- Introducao representacao descritiva
- Bewerbungsformular Stipendium 2015-2
- Emailing Module_RS GIS
- 1- Data Warehouses
- Symantec Data Insight User Guide
- 1Z0-060-Oracle Database 12c NF
- Online Internet Marketing Tipps
- SL Metaschema 1 9-0-120
- OLIVEIRA, Ana Karina Rocha de - Museologia e Ciência Da Informação - Documentação de Um Conjunto de Peças de Roupas Brancas
- Iatlis Paper
- Bd Visual Basic 6
- Sony KDL - 32EX407 Chasis Z1 - L(3).pdf
- Microsoft SQL Server 2014
- MBA BI subjects
- saiqah nusrat csit 101 done done
- Warum NoSQL?

You are on page 1of 8

/ J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633

**Journal of Zhejiang University SCIENCE A
**

ISSN 1009-3095 (Print); ISSN 1862-1775 (Online)

www.zju.edu.cn/jzus; www.springerlink.com

E-mail: jzus@zju.edu.cn

An efficient enhanced k-means clustering algorithm

**FAHIM A.M.†1, SALEM A.M.2, TORKEY F.A.3, RAMADAN M.A.4
**

(1Department of Mathematics, Faculty of Education, Suez Canal University, Suez city, Egypt)

2

( Department of Computer Science, Faculty of Computers & Information, Ain Shams University, Cairo city, Egypt)

(3Department of Computer Science, Faculty of Computers & Information, Minufiya University, Shbeen El Koom City, Egypt)

(4Department of Mathematics, Faculty of Science, Minufiya University, Shbeen El Koom City, Egypt)

†

E-mail: ahmmedfahim@yahoo.com

Received Mar. 15, 2006; revision accepted May 11, 2006

Abstract: In k-means clustering, we are given a set of n data points in d-dimensional space úd and an integer k and the problem

is to determine a set of k points in úd, called centers, so as to minimize the mean squared distance from each data point to its nearest

center. In this paper, we present a simple and efficient clustering algorithm based on the k-means algorithm, which we call en-

hanced k-means algorithm. This algorithm is easy to implement, requiring a simple data structure to keep some information in each

iteration to be used in the next iteration. Our experimental results demonstrated that our scheme can improve the computational

speed of the k-means algorithm by the magnitude in the total number of distance calculations and the overall time of computation.

**Key words: Clustering algorithms, Cluster analysis, k-means algorithm, Data analysis
**

doi:10.1631/jzus.2006.A1626 Document code: A CLC number: TP301.6

**INTRODUCTION networks, artificial intelligence, and statistics.
**

Existing clustering algorithms can be broadly

The huge amount of data collected and stored in classified into hierarchical and partitioning clustering

databases increases the need for effective analysis algorithms (Jain and Dubes, 1988). Hierarchical al-

methods to use the information contained implicitly gorithms decompose a database D of n objects into

there. One of the primary data analysis tasks is cluster several levels of nested partitionings (clusterings),

analysis, intended to help a user understand the represented by a dendrogram, i.e., a tree that itera-

natural grouping or structure in a dataset. Therefore, tively splits D into smaller subsets until each subset

the development of improved clustering algorithms consists of only one object. There are two types of

has received much attention. The goal of a clustering hierarchical algorithms; an agglomerative that builds

algorithm is to group the objects of a database into a the tree from the leaf nodes up, whereas a divisive

set of meaningful subclasses (Ankerst et al., 1999). builds the tree from the top down. Partitioning algo-

Clustering is the process of partitioning or rithms construct a single partition of a database D of n

grouping a given set of patterns into disjoint clusters. objects into a set of k clusters, such that the objects in

This is done such that patterns in the same cluster are a cluster are more similar to each other than to objects

alike, and patterns belonging to two different clusters in different clusters.

are different. Clustering has been a widely studied The Single-Link method is a commonly used

problem in a variety of application domains including hierarchical clustering method (Sibson, 1973). Start-

data mining and knowledge discovery (Fayyad et al., ing by placing every object in a unique cluster, in

1996), data compression and vector quantization every step the two closest clusters are merged, until

(Gersho and Gray, 1992), pattern recognition and all points are in one cluster. Another approach to

pattern classification (Duda and Hart, 1973), neural hierarchical clustering is the algorithm CURE pro-

Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633 1627

**posed in (Guha et al., 1998). This algorithm stops the density-based algorithm DenClue is proposed. This
**

creation of a cluster hierarchy, if a level consists of k algorithm uses a grid but is very efficient, because it

clusters, where k is one of several input parameters. It only keeps information about grid cells that actually

utilizes multiple representative points to evaluate the contain data points, and manages these cells in a

distance between clusters, thereby adjusting well to tree-based access structure. This algorithm general-

arbitrary shaped clusters and avoiding the chain ef- izes some other clustering approaches which, how-

fect. ever, results in a large number of input parameters.

Optimization based partitioning algorithms Also the density and grid-based clustering tech-

typically represent clusters by a prototype. Objects nique CLIQUE (Agrawal et al., 1998) has been pro-

are assigned to the cluster represented by the most posed for mining in high-dimensional data spaces.

similar prototype. An iterative control strategy is used Input parameters are the size of the grid and a global

to optimize the whole clustering such that, the aver- density threshold for clusters. The major difference

age or squared distances of objects to its prototypes from all other clustering approaches is that, this

are minimized. These clustering algorithms are ef- method also detects subspaces of the highest dimen-

fective in determining a good clustering, if the clus- sionality such that high-density clusters exist in those

ters are of convex shape, similar size and density, and subspaces.

if their number k can be reasonably estimated. De- Another approach to clustering is the BIRCH

pending on the kind of prototypes, one can distinguish method (Zhang et al., 1996) that constructs a CF-tree,

k-means, k-modes and k-medoid algorithms. which is a hierarchical data structure designed for a

In k-means algorithm (MacQueen, 1967), the multiphase clustering method. First, the database is

prototype, called the center, is the mean value of all scanned to build an initial in memory CF-tree. Second,

objects belonging to a cluster. The k-modes algorithm an arbitrary clustering algorithm can be used to clus-

(Huang, 1997) extends the k-means paradigm to ter the leaf nodes of the CF-tree.

categorical domains. For k-medoid algorithms Among clustering formulations that are based on

(Kaufman and Rousseeuw, 1990), the prototype, minimizing a formal objective function, perhaps the

called the “medoid”, is the most centrally located most widely used and studied is k-means clustering.

object of a cluster. The algorithm CLARANS, intro- Given a set of n data points in real d-dimensional

duced in (Ng and Han, 1994), is an improved space, d, and an integer k, the problem is to deter-

k-medoid type algorithm restricting the huge search mine a set of k points in d, called centers, so as to

space by using two additional user-supplied parame- minimize the mean squared distance from each data

ters. It is significantly more efficient than the point to its nearest center.

well-known k-medoid algorithms PAM and CLARA, Recently iterated local search (ILS) was pro-

presented in (Kaufman and Rousseeuw, 1990). posed in (Merz, 2003). This algorithm tries to find

Density-based approaches, apply a local cluster near optimal solution for criteria of the k-means al-

criterion, are very popular for database mining. gorithm. It applies k-means algorithm, tries to swap

Clusters, are regions in the data space where the ob- randomly chosen center with randomly chosen point,

jects are dense, and separated by regions of low object compares the current solution with the previous and

density (noise). These regions may have an arbitrary keeps the best solution. This process is repeated a

shape. A density-based clustering method is pre- fixed number of times.

sented in (Ester et al., 1996). The basic idea of the The k-means method has been shown to be ef-

algorithm DBSCAN is that, for each point of a cluster, fective in producing good clustering results for many

the neighborhood of a given radius (e), has to contain practical applications. However, the k-means algo-

at least a minimum number of points (MinPts), where rithm requires time proportional to the product of

e and MinPts are input parameters. Another den- number of patterns and number of clusters per itera-

sity-based approach is WaveCluster (Sheikholeslami tion. This computationally may be expensive espe-

et al., 1998), which applies wavelet transform to the cially for large datasets. We propose an efficient im-

feature space. It can detect arbitrary shape clusters at plementation method for implementing the k-means

different scales. In (Hinneburg and Keim, 1998) the method. Our algorithm produces the same clustering

1628 Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633

results as that of the k-means algorithm, and has sig- positive integer l is known as the number of k-means

nificantly superior performance than the k-means iterations. The precise value of l varies depending on

algorithm. the initial starting cluster centroids even on the same

The rest of this paper is organized as follows. We dataset. So the computational time complexity of the

review the k-means algorithms in Section 2, present algorithm is O(nkl), where n is the total number of

the proposed algorithm in Section 3, and describe the objects in the dataset, k is the required number of

datasets used to evaluate the algorithm in Section 4, clusters we identified and l is the number of iterations,

where we describe the experimental results and con- k≤n, l≤n.

clude with Section 5.

1 MSE=largenumber;

2 Select initial cluster centroids {mj}jk=1;

k-MEANS CLUSTERING 3 Do

4 OldMSE=MSE;

5 MSE1=0;

In this section, we briefly describe the k-means 6 For j=1 to k

algorithm. Suppose that a dataset of n data points x1, 7 mj=0; nj=0;

x2, …, xn such that each data point is in d, the prob- 8 endfor

9 For i=1 to n

lem of finding the minimum variance clustering of the 10 For j=1 to k

dataset into k clusters is that of finding k points {mj} 11 Compute squared Euclidean

(j=1, 2, …, k) in d such that distance d2(xi, mj);

12 endfor

13 Find the closest centroid mj to xi;

1 n

∑ min d 2 ( xi , m j )

n i =1 j

(1) 14

15

mj=mj+xi; nj=nj+1;

MSE1=MSE1+d2(xi, mj);

16 endfor

17 For j=1 to k

is minimized, where d(xi, mj) denotes the Euclidean 18 nj=max(nj, 1); mj=mj/nj;

distance between xi and mj. The points {mj} (j=1, 2, …, 19 endfor

k) are known as cluster centroids. The problem in 20 MSE=MSE1;

while (MSE<OldMSE)

Eq.(1) is to find k cluster centroids, such that the av-

erage squared Euclidean distance (mean squared error, Fig.1 k-means algorithm

MSE) between a data point and its nearest cluster

centroid is minimized.

The k-means algorithm provides an easy method AN EFFICIENT ENHANCED k-MEANS CLUS-

to implement approximate solution to Eq.(1). The TERING

reasons for the popularity of k-means are ease and

simplicity of implementation, scalability, speed of In this section, we introduce the proposed idea

convergence and adaptability to sparse data. that makes k-means more efficient, especially for

The k-means algorithm can be thought of as a dataset containing large number of clusters. Since, in

gradient descent procedure, which begins at starting each iteration, the k-means algorithm computes the

cluster centroids, and iteratively updates these cen- distances between data point and all centers, this is

troids to decrease the objective function in Eq.(1). computationally very expensive especially for huge

The k-means always converge to a local minimum. datasets. Why we do not benefit from previous itera-

The particular local minimum found depends on the tion of k-means algorithm? For each data point, we

starting cluster centroids. The problem of finding the can keep the distance to the nearest cluster. At the

global minimum is NP-complete. The k-means algo- next iteration, we compute the distance to the previ-

rithm updates cluster centroids till local minimum is ous nearest cluster. If the new distance is less than or

found. Fig.1 shows the k-means algorithm. equal to the previous distance, the point stays in its

Before the k-means algorithm converges, dis- cluster, and there is no need to compute its distances

tance and centroid calculations are done while loops to the other cluster centers. This saves the time re-

are executed a number of times, say l, where the quired to compute distances to k−1 cluster centers.

Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633 1629

The proposed idea comes from the fact that the next iteration. Fig.2c shows the final clusters and their

k-means algorithm discovers spherical shaped cluster, centroids.

whose center is the gravity center of points in that When we examine Fig.2b, in Clusters 1, 2 we

cluster, this center moves as new points are added to note that, the most points become closer to their new

or removed from it. This motion makes the center center, only one point in Cluster 1, and 2 points in

closer to some points and far apart from the other Cluster 2 will be redistributed (their distances to all

points, the points that become closer to the center will centroids must be computed), and the final clusters

stay in that cluster, so there is no need to find its dis- are presented in Fig.2c. Based on this idea, the pro-

tances to other cluster centers. The points far apart posed algorithm saves a lot of time.

from the center may change the cluster, so only for In the proposed method, we write two functions.

these points their distances to other cluster centers The first function is the basic function of the k-means

will be calculated, and assigned to the nearest center. algorithm, that finds the nearest center for each data

Fig.2 explains the idea. point, by computing the distances to the k centers, and

Fig.2a represents the dataset points and the ini- for each data point keeps its distance to the nearest

tial 3 centroids. Fig.2b shows points distribution over center.

the initial 3 centroids, and the new centroids for the The first function is shown in Fig.3, which is

similar to that in Fig.1, with adding a simple data

y

structure to keep the distance between each point and

k1

its nearest cluster. This function is called distance().

k2 Function distance()

//assign each point to its nearest cluster

1 For i=1 to n

2 For j=1 to k

k3

3 Compute squared Euclidean distance

x d2(xi, mj);

(a)

4 endfor

5 Find the closest centroid mj to xi;

y Cluster 1

6 mj=mj+xi; nj=nj+1;

7 MSE=MSE+d2(xi, mj);

k1 k1 8 Clusterid[i]=number of the closest centroid;

Cluster 2

9 Pointdis[i]=Euclidean distance to the closest

k2 centroid;

Cluster 3

k3 10 endfor

k2 11 For j=1 to k

k3 12 mj=mj/nj;

13 endfor

x

(b)

Fig.3 The first function used in our implementation

y of k-means algorithm

Cluster 1

k1 In Line 3 the function finds the distance between

Cluster 2

point number i and all k centroids. Line 5 searches for

Cluster 3 the closest centroid to point number i, say the closest

k2

k3

centroid is number j. Line 6 adds point number i to

cluster number j, and increase the count of points in

cluster j by one. Lines 8 and 9 are used to enable us to

x

(c) execute the proposed idea; these two lines keep the

number of the closest cluster and the distance to the

Fig.2 Some points remain in their cluster because the

center becomes closer to it. (a) Initial centroids to a closest cluster. Line 12 does centroids recalculation.

dataset; (b) Recalculating the position of the centroids; The other function is shown in Fig.4, which is

(c) Final positions of the centroids the same as Fig.3 and is called distance_new(). Line 1

1630 Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633

finds the distance between the current point i and the algorithm. In the first implementation, these two

new cluster center assigned to it in the previous it- functions are executed a number of times (first dis-

eration, if the computed distance is smaller than or tance() then distance_new(), i.e., there is overlap

equal to the distance to the old center, the point stays between these two functions). This implementation

in its cluster that was assigned to in previous iteration, will be referred to as “overlapped k-means” algo-

and there is no need to compute the distances to the rithm.

other k−1 centers. Lines 3~5 will be executed if the In the second implementation, function distance()

computed distance is larger than the distance to the is executed two times, while function distance_new()

old center, this is because the point may change its is executed the reminder of iteration. This imple-

cluster, so Line 4 computes the distance between the mentation will be referred to as “enhanced k-means”

current point and all k centers. Line 6 searches for the algorithm.

closest center, Line 7 assigns the current point to the

closest cluster and increases the count of points in this Complexity

cluster by one, Line 8 updates mean squared error. As we discussed before, the k-means algorithm

Lines 9 and 10 keep the cluster id, for the current converges to local minimum. Before the k-means

point assigned to it, and its distance to it to be used in converges, the centroids computed number of times,

next call of that function (i.e. next iteration of that and all points are assigned to their nearest centroids,

function). This information is kept in Line 9 and Line i.e., complete redistribution of points according to

10 allows this function to reduce the distance calcu- new centroids, this takes O(nkl), where n is the

lation required to assign each point to the closest number of points, k is the number of clusters and l is

cluster, and this makes the function faster than the the number of iterations.

function distance in Fig.3. In the proposed enhanced k-means algorithm, to

obtain initial clusters, this process requires O(nk).

Here, some points remain in its cluster, the others

Function distance_new()

//assign each point to its nearest cluster move to another cluster. If the point stays in its cluster

1 For i=1 to n this require O(1), otherwise require O(k). If we sup-

Compute squared Euclidean distance pose that half points move from their clusters, this

d2(xi, Clusterid[i]); requires O(nk/2), since the algorithm converges to

If (d2(xi, Clusterid[i])<=Pointdis[i])

local minimum, the number of points moved from

Point stay in its cluster;

2 Else their clusters decreases in each iteration. So we expect

the total cost is nk ∑ i =11/ i . Even for large number of

3 For j=1 to k l

**4 Compute squared Euclidean distance
**

d2(xi, mj); iterations, nk ∑ i =11/ i is much less than nkl. So the

l

5 endfor

6 Find the closest centroid mj to xi; cost of using enhanced k-means algorithm approxi-

7 mj=mj+xi; nj=nj+1; mately is O(nk), not O(nkl).

8 MSE=MSE+d2(xi, mj);

9 Clustered[i]=number of the closest centroid;

10 Pointdis[i]=Euclidean distance to the closest

EXPERIMENTAL RESULTS

centroid;

11 endfor

12 For j=1 to k We have evaluated our algorithm on several dif-

13 mj=mj/nj; ferent real datasets and synthetic dataset. We have

14 endfor compared our results with that of k-means algorithm

in terms of the total execution time and quality of

Fig.4 The second function used in our implementation clusters. We also did a comparison with the CLARA

of k-means algorithm algorithm. Our experimental results are reported on

PC 800 MHz CPU, 128 MB RAM, 256 kB Cache.

Our implementation of k-means algorithm is

based up on these two functions described in Figs.3 Real datasets

and 4. We have two implementations of the k-means In this section, we give a brief description of the

Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633 1631

**datasets used in our algorithm evaluation. Table 1 Fig.5 shows that, enhanced k-means algorithm
**

shows some characteristics of the datasets. produces the final clusters in shortest period of time.

The overlapped curves in Fig.6 show that, the three

Table 1 Characteristic of the datasets implementations of k-means algorithm produce clus-

Dataset Number of records Number of attributes ters approximately of the same quality.

Letters 20000 16 2. Abalone dataset

Abalone 4177 7 This dataset represents physical measurements

Wind 6574 15 of abalone (sea organism). Each abalone is described

with 8 attributes.

1. Letter Image dataset Fig.7 shows that, the enhanced k-means algo-

This dataset represents the image of English rithm is the most efficient algorithm. The overlapped

capital letters. The image consists of a large number curves in Fig.8 shows that the final clusters are ap-

of black-and-white rectangular pixel displays as one proximately of the same quality.

of the 26 capital letters in the English alphabet. The 3. Wind dataset

character images were based on 20 different fonts and This dataset represents measurements on wind

each letter within these 20 fonts was randomly dis- from 1/1/1961 to 31/12/1978 (Figs.9 and 10). These

torted to produce a file of 20000 unique stimuli. Each observation of wind are described by 15 attributes.

stimulus was converted into 16 primitive numerical

attributes (statistical moments and edge counts) Synthetic datasets

which were then scaled to fit into a range of integer Our generated datasets are semi-spherical

values from 0 through 15. shaped, in two dimensions. This dataset was generated

**160 Original k-means
**

450

Overlapped k-means Original k-means

130 Enhanced k-means 400 Overlapped k-means

Enhanced k-means

100 350

Time (s)

MSE

70 300

40 250

10 200

40 50 60 70 80 90 100 40 50 60 70 80 90 100

Required number of clusters Required number of clusters

Fig.5 Execution time (Letter Image dataset) Fig.6 Quality of clusters (Letter Image dataset)

60 25

Original k-means

50

Overlapped k-means 20

40 Enhanced k-means

15

Time (s)

30

MSE

10 Original k-means

20

Overlapped k-means

10 5 Enhanced k-means

0 0

0 200 400 600 800 1000 0 200 400 600 800 1000

Number of clusters required Number of cluster required

Fig.7 Execution time (Abalone dataset) Fig.8 Quality of clusters (Abalone dataset)

1632 Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633

**randomly using visual basic code. Fig.11 shows a CONCLUSION
**

sample of them.

Fig.12 shows that, enhanced k-means algorithm In this paper we presented a simple idea to en-

is the most efficient algorithm in all cases of our syn- hance the efficiency of k-means clustering. Our ex-

thetic datasets. perimental results demonstrated that both of our

In the following series of experiments, we schemes can improve the execution time of k-means

compared the three implementation of the k-means algorithm, with no miss of clustering quality in most

algorithm with CLARA algorithm. Fig.13 presents cases. From our result we conclude that, the second

the performance of these algorithms, that reflect proposed implementation of the k-means algorithm

CLARA is not suitable for large dataset especially (i.e. enhanced k-means algorithm) is the best one.

when it contains a large number of clusters. This is

because CLARA is based on sample, with size pro- References

portional to the number of clusters. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998.

When we compare the execution time of the Automatic Subspace Clustering of High Dimensional

Data for Data Mining Applications. Proc. ACM SIG-

three implementations of k-means algorithm with

MOD Int. Conf. on Management of Data. Seattle, WA,

CLARA algorithm, we may be unable to distinguish p.94-105.

between different implementations of k-means algo- Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J., 1999.

rithm, this because CLARA takes very large time to OPTICS: Ordering Points to Identify the Clustering

produce the results, and k-means algorithm takes very Structure. Proc. ACM SIGMOD Int. Con. Management

short time. of Data Mining, p.49-60.

70 1400

Original k-means Original k-means

60 1200

Overlapped k-means Overlapped k-means

50 Enhanced k-means 1000 Enhanced k-means

40 800

Time (s)

MSE

30 600

20 400

10 200

0 0

0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160

Number of clusters required Number of clusters required

Fig.9 Execution time (Wind dataset) Fig.10 Quality of clusters (Wind dataset)

16 20

Original k-means

16 Overlapped k-means

12 Enhanced k-means

12

Time (s)

8

8

4 4

0

0 0 10 20 30 40 50 60 70 80 90

0 4 8 12 16 Number of points in dataset (×103)

**Fig.11 Random sample of the Synthetic dataset Fig.12 Execution time for different Synthetic datasets
**

Fahim et al. / J Zhejiang Univ SCIENCE A 2006 7(10):1626-1633 1633

360 360

Original k-means Original k-means

300 Overlapped k-means 300 Overlapped k-means

Enhanced k-means Enhanced k-means

240 CLARA 240 CLARA

Time (s)

Time (s)

180 180

120 120

60 60

0 0

0 5 10 15 20 25 30 0 5 10 15 20 25 30 35 40

Number of required clusters Number of required clusters

(a) (b)

320 35

**280 Original k-means 30
**

Overlapped k-means

240 25

Enhanced k-means

200 CLARA Original k-means

20

Time (s)

Time (s)

Overlapped k-means

160

15 Enhanced k-means

120 CLARA

10

80

40 5

0 0

0 5 10 15 20 25 30 10 30 50 70 90

Number of required clusters Number of points in dataset (×103)

(c) (d)

Fig.13 Execution time. (a) Letters dataset; (b) Abalone dataset; (c) Wind dataset; (d) Synthetic dataset

Duda, R.O., Hart, P.E., 1973. Pattern Classification and Scene Jain, A.K., Dubes, R.C., 1988. Algorithms for Clustering Data.

Analysis. John Wiley & Sons, New York. Prentice-Hall Inc.

Ester, M., Kriegel, H.P., Sander, J., Xu, X., 1996. A Den- Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data:

sity-based Algorithm for Discovering Clusters in Large An Introduction to Cluster Analysis. John Wiley & Sons.

Spatial Databases with Noise. Proc. 2nd Int. Conf. on MacQueen, J., 1967. Some Methods for Classification and

Knowledge Discovery and Data Mining. AAAI Press, Analysis of Multivariate Observations. 5th Berkeley

Portland, OR, p.226-231. Symp. Math. Statist. Prob., 1:281-297.

Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, Merz, P., 2003. An Iterated Local Search Approach for

R., 1996. Advances in Knowledge Discovery and Data Minimum Sum of Squares Clustering. IDA 2003, p.286-

Mining. AAAI/MIT Press. 296.

Gersho, A., Gray, R.M., 1992. Vector Quantization and Signal Ng, R.T., Han, J., 1994. Efficient and Effective Clustering

Compression. Kluwer Academic, Boston. Methods for Spatial Data Mining. Proc. 20th Int. Conf. on

Guha, S., Rastogi, R., Shim, K., 1998. CURE: An Efficient Very Large Data Bases. Morgan Kaufmann Publishers,

Clustering Algorithms for Large Databases. Proc. ACM San Francisco, CA, p.144-155.

SIGMOD Int. Conf. on Management of Data. Seattle, Sheikholeslami, G., Chatterjee, S., Zhang, A., 1998. Wave-

WA, p.73-84. Cluster: A Multi-Resolution Clustering Approach for

Hinneburg, A., Keim, D., 1998. An Efficient Approach to Very Large Spatial Databases. Proc. 24th Int. Conf. on

Clustering in Large Multimedia Databases with Noise. Very Large Data Bases. New York, p.428-439.

Proc. 4th Int. Conf. on Knowledge Discovery and Data Sibson, R., 1973. SLINK: an optimally efficient algorithm for

Mining. New York City, NY. the single-link cluster method. The Comp. Journal,

Huang, Z., 1997. A Fast Clustering Algorithm to Cluster Very 16(1):30-34. [doi:10.1093/comjnl/16.1.30]

Large Categorical Data Sets in Data Mining. Proc. Zhang, T., Ramakrishnan, R., Linvy, M., 1996. BIRCH: An

SIGMOD Workshop on Research Issues on Data Mining Efficient Data Clustering Method for Very Large Data-

and Knowledge Discovery. Tech. Report 97-07, Dept. of bases. Proc. ACM SIGMOD Int. Conf. on Management

CS, UBC. of Data. ACM Press, New York, p.103-114.

- Robust MCR for Problem Images FACSS 2006Uploaded bythsebgo
- 07 Corolla PRGUploaded bynaspuloy
- A Meta-heuristic Approach for Image Segmentation using Firefly AlgorithmUploaded byseventhsensegroup
- Analysis of Clustering techniqueUploaded byEditor IJTSRD
- 12 Clustering Algorithms1Uploaded byemmanuel
- K-Means Clustering Tutorial_ Matlab CodeUploaded byakumar5189
- rfid-mainUploaded bymallajosyulasivani
- Anomaly Detection Using Metaheuristic Firefly Harmonic ClusteringUploaded byfadirsalmen
- Oc 2423022305Uploaded byAnonymous 7VPPkWS8O
- An R Package for Spatial Coverage Sampling and Random Sampling From Compact Geographical Strata by K-meansUploaded bysidneysg
- Meta-heuristic Based Clustering of Two-dimensional Data Using-2Uploaded byIAEME Publication
- Player Modeling Evaluation for Interactive FictionUploaded byezllagarnie
- Development of Pattern Knowledge Discovery Framework UsingUploaded byIAEME Publication
- OPTIMAL PARAMETER SELECTION FOR UNSUPERVISED NEURAL NETWORK USING GENETIC ALGORITHM.pdfUploaded byBilly Bryan
- Botschen Thelen Pieters - Using Means-End Structures for Benefit SegmentationUploaded byIuliaNamaşco
- TMM09Uploaded byNivedha Msse
- Determining the Optimal Distribution of Bee Colony LocationsUploaded byJomar Rabajante
- Frequent Item set Mining of Big Data for Social MediaUploaded byAnonymous 7VPPkWS8O
- Radial Basis Function Artificial Neural Network: Spread SelectionUploaded byIJEC_Editor
- Meteorological Data Analysis Using Hdinsight With RtvsUploaded byesatjournals
- Neurofuzzy ModellingUploaded byJunior Figueroa
- A Classification of CKD Cases Using MultiVariate K-Means Clustering.Uploaded byIJSRP ORG

- Curso de SQL Server-AulaclicUploaded bycloduñas
- Personal Material Database - MaterealityUploaded bylobo1684
- 2014 11 - #AMIA2014 Panel - Local Terminologies and Standards - The Regenstrief ExperienceUploaded byDaniel Vreeman
- Term Paper Library Management SystemUploaded byNav Deep
- MBA 3 a List of TopicsUploaded byAmit Garg
- Essbase OptimizationUploaded byapi-26942036
- Redes - Modelo Iso-osiUploaded bypercycch
- Thesis StrijkerUploaded byLina Lalouta Saadi
- Sql OtimizadoUploaded byvixclaudio
- Data Flow Diagram (DFD)Uploaded byIvan Bendiola
- resume finalUploaded byapi-240902736
- data mining.pdfUploaded bykamal hameed tayy
- Buscadores médicos legalUploaded byTJHORNA
- Uso_de_Funciones_SQL_de_Grupo.pptxUploaded byPato Esteban Becerra
- Introducao representacao descritivaUploaded byDjuli Machado De Lucca
- Bewerbungsformular Stipendium 2015-2Uploaded byArtur Fonte
- Emailing Module_RS GISUploaded byAnonymous iTzCnM
- 1- Data WarehousesUploaded bymoh7711753
- Symantec Data Insight User GuideUploaded byjhimmypalacios
- 1Z0-060-Oracle Database 12c NFUploaded byjolsca
- Online Internet Marketing TippsUploaded byMichael Gandke
- SL Metaschema 1 9-0-120Uploaded bydaniel_teivelis
- OLIVEIRA, Ana Karina Rocha de - Museologia e Ciência Da Informação - Documentação de Um Conjunto de Peças de Roupas BrancasUploaded bym.platini
- Iatlis PaperUploaded byRamesha Bangalore
- Bd Visual Basic 6Uploaded byskrfeis
- Sony KDL - 32EX407 Chasis Z1 - L(3).pdfUploaded byFrancisco Rodriguez Martinez
- Microsoft SQL Server 2014Uploaded byJosé Timaná Mejía
- MBA BI subjectsUploaded byManjula.bs
- saiqah nusrat csit 101 done doneUploaded byapi-297537344
- Warum NoSQL?Uploaded byHinnerk