This action might not be possible to undo. Are you sure you want to continue?
clustering methods in data mining, due to its performance in
clustering massive data sets. The final clustering result of the kmeans
clustering algorithm is based on the correctness of the initial
centroids, which are selected randomly. The original kmeans
algorithm converges to local minimum, not the global optimum. The
kmeans clustering performance can be enhanced if the initial cluster
centers are found. To find the initial cluster centers a series of
procedure is performed. Data in a cell is partitioned using a cutting
plane that divides cell in two smaller cells. The plane is perpendicular
to the data axis with very high variance and is intended to minimize
the sum squared errors of the two cells as much as possible, while at
the same time keep the two cells far apart as possible. Cells are
partitioned one at a time until the number of cells equals to the
predefined number of clusters, K. The centers of the K cells become
the initial cluster centers for Kmeans. In this paper, an efficient
method for computing initial centroids is proposed. A Semi
Unsupervised Centroid Selection Method is used to compute the
initial centroids. Gene dataset is used to experiment the proposed
approach of data clustering using initial centroids. The experimental
results illustrate that the proposed method is vey much apt for the
gene clustering applications.
Index Terms— Clustering algorithm, Kmeans algorithm, Data
partitioning, initial cluster centers, semiunsupervised gene selection.
I. INTRODUCTION
LUSTERING, or unsupervised classification, will be
considered as a mixture of problem where the aim is to
partition a set of data object into a predefined number of
clusters [13]. Number of clusters might be established by
means of the cluster validity criterion or described by user.
Clustering problems are broadly used in many applications,
such as customer segmentation, classification, and trend
analysis. For example, consider that customers purchased a
retail database records containing items. A clustering method
could group the customers in such a way that customers with
similar buying patterns are in the same cluster. Several real
word applications deal with high dimensional data. It is
always a challenge for clustering algorithms because of the
manual processing is practically not possible. A high quality
computerbased clustering removes the unimportant features
and replaces the original set by a smaller representative set of
data objects.
Kmeans is a well known prototypebased [14], partitioning
clustering technique that attempts to find a userspecified
R. Shanmugasundram, Associate Professor, Department of Computer
Science, Erode Arts & Science College, Erode, India.
Dr. S. Sukumaran, Associate Professor, Department of Computer Science,
Erode Arts and Science College, Erode, India.
number of clusters (K), which are represented by their
centroids.
The Kmeans algorithm is as follows:
1. Select initial centers of the K clusters. Repeat the steps 2
through 3 until the cluster membership stabilizes.
2. Generate a new partition by assigning each the data to its
closest cluster centers.
3. Compute new cluster centers as centroids of the clusters.
Though Kmeans is simple and can be used for a wide
variety of data types, it is quite sensitive to initial positions of
cluster centers. The final cluster centroids may not be optimal
ones as the algorithm can converge to local optimal solutions.
An empty cluster can be attained if no points are allocated to
the cluster during the assignment step. Therefore, it is
important for Kmeans to have good initial cluster centers [15,
16]. In this paper a SemiUnsupervised Selection Method
(SCSM) is presented. The organization of this paper is as
follows. In the next section, the literature survey is presented.
In Section III, efficient semiunsupervised centroid selection
algorithm is presented. The experimental results and are
presented in Section IV. Section V concludes the paper.
II. LITERATURE SURVEY
Clustering statistical data has been studied from early time
and lots of advanced models as well as algorithms have been
proposed. This section of the paper provides a view on the
related research work in the field of clustering that may assist
the researchers.
Bradley and Fayyad together in [2] put forth a technique for
refining initial points for clustering algorithms, in particular k
means clustering algorithm. They presented a fast and efficient
algorithm for refining an initial starting point for a general
class of clustering algorithms. The iterative techniques that are
more sensitive to initial starting conditions were used in most
of the clustering algorithms like Kmeans, and EM normally
converges to one local minima. They implemented this
iterative technique for refining the initial condition which
allows the algorithm to converge to a better local minimum
value. The refined initial point is used to evaluate the
performance of Kmeans algorithm in clustering the given
data set. The results illustrated that the refinement run time is
significantly lower than the time required to cluster the full
database. In addition, the method is scalable and can be
coupled with a scalable clustering algorithm to concentrate on
the largescale clustering problems especially in case of data
mining.
Yang et al. in [3] proposed an efficient data clustering
algorithm. It is well known that Kmeans (KM) algorithm is
one of the most popular clustering techniques because it is
Enhancing KMeans Algorithm with
SemiUnsupervised Centroid Selection Method
R. Shanmugasundaram and Dr. S. Sukumaran
C
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
337 http://sites.google.com/site/ijcsis/
ISSN 19475500
unproblematic to implement and works rapid in most
situations. But the sensitivity of KM algorithm to initialization
makes it easily trapped in local optima. KHarmonic Means
(KHM) clustering resolves the problem of initialization faced
by KM algorithm. Even then KHM also easily runs into local
optima. PSO algorithm is a global optimization technique. A
hybrid data clustering algorithm based on the PSO and KHM
(PSOKHM) was proposed by Yang et al. in [3]. This hybrid
data clustering algorithm utilizes the advantages of both the
algorithms. Therefore the PSOKHM algorithm not only helps
the KHM clustering run off from local optima but also
conquer the inadequacy of the slow convergence speed of the
PSO algorithm. They conducted experiments to compare the
hybrid data clustering algorithm with that of PSO and KHM
clustering on seven different data sets. The results of the
experiments show that PSOKHM was simply superior to the
other two clustering algorithms.
Huang in [4] put forth a technique that enhances the
implementation of KMeans algorithm to various data sets.
Generally, the efficiency of KMeans algorithm in clustering
the data sets is high. The restriction for implementing K
Means algorithm to cluster real world data which contains
categorical value is because of the fact that it was mostly
employed to numerical values. They presented two algorithms
which extend the kmeans algorithm to categorical domains
and domains with mixed numeric and categorical values. The
kmodes algorithm uses a troublefree matching dissimilarity
measure to deal with categorical objects, replaces the means of
clusters with modes, and uses a frequencybased method to
modernize modes in the clustering process to decrease the
clustering cost function. The kprototypes algorithm, from the
definition of a combined dissimilarity measure, further
integrates the kmeans and kmodes algorithms to allow for
clustering objects described by mixed numeric and categorical
attributes. The experiments were conducted on well known
soybean disease and credit approval data sets to demonstrate
the clustering performance of the two algorithms.
Kluger [5] first proposed spectral biclustering for
processing gene expression data. But Kluger’s focus is mainly
on unsupervised clustering, not on gene selection.
There are some present works related to the finding
initialization centroids.
1. Compute mean (μj) and standard deviation (σ j) for every
jth attribute values.
2. Compute percentile Z1, Z2,…, Zk corresponding to area
under the normal curve from – ∞ to (2s1)/2k, s=1, 2, … ,k
(clusters).
3. Compute attribute values xs =zsσj+μj corresponding to
these percentiles using mean and standard deviation of the
attribute.
4. Perform the Kmeans to cluster data based on jth attribute
values using xs as initial centers and assign cluster labels to
every data.
5. Repeat the steps of 34 for all attributes (l).
6. For every data item t create the string of the class labels
Pt = (P1, P2,…, Pl) where Pj is the class label of t when using
the jth attribute values for step 4 clustering.
7. Merge the data items which have the same pattern string
Pt yielding K′ clusters. The centroids of the K′ clusters are
computed. If K′ > K, apply Merge DBMSDC (Density based
Multi Scale Data Condensation) algorithm [6] to merge these
K′ clusters into K clusters.
8. Find the centroids of K clusters and use the centroid as
initial centers for clustering the original dataset using K
Means.
Although the mentioned initialization algorithms can help
finding good initial centers for some extent, they are quite
complex and some use the KMeans algorithm as part of their
algorithms, which still need to use the random method for
cluster center initialization. The proposed approach for finding
initial cluster centroid is presented in the following section.
III. METHODOLOGY
3.1. Initial Cluster Centers Deriving from Data
Partitioning
The algorithm follows a novel approach that performs data
partitioning along the data axis with the highest variance. The
approach has been used successfully for color quantization [7].
The data partitioning tries to divide data space into small cells
or clusters where intercluster distances are large as possible
and intracluster distances are small as possible.
Fig. 1 Diagram of ten data points in 2D, sorted by its X value, with an
ordering number for each data point
For instance, consider Fig. 1. Suppose ten data points in 2D
data space are given.
The goal is to partition the ten data points in Fig. 1 into two
disjoint cells where sum of the total clustering errors of the
two cells is minimal, see Fig. 2. Suppose a cutting plane
perpendicular to Xaxis will be used to partition the data. Let
C
1
and C
2
be the first cell and the second cell respectively and
C
1
and C
2
be the cell centroids of the first cell and the second
cell, respectively. The total clustering error of the first cell is
thus computed by:
` J
c
i
eC
1
(c
ì
, c
1
)
(1)
and the total clustering error of the second cell is thus
computed by:
` J
c
i
eC
2
(c
ì
, c
2
)
(2)
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
338 http://sites.google.com/site/ijcsis/
ISSN 19475500
where c
i
is the i
th
data in a cell. As a result, the sum of total
clustering errors of both cells are minimal (as shown in Fig.
2.)
Fig. 2 Diagram of partitioning a cell of ten data points into two
smaller cells, a solid line represents the intercluster distance and dash
lines represent the intracluster distance
Fig. 3Illustration of partitioning the ten data points into two smaller
cells using m as a partitioning point. A solid line in the square
represents the distance between the cell centroid and a data in cell, a
dash line represents the distance between m and data in each cell and
a solid dash line represents the distance between m and the data
centroid in each cell
The partition could be done using a cutting plane that passes
through m. Thus
J(c
ì
, c
1
) < J(c
ì
, c
m
) +J(c
1
, c
m
) onJ
J(c
ì
, c
2
) < J(c
ì
, c
m
) +J(c
2
, c
m
) (3)
(as shown in Fig. 3). Thus
` J
c
i
eC
1
(c
ì
, c
1
) < ` J
c
i
eC
1
(c
ì
, c
m
) +J(c
1
, c
m
). C
1

` J
c
i
eC
2
(c
ì
, c
2
) < ` J
c
i
eC
2
(c
ì
, c
m
) +J(c
2
, c
m
). C
2

(4)
m is called as the partitioning data point where C1 and C2
are the numbers of data points in cluster C1 and C2
respectively. The total clustering error of the first cell can be
minimized by reducing the total discrepancies between all data
in first cell to m, which is computed by:
` J
c
i
eC
1
(c
ì
, c
m
)
(5)
The same argument is also true for the second cell. The total
clustering error of second cell can be minimized by reducing
the total discrepancies between all data in second cell to m,
which is computed by:
` J
c
i
eC
2
(c
ì
, c
m
)
(6)
where d(c
i
,c
m
) is the distance between m and each data in
each cell. Therefore the problem to minimize the sum of total
clustering errors of both cells can be transformed into the
problem to minimize the sum of total clustering error of all
data in the two cells to m.
The relationship between the total clustering error and the
clustering point may is illustrated in Fig. 4, where the
horizontalaxis represents the partitioning point that runs from
1 to n where n is the total number of data points and the
verticalaxis represents the total clustering error. When m=0,
the total clustering error of second cell equals to the total
clustering error of all data points while the total clustering
error of first cell is zero. On the other hand, when m=n, the
total clustering error of the first cell equals to the total
clustering error of all data points, while the total clustering
error of the second cell is zero.
Fig. 4 Graphs depict the total clustering error, lines 1 and 2 represent
the total clustering error of the first cell and second cell, respectively,
Line 3 represents a summation of the total clustering errors of the
first and the second cells
A parabola curve shown in Fig. 4 represents a summation of
the total clustering error of the first cell and the second cell,
represented by the dash line 2. Note that the lowest point of
the parabola curve is the optimal clustering point (m). At this
point, the summation of total clustering error of the first cell
and the second cell are minimum.
Since time complexity of locating the optimal point m is
O(n
2
), the distances between adjacent data is used along the X
axis to find the approximated point of n but with time of O(n).
Let Ð
]
= J(c
]
, c
]+1
)
2
be the squared Euclidean distance of
adjacent data points along the Xaxis.
If i is in the first cell then J(c
m
, c
ì
) < ∑ Ð
]
m
]=ì
. On the one
hand, if i is in the second cell then J(c
m
, c
ì
) < ∑ Ð
]
m
]=m
(as
shown in Fig. 5).
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
339 http://sites.google.com/site/ijcsis/
ISSN 19475500
Fig. 5 Illustration of ten data points, a solid line represents the
distance between adjacent data along the Xaxis and a dash line
represents the distance between m and any data point
The task of approximating the optimal point (m) in 2D is
thus replaced by finding m in onedimensional line as shown
in Fig. 6.
Fig. 6 Illustration of the ten data points on a onedimensional line and
the relevant D
j
The point (m) is therefore a centroid on the one dimensional
line (as shown in Fig. 6), which yields
` J(c
m
, c
ì
)
m1
ì=1
= ` J(c
m
, c
ì
)
n
ì=m
(7)
Let Jsum
ì
= ∑ Ð
]
ì
]=1
and a centroidDist can be computed
ccntroiJÐist =
∑ Jsum
ì
n
ì=1
n
(8)
It is probable to choose either the Xaxis or Yaxis as the
principal axis for data partitioning. However, data axis with
the highest variance will be chosen as the principal axis for
data partitioning. The reason is to make the inter distance
between the centers of the two cells as large as possible while
the sum of total clustering errors of the two cells are reduced
from that of the original cell. To partition the given data into k
cells, it is started with a cell containing all given data and
partition the cell into two cells. Later on the next cell is
selected to be partitioned that yields the largest reduction of
total clustering errors (or Delta clustering error). This can be
described as Total clustering error of the original cell – the
sum of Total clustering errors of the two sub cells of the
original cell. This is done so that every time a partition on a
cell is performed, the partition will help reduce the sum of
total clustering errors for all cells, as much as possible.
The partitioning algorithm can be used now to partition a
given set of data into k cells. The centers of the cells can then
be used as good initial cluster centers for the Kmeans
algorithm. Following are the steps of the initial centroid
predicting algorithm.
1. Let cell c contain the entire data set.
2. Sort all data in the cell c in ascending order on each
attribute value and links data by a linked list for each attribute.
3. Compute variance of each attribute of cell c. Choose an
attribute axis with the highest variance as the principal axis for
partitioning.
4. Compute squared Euclidean distances between adjacent
data along the data axis with the highest variance Ð
]
=
J(c
]
, c
]+1
)
2
and compute the Jsum
ì
= ∑ Ð
]
ì
]=1
5. Compute centroid distance of cell c:
ccntroiJÐist =
∑ Jsum
ì
n
ì=1
n
Where dsum
i
is the summation of distances between the
adjacent data.
6. Divide cell c into two smaller cells. The partition
boundary is the plane perpendicular to the principal axis and
passes through a point m whose dsumi approximately equals
to centroidDist. The sorted linked lists of cell c are scanned
and divided into two for the two smaller cells accordingly
7. Calculate Delta clustering error for c as the total
clustering error before partition minus total clustering error of
its two sub cells and insert the cell into an empty Max heap
with Delta clustering error as a key.
8. Delete a max cell from Max heap and assign it as a
current cell.
9. For each of the two sub cells of c, which is not empty,
perform step 3  7 on the sub cell.
10. Repeat steps 8  9. Until the number of cells (Size of
heap) reaches K.
11. Use centroids of cells in max heap as the initial cluster
centers for Kmeans clustering
The above presented algorithms for finding the initialization
centroids do not provide a better result. Thus an efficient
method is proposed for obtaining the initial cluster centroids.
The proposed approach is well suited to cluster the gene
dataset. So the proposed method is explained on the basis of
genes.
3.2. Proposed Methodology
The proposed method is SemiUnsupervised Centroid
Selection method. The proposed algorithm finds the initial
cluster centroids for the microarray gene dataset. The steps
involved in this procedure are as follows.
Spectral biclustering [1012] can be carried out in the
following three steps: data normalization, Bistochastization,
and seeded region growing clustering. The raw data in many
cancer geneexpression datasets can be arranged in one matrix.
In this matrix, denoted by, the rows and columns represent the
genes and the different conditions (e.g., different patients),
respectively. Then the data normalization is performed as
follows. Take the logarithm of the expression data. Carry out
five to ten cycles of subtracting either the mean or median of
the rows (genes) and columns (conditions) and then perform
five to ten cycles of rowcolumn normalization. Since gene
expression microarray experiments can generate data sets with
multiple missing values, the knearest neighbor (KNN)
algorithm is used to fill those missing values.
Define A
I
= (1¡m) ∑ A
Ij
m
j=1
to be the average of ith row,
A
I
= (1¡n) ∑ A
Ij
n
j=1
to be the average of th column, and
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
340 http://sites.google.com/site/ijcsis/
ISSN 19475500
A
..
= (1¡mn) ∑
n
j=1
∑ A
Ij
m
j=1
to be the average of the whole
matrix, where m is the number of genes and n the number of
conditions.
Bistochastization may be done as follows. First, a matrix of
interactions is defined K = (K
Ij
) by K
Ij
= A
Ij
A
I
A
I
. j +A
..
Then the singular value decomposition (SVD) of the matrix K
is computed as given by = 0Av
T
, where A is a diagonal
matrix of the same dimension as K and with nonnegative
diagonal elements in decreasing order, U and V are m× m
and n × n orthonormal column matrices. The th column of the
matrix V is denoted by v¯
1
and v¯
2
Therefore, a scatter plot of
experimental conditions of the two best class partitioning
eigenvectors v¯
2
andv¯
2
is obtained. The v¯
1
and v¯
2
are often
chosen as the eigenvectors corresponding to the largest and the
second largest eigenvalues, respectively. The main reason is
that they can capture most of the variance in the data and
provide the optimal partition of different experimental
conditions. In general, an sdimensional scatter plot can be
obtained by using eigenvectors v
1
¯, v
2
¯, …. v
s
¯ (with largest
eigenvalues).
Define P = v
1
¯, v
2
¯, …. v
s
¯]
2
which has a dimension of n × s.
The rows of matrix P stand for different conditions, which will
be clustered using SRG. Seeded region growing clustering is
carried out as follows. It begins with some seeds (initial state
of the clusters). At each step of the algorithm, it is considered
all asyet unallocated samples, which border with at least one
of the regions. Among them one sample, which has the
minimum difference from its adjoining cluster, is allocated to
its most similar adjoining cluster. With the result of clustering,
the distinct types of cancer data can be predicted with very
high accuracy. In the next section, such clustering result is
used to select the best gene combinations or explained as the
best initial centroids.
3.2.1. SemiUnsupervised Centroid Selection (SCSM)
The proposed semiunsupervised centroid selection method
includes two steps: gene ranking and gene combination
selection.
As stated above, the best class partitioning eigenvectors is
obtained .Now these eigenvectors :
1
¯, :
2
¯, …. :
s
¯ are used to
rank and preselect genes.
The proposed semiunsupervised centroid selection method
is based on the following two assumptions.
• The genes which are most relevant to the cancer
should capture most variance in the data.
• Since :
1
¯, :
2
¯, …. :
s
¯ may reveal the most variance in
the data, the genes “similar” to :
1
¯, :
2
¯, …. :
s
¯ should be relevant
to the cancer
The gene ranking and preselecting process can be
summarized as follows. After defining the ith gene
profile g
i
¯ = (o
ì1
, o
ì2
, …o
ìn
), cosine measure is used to
compute the correlation (similarity) between each gene profile
(e.g.,) and the eigenvectors(c. g ):
]
¯, ] = 1,2………, s as
R
ì,]
=
(g
i
¯)
1
:
]
¯
g
i
¯
2
. :
i
¯
2
, i = 1,2……. n ,
] = 1,2, ……. s
(9)
Where . 
2
means vector 2—norms. Seen from (10), a
large absolute of R
ì,]
indicates a strong correlation (similarity)
between ith gene and jth eigenvector. Therefore, genes can be
ranked as the absolute correlation values R
ì,]
 for each
eigenvector. For the eigenvector the top genes can be
preselected, denoted by G
j
, according to the corresponding
R
ì,]
 value for ] = 1,2, ……. s . The value l can be empirically
determined. Thus, for each eigenvector of :
1
¯, …. :
s
¯ a set of
genes with largest values of the Cosine Measure is obtained
which are taken as the initial cluster centroids in the proposed
clustering technique.
IV. EXPERIMENTAL RESULTS
The proposed SCSM method is experimented using two
microarray data sets: the lymphoma data set and the liver
cancer data set.
TABLE I
GENE IDS (CLIDS) AND GENE NAMES IN THE TWO MICROARRAY DATA SETS
Data set Gene ID/
CLID
Gene Name Gene Rank
G1 G2
Lymphoma GENE
1622X
*CD63 antigen
(melanoma 1
antigen);
Clone=769861
3 /
GENE
2328X
*FGR tyrosine
Kinase;
Clone=728609
/ 3
GENE
3343X
*mosaic protein
LR11=hybrid;
Receptor gp250
precursor;
Clone=1352833
/ 4
Liver
Cancer
IMAGE:
301122
116682 ECM1
extracellular matrix
protein 1 Hs.81071
N79484
7 /
The lymphoma microarray data has three subtypes of
cancer, i.e., CLL, FL, and DLCL. The dataset is obtained from
[8]. When applying the proposed method to this data set, the
clustering result with two best partition eigenvectors is
obtained. Seen from cluster results the three classes are
correctly divided. Then two sets of l=20 genes are selected
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
341 http://sites.google.com/site/ijcsis/
ISSN 19475500
TABLE II
COMPARISON OF GENERALIZATION ABILITY
Data set Method
Number of genes
selected
Test Rate (%) (p1,p2)
Lymphoma
kmeans 4026 100±0 (0, 0.9937)
Existing Method 81 100±0 (0, 0.9937)
SCSM 2±0 99.92±0.37 (NA, NA)
Liver Cancer
kmeans 1648 98.10±0.11 (0, 0.9973)
Existing Method 23 98.10±0.11 (0, 0.9973)
SCSM 1±0 98.70±0.08 (NA, NA)
according to R
i,1
 and R
i,2
 respectively. (Here set have to
be two.) From the two sets of 20 genes each, the twogene
combinations is chosen that can best divide the lymphoma
data. Two pairs of genes have been found: 1) Gene 1622X and
Gene 2328X, and 2) Gene 1622X and Gene 3343X, which
perfectly divide the lymphoma data. Since the results are
similar to each other, only the result of one group is shown.
Gene ID and gene names of the selecting genes in the
lymphoma data set are shown in Table I, where the group and
the rank of genes are also shown.
The method is applied to the liver cancer data with two
classes, i.e., nontumor liver and HCC. The lung cancer data is
obtained from [9]. The clustering result with the two best
partition eigenvectors is obtained. From the results it can be
seen that there are three samples misclassified and the
clustering accuracy is 98.1%. Actually, it can set so that the
scatter plot is on a single axis. Then top 20 genes are selected
with the largest. From the top 20 genes, it is found one gene
that can divide the liver cancer data well with accuracy of
98.7%. The result and gene name of selecting gene in liver
cancer data set are shown in Table I.
4.1. Comparison with results
The paired ttest method is used to show the statistical
difference between our results and other published results. In
general, given two paired sets and of measured values, the
paired ttest can be employed to compute a –value between
and determines whether they differ from each other in a
statistically significant way under the assumptions that the
paired differences are independent and identically normally
distributed. The value is defined as follows:
p = (X
¥
)
n(n 1)
∑ (X
`
ì
n
ì=1
¥
`
ì
)
2
Where, X
´
ì
= X
ì
X
, ¥
`
ì
= ¥
ì
¥
and X
, ¥
are the mean
values for and , respectively. Hence, all p∈[0,1], with a high 
value indicating statistically insignificant differences and a
low value indicating statistically significant differences
between X
ì
and ¥
ì
.
The order of cancer subtypes are shuffled and carried out
the experiments 20 times for each data set. Each time the
same gene selection result is obtained for each data set, but
slightly different classification accuracies. The pvalues for
both numbers of genes and classification accuracies is
calculated for both data sets in Table II, which showed that the
differences between the numbers of genes used in our method
and other methods are statistically significant, whereas the
differences between the classification accuracies between the
proposed method and other methods are not statistically
significant.
Figure 7: Comparison of classification accuracy among the proposed
and existing technique for two different datasets.
The Figure 7 shows that the DPDAKMeans Algorithm
with Initial Cluster Centers Derived from Data Partitioning
along the Data Axis with the Highest Variance method
produces result with less percentage of accuracy than the
proposed clustering with SCSM. The classification accuracy
of the proposed method is very high than all the existing
method. The result also shows that the proposed method is
suitable only for the gene clustering and when the proposed
method used to cluster the other data it produces a less
percentage of accuracy.
The figure 8 shows the comparison of clustering time
among the DPDAKMeans Algorithm with Initial Cluster
Centers Derived from Data Partitioning along the Data Axis
with the Highest Variance method and the proposed clustering
with SCSM.
65
70
75
80
85
90
Lymphoma Liver Cancer
A
c
c
u
r
a
c
y
(
%
)
Dataset
DPDA (Existing)
SCSM (proposed)
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
342 http://sites.google.com/site/ijcsis/
ISSN 19475500
Figure 8: Comparison of classification time among the proposed and
existing technique for two different datasets.
From the graph it can be easily said that the proposed
method takes slightly more time to cluster the gene data than
the existing method. Even the clustering time taken is more,
clustering accuracy is very high. Thus the proposed system
can be used for the gene clustering.
V. CONCLUSION
The most commonly used efficient clustering technique is
kmeans clustering. Initial starting points those computed
randomly by Kmeans often make the clustering results
reaching the local optima. So to overcome this disadvantage a
new technique is proposed. SemiUnsupervised Centroid
Selection method is used with the present clustering approach
in the proposed system to compute the initial centroids for the
kmeans algorithm. The experiments for this proposed
approach is conducted on the microarray gene database. The
data sets used are lymphoma and the liver cancer data set. The
accuracy of the proposed approach is compared with the
existing technique called the DPDA. The results are obtained
and tabulated. It is clearly observed from the results that, the
proposed approach shows significant performance. In the
lymphoma data set, the accuracy of the proposed approach is
about 87%. The accuracy of the DPDA approach is very less
(i.e.) 75%. Similarly for the liver cancer data set, the accuracy
of the proposed approach is about 81% which is also higher
than the existing approach. Moreover, time taken for
classification of the proposed approach is more or less similar
to the DPDA approach. The time taken for classification by
the proposed approach in lymphoma and liver cancer data sets
are 115 and 130 seconds respectively which is almost similar
to the existing approach. Thus the proposed approach provides
the best classification accuracy within a short time interval.
REFERENCES
[1] Guangsheng Feng, Huiqiang Wang, Qian Zhao, and Ying Liang, “A
Novel Clustering Algorithm for PrefixCoded Data Stream Based upon
MedianTree,” IEEE, International Conference on Internet Computing in
Science and Engineering, ICICSE '08, pp. 7984, 2008.
[2] P. S. Bradley, and U. M. Fayyad, “Refining Initial Points for KMeans
Clustering,” ACM, Proceedings of the 15
th
International Conference on
Machine Learning, pp. 9199, 1998.
[3] F. Yang, T. Sun, and C. Zhang, “An efficient hybrid data clustering
method based on Kharmonic means and Particle Swarm Optimization,”
An International Journal on Expert Systems with Applications, vol. 36,
no. 6, pp. 98479852, 2009.
[4] Zhexue Huang, “Extensions to the kMeans Algorithm for Clustering
Large Data Sets with Categorical Values,” Journal on Data Mining and
Knowledge Discovery, Springer, vol. 2, no. 3, pp. 283304, 1998.
[5] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein, “Spectral biclustering
of microarray cancer data: coclustering genes and conditions,” Genome
Res., vol. 13, pp. 703–716, 2003.
[6] P. Fränti and J. Kivijärvi, “Randomised Local Search Algorithm for the
Clustering Problem”. Pattern Analysis and Applications, Volume 3,
Issue 4, pages 358 – 369, 2000.
[7] M. Halkidi, Y. Batistakis and M. Vazirgiannis, “Cluster Validity
Methods: part I”. In Proceedings of the ACM SIGMOD International
Conference on Management of Data, Volume 31, Issue 2, pages 40 – 45,
June 2002.
[8] Alizedeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A,
Boldrick JC, Sabet H, Tran T, Yu X, et al.: Distinct types of diffuse
large Bcell lymphoma identified by gene expression profiling.
[9] Hong, Z.Q. and Yang, J.Y. "Optimal Discriminant Plane for a Small
Number of Samples and Design Method of Classifier on the Plane",
Pattern Recognition, Vol. 24, No. 4, pp. 317324, 1991.
[10] Manjunath Aradhya, Francesco Masulli, and Stefano Rovetta
“Biclustering of Microarray Data based on Modular Singular Value
Decomposition”, Proceedings of CIBB 2009.
[11] LIU Wei, “A Parallel Algorithm for Gene Expressing Data
Biclustering”, journal of computers, vol. 3, no. 10, october 2008
[12] Kenneth Bryan, P´adraig Cunningham and Nadia Bolshakova,
“Biclustering of Expression Data Using Simulated Annealing”, This
research was sponsored by Science oundation Ireland under grant
number SFI02/IN1/I111.
[13] A. K. Jain, M. N. Murty and P. J. Flynn, “Data Clustering: A Review”,
ACM Computing Surveys, Vol. 31, No. 3, September 1999
[14] Shai BenDavid, David Pal, and Hans Ulrich Simon, “Stability of k
Means Clustering”.
[15] Madhu Yedla, Srinivasa Rao Pathakota and T. M. Srinivasa, “Enhancing
K means Clustering Algorithm with Improved Initial Center”, Vol. 1,
121125, 2010.
[16] A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An
Efficient enhanced kmeans clustering algorithm,” journal of Zhejiang
University, 10(7): 16261633, 2006.
80
90
100
110
120
130
Lymphoma Liver Cancer
T
i
m
e
i
n
s
e
c
Dataset
DPDA (Existing)
SCSM (Proposed)
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 9, December 2010
343 http://sites.google.com/site/ijcsis/
ISSN 19475500
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.