You are on page 1of 6

International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1465



Cluster Validity Indices for Gene Expression Data
Sitansu Mohanty
1
, Kaberi Das
2
, Debahuti Mishra
3
& Ruchi Ranjan
4

1,2,3
Institute of Technical Education and Research, SOA University, Jagamara, Khandagiri, Bhubaneswar, Odisha, India
4
Tata Consultancy Services, Chandaka Industrial Estate, Bhubaneswar, Odisha, India


Abstract Clustering is mostly used to discover patterns and
groups in data but there is a need to validate the quality of cluster
obtained by different clustering techniques. The importance of
validation arises from the fundamental definition of unsupervised
learning, as clustering. So prediction of correct number of
clusters is a hurdle which can be solved by using cluster validity
indices that assess the validity of clusters. The hierarchical
clustering is one of the most ongoing clustering techniques to have
the different clusters as gene subset that is calculated by
measuring the intra as well as inter distance of each data point.
The k-means clustering is another effective technique which is
based on the number of clusters. Our research has been focused
on evaluating the valid clusters through different validity indices.
In this paper, we have applied two validity indices that is
Silhouette index and Dunns index and a comparison has been
made upon the resultant of each clustering technique.

KeywordsClustering, k-means clustering, hierarchical
clustering, cluster validity indices

I. INTRODUCTION

Data mining is broadly used to make realization of data in
every sphere of life. The techniques are generally used for
pattern search in the data and determined the relation between
elements in the data set and clustering is one technique in data
mining which is an unsupervised method where a particular
data set is partitioned into groups (clusters) based on data
similarity [1]. However, the main part of clustering lies in
finding the intrinsic groups in a set of unlabeled data.
Clustering algorithms can be applied in various fields, mainly
marketing, biology, libraries, insurance, city planning and
earthquake studies [2]. The major areas where clustering
techniques have been studied are statistics, machine learning
and data mining.

Clustering is mainly classified into five approaches namely
partition algorithms, hierarchical algorithms, density based,
grid based and model based. Partitioning algorithms constructs
various partitions and then evaluate them by some criterion.
Hierarchical algorithm creates a hierarchical decomposition of
the data set using some criterion. There are two approaches
such as algorithmic approach and divisive approach under
hierarchical algorithms [3][10]. Density based lies upon
connectivity and density functions. Grid based is on a multiple
level granularity structure. A model based clustering method is
to speculate for each of the cluster and the idea is to find the
best fit of that model to each other. In this paper, we have
implemented the partition based and hierarchical clustering
which has been used for better grouping of data into clusters.
In partitioning method, the dataset D has been divided into a k
number of clusters. Provided with a k, we can find a partition
of k clusters that optimizes the chosen partitioning criterion. K-
means and k-medoieds algorithms are the heuristic approaches
of the partitioning algorithm. k-means approach deals with
each cluster being represented by the centre of the cluster,
where as k-medoieds is represented by one of the object in the
cluster [4-6]. Generally hierarchical clustering is used when
different levels of clustering are needed or exact number of
cluster is first undetermined. A hierarchical clustering
technique gives a clustered set where the output has a
hierarchy of clusters of data [7]. As we know, clustering is an
unsupervised process, so clustering results must be validated
against the experimental analysis to figure out the validity of
each cluster. The rest of the paper is organized as follows;
section II deals with related work on the cluster validation
methods, in section III preliminary methods which include
clustering techniques and cluster validity indices are described,
section IV shows the proposed model, in section V the
experimental result has been illustrated and section VI gives
the conclusion and future work.

II. BACKGROUND STUDY

Mahesh Visvanathan et al. [1] proposed a cluster validity
tool that calculates three indices to show the clustering quality
and to optimize the process of selecting an appropriate
clustering algorithm and number of clusters.

Sanghoun Oh et al. [2] derived a new evolutionary method
for the cluster validation index (CVI) or (eCVI). As we know,
finding the number of cluster in a data set is an important task
and it has been given by CVI, to discover the optimal number
of clusters, each CVI employs two kinds of measures i.einter
and intra cluster distances.
Juanying Xie and Shuai Jiang [4], sorted out the
shortcomings of k-means but this procedure has a heavy
computational load, although takes less execution time.

Hui Xiong et al. [5] provided a formal and organized
approach of the effect of skewed data distributions on k-means
clustering. Mostly they have shown that k-means tends to
produce cluster of relatively uniform size, even if the input
data has varied true cluster sizes.
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1466


Danyang Cao and Bingru Yang [6] shown the improved k-
medoids clustering algorithm based on CF-Tree. The above
mentioned algorithms based on the clustering features of
BIRCH algorithm, the concept has been improved. This
algorithm enhances the quality and scalability of clustering.

M. Srinivas and C. Krishna Mohan [7] provided with a
hybrid clustering theory mainly, leaders complete linkage
algorithm (LcL), that merges the advantages of hierarchical
and incremental clustering. These algorithms run in linear time
to rearrange the given data sets.

III. PRELIMINARY CONCEPTS

A. k-means Clustering
The prime goal of clustering is to group a set of objects and
find out whether there is any relationship between those
objects. There are two types of clustering method and those are
partitional and hierarchical. The k-means clustering technique
is the most well known technique for performing non-
hierarchical clustering. The k-means clustering iteratively finds
the k centroid and assigns each sample to the nearest centroid,
where the coordinate of each centroid is the mean of the
coordinates of the objects in the cluster [6-7].

B. Hierarchical Clustering

Hierarchical clustering algorithms generate a set of cluster
in their output as well as a hierarchy of clusters. Generally
hierarchical clustering is used when different levels of
clustering are needed. There are various methods of
hierarchical clustering but all the methods are divided in two
major groups and those are bottom-up and top-down methods.
In bottom-up methods, first we consider N clusters, each
containing a single pattern. Then two nearest clusters,
according to a predefined criterion, merged with each other
and form a new cluster [9]. This procedure is then applied
iteratively to each two set of resultant clusters until it achieves
a cluster including all patterns, In top-down methods, first a
cluster includes all patterns and then a search is made to get a
best possible partition of the cluster into two groups. The
straightforward method is to consider all possible partitions of
origin cluster into two sets and select the optimum, according
to a predefined criterion [7][12].
III. Cluster Validity Indices
Clustering is widely used to discover underlying patterns
and groups in data and there is a need to validate the quality of
clusters generated by the numerous clustering algorithms. The
need for cluster validation arises from the fundamental
definition of unsupervised learning, as clustering is an
unsupervised learning process, the prediction of correct
number of clusters is a hurdle which can be cleared by using
cluster validity indices to assess the quality of the clusters
[1][9][11].

Silhouette Index

To calculate the silhouette index we evaluate all distance
between the clusters. Let us consider two clusters A and C
which have been obtained in the clustering. We define a value
s(i) for each object i, which say, has been assigned to A. Then
for the cluster A we compute a(i) which is the average
dissimilarity of i to all other objects of A. d(i,C) is the average
dissimilarity of i to all objects of C. The minimum of all such
d(i,C) is taken to be b(i) and s(i) is given in (1).
x(|) =
h(|)-a(|)
max {a(|),h(|)}
(1)

Dunns Index
Dunns index attempts to identify compact and well separated
clusters. For two clusters ci and cj, first a dissimilarity function
d(ci, cj) is calculated[1][8]. It is defined in (2).

J(ci,c]) =min
xc ,c]
J(x,y) (2)

Then diameter of a cluster C is calculated in (3)
Jiom(c) =max
x,c
J(x,y) (3)


Then, the Dunns index is given by (4)

nc
=min
=1..nc
_min
]=+1,..nc
_
d(c
i,
c
]
)
max
k=1,.nc
dum(ck)
]_ (4)

IV. SCHEMATIC REPRESENTATION OF PROPOSED MODEL

Three gene expression dataset have been collected from UCI
repository [13]. Upon those dataset we have implemented our
proposed model for evaluation. First, each data set has been
normalized and then the principal component analysis (PCA)
reduction technique has been applied to reduce the dataset.
After reduction the clustering techniques k-means and
hierarchical have been applied to find the clusters. Finally, the
cluster validity indices such as Silhouette and Dunns index are
applied upon the resultant cluster to get an optimum cluster.

International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1467










Fig. 1 Proposed model
Fig. 1 shows the schematic representation of our cluster
validity model.

V. EXPERIMENTAL EVALUATION AND RESULT ANALYSIS

For the evaluation of the proposed work we have taken three
dataset from UCI repository [13]. The breast cancer dataset
consists of 98 no. of instances and 26 no. of attributes, the
hepatitis dataset consists of 155 no. of instances and 20 no. of
attributes, the dermatology dataset consists of 366 instances
and 34 no. of attributes and evaluated using MATLAB 2010.
Our experimental work consists of following steps:

Step1: Normalization of Datasets: Dataset has multivariate
characteristics and the attributes have real characteristics.
Applying the min-max normalization technique the data set
has been normalized which is given below. Fig.2 shows the
normalization of hepatitis data set, fig.3 shows the
normalization of breast cancer dataset and fig.4 shows the
normalization of dermatology data set.



Fig. 2. : Normalization of hepatitis dataset




Fig. 3. : Normalization of breast cancer dataset;




Fig. 4. : Normalization of dermatology dataset

Step2: Feature Reduction using PCA: Here we apply PCA for
reducing the dataset to get only those genes which are relevant.
Fig.5 shows the reduced form of hepatitis dataset which is
reduced to 8*2 from 155*20, fig. 6 shows the reduced form of
breast cancer dataset which is reduced to 26*3 from 98*26 and
fig.7 shows the reduced form of dermatology dataset which is
reduced to 366*11 from 366*34.




Fig. 5. : Reduction of hepatitis dataset
Gene Expression
Dataset

Min-Max
Normalization
PCA for Data
Reduction
k-means
clustering

Hierarchical
clustering

Silhouette
Index

Dunns
Index
C
o
m
p
a
r
i
s
o
n

o
f

R
e
s
u
l
t

Cluster Validity
Indices
Clustering
Techniques
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1468



Fig. 6. : Reduction of breast cancer dataset



Fig. 7. : Reduction of dermatology dataset

TABLE I
DESCRIPTION OF PCA REDUCED DATASETS

Dataset

Before reduction

After reduction

Hepatitis

155*20

8*2

Breast cancer

98*26

26*3

Dermatology

366*34

366*11

Step3: Clustering
A. Hierarchical Clustering: Here hierarchical clustering has
been applied on three dataset. Fig.8 shows the clusters of
hepatitis dataset, Fig.9 shows the clusters of breast-cancer
dataset and fig.10 shows the clusters of dermatology dataset
after applying hierarchical clustering technique. We have
clustered the dataset based on distance so that we separate the
each data and arrange in different clusters.














Fig. 8. : Hierarchical clusters of hepatitis dataset




Fig. 9. : Hierarchical clusters of breast cancer dataset




Fig. 10. : Hierarchical clusters of dermatology dataset


B. k-means clustering: Here k-means clustering has been
applied on three dataset. Fig.11 shows the clusters of hepatitis
dataset, Fig.12 shows the clusters of breast-cancer dataset and
fig.13 shows the clusters of dermatology dataset after applying
k-means technique. We have clustered the dataset based on
distance so that we separate the each data and arrange in
different clusters.


International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1469



Fig. 11. : k-means clusters of hepatitis dataset



Fig. 12. : k-means clusters of breast cancer dataset




Fig. 13. : k-means clusters of dermatology dataset

Step 4: Cluster Validation: Here after applying the two
clustering technique (k-means and hierarchical) upon three
data sets (hepatitis, breast cancer and dermatology) the validity
indices are applied upon the resultant of each clustering
technique. The aim of the validity indices is to substantiate
with a adequate knowledge of better cluster. Hence the two
validity indices (Silhouette index and Dunns index) have been
applied upon each clustered dataset and the result is compared
in table-II.

The experimental result shows that Silhouette Index gives
better result when it is applied on different dataset for both k-
means and hierarchical clustering techniques.

TABLE II
RESULT COMPARISON OF VALIDITY INDICES BY USING
HIERARCHICAL AND K-MEANS CLUSTERING TECHNIQUE


VI. CONCLUSION

In our study, the performance of Dunns index and Silhouette
index in the validation of gene clustering were investigated
with gene expression dataset. They are clustered by
hierarchical and k-means clustering. The above techniques
applied to our gene expression data set which produced the
different clusters. Our work in this paper gives result on
quality basis, as such not depending upon the number of
clusters. Our research has successfully applied validity indices
using knowledge driven methods to estimate the quality of the
clusters. Finally, the comparison of result by using two validity
indices used in this paper, shows that Silhouette Index proves
to be a better method for cluster validity of any clustering
technique. In future, few other validity indices can be applied
upon number of clustering techniques to provide some more
precious information about the cluster validation.

REFERENCES

[1] Mahesh Visvanathan, Adagarla, B Srinivas, Gerald, H
Lushington ,Peter Smith, Cluster Validation: An Integrative
Method for Cluster Analysis, IEEE, pp .238-242, 2009.

[2] Sanghoun Oh, Chang Wook Ahn, Moongu J eon , An Evolutionary
Cluster Validation Index, IEEE, pp. 83-88, 2008.

[3] Morteza J alalat-evakilkandi, Abdolreza Mirzaei, A New
Hierarchical-Clustering Combination scheme Based on Scatter
Matrices and Nearest Neighbour criterion, 5
th
International
Symposium on telecommunications (IEEE), pp. 904-908, 2010.
Clustering
technique
Dataset Validity
indices
Result

H
i
e
r
a
r
c
h
i
c
a
l



Hepatitis
Silhouette Index 70.6%
Dunns Index 63.2%

Breast Cancer
Silhouette Index
66.8%
Dunns Index 61.8%

Hepatitis
Silhouette Index
70.6%
Dunns Index 63.2%
k
-
m
e
a
n
s



Hepatitis
Silhouette Index 68.2%
Dunns Index 61.7%

Breast Cancer
Silhouette Index 62.0%
Dunns Index 59.7%

Hepatitis
Silhouette Index 65.4%
Dunns Index

61.3%
International Journal of Computer Trends and Technology (IJCTT) Volume 4 Issue 5May 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 1470

[4] J uanying Xie, Shuai J iang, A simple and fast algorithmfor global
k-means clustering, 2
nd
International Workshop on Education
Technology and Computer Science (IEEE), pp.36-40, 2010.

[5] Hui Xiong, J unjie Wu,J ian Chen, k-Means Clustering Versus
Validation Measures: A Data-Distribution Perspective, IEEE
Transactions on systems, man, and cyberneticspart b:
cybernetics, vol. 39, no. 2, pp.318-331, April, 2009.

[6] Danyang Cao, Bingru Yang, An improved k-medoids clustering
algorithm, IEEE, vol.3, pp.132-135, 2010.

[7] M. Srinivas,C. Krishna Mohan, Efficient Clustering Approach
using Incremental and Hierarchical Clustering Methods, IEEE,
2010.

[8] Chunmei Yang, Xuehong Zhao, Ning Li, Yan Wang, Arguing the
Validation of Dunns Index in Gene Clustering, IEEE, 2009.

[9] Susmita Datta, Somnath Datta, Comparisons and validation of
statistical clustering techniques for microarray gene expression
data, Bioinformatics, vol. 19, no. 4, pp. 459466, 2003.

[10] Li Zheng,Tao Li, Chris Ding, Hierarchical Ensemble Clustering,
IEEE International Conference on Data Mining, IEEE, pp.1199-
1204, 2010.

[11] Rui Fa, Asoke K Nandi,Li-Yun Gong, Clustering analysis for
Gene Expression Data: A Methodology Review, Proceedings of
the 5th International Symposium on Communications, Control and
Signal Processing, IEEE, 2012.

[12] Gabriela Serban , Alina Campan, A New Core-Based Method For
Hierarchical Incremental Clustering, Proceedings of the Seventh
International Symposium on Symbolic and Numeric Algorithms for
Scientific Computing (SYNASC05), IEEE.

[13] http://www.ics.uci.edu/~mlearn/MLRepository.html.2011.