You are on page 1of 5

International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page214



A Survey on Clustering Approaches for Gene Expression Patterns

Irene Maria
1
, Mathew Kurian
2

1
Department of Computer Science and Engineering, Karunya University, India

2
Department of Computer Science and Engineering, Karunya University, India

ABSTRACT Clustering is the process of organizing
objects into groups whose members are similar in
some way, and different from members of other
groups. Clustering is an efficient data mining
technique that finds its usage in various fields.
Clustering and Classification can be merged for
better result solutions and also they complement each
other. They can be used together as the traditional
pattern recognition methods. With the advancement
in the field of microarray technology, cluster analysis
of the genes is made possible has many applications,
by providing insight to the structural, functional and
organisational aspects of gene data sets. Traditional,
Hierarchical, Density-based and Evolutionary
clustering algorithm for gene expression are
discussed. Evolutionary Approaches include
Clustering based on Genetic Algorithm. Special
characteristics of gene expression data and the
particular requirements from the biological domain,
gene-based clustering presents several new
challenges are also listed finally.

Index Terms: Clustering, Fuzzy Partitioning, Gene
Expression, Genetic Algorithm, Microarray.
1. Introduction
Clustering needs a unique and clear decision
about the clusters to be formed. A few clustering
algorithms which focuses on categorical data have
been developed. But in most cases, the measures
contributing to the clusters may not be appropriate.
There is an increasing interest in clustering methods
when used in pattern recognition, image processing
and information retrieval and also in fields like
biology, geology and marketing.
1.1 Clusters and Clustering
Clustering is an important real world problem.
Clustering is a suitable example of unsupervised
classification. It is the process of grouping data
objects into a set of classes. These classes called
clusters may have entities with high similarity and
dissimilarity with entities in other clusters. So
clustering can be used to find rules for classifying
objects. To detect clusters with diverse shapes and

sizes is a fundamental limitation of every clustering
algorithm. Even if we use the clustering criterion, the
discovery of a majority of the clusters present in the
data is a difficult goal while exploring the patterns.
This becomes more difficult when there is no much
information about the data organization.
1.2 Clustering and Classification
Clustering and Classification both are very
important for traditional pattern recognition and they
also complement each other. Clustering can improve
the generalization of classification while the
information from classes can improve the accuracy of
the clustering solutions.[1] To incorporate the
advantages of both these learning methods, many
algorithms have been developed. All these
approaches uses the method of optimizing clustering
criterion first and then the classification criterion
obtained in the clustering solution.


1.3 Applications of clustering gene expression data
Microarray technologies have been developed to
monitor expression levels of genes. Clustering
techniques proves to be helpful by giving insight to
features like gene function, gene structure, and
cellular processes [2]. Genes with similar expression
patterns called as co-expressed genes can be clustered
together with similar cellular functions. Moreover,
co-expressed genes in the same cluster are likely to
have same cellular processes. The inference of
regulation through the clustering of gene expression
data also gives details to information regarding the
mechanism of the transcriptional regulatory network.
Traditional Clustering Algorithms include
hierarchical, partitioning, and density-based methods.


International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page215

2. Hierarchical clustering
Hierarchical clustering generates a hierarchical
series of clusters which can be graphically
represented by a tree, termed as dendrogram. A
hierarchical clustering method can be started by
defining a distance between two data points in gene
expression. The groupings of two closest data points
are done and it proceeds to the creation of a cluster
tree. One way to evaluate whether the obtained
clusters are stable or not is to explore the original
data set and see whether the same clusters are found
again. Hierarchical clustering identifies sets of
correlated genes with same behavior present in the
samples and gives thousands of clusters in a tree like
structure which is difficult to understand and explore.
Hierarchical clustering can be further divided into
aggglometric and divisive based on the formation of
dendrogram.[3],[4]Agglomerative algorithms are
known to be bottom-up approach, which perform
repeated incorporations of groups of data until some
pre-defined threshold is reached. Here, linkage is
used as the criteria to determine the distance between
two clusters. Single linkage is the smallest minimum
distance between two objects of two clusters, and
complete linkage is the smallest maximum distance
between two objects of the two clusters whereas
average linkage is the mean distance between every
pair of objects of two clusters In agglometric
hierarchical clustering, all objects begin with
individual clusters .Then the object pair with highest
similarity is merged to the same cluster. Thus the
result of agglometric cluster is a complete graph
where each node has relations with all other nodes.
Divisive hierarchical clustering is contrary to
agglometric hierarchical clustering which uses top-
down approach. This approach recursively divide the
data until some pre-defined threshold is reached. The
algorithm divides the complete graph into smaller
components. It results in a dendogram with branches
as clusters and also provides the information about
the similarity between the clusters.
CURE [5] is one of the agglomerative
hierarchical clustering algorithms, which begins by
choosing a constant number, of well scattered points,
from a cluster. These points can be castoff to identify
the shape and size of the cluster. The next step of the
algorithm deals with the shrinkage of the selected
points toward the centroid of the cluster using some
predetermined fractional value. So this an
agglomerative hierarchical clustering algorithm,
relies on links and not distances, to measure the
proximity between a pair of data points, before the
merging is done. Some agglometric hierarchical
clustering like CHAMELEON [6], uses a graph
partitioning algorithm to partition based on the
nearest neighbor approach and then uses an
agglomerative hierarchical clustering algorithm to
combine the sub-clusters and after that find the real
clusters from them.
Hierarchical clustering methods are popular
because of their presentation of cluster results and are
preferred widely by biologists .This method has
many advantages in embedded flexibility regarding
the level of granularity and it is easy to handle any
forms of similarity or distance The method is
effective[3] which is depended on :
i. The appropriateness of the validity measure used.
ii. The need for incorporating information, expression
values and other type of biological domain
knowledge while exploring the cluster.
iii. Ability to be applied on high dimensional numeric
data.
iv. Representation of nested cluster structure is clear
but is unsatisfactory in representing intersected
clusters.
2. Partitional clustering
This Clustering technique uses the partitioning of
the available data points into different clusters based
on a single center criterion.
2.1 Centroid models
In many clustering algorithms, dissimilarity
between the points in datasets are computed using the
proximity distance measures like Euclidean distance,
Mahalanobis distance,Pearson correlation,etc. Some
algorithms have optimized validity measures like
compactness, separation or both of the clusters.
2.1.1 K-means
K-means [7] is a simple and fast centroid-based
clustering algorithm in which division of the data is
done based on pre-defined number of clusters in
order to optimize a predefined criterion. The K-
means algorithm calculates the center of a cluster by
using the mean of that cluster features [3] .The users
run the algorithm repeatedly with the use of different
values of k and compare the clustering results. Thus
the detection of the optimal number of clusters is
done. When a large gene expression dataset is
available, which consists of thousands of genes, this
extensive process may not be practical. Also, gene
expression data may normally contain a huge amount
of noise and the k-means algorithm forces each gene
to be included in a single cluster, which may lead to
the generation of biologically irrelevant clusters.
Here it becomes difficult to detect clusters of
arbitrary shapes and structure. Even though all these
difficulties prevail, the k-means algorithm is
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page216

frequently used to cluster gene expression data due to
its simplicity and it provides baseline results when
compared to the development of new clustering
algorithms. The application of k-means algorithm and
its variants are widely used for clustering gene
expression data.
This algorithm is generally inefficient in the case
of categorical datasets which lack inherent distance
measure. Other extensions of K-means with cluster
medoids, cluster modes instead of cluster center were
also used. All these algorithms use a single function
for partitioning the data. In the cases where natural
ordering of entities is not found, algorithms like K-
means and fuzzy C-means fails.
2.1.2 K-medoids
A variation of K-means clustering method with
the cluster medoid taken as the most centrally located
point is termed as K-medoids. All the rest of the data
points are assigned to their clusters, based on the new
medoid determined in each iteration. Early versions
of k-medoid methods are the algorithms PAM and
CLARA [8]. PAM (Partitioning around Method) uses
dissimilarity values and an iterative approach to
determine a representative object for each cluster
called medoid. Once the medoids have been found,
each non-selected object is grouped with the medoid
to which it is found to be nearly similar. The quality
of a clustering is measured by the average
dissimilarity between an object and the medoid of its
cluster.
CLARA follows the same principle as PAM,
except in the approach of finding representative
objects for the entire dataset, instead it draws a
sample of the dataset, and applies PAM to this
sample. It then classifies the remaining objects using
usual partitioning principles. CLARANS [9]
functions on a randomized search of a graph to find
medoids which represents the clusters. The algorithm
takes input maxneighbor and numlocal, as input then
selects a random node and then tries to check a
sample of the neighbors of the particular node. If a
better neighbor is found, it moves to the neighbor and
continues the process until the maxneighbor criterion
is met. Otherwise, it declares the current node as a
local minimum and starts a new search for finding the
other local minima. After the collection of a specified
number of local minima or numlocal, the algorithm
returns the best of these local values as the medoid of
the cluster.
2.1.3 K-modes
K-modes [7] work in similar to K-medoids except
that instead of medoids, modes are used. This method
can be told as a version of fuzzy K-modes algorithm.
The steps are similar to K-medoids except in
assignment of cluster centers. K-modes is a much
faster extension of k-means algorithm to handle
categorical data which uses a different similarity
measure, as different from the case of k-means, k-
modes uses a frequency based method to update
modes. But the k-modes algorithm is useful only after
the conversion of the numeric data into categorical
data and may lead to information loss which
subsequently may deteriorate the cluster result.
All these traditional partitional clustering
techniques take the advantage of greedy search
techniques for the optimization of compactness of the
clusters. But these approaches suffer from the
problem of local optima and also only a single cluster
validity index is taken for optimization. Partitioning
based clustering algorithms can find separate clusters,
in the context of gene expression clustering,[3]but it
suffers from the following problems:
i. In the case where the number of clusters is not
known apriori.
ii. The validity measures used by most of the
above techniques are inadequate due to large size of
gene data.
iii. The gene data, apart from displaying disjoint
patterns of clusters, often show evidences of
intersected and embedded cluster patterns, which are
usually not encouraged.
3. DBSCAN
This is a density-based clustering algorithm. The
algorithm progresses with regions of high density
into clusters and discovers clusters of arbitrary shape
in spatial databases. It defines a cluster as a maximal
set of density-connected points. The given cluster
continues to grow as long as the density number of
objects or data points exceeds particular threshold. It
can be effectively used to filter out noise and
discover clusters of arbitrary shape. Statistically
significant patterns can be derived from dense
regions, which can then be used to identify genes of
interest and also eliminate others.
4. Soft Computing Techniques
4.1 Fuzzy partitioning
In classical fuzzy clustering, the fuzziness is
usually a possibility of membership of each element
into different classes with different degrees from [0,
1]. In this approach, fuzziness of clustering is
evaluated as the detail of the properties of classified
elements investigated. Sometimes Gene expression
data analysis encounters an intersecting gene pattern
in which case a crisp or hard clustering may not yield
International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page217

a good result. But for fuzzy clustering, a gene can
belong to several clusters with certain degrees of
membership.
4.1.1 Fuzzy C-means
Fuzzy C-means (FCM) [10] is a widely used
technique that uses the principles of fuzzy principles
to evolve a partition matrix.FCM algorithm starts
with randomly selected initial K cluster centers, and
then at every iteration it finds the fuzzy membership
of each gene. The main advantage of this algorithm is
that it has both the determination of initial clusters
and cluster validity, which are the fundamental stages
of a clustering process. Also it uses a more sensitive
accurate analysis compared to DBSCAN algorithm
since it uses the principles of the fuzzy theory. The
genetic algorithms (GAs) can be incorporated with
fuzzy clustering to produce a better cluster. In the
approach of GA-based fuzzy clusters, real numbers
represent the data points of the centers of the
partitions.
5. Gene Expression Patterns and Clustering
The study of expression levels of genes can be
done with the advancement in the field of microarray
technology. This technology finds its application in
various fields including medical diagnosis of
diseases. The analysis of microarray data can be done
by clustering.[11]While considering the set of
features in the clustering process, most of the existing
clustering algorithms appears to be sensitive.
5.1.1 Single objective clustering
Any external knowledge is unavailable during the
process of clustering and they are generally
unsupervised. In single objective clustering method,
the criterion may be based on the similarity or
dissimilarity of the data items[12].The clusters
formed by these criterion may not be correct as they
can take more other complex structures and the
clusters can fail if the chosen objective function is not
appropriate.
5.1.2 Multiobjective Clustering
In Multiobjective clustering, we use several
clustering algorithms along with different objective
functions. The result may not only contain the
clusters, but also the specific objective function
contributing towards the cluster formation[13].Here
the problem of choosing one objective function as in
single objective function is alleviated as this
approach uses the combination of different objective
functions.

6. Evolutionary methods
A genetic algorithm (or GA) is a search technique
used to find true and approximate solutions to
optimization and search problems. Genetic
algorithms are categorized as global search
techniques.[14]These are a particular class of
evolutionary algorithms that use techniques inspired
by evolutionary biology such as inheritance,
mutation, selection, and crossover. The Single
objective Genetic Algorithm uses the approach of
using a single function with different clustering
algorithms. In this clustering technique, each
algorithm works with separate individual functions
such that the result has cluster solution with unique
functions different from other clusters.
7. Challenges of gene clustering
Special characteristics of gene expression data,
and the particular requirements from the biological
domain, gene-based clustering presents several new
challenges [3][2].
First, cluster analysis is typically the first step in data
mining and knowledge discovery. The purpose of
clustering gene expression data is to reveal the
natural data structures and gain some initial insights
regarding data distribution. Therefore, a good
clustering algorithm should depend as little as
possible on prior knowledge, which is usually not
available before cluster analysis.
Second, the effectiveness of a clustering technique is
highly influenced by the proximity measure, used by
the technique. Choosing or finding such an
appropriate proximity measure is a challenging task.
Third, gene expression data often contain a huge
amount of noise and missing values, due to the
complex procedures of microarray experiments.
Therefore, clustering algorithms for gene expression
data should be capable of extracting useful and
needed information.
Fourth, algorithms for gene-based clustering should
be able to effectively handle the situation that gene
expression data are highly connected, and clusters
may be highly intersected with each other or even
embedded.
Finally, the associations between the clusters and also
the relationship between the genes within the same
cluster are of great importance. Thus the clustering
algorithm should not only partition the data set but
also provide some graphical representation of the
cluster structure would be more efficient. Also,
clustering algorithm should be efficient in order to
scale with the increasing size of datasets as well as
dimensionality.

International Journal of Computer Trends and Technology (IJCTT) volume 6 number 4 Dec 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page218

Conclusion
Although gene expression clustering has been done
by applying k-means, hierarchical clustering and
DBSCANS, the desired features for clustering can
include the minimum user input for finding arbitrary
shaped clusters, robustness to noises and the ability to
handle higher dimensionality data. If outliers or
noises create a problem, a new technique is suitable
for outlier detection if hierarchical clustering is to be
used. Fuzzy C-means or GA based approach supports
overlapping clusters of co-regulated genes and needs
user specified parameters. A clustering algorithms
capability to cluster biological data set depends upon
certain desirable features such as speed, minimum
number of input parameters, robustness to noise and
outliers. Though the principles of many clustering
algorithms available satisfy these requirements, in
most cases, they do not provide accurate solutions
and cannot be applied for clustering of biological
data.
In this assessment, an attempt has been made to
provide a comprehensive and precise survey of
various clustering approaches in the context of
pattern identification and recognition in the gene
expression data. Effectiveness of a clustering
technique is highly influenced by the selection of
algorithm and criterion used by the technique. A
short list of clustering approaches available for
numeric data clustering is provided. Each algorithm
is analyzed effectively and their shortcomings are
also enumerated. Finally, discussion about the
challenges faced by the clustering schemes available
for the effective clustering of gene expression is also
provided.
Acknowledgement
I feel it pleasure to be indebted to my guide Mr Mathew Kurian,
M.E, Assistant professor, Department of Computer Science and
Engineering for his invaluable support, advice and encouragement
and the reference for his feedback.
References
[1] Weiling Cai, Songcan Chen, and Daoqiang Zhang A
Multiobjective Simultaneous Learning Framework for Clustering
and Classification IEEE ,2010.
[2]Daxin J iang Chun Tang Aidong Zhang, Department of
Computer Science and Engineering, State University of New York
at Buffalo Cluster Analysis for Gene Expression Data: A Survey
IEEE Transactions on Knowledge and Data Engineering
archive,Volume 16 Issue 11, November 2004 ,Page 1370-1386
[3]Sajid Nagi, Dhruba K. Bhattacharyya, J ugal K. Kalit, Gene
Expression Data Clustering Analysis: A Survey2011.
[4]Clustering of high throughput gene expression data Harun
Pirim, Burak E ks-ioglu , Andy D.Perkins ,Cetin Y uceer ,
Computers and Operations Research, 39(12):3046-3061, 2012.
[5] S. Guha, R. Rastogi, and K Shim, CURE: An Efficient
Clustering Algorithmfor Large Datasets, ACM SIGMOD Conf.,
1998.
[6] G. Karypis, E. H. Han and V. Kumar, CHAMELEON: A
Hierarchical Clustering Algorithm Using Dynamic Modelling,
Computer, vol. 32, no. 8, pp. 68-75, 1999.
[7] J McQueen,SomeMethodsforClassifications and Analysis of
Multivariate Observations, in the Proc of 5th Barkeley
Symposiumon Mathematics, Statistics and Probability, pp. 281-
197, 1967.
[8] L. Kaufman and P. J . Rousseeuw, Finding Groups in Data: An
Introduction to Cluster Analysis, J ohn Wiley & Sons, 1990.(8)
[9] T. Ng. Raymand and J . Han, Efficient and Effective
Clustering Method for Spatial Data Mining, In the VLDB94, pp.
144-155, 1994.(9)
[10] An overview of fuzzy & crisp clustering algorithms,Dr.
Gozde ULUTAGAY,IzmirUniversity,Department of Industrial
Engineering, New Bulgarian University Lecture-3,May 31, 2012.
[11] Anirban Mukhopadhyay, Ujjwal Maulik, and Sanghamitra
Bandyopadhyay, Simultaneous Informative Gene Selection and
Clustering through Multiobjective OptimizationIEEE 2010.
[12] J . Handl and J . Knowles, An evolutionary approach to
multiobjective clustering, IEEE Trans.,2007.
[13]Martin H. C. Law Alexander P. Topchy Anil K. J ain,
Multiobjective Data Clustering,2004,IEEE Computer Society
Conference on Computer Vision and Pattern Recognition.
[14] Ujjwal Maulik, Sanghamitra Bandyopadhyay. Genetic
algorithm-based clustering technique,1999.(14)
[15] A. K. J ain and R. C. Dubes, Data clustering: A review,
,1999.
[16] A. Mukhopadhyay, U. Maulik, and S. Bandyopadhyay,
Multi-objective genetic algorithm based fuzzy clustering of
categorical attributes, IEEE ,2009.
[17] Ujjwal Maulik Analysis of gene microarray data in a soft
computing framework, Applied Soft Computing archive,Volume
11 Issue 6, September, 2011 ,Pages 4152-4160.
[18] Z. Huang, A Fast Clustering Algorithm to cluster very large
categorical datasets in Data Mining, SIGMOD Workshop on
Research Issues on DM and Knowledge Discovery, May 1997