You are on page 1of 15

The Journal of Supercomputing (2020) 76:5679–5693

https://doi.org/10.1007/s11227-019-02953-z

A novel clustering algorithm by clubbing GHFCM and GWO


for microarray gene data

P. Edwin Dhas1 · B. Sankara Gomathi2

Published online: 17 July 2019


© Springer Science+Business Media, LLC, part of Springer Nature 2019

Abstract
The advancement of data mining technology presents a way to examine and ana-
lyse the medical databases. Microarray data help in analysing the gene expressions,
and the process of clustering helps in categorizing the data into organized groups.
Grouping similar gene expressions paves the way for effective analysis, and the
relationship between the expressions can be figured out. Recognizing the benefits
of clustering, this work intends to present a clustering algorithm by combining gen-
eralized hierarchical fuzzy C means (GHFCM) and grey wolf optimization (GWO)
algorithms. The GWO algorithm is utilized for selecting the initial clustering point,
and the GHFCM algorithm is employed for clustering the microarray gene data. The
performance of the proposed clustering algorithm is tested with respect to preci-
sion, recall, F-measure and time consumption, and the results are compared with the
existing approaches. The performance of the proposed work is satisfactory with bet-
ter F-measure rates and minimal time consumption.

Keywords Microarray gene data · Bio-inspired algorithm · Clustering algorithm

1 Introduction

Deoxyribonucleic acid (DNA) holds the genetic information, which plays a vital
role in the development, action and reproductive nature of the living organisms.
A gene contains a series of DNAs, which could represent the protein structure. A
gene expression is defined as the transcription of gene information to ribonucleic
acid (RNA) and translation of expression into proteins [1]. These gene expres-
sions can render beneficial information about the condition of the individual, and

* P. Edwin Dhas
edwindhas.au@gmail.com
1
Department of Computer Science and Engineering, Jayaraj Annapackiam CSI College
of Engineering, Nazareth, India
2
Department of Electronics and Instrumentation Engineering, National Engineering College,
Kovilpatti, India

13
Vol.:(0123456789)
5680 P. Edwin Dhas, B. Sankara Gomathi

the analysis of gene expressions is possible with the help of microarray technol-
ogy [2].
Due to the advancement of medical science, several intelligent biological
applications are developed on the basis of microarray technology. In most of the
cases, the microarray gene data are utilized to analyse the cancerous growths of
cells, which makes it sensible for figuring out the diagnostic plans [3]. Though
this process is highly beneficial, it involves more complexity due to the require-
ment of voluminous gene set to carry out the analysis. However, it is extremely
difficult to analyse the huge set of genes and this issue is addressed by microarray
gene data clustering.
The clustering operation helps in gene organization by clubbing the genes with
identical gene expression. Basically, the clustering operation groups or clusters the
related items in a single class or cluster. Hence, the items present in a class tend
to share more similarity rather than the items in different classes. When it comes
to microarray data, the genes with more related expressions come under a cluster,
which makes the data organized and helps in easy decision-making process.
The process of gene analysis focuses on two important activities such as clus-
tering and classification. The clustering activity helps in grouping the related data,
while the classification activity intends to differentiate between the data items. Both
these analytical processes pave the way for better organization and decision-making.
Understanding the merits of microarray gene data analytics, this research work aims
to propose a clustering algorithm, which can support the healthcare professional in
analysing the related group of gene expressions.
The purpose of this work is achieved by incorporating four significant phases,
and they are microarray gene data acquisition, pre-processing, gene data selection
and clustering. Data acquisition is the initial phase that is responsible for collecting
the input data from the laboratories or standard benchmark datasets. The pre-pro-
cessing phase is concerned with the data cleansing and removing redundant infor-
mation from the gene data [4].
Gene selection is the activity that focuses more on selecting the potential gene
expression for future analysis. This idea helps in reducing the time and memory con-
straints, while minimizing the computational complexity as well. All these prelimi-
nary steps are necessary to achieve better clustering outcomes. The related items are
grouped together under a single cluster for effective data analysis.
This work acquires the input microarray gene data from the standard benchmark
datasets, which deal with different types of cancer. The acquired input data are then
pre-processed to remove redundant information and noisy data. The pre-processed
data are then treated with the gene selection, which aims to select the potential gene
from the voluminous gene set. This work employs information gain ratio (IGR) for
reducing the dimensionality of the dataset.
Finally, the dimensionality-reduced data are treated by the proposed clustering
algorithm, which is based on the combination of generalized hierarchical FCM
(GHFCM) and grey wolf optimization (GWO) algorithms. Finally, the performance
of the work is analysed with the help of standard performance metrics such as accu-
racy, sensitivity, specificity and time consumption. Some of the work contributions
are listed as follows:

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5681

• This work proposes a new clustering algorithm based on the combination of


GHFCM and GWO, which achieves better clustering outcomes with reasonable
performance.
• The process of data pre-processing and gene selection helps in reducing the time
consumption, computational complexity and memory overhead as well.
• The proposed work is tested over different microarray gene datasets for group-
ing-related cancerous gene expressions.

The rest of this article is organized in the following pattern. The related review of
literature with respect to microarray gene data clustering is studied in Sect. 2. The
proposed clustering algorithm for microarray gene data is proposed in Sect. 3, and
the performance of the proposed algorithm is evaluated in Sect. 4. The conclusions
of this article are summarized in Sect. 5.

2 Review of literature

This section reviews and studies the related state-of-the-art literature concerning the
microarray gene data clustering.
A scalable and robust fuzzy-weighted clustering based on MapReduce with
application to microarray gene expression is presented in [5]. This work measures
the functional relationship between the genes. The similarity between the genes is
measured by the combination of ordered weighted averaging and Spearman’s cor-
relation coefficient.
In [6], an appliance for effective microarray gene clustering is proposed for gene
expression dataset with the help of graphical processing unit (GPU). This work
employs k-means algorithm with simulated annealing (SA) algorithm over the GPU
with compute unified device architecture (CUDA). This work claims itself with bet-
ter performance.
The gene ontology is incorporated into fuzzy relational clustering of microarray
gene data in [7]. This work utilizes GOSlim to find the exact number of clusters and
membership values. These values are updated with the help of gene ontology, and
this work is applied on two different yeast expression datasets.
In [8], a case study is presented to analyse the performance of different cluster-
ing techniques over microarray gene data. This work considers five different cluster-
ing approaches such as hybrid swarm-based clustering (HSC), k-means, partition-
ing around medoids (PAM), vector quantization (VQ) and AGglomerative NESting
(AGNES). These clustering approaches are applied over five different microarray
datasets.
A hybrid cuckoo search-based bi-clustering algorithm is proposed for microarray
gene data in [9]. This work employs shuffled cuckoo search over Nelder–Mead. The
performance of this work is tested over four different datasets, and the results are
compared with the analogous approaches.
An effective cancer subtyping technique is proposed for microarray gene
expression data in [10]. This work subtypes the cancer by clustering the data and
detecting the density peaks, through which the tumour tissues are differentiated

13
5682 P. Edwin Dhas, B. Sankara Gomathi

from the normal tissues. The performance of the work is compared with several
real-time datasets and compared with the existing approaches.
In [11], a robust gene clustering algorithm is proposed on the basis of clonal
selection in multi-objective optimization framework. This work is based on the
behaviour of immune system of genes for clustering, and the number of clusters
may differ. In order to increase the efficiency of the clustering performance, bet-
ter validity indexes are employed. The optimal solutions are quickly converged
by means of population updation mechanism, which is applied repetitively on the
less dominant solution of a specific iteration.
A fuzzy mixed-prototype clustering algorithm is proposed for microarray data
analysis in [12]. This work integrates the prototypes of spherical and hyper-pla-
nar cluster types. The performance of this work is tested on yeast and leukaemia
datasets. In [13], an algorithm, namely trioCuckoo, is proposed for triclustering
microarray gene expression data. This work extracts the co-expressed genes from
the samples for a period of time with different encoding formats. This work is
applied over two different datasets, and the performance of the work is compared
with the particle swarm optimization (PSO) algorithm.
In [14], a Web tool is presented for visualizing the clusters of multivariate data
with the help of principal component analysis (PCA) and heatmap. The Web tool
is named as ClustVis, which accepts the input data in a simple spreadsheet. The
PCA and heatmap plots are formed, which can be downloaded by the user in the
required format. A hybrid feature selection technique is presented on the basis
of correlation coefficient and PSO for microarray gene data in [15]. This work
focuses on feature selection, and the classification is performed on three different
datasets. The performance of this work proves that the selected features help in
reducing the time consumption, while increasing the accuracy.
In [16], a semi-supervised clustering algorithm is proposed for gene expression
data based on multi-objective optimization framework. This work utilizes fuzzy C
means (FCM) clustering algorithm, and multiple cluster centres are computed.
Five objective functions are computed, and the proposed approach is applied over
five different datasets. A microarray gene retrieval system based on local fisher
discriminant analysis (LFDA) and support vector machine (SVM) is proposed in
[17]. This work minimizes the dimensionality with the help of LFDA and SVM is
employed as the classifier. The performance of the work is compared by varying
the classifiers.
In [18], an ensemble classification-based microarray gene retrieval system is pro-
posed. This work intends to classify between the normal and cancerous samples by
employing feature selection and ensemble classification. This work utilizes informa-
tion gain ratio (IGR) for feature dimensionality reduction, and ensemble classifier is
formed by taking k-nearest neighbour (k-NN), SVM and extreme learning machine
(ELM). The performance of the work is compared against three different datasets.
A microarray gene expression analytic system based on fuzzy logic is proposed
in [19]. This work employs fuzzy logic to investigate the gene data based on the
degree of membership rather than applying Boolean logic. In [20], a big data-driven
distributed density-based hesitant fuzzy clustering is presented by using Apache
spark. The similarity between the gene expressions is computed by fuzzy-weighted

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5683

similarity measurement. The Apache spark computational model is employed for


achieving soft clustering. This method is claimed to be scalable.
In [21], an automatic microarray image segmentation approach is proposed with
the help of clustering algorithm. This work pre-processes the images and initial-
izes the cluster centres based on data-driven approaches. Different sets of features
such as intensity, spatial and shape features are extracted, and the optimal features
are selected. The performance of the work is compared with FCM and k-means
algorithm.
A high clustering algorithm called AGGLO based on proximity measures is pro-
posed for gene expression microarray data in [22]. This work considers proximity
measures such as Euclidean, Manhattan, Chebyshev and cosine similarity distances.
Finally, the AGGLO algorithm is analysed for the cluster quality. An application of
clustering analysis in brain gene data is proposed on the basis of deep learning in
[23]. In this work, the input data are pre-processed by the random forest approach
and the clustering model is built by deep belief network (DBN) and FCM algorithm.
In [24], a microarray data analytic system is proposed by employing multiple fea-
ture data clustering algorithms. This work considers microarray images and employs
multi-feature clustering algorithms that help in the segmentation of the microarray
images. The efficiency of the approaches is analysed in terms of accuracy.
Inspired by these works, this article presents a clustering algorithm for microar-
ray gene data and the proposed algorithm is discussed in the following section.

3 Proposed microarray gene clustering algorithm

This section proposes a hybrid clustering algorithm for microarray gene data by
combining GHFCM and GWO algorithms. Initially, the overview of the work is
summarized followed by the description of the algorithm.

3.1 Overview of the work

Clustering is the evergreen research areas of data mining, which aims to combine
related data items in a group under a label or a class. Hence, the data items in a class
share more similarity than the data items in another class. However, achieving better
clusters is quite difficult to achieve, as computation of similarity between the data
items is the most important step. There are numerous challenges placed in front of
the clustering process, and some of the significant challenges are scalability, robust-
ness and accuracy rates.
As the microarray gene data are highly voluminous, the clustering algorithm must
be scalable such that the algorithm can process the growing data. The clustering
algorithm has to deal with outlier data effectively, which could enhance the robust-
ness. Finally, accuracy rates are the most important measure that determines the cor-
rectness of the clustering algorithm. Understanding these challenges, this work pre-
sents a hybrid algorithm based on GHFCM and GWO algorithms. The performance
of the proposed algorithm is analysed by varying the clustering approaches and the

13
5684 P. Edwin Dhas, B. Sankara Gomathi

existing clustering algorithms. The following section describes the proposed cluster-
ing algorithm.

3.2 GHFCM

This section presents the significant details of GHFCM [25], which is an enhanced
version of FCM. The standard FCM algorithm utilizes Euclidean distance as the
similarity measure, and this algorithm is inappropriate for several cases. The
GHFCM algorithm is the combined version of hierarchical FCM (HFCM) and gen-
eralized FCM (GFCM) as discussed in [26, 27], respectively. The GHFCM algo-
rithm is proven to be effective than the traditional FCM algorithm, as it is scalable
and approaches the data in a better way. The objective function of the GHFCM is
denoted by the following equation:
I A B
∑ ∑ ∑ g

OBF = xia yhiab ld mdab . (1)
i=1 a=1 b=1 d∈NHi

The above-presented equation is rewritten as


I A B
∑ ∑ ∑ g 2
OBF = xia yhiab ld ||||kc − 𝜇ab |||| . (2)
i=1 a=1 b=1

In the above equation, i = {1, 2, …I} is the dataset that possesses I number of data
samples. A is the total number of clusters, and B is the total number of subclasses.
The membership degree of ki in ath cluster is denoted by xia, and g is the weight
exponent of the fuzzy membership function xia. yiab is the sub-membership that ful-
∑A ∑B
fils the conditions a=1 xia = 1 and b=1 yiab = 1. ld is the weighing factor that man-
ages the impact of the distance between the corresponding and the centre point. N ­ Hi
is the neighbourhood item of the ith data item. kc is the closest data point, and µab
is the centroid of the cluster. mdab is the sub-distance function and is carried out by
Euclidean distance. The following equations present the computation of xia, yiab and
µab:
�∑ ∑ � 1
B h (1−g)
b=1 d∈NH i
l d yiab
m dab
xia = (3)
∑A �∑B ∑ h
� 1
(1−g)
p=1 b=1 d∈NHi ld yipb mdab

�∑ � 1
g (1−h)
d∈NHi ld xia mdab
yiab = . (4)
∑B �∑ � 1
g (1−h)
p=1 d∈NHi ld xia mdab

The centre point of the cluster µab is computed by the following equation:

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5685

∑I ∑ g
i=1 d∈NHi xia yhiab KC
𝜇ab = ∑I g
. (5)
x yh
i=1 ia iab

Hence, the local weighted generalized mean with respect to the spatial and cluster
information is calculated. The changed membership along with the sub-membership
is given by

d∈NH ld xda
xia = ∑A ∑ i (6)
p=1 d∈NHi ld xdp


d∈NHi ld ydab
yiab = ∑B ∑ . (7)
p=1 d∈NHi ld ydap

The GHFCM works by employing the fuzzy objective function that considers
hierarchical distance function and spatial constraints. This idea helps in enhancing
the quality of the clustering algorithm. Irrespective of the better performance of the
GHFCM algorithm, the performance of the algorithm can still be geared up with the
help of the bio-inspired algorithm by choosing the optimal cluster centre point. This
work chooses the point with the help of the GWO algorithm, which is explained as
follows.

3.3 GWO algorithm

The bio-inspired optimization algorithms are so popular in the current research era,
owing to the efficiency, simplicity and handling the issue of local minima diver-
gence. GWO algorithm is discussed in [28], which imitates the nature of natural
wolves with respect to the hunting activity. The basic concept of this algorithm
is presented as follows. Usually, the wolves live in group and follow a standard
hierarchy.
Every group of wolves has a leader, which can either be male or be female and
the leader animal is notated as alpha (α). The leader wolf is highly responsible for
making final decision, and all the other animals of the group follow the commands
of the leader wolf. The second place of this hierarchy is termed as beta (β), which
are the fellow wolves that help the leader wolf in making decisions. Omega (ω)
occupies the third important place in the hierarchy, and these wolves simply need to
obey the commands of alpha and beta wolves. However, it is not mandatory that the
wolf should be in any of the three classes mentioned and when the wolves are not
the part of alpha, beta or omega, the wolves are called as delta (δ). The delta wolves
have more dominance than the omega wolves.
The main reason for the choice of this GWO algorithm is that most of the meta-
heuristic algorithms do not have any controlling authority, as in the case of GWO. In
addition to this, the GWO algorithm requires minimal parameters to be set and easy
to implement. As far as hunting is considered, the GWO algorithm works by three

13
5686 P. Edwin Dhas, B. Sankara Gomathi

important phases such as prey circling, hunting and prey attacking. The pseudo-code
of GWO is presented below.

GWO Pseudo-code

Initialize the population with wolves Wi (i = 1, 2, …, n)


Initialize the vectors v, V, C;
Compute the fitness value of all the hunting wolves
and frame hierarchy;
Wα = Leader wolf;
Wβ = Secondary wolf;
Wω = Third ranking wolf;
Do
Iteration_count = 1;
For i = 1 to Wolves’ group size
Relocate the current wolf;
End for;
Compute the fitness value of all the hunting wolves;
Upgrade Wα, Wβ, Wω and v,V, C;
Iteration_count = Iteration_count + 1;
While termination condition;
Return Wα;
End;

Hence, the standard pseudo-code for GWO is presented in this section. From this
pseudo-code, the clarity and simplicity of GWO can be observed. Taking these mer-
its into account, this work utilizes the GWO for selecting the optimal centre point of
a cluster. In the above-presented pseudo-code, Wi is the population size of wolves
and the vectors v, V, C are represented by
V⃗ = 2⃗v.rand1 − v⃗ (8)

⃗ = 2.rand2 .
C (9)
The GHFCM is employed for clustering the microarray gene expression. The pro-
posed algorithm is presented as follows:

Proposed Algorithm

Input: Microarray gene dataset


Output: Gene expression clusters
Begin
Initialize the population with wolves Wi (i = 1, 2, …, n)
Initialize the vectors v, V, C;
Compute the fitness value of all the hunting wolves by
Eq. (10) and frame hierarchy;
Declare Wα as Leader wolf;
Declare Wβ as Secondary wolf;

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5687

Proposed Algorithm
Declare Wω as Third ranking wolf;
Do
Iteration_count = 1;
For each wolf
Relocate the current wolf by Eq. (18);
Employ GHFCM for clustering;
End for;
Compute the fitness value of all the hunting wolves;
Upgrade Wα, Wβ, Wω and v, V, C;
Employ GHFCM for clustering;
Iteration_count = Iteration_count + 1;
While termination condition;
Return Wα and save the solution;
End;

The fitness value for the wolves is computed by the following equation:

⃗ = ||C.
F ⃗W ⃗ ||
⃗ p (t) − W(t) (10)
| |

⃗ + 1) = W
W(t ⃗ p (t) − V.
⃗ F.
⃗ (11)

The leader, subordinate and omega wolves are determined by the following
equations:

⃗ 𝛼 = ||C
F ⃗ .W ⃗ ||
⃗ −W (12)
| 1 𝛼 |

⃗ 𝛽 = ||C
F ⃗ .W ⃗ ||
⃗ −W (13)
| 2 𝛽 |

⃗ 𝛿 = ||C
F ⃗ .W ⃗ ||.
⃗ −W (14)
| 3 𝛿 |
The categories of W ⃗ 2 and W
⃗ 1, W ⃗ 3 are represented as follows:
⃗1 = W
W ⃗ 𝛼 − V⃗ 1 .(F
⃗𝛼) (15)

⃗2 = W
W ⃗ 𝛽 − V⃗ 2 .(F
⃗𝛽 ) (16)

⃗3 = W
W ⃗ 𝛿 − V⃗ 3 .(F
⃗ 𝛿 ). (17)
The process of wolf relocation is carried out by
⃗1 + W
W ⃗2 + W
⃗3
⃗ + 1) =
W(t . (18)
3

13
5688 P. Edwin Dhas, B. Sankara Gomathi

The values of Wα, Wβ and Wδ are then updated, and this process continues till
the termination condition reaches. Otherwise, the location is changed in the next
iteration by Eq. (18). By this way, the microarray gene expressions are clustered and
the performance of the proposed clustering algorithm is analysed in the following
section.

4 Results and discussion

The proposed GHFCM-GWO algorithm is simulated by exploiting MATLAB on a


stand-alone computer with Intel i7 processor and 8 GB RAM. The performance of
the proposed algorithm is evaluated on two prominent datasets such as acute lymph-
oblastic leukaemia–acute myeloid leukaemia (ALL-AML) and colon tumour [29].
The ALL-AML dataset possesses 3571 genes and 72 instances. The colon tumour
gene dataset possesses 2000 genes, and the total instances of this dataset are 60.
Both these datasets involve two clusters.
The initial parameter settings of the proposed algorithm are presented as follows.
The initial population of the algorithm is fixed as 50, and the termination condition
is set as 100 iterations. The results attained by the proposed algorithm are analysed
by the performance metrics, and the results are presented as follows.
The performance of this work is evaluated by means of standard performance
measures such as precision (P), recall (R), F-measure (F) and Rand index [30].
The formulae for computing the above-stated performance metrics are presented as
follows:

Qxy
P(x, y) = (19)
Qy

Qxy
R(x, y) = (20)
Qx

2 ∗ R(x, y) ∗ P(x, y)
F(x, y) = . (21)
P(x, y) + R(x, y)
In the above equations, P(x, y) denotes the probability of an entity in cluster x to
be a part of class y. Qxy is the total number of entities in class y of cluster x, and Qx is
the total number of entities in class x. R(x, y) is the recall rate of cluster x by consid-
ering the class y. Here, Qx is the total count of entities in class y and Qxy is the total
count of entities y in cluster x. The Rand index compares the pairs of entities being
present in the cluster. The Rand index takes the value from 0 to 1. The value one
indicates that the pairs are relevant to each other, which is computed by considering
the actual place of an entity with the ground truth cluster. In case of perfect place-
ment of an entity, the value is set as 1.
Total acceptance
RI = . (22)
Total acceptance + Total rejections

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5689

Table 1  Experimental results on ALL-AML dataset


Dataset ALL-AML Part 1 Part 2 Part 3 Part 4
Cl-1 Cl-2 Cl-3 Cl-4 Cl-5 Cl-6 Cl-7 Cl-8

Precision 1.0 0.89 0.93 1 0.72 1 0.87 0.89


Recall 0.94 1.0 1.0 0.86 0.92 0.38 0.91 0.78
F-measure 0.969 0.941 0.963 0.924 0.807 0.550 0.889 0.831
Rand index 0.68 0.71 0.74 0.78 0.87 0.88 0.69 0.72
Time (s) 1.8 2.4 2.1 1.7 1.4 0.98 1.1 0.97

Table 2  Experimental results on Colon dataset Dataset 1 Dataset 2


colon dataset
Cl-1 Cl-2 Cl-3 Cl-4

Precision 0.57 1 0.61 0.54


Recall 1 0.06 0.6 0.54
F-measure 0.726 0.113 0.6.4 0.54
Rand index 0.96 0.93 0.61 0.59
Time (s) 2.2 2.3 2.4 2.1

The total acceptance denotes the acceptance of an entity inside a cluster, and the
rejections indicate the denied permission for an entity to be a part of a cluster. The
experimentation is carried out by dividing the ALL-AML dataset into four datasets
for easy processing. The experimental results attained by the proposed approach
are presented in Tables 1 and 2. In the second round of performance comparison,
the performance of the proposed approach is compared with the existing techniques
such as fuzzy clustering [7], cuckoo search algorithm [9] and semi-supervised clus-
tering [16] in terms of precision, recall, F-measure and time consumption (Figs. 1,
2).
Tables 1 and 2 present the outcome of the proposed clustering algorithm
employed over ALL-AML and colon datasets. The efficiency of the proposed algo-
rithm is shown with the help of precision, recall, F-measure, Rand index and time
consumption. In order to study the performance of the proposed clustering algo-
rithm, the analysis is carried out by segregating the input data into several groups
and the results are noted. These results are tabulated in Tables 1 and 2. The perfor-
mance of the proposed algorithm is satisfactory, and the following part compares
the proposed algorithm with the analogous approaches such as fuzzy clustering [7],
cuckoo search algorithm (CSA) [9] and semi-supervised clustering [16] with respect
to precision, recall, F-measure and time consumption.
From the precision and recall rate analysis, it is observed that the proposed clus-
tering algorithm proves better performance than the existing approaches. The pro-
posed clustering algorithm shows 98.6 and 95.3 per cents as the precision and recall
rates, respectively. The second best performing clustering algorithm is the cuckoo
search-based clustering algorithm, which has proven 92 and 84 per cents as the

13
5690 P. Edwin Dhas, B. Sankara Gomathi

120

100
Precision Values (%)

80

60

40

20

0
Fuzzy [7] Semisupervised [16] CSA [9] GHFCM+GWO
Techniques

Fig. 1  Precision rate analysis

120

100

80
Recall Values (%)

60

40

20

0
Fuzzy [7] Semisupervised CSA [9] GHFCM+GWO
[16]
Techniques

Fig. 2  Recall rate analysis

precision and recall rates, respectively. The following figure shows the F-measure
rate of the proposed clustering algorithm.
Figure 3 shows the F-measure rate of the proposed approach and the compara-
tive clustering algorithms. It is obvious that the F-measure rate of a system depends
on the precision and recall rates. Hence, it is completely obvious that the proposed
approach proves better F-measure rates than the existing approaches and the time
consumption rate of the algorithms is presented in the following graph.

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5691

120

F-measure Values (%) 100

80

60

40

20

0
Fuzzy [7] Semisupervised [16] CSA [9] GHFCM+GWO
Techniques

Fig. 3  F-measure rate analysis

The time consumption rates of the proposed approach are compared with the
existing approaches, and the results are shown in Fig. 4. The time consumption of
the proposed approach is minimal, as the parameters involved in GWO are very
limited and the leadership management of the algorithm is absent in most of the
meta-heuristic algorithms. Hence, the proposed clustering algorithm shows better
performance in terms of standard performance measures such as precision, recall,
F-measure and time consumption.

4
Time (sec)

0
Fuzzy [7] Semisupervised [16] CSA [9] GHFCM+GWO
Techniques

Fig. 4  Time consumption analysis

13
5692 P. Edwin Dhas, B. Sankara Gomathi

5 Conclusion

This work proposes a clustering algorithm for microarray gene data by combining
GHFCM and GWO algorithms together. The main objective of this algorithm is to
enhance the quality of clusters in a reasonable period of time. Though the utilization
of microarray data is widespread, it is not utilized for processing due to its volume.
Taking this as a challenge, this work presents a clustering algorithm that analyses the
relationship between the gene expressions and this study supports in forming some
useful pattern. The proposed clustering algorithm is analysed for precision, recall,
F-measure and time consumption. The attained results of the proposed approach
are compared with the existing approaches, and the proposed work outperforms the
existing clustering algorithms. In future, this work is planned to be extended so that
multiple class clustering problems can be tackled and the characteristics of certain
other meta-heuristic algorithms are planned to be investigated.

References
1. Ambroise C, McLachlan G (2002) Selection bias in gene extraction on the basis of microarray gene-
expression data. Proc Natl Acad Sci 99(10):6562–6566
2. Breitling R, Armengaud P, Amtmann A, Herzyk P (2004) Rank products: a simple, yet powerful,
new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett
573(1–3):83–92
3. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M et al (1999) Molecular classifica-
tion of cancer: class discovery and class prediction by gene expression monitoring. Science
286(5439):531–538
4. Baldi P, Hatfield GW (2002) DNA microarrays and gene expression: from experiments to data anal-
ysis and modeling. Cambridge University Press, Cambridge
5. Hosseini B, Kiani K (2018) FWCMR: a scalable and robust fuzzy weighted clustering based on
MapReduce with application to microarray gene expression. Expert Syst Appl 91:198–210
6. Saveetha V, Sophia S, Vijayakumar PDR (2018) Appliance of effective clustering technique for gene
expression datasets using GPU. Cluster Comput 1–8
7. Paul AK, Shill PC (2018) Incorporating gene ontology into fuzzy relational clustering of microarray
gene expression data. Biosystems 163:1–10
8. Dash R, Misra BB (2018) Performance analysis of clustering techniques over microarray data: a
case study. Physica A 493:162–176
9. Balamurugan R, Natarajan AM, Premalatha K (2018) A new hybrid cuckoo search algorithm for
biclustering of microarray gene-expression data. Appl Artif Intell 32(7–8):644–659
10. Mehmood R, El-Ashram S, Bie R, Sun Y (2018) Effective cancer subtyping by employing density
peaks clustering by using gene expression microarray. Pers Ubiquit Comput 22(3):615–619
11. Zareizadeh Z, Helfroush MS, Rahideh A, Kazemi K (2018) A robust gene clustering algorithm
based on clonal selection in multiobjective optimization framework. Expert Syst Appl 113:301–314
12. Liu J, Pham TD, Yan H, Liang Z (2018) Fuzzy mixed-prototype clustering algorithm for microarray
data analysis. Neurocomputing 276:42–54
13. Swathypriyadharsini P, Premalatha K (2018) TrioCuckoo: a multi objective cuckoo search algo-
rithm for triclustering microarray gene expression data. J Inf Sci Eng 34(6):1617–1631
14. Metsalu T, Vilo J (2015) ClustVis: a web tool for visualizing clustering of multivariate data using
principal component analysis and heatmap. Nucleic Acids Res 43(W1):W566–W570
15. Chinnaswamy A, Srinivasan R (2016) Hybrid feature selection using correlation coefficient and par-
ticle swarm optimization on microarray gene expression data. In: Chinnaswamy A, Srinivasan R
(eds) Innovations in bio-inspired computing and applications. Springer, Cham, pp 229–239

13
A novel clustering algorithm by clubbing GHFCM and GWO for… 5693

16. Alok AK, Saha S, Ekbal A (2017) Semi-supervised clustering for gene-expression data in multiob-
jective optimization framework. Int J Mach Learn Cybern 8(2):421–439
17. Scaria T, Christopher T (2018) Microarray gene retrieval system based on LFDA and SVM. Int J
Intell Syst Appl 10(1):9
18. Scaria T, Christopher T (2018) Ensemble classification based microarray gene retrieval system.
ICTACT J Soft Comput 9(1):1813–1819
19. Khanna D, Choudhury T, Sabitha AS, Nhu NG (2019) Microarray gene expression analysis using
fuzzy logic (MGA-FL). In: Abraham A, Dutta P, Mandal J, Bhattacharya A, Dutta S (eds) Emerging
technologies in data mining and information security. Springer, Singapore, pp 169–180
20. Hosseini B, Kiani K (2019) A big data driven distributed density based hesitant fuzzy cluster-
ing using Apache spark with application to gene expression microarray. Eng Appl Artif Intell
79:100–113
21. Shao G, Li D, Zhang J, Yang J, Shangguan Y (2019) Automatic microarray image segmentation
with clustering-based algorithms. PLoS ONE 14(1):e0210075
22. Kavitha E, Tamilarasan R (2019) AGGLO-Hi clustering algorithm for gene expression micro array
data using proximity measures. Multimed Tools Appl 1–15
23. Suo Y, Liu T, Jia X, Yu F (2019) Application of clustering analysis in brain gene data based on deep
learning. IEEE Access 7:2947–2956
24. SivaLakshmi B, Rao NN (2019) Microarray analysis using multiple feature data clustering algo-
rithms. In: Satapathy S, Bhateja V, Das S (eds) Smart intelligent computing and applications.
Springer, Singapore, pp 469–476
25. Zheng Y, Jeon B, Xu D, Wu QMJ, Zhang H (2015) Image segmentation by generalized hierarchical
fuzzy C-means algorithm. J Intell Fuzzy Syst 28:961–973
26. Pedrycz A, Reformat M (2006) Hierarchical FCM in a stepwise discovery of structure in data. Soft
Comput 10(3):244–256
27. Karayiannis NB (1996) Generalized fuzzy c-means algorithms. In: Proceedings of the Fifth IEEE
International Conference on Fuzzy Systems, 1996, vol 2. IEEE
28. Mirjalili S, Mirjalili SM, Lewis A (2014) Grey wolf optimizer. Adv Eng Softw 69:46–61
29. http://csse.szu.edu.cn/staff​/zhuzx​/Datas​ets.html
30. Freyhult E, Landfors M, Önskog J, Hvidsten TR, Rydén P (2010) Challenges in microarray class
discovery: a comprehensive examination of normalization, gene selection and clustering. BMC Bio-
inform 11:503

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published
maps and institutional affiliations.

13

You might also like