You are on page 1of 20

Neural Computing and Applications

https://doi.org/10.1007/s00521-020-05375-8 (0123456789().,-volV)(0123456789().
,- volV)

ORIGINAL ARTICLE

A novel binary gaining–sharing knowledge-based optimization


algorithm for feature selection
Prachi Agrawal1 • Talari Ganesh1 • Ali Wagdy Mohamed2,3

Received: 19 June 2020 / Accepted: 18 September 2020


 Springer-Verlag London Ltd., part of Springer Nature 2020

Abstract
To obtain the optimal set of features in feature selection problems is the most challenging and prominent problem in
machine learning. Very few human-related metaheuristic algorithms were developed and solved this type of problem. It
motivated us to check the performance of recently developed gaining–sharing knowledge-based optimization algorithm
(GSK), which is based on the concept of gaining and sharing knowledge of humans throughout their lifespan. It depends on
two stages: beginners–intermediate gaining and sharing stage and intermediate–experts gaining and sharing stage. In this
study, two approaches are proposed to solve feature selection problems: FS-BGSK: a novel binary version of GSK
algorithm that relies on these two stages with knowledge factor 1 and FS-pBGSK: a population reduction technique that is
employed on BGSK algorithm to enhance the exploration and exploitation quality of FS-BGSK. The proposed approaches
are checked on twenty two feature selection benchmark datasets from UCI repository that contains small, medium and
large dimensions datasets. The obtained results are compared with seven state-of-the-art metaheuristic algorithms; binary
differential evolution, binary particle swarm optimization algorithm, binary bat algorithm, binary grey wolf optimizer,
binary ant lion optimizer, binary dragonfly algorithm and binary salp swarm algorithm. It concludes that FS-pBGSK and
FS-BGSK outperform other algorithms in terms of accuracy, convergence and robustness in most of the datasets.

Keywords Gaining–sharing knowledge-based optimization algorithm  Feature selection  Classification  K-NN classifier

1 Introduction

In machine learning, feature selection plays an important


role which deals with inappropriate, redundant or unnec-
essary features that deteriorates the performance of a
classifier. It has several real-world applications in different
& Ali Wagdy Mohamed fields such as biology, medicine or industry [6], multi-view
awagdy@nu.edu.eg learning [62], representational learning, document mod-
Prachi Agrawal elling [63], supervised learning [61] and intrusion detection
prachi@nith.ac.in system [15]. The main purpose of feature (attribute)
Talari Ganesh selection is to find an optimal subset of features from the
drganesh@nith.ac.in original dataset, which improves the performance (accu-
1 racy) and save the resources (memory and CPU time). If a
Department of Mathematics & Scientific Computing,
National Institute of Technology Hamirpur, dataset contains n number of attributes, total 2n number of
Himachal Pradesh 177005, India subsets of attributes will be evaluated. Hence, to solve the
2
Operations Research Department, Faculty of Graduate feature selection problem, filter methods and wrapper
Studies for Statistical Research, Cairo University, methods are commonly used in this process. The filter
Giza 12613, Egypt methods depend on the data properties, which does not
3
Wireless Intelligent Networks Center (WINC), School of involve any learning algorithm, whereas, in the wrapper
Engineering and Applied Sciences, Nile University, Giza, method, the involvement of any learning algorithm (e.g.
Egypt

123
Neural Computing and Applications

classifier) is necessary. Wrapper method directly connects this motivates us to use recently developed human beha-
with the features itself, which determines the more accu- viour-based algorithm, viz. gaining–sharing knowledge-
racy of the method. Moreover, wrapper methods are based optimization (GSK) algorithm [39] in the feature
computationally expensive than filter methods, whereas the selection task. Mohamed et al. proposed GSK algorithm to
results obtained by these methods are more accurate and solve the optimization problem in continuous space. The
the performance is typically based on the employed basic concept of applying GSK is how humans reciprocate
learning algorithm. knowledge in their lifespan. The two main pillars of GSK
Feature selection problem is considered as NP-complete are: beginners–intermediate (junior) gaining and sharing
combinatorial optimization problem as, in which an opti- stage and intermediate–experts (senior) gaining and shar-
mal subset of features are selected from the original data- ing stage. In both stages, the persons gain knowledge from
set. To reduce the dimension of the original dataset with their related networks and share their acquired knowledge
maximum accuracy and the minimum number of features is with others to amplify their skills.
the main motive of the feature selection problem, which Although GSK is population-based and real-valued
implies that these two contrary objectives are treated as a continuous optimization algorithm and to check the per-
multi-objective optimization problem. Therefore, to solve formance of GSK in feature selection problem, a binary-
these type of problems, metaheuristic algorithms are suit- coded GSK algorithm is proposed. It is the first binary
able, and they show their superior performance. variant of GSK to solve the feature selection problem (FS-
From the last three decades, metaheuristic algorithms BGSK). FS-BGSK has two main requisites: binary begin-
prove their efficiency and ability in solving high-dimen- ners–intermediate (junior) gaining and sharing stage and
sional optimization problems. These algorithms are based binary intermediate–experts (senior) gaining and sharing
on randomly generated initial solutions and successfully stage. These two requisites enable FS-BGSK to explore the
find the optimal solution of the complex optimization search space and intensifies the exploitation tendency
problems [66]. It can be categorized in four categories, viz. efficiently and effectively. From the literature, the fol-
evolution-based algorithms, swarm-based algorithms, lowing observations are to choose the population size: the
physics-based algorithms and human-related algorithms. population size may be different for every problem [8], it
There are several algorithms in each category, and they can be based on the dimension of the problems [40], and it
also have applied to many real-world applications in dif- may be varied or fixed throughout the optimization process
ferent fields of engineering and sciences [38]. These according to the problems [5, 21]. To get rid of these
algorithms have been achieved success in solving feature conditions, Mohamed et al. [37] proposed adaptive guided
selection problems. The detailed literature of algorithms differential evolution algorithm with population size
regarding the feature selection problem is presented in reduction technique which reduces the population size
Table 1. gradually. Thus, to choose the appropriate size of the
Besides these algorithms, there are some other various population is an arduous task. Hence, to overcome this
hybrid algorithms which are proposed for feature selection difficulty and to enhance the performance of FS-BGSK, a
problem. Wan et al. [53] introduced a modified ant colony population reduction technique is applied to FS-BGSK
optimization algorithm which is combined with genetic named as FS-pBGSK. The proposed approaches are
algorithm for feature selection problem. Marfarja and employed on twenty two benchmark datasets of feature
Mirjalili [35] designed a hybrid metaheuristic algorithm in selection, and the obtained results are compared with other
which the simulated annealing algorithm is inserted in the seven feature selection-based algorithms that is binary-
whale optimization algorithm to find the optimal feature coded differential evolution algorithm (BDE) [27], binary-
subset. Tawhid and Dsouza [52] included bat algorithm for coded particle swarm optimization algorithm (BPSO) [9],
exploring search space and enhanced the particle swarm binary bat algorithm (BBA) [42], binary grey wolf opti-
optimization for converging to the optimal solution in the mizer (bGWO) [17], binary ant lion optimization (BALO)
feature selection process. Yan et al. [55] proposed binary [16], binary dragonfly algorithm (BDA) [35] and binary
coral reefs optimization algorithm involving simulated salp swarm algorithm (BSSA) [19].
annealing for feature selection in biomedical datasets. For The organization of the paper is as follows: Sect. 2
regression task, Zhang et al. [67] proposed a support vector describes the fundamentals of feature selection problem
regression model with chaotic krill herd algorithm and and the GSK algorithms, Sect. 3 represents the proposed
empirical mode decomposition. algorithms and Sect. 4 describes the proposed algorithm
Table 1 shows that in each category, there are various for feature selection problem. The experimental results and
algorithms which have been used for feature selection discussion is shown in Sect. 5, which is followed by con-
issues. In human-related algorithms, very few algorithms cluding remarks and future work in Sect. 6.
have been applied in feature selection problems. Therefore,

123
Neural Computing and Applications

Table 1 Algorithms used for feature selection problem in machine learning


Algorithm Instructions Author and references

Evolution-based algorithms
GA Presented a genetic algorithm for feature selection problem and shown the full validation on datasets Leardi [32]
Tabu Preferred tabu search algorithm and used pattern classifier to select the best features in feature Zhang and Sun [59]
selection problems
GP Presented genetic programming to select the best features and build the classifier using selected Muni et al. [41]
features
FSC A feature selection and construction method is proposed using grammatical evolution algorithm, by Gavrilis et al. [23]
mapping of original attributes to the artificial attributes with different classifiers
BDE Proposed a new algorithm, discrete binary differential evolution and solved feature selection He et al. [27]
problem using support vector machine (SVM), C&R trees and RBF network
ICA The imperialist competitive algorithm has been used for selecting the optimal features of different Rad et al. [45]
rice, which is based on the bulk of sample images
HBSA Hybrid backtracking search algorithm has been proposed and used with extreme learning machine to Zhang et al. [58]
predict the wind speed
Swarm-based algorithms
ACOSVM Presented ant colony optimization with support vector machine for face recognition Yan and Yua [56]
AFSA Introduced artificial fish swarm algorithm with neural network classifier in the feature selection Zhang et al. [60]
process
MA Proposed hybrid filter and wrapper methods with memetic algorithms in classification Zhu et al. [68]
PSO Introduced a particle swarm optimization with multi-objectives for feature selection problem Xue et al. [54]
BBA Binary-coded bat algorithm has been proposed with optimum-path forest classifier to get the optimal Nakamura et al. [42]
set of features
ABC Employed artificial bee colony algorithm to feature selection problem with the perturbation Schiezaro and Pedrini
parameter as a search mechanism for new features [49]
BCS Proposed a binary version of the cuckoo search algorithm for theft detection in power distribution Rodrigues et al. [47]
systems
MAKHA Based on chicken movement, a hybrid monkey algorithm with krill herd algorithm is presented to Hafez et al. [24]
obtain the optimal feature subsets
CSO Proposed chicken swarm optimization for finding the optimal feature subsets in feature selection Hafez et al. [26]
problem
bGWO Two approaches of binary grey wolf optimizer are proposed and hired for the classification Emary et al. [17]
ISFLA Presented with an improved version of shuffled frog leaping algorithm and applied to feature Hu et al. [29]
selection in biomedical datasets
ICSO Proposed improved cat swarm optimization algorithm and used for big data classification Lin et al. [33]
MFO Proposed a new moth flame algorithm which based on the motion of moths to solve feature selection Zawbaa et al. [57]
problems
FOA–SVM Used fruit fly algorithm with SVM as a classifier in medical diagnosis Shen et al. [50]
BALO Based on two new approaches, the ant lion optimizer (based on the hunting process of ant lions) is Emary et al. [16]
employed on the feature selection domain
Rc-BBFA Presented a binary form of firefly algorithm as return-cost-based firefly algorithm and applied to Zhang et al. [65]
feature selection problems
BDO Binary version of dragonfly algorithm is presented for getting the maximum classification accuracy Mafarja et al. [35]
WOA–SA A hybrid approach of whale optimization algorithm and simulated annealing algorithm has been Mafarja and Mirjalili
proposed and applied to feature selection problems [36]
BCSA To solve the feature selection problem, V-shaped transfer function is used to develop the binary Souza et al. [13]
crow search algorithm
BSSA A crossover operator is employed on salp swarm algorithm with V- and S-shaped transfer function, Faris et al. [18]
and then, it is applied to feature selection problem
BMNABC Binary multi-neighbourhood artificial bee colony algorithm is proposed in which a new probability Beheshti [4]
function is imposed and used for classification
PSO–K-NN Proposed particle swarm optimization with K-NN classifier to perform the face recognition Sasirekha and Thangavel
[48]

123
Neural Computing and Applications

Table 1 (continued)
Algorithm Instructions Author and references

BGWO Binary grey wolf optimizer with elite-based crossover has been proposed and used with different Chantar et al. [7]
classifiers such as SVM, K-NN, decision trees and naive Bayes for Arabic text classification
ACOISA For SVM speed optimization, ant colony optimization instance selection algorithm has been Akinyelu et al. [1]
proposed, which is based on boundary detection and boundary instance selection
NBGOA A new binary variant of grasshopper optimization algorithm (based on behaviours of grasshoppers) Hichem et al. [28]
is proposed for finding out the small size subset of features
GOA ? SVM Grasshopper optimization approach has been used with SVM and applied to biomedical datasets to Ibrahim et al. [30]
select optimal features
CHHO Proposed chaotic sequence-based Harris hawks optimizer for data clustering Singh [51]
Physics-based algorithms
HS The parameter adaption scheme is developed for harmony search and applied to feature selection Diao and Shen [14]
domain
IBGSA An improved version of gravitational search algorithm with transfer function is developed to find out Rashedi and
optimal feature subsets Nezamabadi-pour [46]
WC-RSAR Water cycle algorithm for attribute reduction has been proposed and applied to rough set theory Jabbar and Zaindin [31]
SCA Presented a sine–cosine optimization algorithm to extract the best feature selection from the original Hafez et al. [25]
feature datasets
MVO–SVM Proposed a robust technique based on the multi-verse optimizer for selecting the features and Faris et al. [19]
optimizing the parameters of SVM
BBHA Developed a binary black hole algorithm and applied to biological datasets in feature selection Pashaei and Aydin [44]
problems
Human-related algorithm
FS-BTLBO Developed teaching learning-based optimization algorithm for Wisconsin diagnosis breast cancer Allam and Nandhini [3]
with different classifiers
BBSOFS Proposed brain storm optimization algorithm with individual clustering technology and two Zhang et al. [64]
individual updating mechanisms for feature selection problems

2 Preliminaries Assume A be original dataset with m instances and d


features. Let D  RN be the set that contains all d features.
This section describes some fundamentals of the feature The working mechanism of feature selection is to select
selection problem and the GSK algorithm. optimal features subset from D such that the objective
function f(X) is optimized. We assume the binary encoding
2.1 Feature selection approach to represent the solution:
X ¼ ðxi1 ; xi2 ; . . .; xid Þ; i ¼ 1; 2; . . .; m; xij 2 f0; 1g
A dataset comprises instances (rows), features (columns)
and the classes, and in datasets, the problem is to classify In the solution, xij ¼ 0 means the jth feature is not selected
the data or to categorize the unseen data into their corre- and xij ¼ 1 implies that jth feature is selected. The math-
sponding class. The problems which are related to this ematical formulation of the feature selection problem is as:
category are known as feature selection problems. The
max or min f ðXÞ ð1Þ
foremost motive of the developed problem is to select the
optimal dataset, which improves the classification accuracy s.t.
or minimizes the error rate. In the original dataset, some of
the features are of no use, irrelevant or redundant, which X ¼ ðx1 ; x2 ; . . .; xd Þ; where x 2 f0; 1g
affect the performance of the classifier in the negative 1  jXj  jDj
direction [34]. Therefore, to enhance the performance of
the classifier and to reduce the size of a dataset, feature
selection problems are developed.

123
Neural Computing and Applications

2.2 K-Nearest neighbour classifier (K-NN) classes. Thus, they can share their knowledge or skills with
the most suitable persons so that they can enhance their
K-NN classifier is a nonparametric method used for clas- skills. The process, as mentioned earlier of GSK, can be
sification. It is based on supervised learning algorithm that formulated mathematically in the following steps:
classifies unknown instances by calculating the distance Step 1: In the first step, the number of persons is
between given unknown instances and its nearest K assumed (number of population size NP). Let xt ðt ¼
neighbours. The K-NN classifier is a very commonly used 1; 2; . . .; NPÞ be the individuals of a population.
and simple method to classify the features, and it is easy to xtk ¼ ðxt1 ; xt2 ; . . .; xtd Þ, where d is branch of knowledge
understand and implement. However, the most common assigned to an individual and ft ðt ¼ 1; 2; . . .; NPÞ are the
used distance measurement is Euclidean distance which is corresponding objective function values.
given as To obtain a starting solution for the optimization prob-
" #1=2 lem, the initial population is obtained. The initial popula-
Xd  2
jX1  X2 j ¼ x1i  x2i ð2Þ tion is created randomly within the boundary constraints as
i¼1 x0tk ¼ Lk þ randk  ðUk  Lk Þ ð3Þ
where |.| denotes the distance between X1 and X2 which are where randk denotes uniformly distributed random number
points in d dimensions. in the range 0 and 1.
Step 2: At first, the dimensions of each stage are com-
2.3 GSK algorithm for continuous variables puted through the following formula
 K
An optimization problem is formulated as Genmax  G
djunior ¼ d  ð4Þ
Genmax
min f ðXÞ; X ¼ ½x1 ; x2 ; . . .; xd 
X 2 ½Lk ; Uk ; k ¼ 1; 2; . . .; d dsenior ¼ d  djunior ð5Þ

where f denotes the objective function; X ¼ ½x1 ; x2 ; . . .; xd  where Kð [ 0Þ denotes the knowledge rate that governs the
are the decision variables; Lk ; Uk are the lower and upper experience rate. djunior and dsenior represent the dimension
bounds of decision variables, respectively; and d represents for the junior and senior stage, respectively. Genmax is the
the dimension of individuals. maximum number of generations, and G denotes the gen-
In the recent years, a novel human-based optimization eration number.
algorithm, gaining–sharing knowledge-based optimization Step 3: Beginners–intermediate gaining–sharing stage:
algorithm (GSK) [39], has been developed. It follows the During this stage, the early aged people gain knowledge
concept of gaining and sharing knowledge throughout the from their small networks and share their views with the
human lifetime. GSK mainly relies on the two important other people who may or may not belong to their group.
stages: Thus, individuals are updated through as:

(i) Beginners–intermediate gaining and sharing stage (i) According to objective function values, the indi-
(junior or early middle stage). viduals are arranged in ascending order as
(ii) Intermediate–experts gaining and sharing stage xbest ; . . .; xt1 ; xt ; xtþ1 ; . . .; xworst
(senior or middle later stage).
In the early middle stage or beginners–intermediate gaining (ii) For every xt ðt ¼ 1; 2; . . .; NPÞ, select the nearest
and sharing stage, it is not possible to acquire knowledge best ðxt1 Þ and worst ðxtþ1 Þ to gain the knowledge,
from social media or friends. An individual gains knowl- and also select randomly ðxR Þ to share the knowl-
edge from their known persons such as family members, edge. Therefore, to update the individuals, the
relatives or neighbours. Due to lack of experience, these pseudocode is presented in Algorithm 1, in which
people want to share their thoughts or gained knowledge kf ð [ 0Þ is the knowledge factor.
with other people which may or may not be from their
networks, and they do not have much experience to dif-
ferentiate others in good or bad category.
Contrarily, in the middle later year stage or intermedi-
ate–experts gaining–sharing stage, individuals gain
knowledge from their large networks such as social media
friends and colleagues. These people have much experi-
ence or great ability to categorize people into good or bad

123
Neural Computing and Applications

Step 4: Intermediate–experts gaining–sharing stage: 3.1 Binary gaining–sharing knowledge-based


This stage comprises the impact and effect of other people optimization algorithm
(good or bad) on the individual. The updation of the
individual can be determined as follows: To solve problems in binary space, a novel binary gaining–
sharing knowledge-based optimization algorithm (BGSK)
(i) The individuals are classified into three categories
is proposed. In BGSK, the new initialization, dimensions of
(best, middle and worst) after sorting individuals
stages and the working mechanism of both stages (begin-
into ascending order (based on the objective
ners–intermediate and intermediate–experts gaining–shar-
function values).
ing stages) are introduced over binary space, and the
best individual ¼ 100p%ðxpbest Þ, middle individ-
remaining algorithm remains the same as the previous one.
ual ¼ d  2  100p%ðxmiddle Þ and worst
The working mechanism of BGSK are presented in the
individual ¼ 100p%ðxpworst Þ.
following subsections:
(ii) For every individual xt , choose two random vectors
of the top and bottom 100p% individual for gaining
3.2 Binary initialization
part and the third one (middle individual) is chosen
for the sharing part, where p 2 ½0; 1 is the
The initial population is obtained in GSK using Eq. (3),
percentage of best and worst classes. Therefore,
and in binary initialization, it is updated using Eq. (6):
the new individual is updated through the following
pseudocode presented in Algorithm 2. x0tk ¼ roundðrandð0; 1ÞÞ ð6Þ

3 Proposed methodology where the round operator used to convert the decimal
number into the nearest binary number.
The two approaches of GSK algorithm are presented in the
following subsections.

123
Neural Computing and Applications


3.3 Evaluate the dimensions of stages xR ; if xt1 ¼ xtþ1
xnew
tk ¼ ð9Þ
xt1 ; if xt1 6¼ xtþ1
Before proceeding further, the dimensions of each stage are
computed through the following formula: Case 2 When f ðxR Þ f ðxt Þ: There are four different
 K vectors ðxt1 ; xt ; xtþ1 ; xR Þ that consider only two values (0
NFE and 1). Thus, there are total 24 combinations possible that
djunior ¼ d  1  ð7Þ
MaxNFE are presented in Table 3. Moreover, the 16 combinations
dsenior ¼ d  djunior ð8Þ are divided into two subcases [(c) and (d)] in which (c) and
(d) have four and twelve combinations, respectively.
where Kð [ 0Þ denotes the knowledge rate which is ran-
domly generated, NFE denotes the number of function Subcase (c) If xt1 is not equal to xtþ1 , but xtþ1 is equal to
evaluation and MaxNFE represents the maximum number xR , the result is equal to xt1 .
of function evaluations. Subcase (d) If any of the condition arise xt1 ¼ xtþ1 6¼ xR
or xt1 6¼ xtþ1 6¼ xR or xt1 ¼ xtþ1 ¼ xR , the result is equal
3.4 Binary beginners–intermediate (junior) to xt by considering - 1 and - 2 as 0, and 2 and 3 as 1.
gaining and sharing step
The mathematical formulation of Case 2 is as

A binary beginners–intermediate gaining and sharing step new xt1 ; ifxt1 6¼ xtþ1 ¼ xR
xtk ¼ ð10Þ
is based on the original GSK with kf ¼ 1. The individuals xt ; Otherwise
were updated in original GSK using the pseudocode (Al-
gorithm 1) which contains two cases. These two cases are
defined for binary stage as follows: 3.5 Binary intermediate–experts (senior) gaining
and sharing stage
Case 1 When f ðxR Þ\f ðxt Þ: There are three different
vectors ðxt1 ; xtþ1 ; xR Þ, which can take only two values (0 The working mechanism of binary intermediate–experts
and 1). Therefore, total 23 combinations are possible, gaining and sharing stage is same as the binary junior
which are listed in Table 2. Furthermore, these eight gaining and sharing stage with value of kf ¼ 1. The indi-
combinations are categorized into two different subcases viduals are updated in original senior gaining–sharing stage
[(a) and (b)] and each subcase has four combinations. The using pseudocode (Algorithm 2) that contains two cases.
results of every possible combination are presented in
Table 2.
Subcase (a) If xt1 is equal to xtþ1 , the result is equal to Table 3 Results of binary beginners–intermediate gaining and sharing
xR . stage of Case 2 with kf ¼ 1
xt1 xt xtþ1 xR Results Modified results
Subcase (b) When xt1 is not equal to xtþ1 , then the result
is same as xt1 by taking - 1 as 0 and 2 as 1. Subcase (c) 1 1 0 0 3 1
The mathematical formulation of Case 1 is as follows: 1 0 0 0 1 1
0 1 1 1 0 0
0 0 1 1 -2 0
Subcase (d) 0 0 0 0 0 0
Table 2 Results of binary beginners–intermediate gaining and sharing
stage of Case 1 with kf ¼ 1 0 1 0 0 2 1
0 0 1 0 -1 0
xt1 xtþ1 xR Results Modified results
0 0 0 1 -1 0
Subcase (a) 0 0 0 0 0 1 0 1 0 0 0
0 0 1 1 1 1 0 0 1 0 0
1 1 0 0 0 0 1 1 0 1 1
1 1 1 1 1 0 1 0 1 1 1
Subcase (b) 1 0 0 1 1 1 1 1 0 2 1
1 0 1 2 1 1 0 1 1 -1 0
0 1 0 -1 0 1 1 0 1 2 1
0 1 1 0 0 1 1 1 1 1 1

123
Neural Computing and Applications

The two cases further are modified for binary intermediate– Case 2. When f ðxmiddle Þ f ðxt Þ: It consists of four dif-
experts gaining–sharing stage in the following manner: ferent binary vectors ðxpbest ; xmiddle ; xpworst ; xt Þ, and with the
Case 1 When f ðxmiddle Þ\f ðxt Þ: It contains three different values of each vector, total sixteen combination are pre-
vectors ðxpbest ; xmiddle ; xpworst Þ, and they assume only binary sented. The sixteen combinations are also divided into two
values (0 and 1); thus, total eight combinations are possible subcases [(c) and (d)]. The subcases (c) and (d) further
to update the individuals. These total eight combinations contains four and twelve combinations, respectively. The
are classified into two subcases [(a) and (b)], and each subcases are explained in detail in Table 5.
subcase contains only four different combinations. Table 4 Subcase (c): When xpbest is not equal to xpworst , and
represents the obtained results of this case. xpworst is equal to xmiddle , then the obtained results are equal
to xpbest .
Subcase (a) If xpbest is equal to xpworst , the result is equal
Subcase (d): If any case arises other than (c), then the
to xmiddle .
obtained results is equal to xt by taking - 2 and - 1 as 0
Subcase (b): On the other hand, if xpbest is not equal to and 2 and 3 as 1.
xpworst , the results are equal to xpbest with assuming - 1 or 2 The mathematical formulation of Case 2 is given as
equivalent to their nearest binary value (0 and 1, 
xpbest ; ifxpbest 6¼ xpworst ¼ xmiddle
respectively). xnew
tk ¼ ð12Þ
xt ; Otherwise
Case 1 is mathematically formulated in the following
way: The pseudocode for BGSK is shown in Algorithm 3.

 3.6 Population reduction on BGSK (pBGSK)


xmiddle ; if xpbest ¼ xpworst
xnew
tk ¼ ð11Þ
xpbest ; if xpbest 6¼ xpworst
As the population size is one of the most important
parameters of optimization algorithm, it cannot be fixed
Table 4 Results of binary intermediate–experts gaining and sharing throughout the optimization process. For exploring the
stage of Case 1 with kf ¼ 1
solutions of optimization problem, firstly the number of
xpbest xpworst xmiddle Results Modified results population size must be large, and to rectify the quality of
Subcase (a) 0 0 0 0 0
solutions and enhance the performance of algorithm,
decrement in the population size is required.
0 0 1 1 1
Mohamed et al. [37] used nonlinear population reduc-
1 1 0 0 0
tion formula for differential evolution algorithm to solve
1 1 1 1 1
the global numerical optimization problem. Based on the
Subcase (b) 1 0 0 1 1
formula, we used the following framework to reduce the
1 0 1 2 1
population size gradually:
0 1 0 -1 0
0 1 1 0 0

123
Neural Computing and Applications

"
  Table 5 Results of binary intermediate–experts gaining and sharing
NPGþ1 ¼ round NPmin  NPmax stage of Case 2 with kf ¼ 1

! # ð13Þ xpbest xt xpworst xmiddle Results Modified results


NFE
 þ NPmax Subcase (c) 1 1 0 0 3 1
Max NFE
1 0 0 0 1 1
where NPGþ1 denotes the modified (new) population size in 0 1 1 1 0 0
the next generation, NPmin and NPmax are the minimum and 0 0 1 1 -2 0
maximum population size, respectively, NFE is current Subcase (d) 0 0 0 0 0 0
number of function evaluation and Max NFE is the 0 1 0 0 2 1
assumed maximum number of function evaluations. Taking 0 0 1 0 -1 0
into consideration that NPmin is assumed as 12, we need at 0 0 0 1 -1 0
least 2 elements in best and worst partitions. The main 1 0 1 0 0 0
advantage of applying population reduction technique to 1 0 0 1 0 0
BGSK is to discard the infeasible or worst solutions from 0 1 1 0 1 1
the initial phase without influencing the exploration capa- 0 1 0 1 1 1
bility. In the later stage, it emphasizes the exploitation 1 1 1 0 2 1
tendency by deleting the worst solutions from the search 1 0 1 1 -1 0
space. 1 1 0 1 2 1
Note: In this study, population size reduction technique 1 1 1 1 1 1
is combined with proposed BGSK which is named as
pBGSK and the pseudocode for pBGSK is shown in
Algorithm 4.

4 Binary GSK for feature selection problem (ii) To minimize the number of features.
The above two criterion must be hold simultaneously;
To find the optimal subset of features, two wrapper therefore, the multi-objective function is formulated as
approaches FS-BGSK and FS-pBGSK are proposed in single objective:
which BGSK and pBGSK as a searching algorithm and
min ZðfitnessÞ ¼ c1 ERate þ c2 Nfeatures ð14Þ
K-NN classifier as an evaluator are considered. In this
study, a feature subset is denoted as a binary vector which where ERate represents the classification error rate (com-
describes that if a feature is 1, it is selected otherwise not. pliment of classification accuracy) which is calculated
The optimization problem is solved based on the goodness using K-NN classifier and
of feature subset that depends on the two conditions:
Number of selected features
(i) To maximize the classification accuracy. Nfeatures ¼
Total number of features

123
Neural Computing and Applications

Table 6 Description of datasets


Label Name No. of instances No. of features

Small-dimensional datasets (S): d  20


S1 Tic-Tac-Toe 958 9
S2 Breast cancer 699 10
S3 Heart 270 13
S4 Wine 179 13
S5 House Vote 435 16
S6 Zoo 101 16
S7 Lymphography 148 18
S8 Hepatitits 155 19
Medium-dimensional datasets (M): 21  d  90
M1 Waveform 5000 21
M2 German 1000 24
Fig. 1 Overview of proposed approaches for feature selection and
M3 Wdbc 569 30
classification
M4 Ionosphere 351 34
M5 Dermatology 366 34
c1 and c2 are two parameters corresponding to two criterion M6 Soybean 47 34
and c2 ¼ 1  c1 ; c1 2 ½0; 1. In this study, c1 is taken M7 Lung cancer 32 56
according to [2, 16]. The overview of proposed approaches M8 Spambase 4601 57
for feature selection problem is drawn in Fig. 1. M9 Sonar 208 60
Large-dimensional datasets (L): 100  d
L1 Hillvalley 606 100
5 Experimental results and discussion L2 Clean 1 476 165
L3 Semeion 1593 265
This section presents the detailed description of datasets, L4 Arrhythmia 452 279
parameter settings, evaluation criterion and numerical L5 CNAE 1080 856
results with their statistical comparison.

5.1 Data description


5.2 Benchmark algorithms and parameter
The proposed approaches are tested on 22 benchmark settings
datasets taken from UCI repository [20]. The considered
datasets are divided into three categories (small, medium To check and compare the performance of proposed
and large scale) based on the dimensions of datasets. The approaches on feature selection datasets, seven state-of-
data contain various number of instances ranges from 32 to the-art algorithms have been considered. These algorithms
5000 and different number of features ranges from 9 to 856 include binary differential evolution algorithm (BDE) [27],
in feature selection process. Equal portion of each dataset binary particle swarm optimization algorithm (BPSO) [9],
is taken for three sets as training, testing and validation in a binary bat algorithm (BBA) [42], binary grey wolf opti-
cross-validation manner. The K-NN classifier is used to mizer (bGWO) [17], binary ant lion optimization (BALO)
produce the optimal feature subset with K ¼ 5. [16], binary dragonfly algorithm (BDA) [35] and binary
The benchmark datasets of feature selection problem are salp swarm algorithm (BSSA) [19]. The algorithms run on
categorized into the following manner: a personal computer Inter CoreTM i5 @ 2.50 GHz with
S: Small-dimensional sets (S1  S8 ) 8 GB RAM on MATLAB R2015a with same conditions,
M: Medium-dimensional sets (M1  M9 ) and the obtained results are noted throughout the process.
L: Large-dimensional sets (L1  L5 ) The values of parameters are presented in Table 7 for each
algorithm.
The representation and the detailed description of each
datasets is presented in Table 6.

123
Neural Computing and Applications

Table 7 Parameter values for


Parameters Values
different optimizer
r Pulse rate for BBA 0.9
A Loudness for BBA 0.5
Qmin Frequency minimum for BBA 0
Qmax Frequency maximum for BBA 2
Cr Crossover probability BDE 0.95
c1 Cognitive factor 2
c2 Social factor 2
Wmax Maximum bound on inertia weight 0.6
Wmin Minimum bound on inertia weight 0.2
f Food attraction weight for BDA 2 * rand
p Probability in FS-pBGSK and FS-BGSK 0.1
kr Knowledge ratio in FS-pBGSK and FS-BGSK 0.95
K for cross-validation 5
Dimension Number of features in dataset
NP Population size (dimension  20) 50
NP Population size (dimension 20) 100
MaxNFE (dimension  20) 5000
MaxNFE (dimension 20) 20,000
W: the number of runs (dimension  20) 25
W: the number of runs (dimension 20) 10
c1 parameter in fitness function 0.99
c2 parameter in fitness function 0.01

5.3 Evaluation criterion solution to the total number of features (D). The
avgfeature is determined by Eq. (17)
To evaluate the performance of each algorithm, assume
1X W
lengthðxÞi
that the algorithms ran over W times and the results were avgfeature ¼ ð17Þ
W i¼1 jDj
recorded. Hence, the comparison among algorithms had
been done by calculating the following measures in which where lengthðxÞi is the length of the selected feature
the fitness function is denoted by Z. subset in ith run and |D| is the total number of
(a) Average fitness function value It represents the mean features.
value of fitness function (avgZ ) values over W runs (d) Statistical Standard deviation of fitness values It
which can be formulated by Eq. (15): presents the robustness and variation of obtained
fitness values of each algorithm in W run. This tells
1X W
avgZ ¼ Z ð15Þ that how much the optimal solution deviates from the
W i¼1 i mean of algorithm. The smaller value of standard
deviation indicates that the algorithm converges to
where Zi is the optimal fitness value in ith run.
same solution, whereas the larger value represents
(b) Average classification accuracy It calculates the
the random solutions. The standard deviation (stdZ )
average value of obtained optimal accuracy of
is computed using Eq. (18)
classifier in W runs. Suppose the optimal accuracy
" #1=2
is denoted as Acci in ith run, average classification 1 X W  2
 ð18Þ
accuracy avgAcc is computed through Eq. (16): stdZ ¼ Z  avgZ
W  1 i¼1 i
1X W
avgAcc ¼ Acci ð16Þ where Zi is the optimal fitness value in ith run and
W i¼1
avgZ is the mean value of fitness function that is
(c) Average feature selection size In a total number of calculated in Eq. (15).
runs W, the average selection size of feature (e) Average computation time This measure is computed
(avgfeature ) is obtained by the length of the optimal for the computational time (in s) taken by each

123
Neural Computing and Applications

algorithm for performing W runs. The average the mean results of the solutions. In this test, Sþ
computation time avgoTime is calculated for oth denotes the sum of ranks of all datasets which
optimizer using Eq. (19) describes the first algorithm performs better than
the second one in the row, and S denotes
1X W
avgoTime ¼ ðTimeÞoi ð19Þ opposite of Sþ which indicates that the larger
W i¼1 rank narrates the larger discrepancy in the
where ðTimeÞoi denotes the time used by oth algo- algorithms. Assume the null hypothesis is ‘‘The
algorithms are not significantly different with
rithm in ith run.
respect to mean results’’ and the alternative
(f) Statistical tests To investigate the performance of
hypothesis is ‘‘The algorithms are significantly
algorithms and quality of solutions statistically, two
different with respect to mean results’’. In the
nonparametric statistical tests are conducted: Fried-
results of this test, three notations are taken þ; 
man test and multi-problem Wilcoxon signed-rank
and
that are described as Plus ðþÞ : The results
test [22].
from the first algorithm are significantly better
• Friedman test It represents the rank of all than the second one. Minus ðÞ : The results
algorithms overall datasets, which indicates that from the second algorithm are significantly worse
performance of algorithms. The null hypothesis than the second one. Approximate ð
Þ : There is
states that ‘‘There is no significant difference in no significant difference between the two algo-
all algorithms’’, whereas the alternative hypoth- rithms. The decision is taken on p value; the null
esis is ‘‘the algorithms have significantly different hypothesis is being rejected if the obtained
for all datasets’’. The final decision is made on p value is less than assumed the significance
the obtained p value; if the obtained p value is level (5%).
less than 0.05 (significance level), the null
hypothesis has to be rejected.
• Multi-problem Wilcoxon signed-rank test It is
5.4 Numerical results and discussion
used to check the performance between the
The numerical results obtained by all optimizer are pre-
algorithms for all datasets, which depends on
sented in Tables 8, 9, 10, 11 and 12, evaluated by the

Table 8 Average fitness


Datasets BALO BBA BDE bGWO BPSO BDA BSSA FS-BGSK FS-pBGSK
function values from different
optimizer S1 0.2067 0.2240 0.2098 0.2082 0.2067 0.2022 0.2021 0.2097 0.1974
S2 0.0330 0.0373 0.0346 0.0314 0.0330 0.0278 0.0291 0.0299 0.0322
S3 0.1404 0.1827 0.1569 0.1562 0.1401 0.1482 0.1493 0.1498 0.1455
S4 0.0407 0.0616 0.0489 0.0519 0.0395 0.0446 0.0466 0.0409 0.0404
S5 0.0442 0.0901 0.0570 0.0624 0.0476 0.0471 0.0607 0.0438 0.0440
S6 0.0495 0.0766 0.0558 0.0559 0.0476 0.0498 0.0515 0.0521 0.0423
S7 0.4928 0.5410 0.5184 0.5243 0.4935 0.5014 0.5123 0.5037 0.4894
S8 0.2429 0.3257 0.2938 0.2953 0.2526 0.2533 0.2828 0.2690 0.2400
M1 0.1699 0.1891 0.1728 0.1736 0.1720 0.1719 0.1761 0.1701 0.1684
M2 0.2382 0.2761 0.2534 0.2596 0.2456 0.2449 0.2458 0.2462 0.2372
M3 0.0494 0.0601 0.0562 0.0539 0.0510 0.0514 0.0491 0.0515 0.0480
M4 0.0864 0.1431 0.1330 0.1281 0.1098 0.0790 0.1101 0.0945 0.0762
M5 0.0085 0.0372 0.0169 0.0206 0.0116 0.0111 0.0164 0.0108 0.0074
M6 0.0006 0.0157 0.0027 0.0031 0.0019 0.0378 0.0402 0.0009 0.0006
M7 0.1124 0.1527 0.1154 0.1215 0.1140 0.0637 0.0671 0.1128 0.0813
M8 0.0731 0.1382 0.0775 0.0800 0.0741 0.0710 0.0839 0.0754 0.0698
M9 0.1011 0.1910 0.1500 0.1441 0.1378 0.0977 0.1293 0.1182 0.0923
L1 0.3992 0.4643 0.4409 0.4415 0.4405 0.4445 0.4682 0.4071 0.3861
L2 0.0701 0.1453 0.0984 0.0900 0.1079 0.0581 0.1767 0.0565 0.0530
L3 0.0200 0.0472 0.0279 0.0257 0.0271 0.0184 0.0283 0.0134 0.0134
L4 0.2957 0.3621 0.3283 0.3257 0.3236 0.2893 0.3377 0.2871 0.2820
L5 0.0064 0.0074 0.0092 0.0086 0.0083 0.0078 0.0083 0.0057 0.0046

123
Neural Computing and Applications

Table 9 Average accuracy


Datasets BALO BBA BDE bGWO BPSO BDA BSSA FS-BGSK FS-pBGSK
obtained by different algorithms
S1 0.7989 0.6792 0.7968 0.7978 0.7954 0.8041 0.8041 0.7989 0.8082
S2 0.9714 0.8074 0.9707 0.9736 0.9714 0.9768 0.9758 0.9751 0.9726
S3 0.8619 0.6717 0.8462 0.8471 0.8622 0.8542 0.8533 0.8524 0.8566
S4 0.9622 0.7853 0.9556 0.9520 0.9636 0.9587 0.9569 0.9622 0.9627
S5 0.9567 0.7985 0.9448 0.9396 0.9530 0.9536 0.9415 0.9567 0.9563
S6 0.9537 0.9624 0.9624 0.9490 0.9561 0.9537 0.9529 0.9513 0.9616
S7 0.5060 0.3613 0.4825 0.4763 0.5057 0.4973 0.4876 0.4952 0.5094
S8 0.7574 0.5703 0.7082 0.7062 0.7477 0.7472 0.7185 0.7313 0.7610
M1 0.8362 0.7543 0.8341 0.8334 0.8338 0.8336 0.8284 0.8358 0.8373
M2 0.7638 0.6742 0.7508 0.7444 0.7563 0.7620 0.7480 0.7557 0.7646
M3 0.9512 0.9158 0.9477 0.9495 0.9509 0.9314 0.9044 0.9495 0.9526
M4 0.9148 0.8778 0.8722 0.8767 0.8920 0.9135 0.8938 0.9074 0.9250
M5 0.9951 0.8847 0.9891 0.9858 0.9934 0.9923 0.9891 0.9934 0.9962
M6 1.0000 0.8708 1.0000 1.0000 1.0000 0.9625 0.9625 1.0000 1.0000
M7 0.8875 0.7500 0.8875 0.8813 0.8875 0.9375 0.9375 0.8875 0.9188
M8 0.9308 0.8655 0.9292 0.9270 0.9298 0.9322 0.9209 0.9284 0.9338
M9 0.9010 0.7240 0.8548 0.8606 0.8654 0.9038 0.8750 0.8846 0.9096
L1 0.6007 0.4071 0.5614 0.5604 0.5597 0.5545 0.5330 0.5927 0.6139
L2 0.9336 0.8008 0.9084 0.9164 0.8958 0.9454 0.8277 0.9475 0.9508
L3 0.9843 0.9576 0.9793 0.9812 0.9773 0.9856 0.9774 0.9905 0.9905
L4 0.7053 0.5881 0.6761 0.6783 0.6779 0.7112 0.6652 0.7137 0.7190
L5 0.9965 0.9970 0.9954 0.9956 0.9957 0.9944 0.9963 0.9961 0.9972

Table 10 Average selected


Datasets BALO BBA BDE bGWO BPSO BDA BSSA FS-BGSK FS-pBGSK
feature ratio to the total number
of features S1 0.76 0.46 0.86 0.81 0.76 0.83 0.82 0.72 0.75
S2 0.47 0.46 0.57 0.53 0.47 0.49 0.51 0.52 0.50
S3 0.37 0.39 0.47 0.49 0.37 0.387 0.41 0.37 0.35
S4 0.33 0.43 0.49 0.44 0.34 0.36 0.40 0.35 0.34
S5 0.13 0.40 0.23 0.26 0.11 0.11 0.28 0.10 0.08
S6 0.37 0.48 0.54 0.40 0.41 0.39 0.49 0.39 0.42
S7 0.37 0.51 0.60 0.58 0.41 0.37 0.50 0.40 0.38
S8 0.27 0.39 0.49 0.44 0.28 0.30 0.41 0.30 0.34
M1 0.77 0.79 0.86 0.87 0.75 0.73 0.79 0.76 0.73
M2 0.43 0.43 0.67 0.66 0.43 0.43 0.56 0.43 0.42
M3 0.11 0.44 0.44 0.38 0.24 0.13 0.40 0.14 0.11
M4 0.20 0.45 0.64 0.60 0.30 0.21 0.49 0.28 0.19
M5 0.37 0.47 0.61 0.65 0.51 0.36 0.56 0.43 0.36
M6 0.06 0.46 0.28 0.31 0.19 0.07 0.32 0.09 0.06
M7 0.11 0.39 0.40 0.39 0.13 0.10 0.46 0.14 0.09
M8 0.46 0.51 0.74 0.78 0.46 0.38 0.57 0.45 0.43
M9 0.31 0.47 0.62 0.61 0.46 0.30 0.57 0.39 0.29
L1 0.39 0.44 0.67 0.63 0.46 0.41 0.58 0.39 0.38
L2 0.44 0.43 0.78 0.72 0.47 0.47 0.62 0.45 0.43
L3 0.44 0.53 0.74 0.71 0.47 0.41 0.59 0.39 0.39
L4 0.40 0.47 0.76 0.72 0.47 0.39 0.62 0.36 0.38
L5 0.29 0.31 0.46 0.42 0.41 0.23 0.46 0.18 0.18

123
Neural Computing and Applications

Table 11 Standard deviation of


Datasets BALO BBA BDE bGWO BPSO BDA BSSA FS-BGSK FS-pBGSK
fitness values of all optimizers
S1 0.0104 0.0128 0.0118 0.0123 0.0104 0.0158 0.0124 0.0104 0.0137
S2 0.0064 0.0068 0.0064 0.0067 0.0080 0.0070 0.0072 0.0075 0.0063
S3 0.0174 0.0263 0.0197 0.0239 0.0195 0.0167 0.0188 0.0205 0.0169
S4 0.0186 0.0213 0.0206 0.0223 0.0186 0.0163 0.0163 0.0178 0.0185
S5 0.0099 0.0245 0.0207 0.0231 0.0133 0.0096 0.0124 0.0098 0.0093
S6 0.0322 0.0400 0.0451 0.0374 0.0344 0.0408 0.0444 0.0338 0.0319
S7 0.0346 0.0333 0.0382 0.0344 0.0334 0.0313 0.0334 0.0358 0.0261
S8 0.0292 0.0276 0.0379 0.0330 0.0301 0.0291 0.0238 0.0328 0.0274
M1 0.0048 0.0402 0.0045 0.0063 0.0055 0.0052 0.0066 0.0060 0.0036
M2 0.0104 0.0100 0.0122 0.0112 0.0114 0.0101 0.0135 0.0099 0.0056
M3 0.0102 0.0111 0.0119 0.0106 0.0101 0.0088 0.0089 0.0099 0.0082
M4 0.0217 0.0232 0.0160 0.0198 0.0207 0.0142 0.0217 0.0173 0.0078
M5 0.0054 0.0141 0.0079 0.0085 0.0063 0.0040 0.0057 0.0043 0.0053
M6 0.0001 0.0400 0.0012 0.0008 0.0002 0.1174 0.1173 0.0003 0.0001
M7 0.1743 0.1678 0.1736 0.1731 0.1741 0.1419 0.1529 0.1740 0.1304
M8 0.0089 0.0145 0.0064 0.0053 0.0059 0.0032 0.0025 0.0049 0.0012
M9 0.0322 0.0310 0.0310 0.0380 0.0340 0.0165 0.0310 0.0189 0.0162
L1 0.0294 0.0341 0.0332 0.0377 0.0327 0.0297 0.0312 0.0293 0.0280
L2 0.0086 0.0086 0.0112 0.0096 0.0113 0.0073 0.0089 0.0152 0.0060
L3 0.0044 0.0074 0.0051 0.0051 0.0056 0.0044 0.0053 0.0050 0.0038
L4 0.0352 0.0301 0.0309 0.0321 0.0302 0.0258 0.0167 0.0287 0.0254
L5 0.0010 0.0015 0.0016 0.0012 0.0013 0.0011 0.0020 0.0016 0.0007

Table 12 Average computational time for all different algorithms


Datasets BALO BBA BDE bGWO BPSO BDA BSSA FS-BGSK FS-pBGSK

S1 43.83 39.02 48.71 47.50 42.61 39.37 44.50 44.08 44.92


S2 32.34 31.14 34.45 55.53 33.34 31.10 48.18 32.41 47.14
S3 24.44 24.37 24.61 25.32 24.60 38.43 36.07 24.14 24.52
S4 27.22 28.99 26.37 26.15 27.87 26.51 26.21 25.67 26.51
S5 26.13 26.04 28.63 26.13 25.51 36.70 37.14 27.99 27.98
S6 28.20 28.40 27.73 29.33 27.03 24.82 23.33 28.11 27.16
S7 34.58 35.29 35.06 35.04 35.79 52.65 50.81 33.00 33.92
S8 27.51 27.70 27.53 28.58 27.73 22.70 21.64 26.67 27.04
M1 551.32 414.41 707.53 680.16 542.52 936.22 1031.86 617.20 615.41
M2 41.51 40.38 46.84 48.58 42.10 45.23 46.28 40.27 41.44
M3 29.70 31.20 31.67 32.45 31.37 192.35 211.90 26.79 27.29
M4 25.60 26.02 27.88 29.48 27.75 161.98 150.49 24.30 24.43
M5 28.46 30.79 27.16 29.32 29.72 156.66 166.34 25.84 25.78
M6 20.79 21.43 20.81 22.68 21.86 100.35 90.38 20.44 20.71
M7 39.32 40.55 39.40 44.73 41.67 104.73 92.29 38.08 38.92
M8 3407.74 3473.64 6213.21 5272.60 5347.35 4125.51 4667.61 3441.45 3496.28
M9 40.44 43.32 43.09 48.11 44.76 114.04 116.17 24.20 24.07
L1 189.48 196.07 231.08 240.18 201.24 216.02 290.55 172.18 178.32
L2 194.10 203.04 275.52 300.18 216.79 232.26 315.22 201.69 184.02
L3 2687.08 2907.62 3844.61 5094.56 2263.87 2043.30 3588.19 1772.33 1767.84
L4 385.33 447.09 451.64 588.51 425.95 591.59 803.22 267.05 274.54
L5 3891.63 4171.80 4704.51 3598.61 3425.73 4042.71 6454.84 2073.50 1778.36

123
Neural Computing and Applications

Fig. 2 Convergence graph for


small-dimensional datasets
S1  S8

criteria that are described in the previous section. The best function evaluations, and both binary stages enable FS-
results are shown in bold text. pBGSK algorithm to explore and exploit the search space
Table 8 represents the statistical mean of fitness values and find the optimal solutions. In datasets S3 ; S4 , BPSO
of all algorithms. It indicates that FS-pBGSK algorithm algorithm shows their best results among all optimizers.
obtains best average fitness values in 77% datasets. It Moreover, the convergence figures of all algorithms for
shows the best results for high-dimensional datasets small-, medium- and large-dimensional datasets are drawn
(L1  L5 ); for example, in L5 the obtained worst fitness in Figs. 2, 3 and 4, respectively. The convergence fig-
value (0.005524) is much better than the average fitness ures of FS-pBGSK illustrate the performance of algo-
values of other algorithms. The reason behind is that the rithms. From the figures, it is noticed that both FS-pBGSK
population size decreases linearly with the number of and FS-BGSK algorithms converge to the best solution in

123
Neural Computing and Applications

Fig. 3 Convergence graph for


medium-dimensional datasets
M1  M9

123
Neural Computing and Applications

Fig. 4 Convergence graph for


high-dimensional datasets
L1  L5

82% datasets as compared to other algorithms. Although For minimizing the fitness values, the second objective
the state-of-the-art algorithms converge faster than FS- function is to minimize the number of selected features.
pBGSK and FS-BGSK, they either prematurely converge, Table 10 outlines the ratio of selected features obtained by
or they are stagnated at the early stage of the optimization all comparative optimizer with the proposed approaches.
process. Thus, it concludes that both FS-pBGSK and FS- Table 10 shows that the FS-pBGSK algorithm selects the
BGSK are able to balance between the two contradictory minimal number of features as compared to other algo-
aspects of exploration capability and exploitation tendency. rithms in 16 datasets. BALO comes in the second place
Table 9 shows the second evaluation criterion, i.e. where it shows the best results in 4 small dimensions
average classification accuracy of all algorithms. Table 9 datasets, and then, BBA presents its best results in only two
shows that out of 22 datasets, FS-pBGSK performs better datasets. It indicates the FS-pBGSK algorithm has the
than other algorithms in 16 datasets. Precisely, in 14 ability to find the optimal solutions of the feature selection
datasets, FS-pBGSK algorithm has more than 90% accu- problems.
racy in finding the optimal feature subset. Also, it obtains To prove the robustness of the proposed approaches,
the high accuracy in mostly medium- and high-dimensional Table 11 presents the statistical standard deviation of fit-
datasets ðM1  M9 andL1  L5 Þ. To be precise, BDA comes ness values of all algorithms. In 82% of datasets, the
in the second place where it finds the best accuracy of 2 standard deviation provided by FS-pBGSK is less as
datasets. The main cause of obtaining the high accuracy by compared to other algorithms. The other algorithms have
FS-pBGSK is that it did not consider any fixed size of the more disparity between their fitness value. Therefore, the
population and it gradually decreases the population size FS-pBGSK algorithms converge to the same solution over
for which FS-pBGSK algorithm does not trap into local the different number of runs and prove its robustness.
optima and present the best solutions. The average computational time (per second) taken by
different algorithms is shown in Table 12, which

123
Neural Computing and Applications

Table 13 Results of Friedman test obtained results are shown in Table 14. We observe that
Algorithms Mean rank Ranking FS-pBGSK obtains higher Sþ than S in all comparisons.
To be more precise and according to the test at a ¼ 0:05,
FS-pBGSK 1.50 1 FS-pBGSK performs significantly better than other men-
BALO 3.02 2 tioned algorithms. Thus, from Table 14 FS-pBGSK is
BDA 3.68 3 inferior, equal or superior to other optimizer in 0, 0, 132
FS-BGSK 3.93 4 out of 132 cases which represents that superiority of FS-
BPSO 4.43 5 pBGSK in 100% cases. Also, the optimal p value is less
BSSA 5.93 6 than the assumed significance level that implies that the
bGWO 6.91 7 null hypothesis has to be rejected. Hence, FS-pBGSK
BDE 7.00 8 algorithm proves that it is significantly different with other
BBA 8.59 9 optimizer.
p value 0.00 From the above discussion and results, it can be con-
cluded that the proposed FS-pBGSK algorithm has better
searching quality, efficiency and robustness to solve feature
selection problem. The FS-pBGSK algorithm shows its
Table 14 Results of Wilcoxon signed-rank test promising results for all datasets and proves its superiority
Algorithms Sþ S p value þ
– Decision from state-of-the-art algorithms. Moreover, the proposed
binary beginners–intermediate and intermediate–experts
FS- BALO 219 12 0.00 20 1 1 þ stage keeps the balance between the two main components
pBGSK BBA 253 0 0.00 22 0 0 þ of algorithms that is exploration and exploitation abilities
BDE 253 0 0.00 22 0 0 þ and the population reduction rule helps to delete the worst
FS- 244 9 0.00 20 0 2 þ solutions from the search space of FS-pBGSK. The pro-
BGSK posed approaches of GSK algorithm can also be applied to
bGWO 252 1 0.00 21 0 1 þ different type of problems such as compressed sensing
BPSO 239 14 0.00 20 0 2 þ problems [11], unit commitment problems [43] and ana-
BDA 223 30 0.002 20 0 2 þ lytic loss minimization in power engineering [12] and to
BSSA 224 7 0.00 19 1 2 þ obtain the solution of Caputo fractional differential equa-
tions [10].

demonstrates that FS-BGSK takes very less running time. 6 Conclusions and future work
Especially in the large-dimensional datasets, the algorithms
consume lots of time. In contrast, FS-pBGSK and FS- This paper develops a binary version of recently proposed
BGSK take very less, e.g. in L5 , BDE takes 4704.51 s, and gaining–sharing knowledge-based optimization algorithm
FS-pBGSK consumes 1778.36 s only for 10 different runs. and applies to feature selection problems. The binary ver-
sion of GSK mainly depends on beginners–intermediate
and intermediate–experts stages which enable the algo-
rithm to investigate the search space and utilize the ten-
5.5 Statistical results dency of finding the best solutions. Moreover, to enhance
the performance of BGSK and to get rid of choosing the
To show the comparison statistically, Table 13 summarizes appropriate size of the population, a population reduction
the results obtained by Friedman test in which the p-value technique is employed to BGSK which saves the algorithm
is shown in bold. According to Friedman test, Table 13 from trapping into local optima and the premature con-
lists the ranks and the computed p value (0.00) is less than vergence. The proposed approaches were applied on 22
the assumed significant level (0.05) for all datasets. This benchmark datasets of feature selection and obtained the
implies that FS-pBGSK performs significantly better than results. The presented results compared with the state-of-
the other comparative algorithms. The best rank is given to the-art algorithms which already have shown their results
FS-pBGSK corresponding to their mean rank which is in the feature selection problem. The comparison is based
followed by BALO, BDA, FS-BGSK, BPSO, BSSA, on evaluation criteria such as average fitness value, stan-
bGWO, BDE and BBA. dard deviation, average accuracy and average feature
To check the performance of FS-pBGSK, the multi- selection size, and the proposed approaches present
problem Wilcoxon signed-rank test performed and the

123
Neural Computing and Applications

significantly promising results as compared to other algo- 12. Dassios IK (2019) Analytic loss minimization: theoretical
rithms. Precisely, FS-pBGSK algorithm shows their best framework of a second order optimization method. Symmetry
11(2):136
results in 80% datasets and also obtains the best ranking 13. De Souza RCT, dos Santos Coelho L, De Macedo CA, Pierezan J
from statistical tests. Besides, FS-pBGSK is very simple, (2018) A v-shaped binary crow search algorithm for feature
easy to understand and implement in many languages. selection. In: 2018 IEEE congress on evolutionary computation
For the future work, the proposed algorithm can be (CEC), pp 1–8. IEEE
14. Diao R, Shen Q (2012) Feature selection with harmony search.
enhanced using chaotic maps for feature selection problem IEEE Trans Syst Man Cybern Part B42(6):1509–1523
and can also be applied to multi-dimensional knapsack 15. Ding S, Zhang N, Zhang X, Wu F (2017) Twin support vector
problems. Moreover, the interested researcher can take machine: theory, algorithm and applications. Neural Comput
benefit from the proposed approaches in solving various Appl 28(11):3119–3130
16. Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion
real-world problems. approaches for feature selection. Neurocomputing 213:54–65
17. Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf
Acknowledgements The authors would like to acknowledge the edi- optimization approaches for feature selection. Neurocomputing
tors and anonymous reviewers for providing their valuable comments 172:371–381
and suggestions. The authors would also like to thank the Ministry of 18. Faris H, Hassonah MA, Ala’M AZ, Mirjalili S, Aljarah I (2018)
Human Resource Development (MHRD Govt. of India) for providing A multi-verse optimizer approach for feature selection and opti-
the financial assistantship to the first author. mizing svm parameters based on a robust system architecture.
Neural Comput Appl 30(8):2355–2369
Compliance with ethical standards 19. Faris H, Mafarja MM, Heidari AA, Aljarah I, Ala’M AZ, Mir-
jalili S, Fujita H (2018) An efficient binary salp swarm algorithm
with crossover scheme for feature selection problems. Knowl-
Conflict of interest The authors declare that they have no conflict of Based Syst 154:43–67
interest. 20. Frank A, Asuncion A et al (2011) UCI machine learning repos-
itory, 2010. http://archive.ics.uci.edu/ml 15, 22
21. Gao WF, Yen GG, Liu SY (2014) A dual-population differential
References evolution with coevolution for constrained optimization. IEEE
Trans Cybern 45(5):1108–1121
1. Akinyelu AA, Ezugwu AE, Adewumi AO (2019) Ant colony 22. Garcı́a S, Molina D, Lozano M, Herrera F (2009) A study on the
optimization edge selection for support vector machine speed use of non-parametric tests for analyzing the evolutionary algo-
optimization. Neural Comput Appl 1–33 rithms’ behaviour: a case study on the cec’2005 special session
2. Al-Madi N, Faris H, Mirjalili S (2019) Binary multi-verse opti- on real parameter optimization. J Heurist 15(6):617
mization algorithm for global optimization and discrete problems. 23. Gavrilis D, Tsoulos IG, Dermatas E (2008) Selecting and con-
Int J Mach Learn Cybern 10(12):3445–3465 structing features using grammatical evolution. Pattern Recogn
3. Allam M, Nandhini M (2018) Optimal feature selection using Lett 29(9):1358–1365
binary teaching learning based optimization algorithm. J King 24. Hafez AI, Hassanien AE, Zawbaa HM, Emary E (2015) Hybrid
Saud Univ Comput Inform Sci. https://doi.org/10.1016/j.jksuci. monkey algorithm with krill herd algorithm optimization for
2018.12.001 feature selection. In: 2015 11th international computer engi-
4. Beheshti Z (2018) Bmnabc: Binary multi-neighborhood artificial neering conference (ICENCO), pp 273–277. IEEE
bee colony for high-dimensional discrete optimization problems. 25. Hafez AI, Zawbaa HM, Emary E, Hassanien AE (2016) Sine
Cybern Syst 49(7–8):452–474 cosine optimization algorithm for feature selection. In: 2016
5. Brest J, Maučec MS (2011) Self-adaptive differential evolution international symposium on innovations in intelligent systems
algorithm using population size reduction and three strategies. and applications (INISTA), pp. 1–5. IEEE
Soft Comput 15(11):2157–2174 26. Hafez AI, Zawbaa HM, Emary E, Mahmoud HA, Hassanien AE
6. Chandrashekar G, Sahin F (2014) A survey on feature selection (2015) An innovative approach for feature selection based on
methods. Comput Electr En 40(1):16–28 chicken swarm optimization. In: 2015 7th international confer-
7. Chantar H, Mafarja M, Alsawalqah H, Heidari AA, Aljarah I, ence of soft computing and pattern recognition (SoCPaR),
Faris H (2019) Feature selection using binary grey wolf optimizer pp 19–24. IEEE
with elite-based crossover for arabic text classification. Neural 27. He X, Zhang Q, Sun N, Dong Y (2009) Feature selection with
Comput Appl. https://doi.org/10.1007/s00521-019-04368-6 discrete binary differential evolution. In: 2009 international
8. Cheng J, Zhang G, Neri F (2013) Enhancing distributed differ- conference on artificial intelligence and computational intelli-
ential evolution with multicultural migration for global numerical gence, vol 4, pp 327–330. IEEE
optimization. Inf Sci 247:72–93 28. Hichem H, Elkamel M, Rafik M, Mesaaoud MT, Ouahiba C
9. Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved (2019) A new binary grasshopper optimization algorithm for
binary pso for feature selection using gene expression data. feature selection problem. J King Saud Univ Compute Inform Sci
Comput Biol Chem 32(1):29–38 29. Hu B, Dai Y, Su Y, Moore P, Zhang X, Mao C, Chen J, Xu L
10. Dassios I, Baleanu D (2018) Optimal solutions for singular linear (2016) Feature selection for optimized high-dimensional
systems of caputo fractional differential equations. Math Methods biomedical data using an improved shuffled frog leaping algo-
Appl Sci. https://doi.org/10.1002/mma.5410 rithm. IEEE/ACM Trans Comput Biol Bioinform
11. Dassios I, Fountoulakis K, Gondzio J (2015) A preconditioner for 15(6):1765–1773
a primal-dual newton conjugate gradient method for compressed 30. Ibrahim HT, Mazher WJ, Ucan ON, Bayat O (2019) A
sensing problems. SIAM J Sci Comput 37(6):A2783–A2812 grasshopper optimizer approach for feature selection and opti-
mizing SVM parameters utilizing real biomedical data sets.
Neural Comput Appl 31(10):5965–5974

123
Neural Computing and Applications

31. Jabbar A, Zainudin S (2014) Water cycle algorithm for attribute optimization for medical data classification. Knowl-Based Syst
reduction problems in rough set theory. J Theor Appl Inform 96:61–75
Technol 61(1):107–117 51. Singh T (2020) A chaotic sequence-guided Harris Hawks opti-
32. Leardi R (1994) Application of a genetic algorithm to feature mizer for data clustering. Neural Comput Appl. https://doi.org/10.
selection under full validation conditions and to outlier detection. 1007/s00521-020-04951-2
J Chemom 8(1):65–79 52. Tawhid MA, Dsouza KB (2018) Hybrid binary bat enhanced
33. Lin KC, Zhang KY, Huang YH, Hung JC, Yen N (2016) Feature particle swarm optimization algorithm for solving feature selec-
selection based on an improved cat swarm optimization algorithm tion problems. Appl Comput Inform. https://doi.org/10.1016/j.
for big data classification. J Supercomput 72(8):3210–3221 aci.2018.04.001
34. Liu H, Motoda H (1998) Feature extraction, construction and 53. Wan Y, Wang M, Ye Z, Lai X (2016) A feature selection method
selection: a data mining perspective, vol 453. Springer, Berlin based on modified binary coded ant colony optimization algo-
35. Mafarja MM, Eleyan D, Jaber I, Hammouri A, Mirjalili S (2017) rithm. Appl Soft Comput 49:248–258
Binary dragonfly algorithm for feature selection. In: 2017 inter- 54. Xue B, Zhang M, Browne WN (2012) Particle swarm optimiza-
national conference on new trends in computing sciences tion for feature selection in classification: a multi-objective
(ICTCS), pp 12–17. IEEE approach. IEEE Trans Cybern 43(6):1656–1671
36. Mafarja MM, Mirjalili S (2017) Hybrid whale optimization 55. Yan C, Ma J, Luo H, Patel A (2019) Hybrid binary coral reefs
algorithm with simulated annealing for feature selection. Neu- optimization algorithm with simulated annealing for feature
rocomputing 260:302–312 selection in high-dimensional biomedical datasets. Chemometr
37. Mohamed AK, Mohamed AW, Elfeky EZ, Saleh M (2018) Intell Lab Syst 184:102–111
Enhancing AGDE algorithm using population size reduction for 56. Yan Z, Yuan C (2004) Ant colony optimization for feature
global numerical optimization. In: International conference on selection in face recognition. In: International conference on
advanced machine learning technologies and applications, biometric authentication, pp 221–226. Springer, Berlin
pp 62–72. Springer 57. Zawbaa HM, Emary E, Parv B, Sharawi M (2016) Feature
38. Mohamed AW (2016) A new modified binary differential evo- selection approach based on moth-flame optimization algorithm.
lution algorithm and its applications. Appl Math Inform Sci In: 2016 IEEE congress on evolutionary computation (CEC),
10(5):1965–1969 pp 4612–4617. IEEE
39. Mohamed AW, Hadi AA, Mohamed AK (2020) Gaining-sharing 58. Zhang C, Zhou J, Li C, Fu W, Peng T (2017) A compound
knowledge based algorithm for solving optimization problems: a structure of elm based on feature selection and parameter opti-
novel nature-inspired algorithm. Int J Mach Learn Cybern mization using hybrid backtracking search algorithm for wind
11:1501–1529 speed forecasting. Energy Convers Manag 143:360–376
40. Mohamed AW, Sabry HZ (2012) Constrained optimization based 59. Zhang H, Sun G (2002) Feature selection using tabu search
on modified differential evolution algorithm. Inf Sci 194:171–208 method. Pattern Recogn 35(3):701–711
41. Muni DP, Pal NR, Das J (2006) Genetic programming for 60. Zhang M, Shao C, Li F, Gan Y, Sun J (2006) Evolving neural
simultaneous feature selection and classifier design. IEEE Trans network classifiers and feature subset using artificial fish swarm.
Syst Man Cybern Part B 36(1):106–117 In: 2006 international conference on mechatronics and automa-
42. Nakamura RY, Pereira LA, Costa KA, Rodrigues D, Papa JP, tion, pp 1598–1602. IEEE
Yang XS (2012) BBA: a binary bat algorithm for feature selec- 61. Zhang N, Ding S (2017) Unsupervised and semi-supervised
tion. In: 2012 25th SIBGRAPI conference on graphics, patterns extreme learning machine with wavelet kernel for high dimen-
and images, pp 291–297. IEEE sional data. Memet Comput 9(2):129–139
43. Panwar LK, Reddy S, Verma A, Panigrahi BK, Kumar R (2018) 62. Zhang N, Ding S, Sun T, Liao H, Wang L, Shi Z (2020) Multi-
Binary grey wolf optimizer for large scale unit commitment view rbm with posterior consistency and domain adaptation. Inf
problem. Swarm Evolut Comput 38:251–266 Sci 516:142–157
44. Pashaei E, Aydin N (2017) Binary black hole algorithm for 63. Zhang N, Ding S, Zhang J, Xue Y (2018) An overview on
feature selection and classification on biological data. Appl Soft restricted boltzmann machines. Neurocomputing 275:1186–1199
Comput 56:94–106 64. Zhang WQ, Zhang Y, Peng C (2019) Brain storm optimization
45. Rad SM, Tab FA, Mollazade K (2012) Application of imperialist for feature selection using new individual clustering and updating
competitive algorithm for feature selection: a case study on bulk mechanism. Appl Intell 49(12):4294–4302
rice classification. Int J Comput Appl 40(16):41–48 65. Zhang Y, Song XF, Gong DW (2017) A return-cost-based binary
46. Rashedi E, Nezamabadi-pour H (2014) Feature subset selection firefly algorithm for feature selection. Inform Sci 418:561–574
using improved binary gravitational search algorithm. J Intell 66. Zhang Z, Ding S, Jia W (2019) A hybrid optimization algorithm
Fuzzy Syst 26(3):1211–1221 based on cuckoo search and differential evolution for solving
47. Rodrigues D, Pereira LA, Almeida T, Papa JP, Souza A, Ramos constrained engineering problems. Eng Appl Artif Intell
CC, Yang XS (2013) BCS: a binary cuckoo search algorithm for 85:254–268
feature selection. In: 2013 IEEE international symposium on 67. Zhang Z, Ding S, Sun Y (2020) A support vector regression
circuits and systems (ISCAS2013), pp 465–468. IEEE model hybridized with chaotic krill herd algorithm and empirical
48. Sasirekha K, Thangavel K (2019) Optimization of k-nearest mode decomposition for regression task. Neurocomputing.
neighbor using particle swarm optimization for face recognition. https://doi.org/10.1016/j.neucom.2020.05.075
Neural Comput Appl 31(11):7935–7944 68. Zhu Z, Ong YS, Dash M (2007) Wrapper-filter feature selection
49. Schiezaro M, Pedrini H (2013) Data feature selection based on algorithm using a memetic framework. IEEE Trans Syst Man
artificial bee colony algorithm. EURASIP J Image Video Process Cybern Part B Cybern 37(1):70–76
2013(1):47
50. Shen L, Chen H, Yu Z, Kang W, Zhang B, Li H, Yang B, Liu D Publisher’s Note Springer Nature remains neutral with regard to
(2016) Evolving support vector machines using fruit fly jurisdictional claims in published maps and institutional affiliations.

123

You might also like