Professional Documents
Culture Documents
https://doi.org/10.1007/s00521-020-05375-8 (0123456789().,-volV)(0123456789().
,- volV)
ORIGINAL ARTICLE
Abstract
To obtain the optimal set of features in feature selection problems is the most challenging and prominent problem in
machine learning. Very few human-related metaheuristic algorithms were developed and solved this type of problem. It
motivated us to check the performance of recently developed gaining–sharing knowledge-based optimization algorithm
(GSK), which is based on the concept of gaining and sharing knowledge of humans throughout their lifespan. It depends on
two stages: beginners–intermediate gaining and sharing stage and intermediate–experts gaining and sharing stage. In this
study, two approaches are proposed to solve feature selection problems: FS-BGSK: a novel binary version of GSK
algorithm that relies on these two stages with knowledge factor 1 and FS-pBGSK: a population reduction technique that is
employed on BGSK algorithm to enhance the exploration and exploitation quality of FS-BGSK. The proposed approaches
are checked on twenty two feature selection benchmark datasets from UCI repository that contains small, medium and
large dimensions datasets. The obtained results are compared with seven state-of-the-art metaheuristic algorithms; binary
differential evolution, binary particle swarm optimization algorithm, binary bat algorithm, binary grey wolf optimizer,
binary ant lion optimizer, binary dragonfly algorithm and binary salp swarm algorithm. It concludes that FS-pBGSK and
FS-BGSK outperform other algorithms in terms of accuracy, convergence and robustness in most of the datasets.
Keywords Gaining–sharing knowledge-based optimization algorithm Feature selection Classification K-NN classifier
1 Introduction
123
Neural Computing and Applications
classifier) is necessary. Wrapper method directly connects this motivates us to use recently developed human beha-
with the features itself, which determines the more accu- viour-based algorithm, viz. gaining–sharing knowledge-
racy of the method. Moreover, wrapper methods are based optimization (GSK) algorithm [39] in the feature
computationally expensive than filter methods, whereas the selection task. Mohamed et al. proposed GSK algorithm to
results obtained by these methods are more accurate and solve the optimization problem in continuous space. The
the performance is typically based on the employed basic concept of applying GSK is how humans reciprocate
learning algorithm. knowledge in their lifespan. The two main pillars of GSK
Feature selection problem is considered as NP-complete are: beginners–intermediate (junior) gaining and sharing
combinatorial optimization problem as, in which an opti- stage and intermediate–experts (senior) gaining and shar-
mal subset of features are selected from the original data- ing stage. In both stages, the persons gain knowledge from
set. To reduce the dimension of the original dataset with their related networks and share their acquired knowledge
maximum accuracy and the minimum number of features is with others to amplify their skills.
the main motive of the feature selection problem, which Although GSK is population-based and real-valued
implies that these two contrary objectives are treated as a continuous optimization algorithm and to check the per-
multi-objective optimization problem. Therefore, to solve formance of GSK in feature selection problem, a binary-
these type of problems, metaheuristic algorithms are suit- coded GSK algorithm is proposed. It is the first binary
able, and they show their superior performance. variant of GSK to solve the feature selection problem (FS-
From the last three decades, metaheuristic algorithms BGSK). FS-BGSK has two main requisites: binary begin-
prove their efficiency and ability in solving high-dimen- ners–intermediate (junior) gaining and sharing stage and
sional optimization problems. These algorithms are based binary intermediate–experts (senior) gaining and sharing
on randomly generated initial solutions and successfully stage. These two requisites enable FS-BGSK to explore the
find the optimal solution of the complex optimization search space and intensifies the exploitation tendency
problems [66]. It can be categorized in four categories, viz. efficiently and effectively. From the literature, the fol-
evolution-based algorithms, swarm-based algorithms, lowing observations are to choose the population size: the
physics-based algorithms and human-related algorithms. population size may be different for every problem [8], it
There are several algorithms in each category, and they can be based on the dimension of the problems [40], and it
also have applied to many real-world applications in dif- may be varied or fixed throughout the optimization process
ferent fields of engineering and sciences [38]. These according to the problems [5, 21]. To get rid of these
algorithms have been achieved success in solving feature conditions, Mohamed et al. [37] proposed adaptive guided
selection problems. The detailed literature of algorithms differential evolution algorithm with population size
regarding the feature selection problem is presented in reduction technique which reduces the population size
Table 1. gradually. Thus, to choose the appropriate size of the
Besides these algorithms, there are some other various population is an arduous task. Hence, to overcome this
hybrid algorithms which are proposed for feature selection difficulty and to enhance the performance of FS-BGSK, a
problem. Wan et al. [53] introduced a modified ant colony population reduction technique is applied to FS-BGSK
optimization algorithm which is combined with genetic named as FS-pBGSK. The proposed approaches are
algorithm for feature selection problem. Marfarja and employed on twenty two benchmark datasets of feature
Mirjalili [35] designed a hybrid metaheuristic algorithm in selection, and the obtained results are compared with other
which the simulated annealing algorithm is inserted in the seven feature selection-based algorithms that is binary-
whale optimization algorithm to find the optimal feature coded differential evolution algorithm (BDE) [27], binary-
subset. Tawhid and Dsouza [52] included bat algorithm for coded particle swarm optimization algorithm (BPSO) [9],
exploring search space and enhanced the particle swarm binary bat algorithm (BBA) [42], binary grey wolf opti-
optimization for converging to the optimal solution in the mizer (bGWO) [17], binary ant lion optimization (BALO)
feature selection process. Yan et al. [55] proposed binary [16], binary dragonfly algorithm (BDA) [35] and binary
coral reefs optimization algorithm involving simulated salp swarm algorithm (BSSA) [19].
annealing for feature selection in biomedical datasets. For The organization of the paper is as follows: Sect. 2
regression task, Zhang et al. [67] proposed a support vector describes the fundamentals of feature selection problem
regression model with chaotic krill herd algorithm and and the GSK algorithms, Sect. 3 represents the proposed
empirical mode decomposition. algorithms and Sect. 4 describes the proposed algorithm
Table 1 shows that in each category, there are various for feature selection problem. The experimental results and
algorithms which have been used for feature selection discussion is shown in Sect. 5, which is followed by con-
issues. In human-related algorithms, very few algorithms cluding remarks and future work in Sect. 6.
have been applied in feature selection problems. Therefore,
123
Neural Computing and Applications
Evolution-based algorithms
GA Presented a genetic algorithm for feature selection problem and shown the full validation on datasets Leardi [32]
Tabu Preferred tabu search algorithm and used pattern classifier to select the best features in feature Zhang and Sun [59]
selection problems
GP Presented genetic programming to select the best features and build the classifier using selected Muni et al. [41]
features
FSC A feature selection and construction method is proposed using grammatical evolution algorithm, by Gavrilis et al. [23]
mapping of original attributes to the artificial attributes with different classifiers
BDE Proposed a new algorithm, discrete binary differential evolution and solved feature selection He et al. [27]
problem using support vector machine (SVM), C&R trees and RBF network
ICA The imperialist competitive algorithm has been used for selecting the optimal features of different Rad et al. [45]
rice, which is based on the bulk of sample images
HBSA Hybrid backtracking search algorithm has been proposed and used with extreme learning machine to Zhang et al. [58]
predict the wind speed
Swarm-based algorithms
ACOSVM Presented ant colony optimization with support vector machine for face recognition Yan and Yua [56]
AFSA Introduced artificial fish swarm algorithm with neural network classifier in the feature selection Zhang et al. [60]
process
MA Proposed hybrid filter and wrapper methods with memetic algorithms in classification Zhu et al. [68]
PSO Introduced a particle swarm optimization with multi-objectives for feature selection problem Xue et al. [54]
BBA Binary-coded bat algorithm has been proposed with optimum-path forest classifier to get the optimal Nakamura et al. [42]
set of features
ABC Employed artificial bee colony algorithm to feature selection problem with the perturbation Schiezaro and Pedrini
parameter as a search mechanism for new features [49]
BCS Proposed a binary version of the cuckoo search algorithm for theft detection in power distribution Rodrigues et al. [47]
systems
MAKHA Based on chicken movement, a hybrid monkey algorithm with krill herd algorithm is presented to Hafez et al. [24]
obtain the optimal feature subsets
CSO Proposed chicken swarm optimization for finding the optimal feature subsets in feature selection Hafez et al. [26]
problem
bGWO Two approaches of binary grey wolf optimizer are proposed and hired for the classification Emary et al. [17]
ISFLA Presented with an improved version of shuffled frog leaping algorithm and applied to feature Hu et al. [29]
selection in biomedical datasets
ICSO Proposed improved cat swarm optimization algorithm and used for big data classification Lin et al. [33]
MFO Proposed a new moth flame algorithm which based on the motion of moths to solve feature selection Zawbaa et al. [57]
problems
FOA–SVM Used fruit fly algorithm with SVM as a classifier in medical diagnosis Shen et al. [50]
BALO Based on two new approaches, the ant lion optimizer (based on the hunting process of ant lions) is Emary et al. [16]
employed on the feature selection domain
Rc-BBFA Presented a binary form of firefly algorithm as return-cost-based firefly algorithm and applied to Zhang et al. [65]
feature selection problems
BDO Binary version of dragonfly algorithm is presented for getting the maximum classification accuracy Mafarja et al. [35]
WOA–SA A hybrid approach of whale optimization algorithm and simulated annealing algorithm has been Mafarja and Mirjalili
proposed and applied to feature selection problems [36]
BCSA To solve the feature selection problem, V-shaped transfer function is used to develop the binary Souza et al. [13]
crow search algorithm
BSSA A crossover operator is employed on salp swarm algorithm with V- and S-shaped transfer function, Faris et al. [18]
and then, it is applied to feature selection problem
BMNABC Binary multi-neighbourhood artificial bee colony algorithm is proposed in which a new probability Beheshti [4]
function is imposed and used for classification
PSO–K-NN Proposed particle swarm optimization with K-NN classifier to perform the face recognition Sasirekha and Thangavel
[48]
123
Neural Computing and Applications
Table 1 (continued)
Algorithm Instructions Author and references
BGWO Binary grey wolf optimizer with elite-based crossover has been proposed and used with different Chantar et al. [7]
classifiers such as SVM, K-NN, decision trees and naive Bayes for Arabic text classification
ACOISA For SVM speed optimization, ant colony optimization instance selection algorithm has been Akinyelu et al. [1]
proposed, which is based on boundary detection and boundary instance selection
NBGOA A new binary variant of grasshopper optimization algorithm (based on behaviours of grasshoppers) Hichem et al. [28]
is proposed for finding out the small size subset of features
GOA ? SVM Grasshopper optimization approach has been used with SVM and applied to biomedical datasets to Ibrahim et al. [30]
select optimal features
CHHO Proposed chaotic sequence-based Harris hawks optimizer for data clustering Singh [51]
Physics-based algorithms
HS The parameter adaption scheme is developed for harmony search and applied to feature selection Diao and Shen [14]
domain
IBGSA An improved version of gravitational search algorithm with transfer function is developed to find out Rashedi and
optimal feature subsets Nezamabadi-pour [46]
WC-RSAR Water cycle algorithm for attribute reduction has been proposed and applied to rough set theory Jabbar and Zaindin [31]
SCA Presented a sine–cosine optimization algorithm to extract the best feature selection from the original Hafez et al. [25]
feature datasets
MVO–SVM Proposed a robust technique based on the multi-verse optimizer for selecting the features and Faris et al. [19]
optimizing the parameters of SVM
BBHA Developed a binary black hole algorithm and applied to biological datasets in feature selection Pashaei and Aydin [44]
problems
Human-related algorithm
FS-BTLBO Developed teaching learning-based optimization algorithm for Wisconsin diagnosis breast cancer Allam and Nandhini [3]
with different classifiers
BBSOFS Proposed brain storm optimization algorithm with individual clustering technology and two Zhang et al. [64]
individual updating mechanisms for feature selection problems
123
Neural Computing and Applications
2.2 K-Nearest neighbour classifier (K-NN) classes. Thus, they can share their knowledge or skills with
the most suitable persons so that they can enhance their
K-NN classifier is a nonparametric method used for clas- skills. The process, as mentioned earlier of GSK, can be
sification. It is based on supervised learning algorithm that formulated mathematically in the following steps:
classifies unknown instances by calculating the distance Step 1: In the first step, the number of persons is
between given unknown instances and its nearest K assumed (number of population size NP). Let xt ðt ¼
neighbours. The K-NN classifier is a very commonly used 1; 2; . . .; NPÞ be the individuals of a population.
and simple method to classify the features, and it is easy to xtk ¼ ðxt1 ; xt2 ; . . .; xtd Þ, where d is branch of knowledge
understand and implement. However, the most common assigned to an individual and ft ðt ¼ 1; 2; . . .; NPÞ are the
used distance measurement is Euclidean distance which is corresponding objective function values.
given as To obtain a starting solution for the optimization prob-
" #1=2 lem, the initial population is obtained. The initial popula-
Xd 2
jX1 X2 j ¼ x1i x2i ð2Þ tion is created randomly within the boundary constraints as
i¼1 x0tk ¼ Lk þ randk ðUk Lk Þ ð3Þ
where |.| denotes the distance between X1 and X2 which are where randk denotes uniformly distributed random number
points in d dimensions. in the range 0 and 1.
Step 2: At first, the dimensions of each stage are com-
2.3 GSK algorithm for continuous variables puted through the following formula
K
An optimization problem is formulated as Genmax G
djunior ¼ d ð4Þ
Genmax
min f ðXÞ; X ¼ ½x1 ; x2 ; . . .; xd
X 2 ½Lk ; Uk ; k ¼ 1; 2; . . .; d dsenior ¼ d djunior ð5Þ
where f denotes the objective function; X ¼ ½x1 ; x2 ; . . .; xd where Kð [ 0Þ denotes the knowledge rate that governs the
are the decision variables; Lk ; Uk are the lower and upper experience rate. djunior and dsenior represent the dimension
bounds of decision variables, respectively; and d represents for the junior and senior stage, respectively. Genmax is the
the dimension of individuals. maximum number of generations, and G denotes the gen-
In the recent years, a novel human-based optimization eration number.
algorithm, gaining–sharing knowledge-based optimization Step 3: Beginners–intermediate gaining–sharing stage:
algorithm (GSK) [39], has been developed. It follows the During this stage, the early aged people gain knowledge
concept of gaining and sharing knowledge throughout the from their small networks and share their views with the
human lifetime. GSK mainly relies on the two important other people who may or may not belong to their group.
stages: Thus, individuals are updated through as:
(i) Beginners–intermediate gaining and sharing stage (i) According to objective function values, the indi-
(junior or early middle stage). viduals are arranged in ascending order as
(ii) Intermediate–experts gaining and sharing stage xbest ; . . .; xt1 ; xt ; xtþ1 ; . . .; xworst
(senior or middle later stage).
In the early middle stage or beginners–intermediate gaining (ii) For every xt ðt ¼ 1; 2; . . .; NPÞ, select the nearest
and sharing stage, it is not possible to acquire knowledge best ðxt1 Þ and worst ðxtþ1 Þ to gain the knowledge,
from social media or friends. An individual gains knowl- and also select randomly ðxR Þ to share the knowl-
edge from their known persons such as family members, edge. Therefore, to update the individuals, the
relatives or neighbours. Due to lack of experience, these pseudocode is presented in Algorithm 1, in which
people want to share their thoughts or gained knowledge kf ð [ 0Þ is the knowledge factor.
with other people which may or may not be from their
networks, and they do not have much experience to dif-
ferentiate others in good or bad category.
Contrarily, in the middle later year stage or intermedi-
ate–experts gaining–sharing stage, individuals gain
knowledge from their large networks such as social media
friends and colleagues. These people have much experi-
ence or great ability to categorize people into good or bad
123
Neural Computing and Applications
3 Proposed methodology where the round operator used to convert the decimal
number into the nearest binary number.
The two approaches of GSK algorithm are presented in the
following subsections.
123
Neural Computing and Applications
3.3 Evaluate the dimensions of stages xR ; if xt1 ¼ xtþ1
xnew
tk ¼ ð9Þ
xt1 ; if xt1 6¼ xtþ1
Before proceeding further, the dimensions of each stage are
computed through the following formula: Case 2 When f ðxR Þ f ðxt Þ: There are four different
K vectors ðxt1 ; xt ; xtþ1 ; xR Þ that consider only two values (0
NFE and 1). Thus, there are total 24 combinations possible that
djunior ¼ d 1 ð7Þ
MaxNFE are presented in Table 3. Moreover, the 16 combinations
dsenior ¼ d djunior ð8Þ are divided into two subcases [(c) and (d)] in which (c) and
(d) have four and twelve combinations, respectively.
where Kð [ 0Þ denotes the knowledge rate which is ran-
domly generated, NFE denotes the number of function Subcase (c) If xt1 is not equal to xtþ1 , but xtþ1 is equal to
evaluation and MaxNFE represents the maximum number xR , the result is equal to xt1 .
of function evaluations. Subcase (d) If any of the condition arise xt1 ¼ xtþ1 6¼ xR
or xt1 6¼ xtþ1 6¼ xR or xt1 ¼ xtþ1 ¼ xR , the result is equal
3.4 Binary beginners–intermediate (junior) to xt by considering - 1 and - 2 as 0, and 2 and 3 as 1.
gaining and sharing step
The mathematical formulation of Case 2 is as
A binary beginners–intermediate gaining and sharing step new xt1 ; ifxt1 6¼ xtþ1 ¼ xR
xtk ¼ ð10Þ
is based on the original GSK with kf ¼ 1. The individuals xt ; Otherwise
were updated in original GSK using the pseudocode (Al-
gorithm 1) which contains two cases. These two cases are
defined for binary stage as follows: 3.5 Binary intermediate–experts (senior) gaining
and sharing stage
Case 1 When f ðxR Þ\f ðxt Þ: There are three different
vectors ðxt1 ; xtþ1 ; xR Þ, which can take only two values (0 The working mechanism of binary intermediate–experts
and 1). Therefore, total 23 combinations are possible, gaining and sharing stage is same as the binary junior
which are listed in Table 2. Furthermore, these eight gaining and sharing stage with value of kf ¼ 1. The indi-
combinations are categorized into two different subcases viduals are updated in original senior gaining–sharing stage
[(a) and (b)] and each subcase has four combinations. The using pseudocode (Algorithm 2) that contains two cases.
results of every possible combination are presented in
Table 2.
Subcase (a) If xt1 is equal to xtþ1 , the result is equal to Table 3 Results of binary beginners–intermediate gaining and sharing
xR . stage of Case 2 with kf ¼ 1
xt1 xt xtþ1 xR Results Modified results
Subcase (b) When xt1 is not equal to xtþ1 , then the result
is same as xt1 by taking - 1 as 0 and 2 as 1. Subcase (c) 1 1 0 0 3 1
The mathematical formulation of Case 1 is as follows: 1 0 0 0 1 1
0 1 1 1 0 0
0 0 1 1 -2 0
Subcase (d) 0 0 0 0 0 0
Table 2 Results of binary beginners–intermediate gaining and sharing
stage of Case 1 with kf ¼ 1 0 1 0 0 2 1
0 0 1 0 -1 0
xt1 xtþ1 xR Results Modified results
0 0 0 1 -1 0
Subcase (a) 0 0 0 0 0 1 0 1 0 0 0
0 0 1 1 1 1 0 0 1 0 0
1 1 0 0 0 0 1 1 0 1 1
1 1 1 1 1 0 1 0 1 1 1
Subcase (b) 1 0 0 1 1 1 1 1 0 2 1
1 0 1 2 1 1 0 1 1 -1 0
0 1 0 -1 0 1 1 0 1 2 1
0 1 1 0 0 1 1 1 1 1 1
123
Neural Computing and Applications
The two cases further are modified for binary intermediate– Case 2. When f ðxmiddle Þ f ðxt Þ: It consists of four dif-
experts gaining–sharing stage in the following manner: ferent binary vectors ðxpbest ; xmiddle ; xpworst ; xt Þ, and with the
Case 1 When f ðxmiddle Þ\f ðxt Þ: It contains three different values of each vector, total sixteen combination are pre-
vectors ðxpbest ; xmiddle ; xpworst Þ, and they assume only binary sented. The sixteen combinations are also divided into two
values (0 and 1); thus, total eight combinations are possible subcases [(c) and (d)]. The subcases (c) and (d) further
to update the individuals. These total eight combinations contains four and twelve combinations, respectively. The
are classified into two subcases [(a) and (b)], and each subcases are explained in detail in Table 5.
subcase contains only four different combinations. Table 4 Subcase (c): When xpbest is not equal to xpworst , and
represents the obtained results of this case. xpworst is equal to xmiddle , then the obtained results are equal
to xpbest .
Subcase (a) If xpbest is equal to xpworst , the result is equal
Subcase (d): If any case arises other than (c), then the
to xmiddle .
obtained results is equal to xt by taking - 2 and - 1 as 0
Subcase (b): On the other hand, if xpbest is not equal to and 2 and 3 as 1.
xpworst , the results are equal to xpbest with assuming - 1 or 2 The mathematical formulation of Case 2 is given as
equivalent to their nearest binary value (0 and 1,
xpbest ; ifxpbest 6¼ xpworst ¼ xmiddle
respectively). xnew
tk ¼ ð12Þ
xt ; Otherwise
Case 1 is mathematically formulated in the following
way: The pseudocode for BGSK is shown in Algorithm 3.
123
Neural Computing and Applications
"
Table 5 Results of binary intermediate–experts gaining and sharing
NPGþ1 ¼ round NPmin NPmax stage of Case 2 with kf ¼ 1
4 Binary GSK for feature selection problem (ii) To minimize the number of features.
The above two criterion must be hold simultaneously;
To find the optimal subset of features, two wrapper therefore, the multi-objective function is formulated as
approaches FS-BGSK and FS-pBGSK are proposed in single objective:
which BGSK and pBGSK as a searching algorithm and
min ZðfitnessÞ ¼ c1 ERate þ c2 Nfeatures ð14Þ
K-NN classifier as an evaluator are considered. In this
study, a feature subset is denoted as a binary vector which where ERate represents the classification error rate (com-
describes that if a feature is 1, it is selected otherwise not. pliment of classification accuracy) which is calculated
The optimization problem is solved based on the goodness using K-NN classifier and
of feature subset that depends on the two conditions:
Number of selected features
(i) To maximize the classification accuracy. Nfeatures ¼
Total number of features
123
Neural Computing and Applications
123
Neural Computing and Applications
5.3 Evaluation criterion solution to the total number of features (D). The
avgfeature is determined by Eq. (17)
To evaluate the performance of each algorithm, assume
1X W
lengthðxÞi
that the algorithms ran over W times and the results were avgfeature ¼ ð17Þ
W i¼1 jDj
recorded. Hence, the comparison among algorithms had
been done by calculating the following measures in which where lengthðxÞi is the length of the selected feature
the fitness function is denoted by Z. subset in ith run and |D| is the total number of
(a) Average fitness function value It represents the mean features.
value of fitness function (avgZ ) values over W runs (d) Statistical Standard deviation of fitness values It
which can be formulated by Eq. (15): presents the robustness and variation of obtained
fitness values of each algorithm in W run. This tells
1X W
avgZ ¼ Z ð15Þ that how much the optimal solution deviates from the
W i¼1 i mean of algorithm. The smaller value of standard
deviation indicates that the algorithm converges to
where Zi is the optimal fitness value in ith run.
same solution, whereas the larger value represents
(b) Average classification accuracy It calculates the
the random solutions. The standard deviation (stdZ )
average value of obtained optimal accuracy of
is computed using Eq. (18)
classifier in W runs. Suppose the optimal accuracy
" #1=2
is denoted as Acci in ith run, average classification 1 X W 2
ð18Þ
accuracy avgAcc is computed through Eq. (16): stdZ ¼ Z avgZ
W 1 i¼1 i
1X W
avgAcc ¼ Acci ð16Þ where Zi is the optimal fitness value in ith run and
W i¼1
avgZ is the mean value of fitness function that is
(c) Average feature selection size In a total number of calculated in Eq. (15).
runs W, the average selection size of feature (e) Average computation time This measure is computed
(avgfeature ) is obtained by the length of the optimal for the computational time (in s) taken by each
123
Neural Computing and Applications
algorithm for performing W runs. The average the mean results of the solutions. In this test, Sþ
computation time avgoTime is calculated for oth denotes the sum of ranks of all datasets which
optimizer using Eq. (19) describes the first algorithm performs better than
the second one in the row, and S denotes
1X W
avgoTime ¼ ðTimeÞoi ð19Þ opposite of Sþ which indicates that the larger
W i¼1 rank narrates the larger discrepancy in the
where ðTimeÞoi denotes the time used by oth algo- algorithms. Assume the null hypothesis is ‘‘The
algorithms are not significantly different with
rithm in ith run.
respect to mean results’’ and the alternative
(f) Statistical tests To investigate the performance of
hypothesis is ‘‘The algorithms are significantly
algorithms and quality of solutions statistically, two
different with respect to mean results’’. In the
nonparametric statistical tests are conducted: Fried-
results of this test, three notations are taken þ;
man test and multi-problem Wilcoxon signed-rank
and
that are described as Plus ðþÞ : The results
test [22].
from the first algorithm are significantly better
• Friedman test It represents the rank of all than the second one. Minus ðÞ : The results
algorithms overall datasets, which indicates that from the second algorithm are significantly worse
performance of algorithms. The null hypothesis than the second one. Approximate ð
Þ : There is
states that ‘‘There is no significant difference in no significant difference between the two algo-
all algorithms’’, whereas the alternative hypoth- rithms. The decision is taken on p value; the null
esis is ‘‘the algorithms have significantly different hypothesis is being rejected if the obtained
for all datasets’’. The final decision is made on p value is less than assumed the significance
the obtained p value; if the obtained p value is level (5%).
less than 0.05 (significance level), the null
hypothesis has to be rejected.
• Multi-problem Wilcoxon signed-rank test It is
5.4 Numerical results and discussion
used to check the performance between the
The numerical results obtained by all optimizer are pre-
algorithms for all datasets, which depends on
sented in Tables 8, 9, 10, 11 and 12, evaluated by the
123
Neural Computing and Applications
123
Neural Computing and Applications
123
Neural Computing and Applications
criteria that are described in the previous section. The best function evaluations, and both binary stages enable FS-
results are shown in bold text. pBGSK algorithm to explore and exploit the search space
Table 8 represents the statistical mean of fitness values and find the optimal solutions. In datasets S3 ; S4 , BPSO
of all algorithms. It indicates that FS-pBGSK algorithm algorithm shows their best results among all optimizers.
obtains best average fitness values in 77% datasets. It Moreover, the convergence figures of all algorithms for
shows the best results for high-dimensional datasets small-, medium- and large-dimensional datasets are drawn
(L1 L5 ); for example, in L5 the obtained worst fitness in Figs. 2, 3 and 4, respectively. The convergence fig-
value (0.005524) is much better than the average fitness ures of FS-pBGSK illustrate the performance of algo-
values of other algorithms. The reason behind is that the rithms. From the figures, it is noticed that both FS-pBGSK
population size decreases linearly with the number of and FS-BGSK algorithms converge to the best solution in
123
Neural Computing and Applications
123
Neural Computing and Applications
82% datasets as compared to other algorithms. Although For minimizing the fitness values, the second objective
the state-of-the-art algorithms converge faster than FS- function is to minimize the number of selected features.
pBGSK and FS-BGSK, they either prematurely converge, Table 10 outlines the ratio of selected features obtained by
or they are stagnated at the early stage of the optimization all comparative optimizer with the proposed approaches.
process. Thus, it concludes that both FS-pBGSK and FS- Table 10 shows that the FS-pBGSK algorithm selects the
BGSK are able to balance between the two contradictory minimal number of features as compared to other algo-
aspects of exploration capability and exploitation tendency. rithms in 16 datasets. BALO comes in the second place
Table 9 shows the second evaluation criterion, i.e. where it shows the best results in 4 small dimensions
average classification accuracy of all algorithms. Table 9 datasets, and then, BBA presents its best results in only two
shows that out of 22 datasets, FS-pBGSK performs better datasets. It indicates the FS-pBGSK algorithm has the
than other algorithms in 16 datasets. Precisely, in 14 ability to find the optimal solutions of the feature selection
datasets, FS-pBGSK algorithm has more than 90% accu- problems.
racy in finding the optimal feature subset. Also, it obtains To prove the robustness of the proposed approaches,
the high accuracy in mostly medium- and high-dimensional Table 11 presents the statistical standard deviation of fit-
datasets ðM1 M9 andL1 L5 Þ. To be precise, BDA comes ness values of all algorithms. In 82% of datasets, the
in the second place where it finds the best accuracy of 2 standard deviation provided by FS-pBGSK is less as
datasets. The main cause of obtaining the high accuracy by compared to other algorithms. The other algorithms have
FS-pBGSK is that it did not consider any fixed size of the more disparity between their fitness value. Therefore, the
population and it gradually decreases the population size FS-pBGSK algorithms converge to the same solution over
for which FS-pBGSK algorithm does not trap into local the different number of runs and prove its robustness.
optima and present the best solutions. The average computational time (per second) taken by
different algorithms is shown in Table 12, which
123
Neural Computing and Applications
Table 13 Results of Friedman test obtained results are shown in Table 14. We observe that
Algorithms Mean rank Ranking FS-pBGSK obtains higher Sþ than S in all comparisons.
To be more precise and according to the test at a ¼ 0:05,
FS-pBGSK 1.50 1 FS-pBGSK performs significantly better than other men-
BALO 3.02 2 tioned algorithms. Thus, from Table 14 FS-pBGSK is
BDA 3.68 3 inferior, equal or superior to other optimizer in 0, 0, 132
FS-BGSK 3.93 4 out of 132 cases which represents that superiority of FS-
BPSO 4.43 5 pBGSK in 100% cases. Also, the optimal p value is less
BSSA 5.93 6 than the assumed significance level that implies that the
bGWO 6.91 7 null hypothesis has to be rejected. Hence, FS-pBGSK
BDE 7.00 8 algorithm proves that it is significantly different with other
BBA 8.59 9 optimizer.
p value 0.00 From the above discussion and results, it can be con-
cluded that the proposed FS-pBGSK algorithm has better
searching quality, efficiency and robustness to solve feature
selection problem. The FS-pBGSK algorithm shows its
Table 14 Results of Wilcoxon signed-rank test promising results for all datasets and proves its superiority
Algorithms Sþ S p value þ
– Decision from state-of-the-art algorithms. Moreover, the proposed
binary beginners–intermediate and intermediate–experts
FS- BALO 219 12 0.00 20 1 1 þ stage keeps the balance between the two main components
pBGSK BBA 253 0 0.00 22 0 0 þ of algorithms that is exploration and exploitation abilities
BDE 253 0 0.00 22 0 0 þ and the population reduction rule helps to delete the worst
FS- 244 9 0.00 20 0 2 þ solutions from the search space of FS-pBGSK. The pro-
BGSK posed approaches of GSK algorithm can also be applied to
bGWO 252 1 0.00 21 0 1 þ different type of problems such as compressed sensing
BPSO 239 14 0.00 20 0 2 þ problems [11], unit commitment problems [43] and ana-
BDA 223 30 0.002 20 0 2 þ lytic loss minimization in power engineering [12] and to
BSSA 224 7 0.00 19 1 2 þ obtain the solution of Caputo fractional differential equa-
tions [10].
demonstrates that FS-BGSK takes very less running time. 6 Conclusions and future work
Especially in the large-dimensional datasets, the algorithms
consume lots of time. In contrast, FS-pBGSK and FS- This paper develops a binary version of recently proposed
BGSK take very less, e.g. in L5 , BDE takes 4704.51 s, and gaining–sharing knowledge-based optimization algorithm
FS-pBGSK consumes 1778.36 s only for 10 different runs. and applies to feature selection problems. The binary ver-
sion of GSK mainly depends on beginners–intermediate
and intermediate–experts stages which enable the algo-
rithm to investigate the search space and utilize the ten-
5.5 Statistical results dency of finding the best solutions. Moreover, to enhance
the performance of BGSK and to get rid of choosing the
To show the comparison statistically, Table 13 summarizes appropriate size of the population, a population reduction
the results obtained by Friedman test in which the p-value technique is employed to BGSK which saves the algorithm
is shown in bold. According to Friedman test, Table 13 from trapping into local optima and the premature con-
lists the ranks and the computed p value (0.00) is less than vergence. The proposed approaches were applied on 22
the assumed significant level (0.05) for all datasets. This benchmark datasets of feature selection and obtained the
implies that FS-pBGSK performs significantly better than results. The presented results compared with the state-of-
the other comparative algorithms. The best rank is given to the-art algorithms which already have shown their results
FS-pBGSK corresponding to their mean rank which is in the feature selection problem. The comparison is based
followed by BALO, BDA, FS-BGSK, BPSO, BSSA, on evaluation criteria such as average fitness value, stan-
bGWO, BDE and BBA. dard deviation, average accuracy and average feature
To check the performance of FS-pBGSK, the multi- selection size, and the proposed approaches present
problem Wilcoxon signed-rank test performed and the
123
Neural Computing and Applications
significantly promising results as compared to other algo- 12. Dassios IK (2019) Analytic loss minimization: theoretical
rithms. Precisely, FS-pBGSK algorithm shows their best framework of a second order optimization method. Symmetry
11(2):136
results in 80% datasets and also obtains the best ranking 13. De Souza RCT, dos Santos Coelho L, De Macedo CA, Pierezan J
from statistical tests. Besides, FS-pBGSK is very simple, (2018) A v-shaped binary crow search algorithm for feature
easy to understand and implement in many languages. selection. In: 2018 IEEE congress on evolutionary computation
For the future work, the proposed algorithm can be (CEC), pp 1–8. IEEE
14. Diao R, Shen Q (2012) Feature selection with harmony search.
enhanced using chaotic maps for feature selection problem IEEE Trans Syst Man Cybern Part B42(6):1509–1523
and can also be applied to multi-dimensional knapsack 15. Ding S, Zhang N, Zhang X, Wu F (2017) Twin support vector
problems. Moreover, the interested researcher can take machine: theory, algorithm and applications. Neural Comput
benefit from the proposed approaches in solving various Appl 28(11):3119–3130
16. Emary E, Zawbaa HM, Hassanien AE (2016) Binary ant lion
real-world problems. approaches for feature selection. Neurocomputing 213:54–65
17. Emary E, Zawbaa HM, Hassanien AE (2016) Binary grey wolf
Acknowledgements The authors would like to acknowledge the edi- optimization approaches for feature selection. Neurocomputing
tors and anonymous reviewers for providing their valuable comments 172:371–381
and suggestions. The authors would also like to thank the Ministry of 18. Faris H, Hassonah MA, Ala’M AZ, Mirjalili S, Aljarah I (2018)
Human Resource Development (MHRD Govt. of India) for providing A multi-verse optimizer approach for feature selection and opti-
the financial assistantship to the first author. mizing svm parameters based on a robust system architecture.
Neural Comput Appl 30(8):2355–2369
Compliance with ethical standards 19. Faris H, Mafarja MM, Heidari AA, Aljarah I, Ala’M AZ, Mir-
jalili S, Fujita H (2018) An efficient binary salp swarm algorithm
with crossover scheme for feature selection problems. Knowl-
Conflict of interest The authors declare that they have no conflict of Based Syst 154:43–67
interest. 20. Frank A, Asuncion A et al (2011) UCI machine learning repos-
itory, 2010. http://archive.ics.uci.edu/ml 15, 22
21. Gao WF, Yen GG, Liu SY (2014) A dual-population differential
References evolution with coevolution for constrained optimization. IEEE
Trans Cybern 45(5):1108–1121
1. Akinyelu AA, Ezugwu AE, Adewumi AO (2019) Ant colony 22. Garcı́a S, Molina D, Lozano M, Herrera F (2009) A study on the
optimization edge selection for support vector machine speed use of non-parametric tests for analyzing the evolutionary algo-
optimization. Neural Comput Appl 1–33 rithms’ behaviour: a case study on the cec’2005 special session
2. Al-Madi N, Faris H, Mirjalili S (2019) Binary multi-verse opti- on real parameter optimization. J Heurist 15(6):617
mization algorithm for global optimization and discrete problems. 23. Gavrilis D, Tsoulos IG, Dermatas E (2008) Selecting and con-
Int J Mach Learn Cybern 10(12):3445–3465 structing features using grammatical evolution. Pattern Recogn
3. Allam M, Nandhini M (2018) Optimal feature selection using Lett 29(9):1358–1365
binary teaching learning based optimization algorithm. J King 24. Hafez AI, Hassanien AE, Zawbaa HM, Emary E (2015) Hybrid
Saud Univ Comput Inform Sci. https://doi.org/10.1016/j.jksuci. monkey algorithm with krill herd algorithm optimization for
2018.12.001 feature selection. In: 2015 11th international computer engi-
4. Beheshti Z (2018) Bmnabc: Binary multi-neighborhood artificial neering conference (ICENCO), pp 273–277. IEEE
bee colony for high-dimensional discrete optimization problems. 25. Hafez AI, Zawbaa HM, Emary E, Hassanien AE (2016) Sine
Cybern Syst 49(7–8):452–474 cosine optimization algorithm for feature selection. In: 2016
5. Brest J, Maučec MS (2011) Self-adaptive differential evolution international symposium on innovations in intelligent systems
algorithm using population size reduction and three strategies. and applications (INISTA), pp. 1–5. IEEE
Soft Comput 15(11):2157–2174 26. Hafez AI, Zawbaa HM, Emary E, Mahmoud HA, Hassanien AE
6. Chandrashekar G, Sahin F (2014) A survey on feature selection (2015) An innovative approach for feature selection based on
methods. Comput Electr En 40(1):16–28 chicken swarm optimization. In: 2015 7th international confer-
7. Chantar H, Mafarja M, Alsawalqah H, Heidari AA, Aljarah I, ence of soft computing and pattern recognition (SoCPaR),
Faris H (2019) Feature selection using binary grey wolf optimizer pp 19–24. IEEE
with elite-based crossover for arabic text classification. Neural 27. He X, Zhang Q, Sun N, Dong Y (2009) Feature selection with
Comput Appl. https://doi.org/10.1007/s00521-019-04368-6 discrete binary differential evolution. In: 2009 international
8. Cheng J, Zhang G, Neri F (2013) Enhancing distributed differ- conference on artificial intelligence and computational intelli-
ential evolution with multicultural migration for global numerical gence, vol 4, pp 327–330. IEEE
optimization. Inf Sci 247:72–93 28. Hichem H, Elkamel M, Rafik M, Mesaaoud MT, Ouahiba C
9. Chuang LY, Chang HW, Tu CJ, Yang CH (2008) Improved (2019) A new binary grasshopper optimization algorithm for
binary pso for feature selection using gene expression data. feature selection problem. J King Saud Univ Compute Inform Sci
Comput Biol Chem 32(1):29–38 29. Hu B, Dai Y, Su Y, Moore P, Zhang X, Mao C, Chen J, Xu L
10. Dassios I, Baleanu D (2018) Optimal solutions for singular linear (2016) Feature selection for optimized high-dimensional
systems of caputo fractional differential equations. Math Methods biomedical data using an improved shuffled frog leaping algo-
Appl Sci. https://doi.org/10.1002/mma.5410 rithm. IEEE/ACM Trans Comput Biol Bioinform
11. Dassios I, Fountoulakis K, Gondzio J (2015) A preconditioner for 15(6):1765–1773
a primal-dual newton conjugate gradient method for compressed 30. Ibrahim HT, Mazher WJ, Ucan ON, Bayat O (2019) A
sensing problems. SIAM J Sci Comput 37(6):A2783–A2812 grasshopper optimizer approach for feature selection and opti-
mizing SVM parameters utilizing real biomedical data sets.
Neural Comput Appl 31(10):5965–5974
123
Neural Computing and Applications
31. Jabbar A, Zainudin S (2014) Water cycle algorithm for attribute optimization for medical data classification. Knowl-Based Syst
reduction problems in rough set theory. J Theor Appl Inform 96:61–75
Technol 61(1):107–117 51. Singh T (2020) A chaotic sequence-guided Harris Hawks opti-
32. Leardi R (1994) Application of a genetic algorithm to feature mizer for data clustering. Neural Comput Appl. https://doi.org/10.
selection under full validation conditions and to outlier detection. 1007/s00521-020-04951-2
J Chemom 8(1):65–79 52. Tawhid MA, Dsouza KB (2018) Hybrid binary bat enhanced
33. Lin KC, Zhang KY, Huang YH, Hung JC, Yen N (2016) Feature particle swarm optimization algorithm for solving feature selec-
selection based on an improved cat swarm optimization algorithm tion problems. Appl Comput Inform. https://doi.org/10.1016/j.
for big data classification. J Supercomput 72(8):3210–3221 aci.2018.04.001
34. Liu H, Motoda H (1998) Feature extraction, construction and 53. Wan Y, Wang M, Ye Z, Lai X (2016) A feature selection method
selection: a data mining perspective, vol 453. Springer, Berlin based on modified binary coded ant colony optimization algo-
35. Mafarja MM, Eleyan D, Jaber I, Hammouri A, Mirjalili S (2017) rithm. Appl Soft Comput 49:248–258
Binary dragonfly algorithm for feature selection. In: 2017 inter- 54. Xue B, Zhang M, Browne WN (2012) Particle swarm optimiza-
national conference on new trends in computing sciences tion for feature selection in classification: a multi-objective
(ICTCS), pp 12–17. IEEE approach. IEEE Trans Cybern 43(6):1656–1671
36. Mafarja MM, Mirjalili S (2017) Hybrid whale optimization 55. Yan C, Ma J, Luo H, Patel A (2019) Hybrid binary coral reefs
algorithm with simulated annealing for feature selection. Neu- optimization algorithm with simulated annealing for feature
rocomputing 260:302–312 selection in high-dimensional biomedical datasets. Chemometr
37. Mohamed AK, Mohamed AW, Elfeky EZ, Saleh M (2018) Intell Lab Syst 184:102–111
Enhancing AGDE algorithm using population size reduction for 56. Yan Z, Yuan C (2004) Ant colony optimization for feature
global numerical optimization. In: International conference on selection in face recognition. In: International conference on
advanced machine learning technologies and applications, biometric authentication, pp 221–226. Springer, Berlin
pp 62–72. Springer 57. Zawbaa HM, Emary E, Parv B, Sharawi M (2016) Feature
38. Mohamed AW (2016) A new modified binary differential evo- selection approach based on moth-flame optimization algorithm.
lution algorithm and its applications. Appl Math Inform Sci In: 2016 IEEE congress on evolutionary computation (CEC),
10(5):1965–1969 pp 4612–4617. IEEE
39. Mohamed AW, Hadi AA, Mohamed AK (2020) Gaining-sharing 58. Zhang C, Zhou J, Li C, Fu W, Peng T (2017) A compound
knowledge based algorithm for solving optimization problems: a structure of elm based on feature selection and parameter opti-
novel nature-inspired algorithm. Int J Mach Learn Cybern mization using hybrid backtracking search algorithm for wind
11:1501–1529 speed forecasting. Energy Convers Manag 143:360–376
40. Mohamed AW, Sabry HZ (2012) Constrained optimization based 59. Zhang H, Sun G (2002) Feature selection using tabu search
on modified differential evolution algorithm. Inf Sci 194:171–208 method. Pattern Recogn 35(3):701–711
41. Muni DP, Pal NR, Das J (2006) Genetic programming for 60. Zhang M, Shao C, Li F, Gan Y, Sun J (2006) Evolving neural
simultaneous feature selection and classifier design. IEEE Trans network classifiers and feature subset using artificial fish swarm.
Syst Man Cybern Part B 36(1):106–117 In: 2006 international conference on mechatronics and automa-
42. Nakamura RY, Pereira LA, Costa KA, Rodrigues D, Papa JP, tion, pp 1598–1602. IEEE
Yang XS (2012) BBA: a binary bat algorithm for feature selec- 61. Zhang N, Ding S (2017) Unsupervised and semi-supervised
tion. In: 2012 25th SIBGRAPI conference on graphics, patterns extreme learning machine with wavelet kernel for high dimen-
and images, pp 291–297. IEEE sional data. Memet Comput 9(2):129–139
43. Panwar LK, Reddy S, Verma A, Panigrahi BK, Kumar R (2018) 62. Zhang N, Ding S, Sun T, Liao H, Wang L, Shi Z (2020) Multi-
Binary grey wolf optimizer for large scale unit commitment view rbm with posterior consistency and domain adaptation. Inf
problem. Swarm Evolut Comput 38:251–266 Sci 516:142–157
44. Pashaei E, Aydin N (2017) Binary black hole algorithm for 63. Zhang N, Ding S, Zhang J, Xue Y (2018) An overview on
feature selection and classification on biological data. Appl Soft restricted boltzmann machines. Neurocomputing 275:1186–1199
Comput 56:94–106 64. Zhang WQ, Zhang Y, Peng C (2019) Brain storm optimization
45. Rad SM, Tab FA, Mollazade K (2012) Application of imperialist for feature selection using new individual clustering and updating
competitive algorithm for feature selection: a case study on bulk mechanism. Appl Intell 49(12):4294–4302
rice classification. Int J Comput Appl 40(16):41–48 65. Zhang Y, Song XF, Gong DW (2017) A return-cost-based binary
46. Rashedi E, Nezamabadi-pour H (2014) Feature subset selection firefly algorithm for feature selection. Inform Sci 418:561–574
using improved binary gravitational search algorithm. J Intell 66. Zhang Z, Ding S, Jia W (2019) A hybrid optimization algorithm
Fuzzy Syst 26(3):1211–1221 based on cuckoo search and differential evolution for solving
47. Rodrigues D, Pereira LA, Almeida T, Papa JP, Souza A, Ramos constrained engineering problems. Eng Appl Artif Intell
CC, Yang XS (2013) BCS: a binary cuckoo search algorithm for 85:254–268
feature selection. In: 2013 IEEE international symposium on 67. Zhang Z, Ding S, Sun Y (2020) A support vector regression
circuits and systems (ISCAS2013), pp 465–468. IEEE model hybridized with chaotic krill herd algorithm and empirical
48. Sasirekha K, Thangavel K (2019) Optimization of k-nearest mode decomposition for regression task. Neurocomputing.
neighbor using particle swarm optimization for face recognition. https://doi.org/10.1016/j.neucom.2020.05.075
Neural Comput Appl 31(11):7935–7944 68. Zhu Z, Ong YS, Dash M (2007) Wrapper-filter feature selection
49. Schiezaro M, Pedrini H (2013) Data feature selection based on algorithm using a memetic framework. IEEE Trans Syst Man
artificial bee colony algorithm. EURASIP J Image Video Process Cybern Part B Cybern 37(1):70–76
2013(1):47
50. Shen L, Chen H, Yu Z, Kang W, Zhang B, Li H, Yang B, Liu D Publisher’s Note Springer Nature remains neutral with regard to
(2016) Evolving support vector machines using fruit fly jurisdictional claims in published maps and institutional affiliations.
123