Professional Documents
Culture Documents
Anembeddedfeatureselectionmethodbasedongeneralizedclassifierneural Network For Cancer Classification
Anembeddedfeatureselectionmethodbasedongeneralizedclassifierneural Network For Cancer Classification
Keywords: The selection of relevant genes plays a vital role in classifying high-dimensional microarray gene expression
Embedded feature selection data. Sparse group Lasso and its variants have been employed for gene selection to capture the interactions
Generalized classifier neural network of genes within a group. Most of the embedded methods are linear sparse learning models that fail to capture
Explainable model
the non-linear interactions. Additionally, very less attention is given to solving multi-class problems. The
existing methods create overlapping groups, which further increases dimensionality. The paper proposes a
neural network-based embedded feature selection method that can represent the non-linear relationship. In an
effort toward an explainable model, a generalized classifier neural network (GCNN) is adopted as the model
for the proposed embedded feature selection. GCNN has well-defined architecture in terms of the number of
layers and neurons within each layer. Each layer has a distinct functionality, eliminating the obscure nature of
most neural networks. The paper proposes a feature selection approach called Weighted GCNN (WGCNN) that
embeds feature weighting as a part of training the neural network. Since the gene expression data comprises
a large number of features, to avoid overfitting of the model a statistical guided dropout is implemented at
the input layer. The proposed method works for binary as well as multi-class classification problems likewise.
Experimental validation is carried out on seven microarray datasets on three learning models and compared
with six state-of-art methods that are popularly employed for feature selection. The WGCNN performs well in
terms of the F1 score and the number of features selected.
∗ Corresponding author.
E-mail address: akshata.naik@nitgoa.ac.in (A.K. Naik).
https://doi.org/10.1016/j.compbiomed.2023.107677
Received 2 June 2022; Received in revised form 26 October 2023; Accepted 6 November 2023
Available online 8 November 2023
0010-4825/© 2023 Elsevier Ltd. All rights reserved.
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
with an optimization technique outputs a hypothesis that is used in ultrasound is suggested by Zhou et al. [16] for artificially intelligent
feature selection. In this approach, the dataset is initially divided into diagnosis of aggressive and benign breast tumors. The growing appli-
training and testing sets. The training set is utilized by an optimization cations of machine learning models to decision-making, in turn, have
algorithm to learn an accurate set of parameters for a model. The led to a need for explainable learning models. The stakeholders and
learned parameters are used to determine the importance of features. decision-makers tend to trust a model that states transparency on how
The testing set is then used to evaluate the selected set of features. the decisions are arrived at. Dinh et al. [12] state that features selection
Most of the embedded methods incorporate regression as a con- is one of the prime steps towards explainability of models. We propose
straint in existing learning models to achieve a sparse solution. Many a feature selection method that not only accounts for the non-linear
popular embedded methods utilize 𝐿1 , 𝐿2 , 𝐿2,1 norm or regularized interaction of features using a neural network but also is explainable at
sparse multinomial logistics regression with penalty [4]. Sparse reg- the same time. The proposed method is based on GCNN [17] which un-
ularizers try to learn a model by minimizing the fitting errors and like most ANN and deep learning models, avoids parameters about the
simultaneously reducing the coefficients to zero or near zero. The architecture settings. GCNN provides clear guidelines on the number of
output is a model along with the set of selected features. Sparse regu- layers and neurons in each layer thus making it a suitable candidate for
larizers are adopted for grouped gene selection by constructing weights an explainable model. The contributions of the paper are summarized
for gene groups and genes based on the actual gene expression values. as follows:
Most of these methods do not capture the gene interaction information • An embedded feature selection method, WGCNN is proposed,
thereby leading to a selection of biologically unrelated genes. Wang and that captures the non-linear relationship efficiently using a neural
Li [5] further designed a gene and group weight calculation method network.
based on Joint Mutual Information (JMI) that considers biological • To make the model explainable, the proposed method embeds
relations. The Weighted General Group Lasso (WGGL) in [5] works feature selection into GCNN that has a clear interpretation of the
only for two class problems. The existing embedded methods are based number and functionality of layers.
on the following learning models, SVM, linear regression, and decision • A statistical method for guided dropout at the input layer is
trees [6,7]. adopted to counter the overfitting of the model.
An Artificial Neural Network (ANN), also known as a neural net- • The proposed method works efficiently for binary and multi-class
work, is a mathematical model inspired by how biological nervous classification problems and does not require any special handling
systems, such as the brain, process information. A neural network is an for the multi-class scenario.
interconnected group of simulated neurons that processes information • Experimental validation has been carried out on microarray gene
for computation using a connectionist approach. Neural networks can expression data along with statistical testing and comparative
easily model simple and complex relationships that are difficult to analysis with popular feature selection methods.
capture in linear or additive models. They can also be used to identify
patterns and clusters in data [8–10]. A comprehensive review of earlier 1.1. Motivation
research using deep learning, and reinforcement learning for breast
cancer detection and categorization is conducted by Nusrat et al. [11]. Microarray technology finds its application in the health sector to
The research has also examined the publicly accessible datasets for study and investigate gene expression levels to diagnose cancer. The
various imaging modalities. Through a learning process, an ANN can gene expression data consists of many irrelevant and unimportant genes
be designed for a specific application, such as data classification and for cancer diagnosis. Feature selection, thus helps in choosing a relevant
pattern categorization. Given these characteristics, the paper proposes set of genes for cancer classification. Embedded feature selection has a
an embedded feature selection method based on a neural network distinctive advantage over filter and wrapper about model interaction
model. and faster computation. One prominent technique employed in an
Although, neural networks like deep learning are widely and suc- embedded approach is sparse learning-based models that eliminate
cessfully applied for classification, the obscure nature of the model features with a lower score or weight. Linear sparse learning models are
provides little insight into how the classification is performed [12]. inefficient in capturing the non-linear interaction of features. Therefore,
Binjun et al. [13] created a model for the use of neural network the paper proposes a sparse learning model based on GCNN that can
algorithms in cancer diagnosis and treatment. CNN (Convolutional model a non-linear relationship. The weights for features are learned
Neural Network) based saliency detection network is presented to while training the GCNN model. The choice of GCNN is to make the
address the problem of colorectal polyp region extraction [14]. Ma- model more explainable. The proposed approach is called WGCNN as
jid and Fardin [15] present a lightweight, effective model for the it includes the learning of weights in GCNN.
categorization of breast cancer histopathology images using knowl- The remainder of the paper is organized as follows. Section 2
edge distillation. A multi-region radiomics strategy with multimodal discusses related works found in the literature. The problem statement
2
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
nomial logistic regression respectively using Bayesian regularization. The discrimination functions 𝑓𝑗 for the WGCNN are founded on a radial
From a biological point of view, the cancer-causing genes usually basis function and are as given in Eq. (2), where 𝑇𝑗 refers to the training
interact with each other in a group. However, these methods do not set belonging to class 𝑗.
account for the group interaction of the genes. ( ‖𝑊 𝑇 (𝑥𝑖 −𝑇𝑗 )‖ , )
− 2
The group Lasso [25] and logistic group lasso [26] select rele- 𝑓𝑗 (𝑥𝑖 ) = ℎ 𝑒 2𝜎 2 (2)
vant features in the group, thus, making it a suitable approach for
gene selection. Simon et al. [27] designed SGL that could achieve The training phase of WGCNN learns the weight parameter through
a selection of both sparse groups and sparsity within groups, unlike a gradient descent optimization algorithm. The weight determines the
group lasso. Improvisation of SGL was designed in [28] that utilized importance of a feature in the discrimination function and hence in turn
data-dependent weights for feature selection. MSGL was applied for for classification.
multi-class problems in [29].
The efficiency of the group lasso and its extensions for microarray 4. Weighted generalized classifier neural network
gene selection relies heavily on the division of the groups. WGCNA has
GCNN is a radial basis function-based classification neural net-
been applied to divide the genes based on the correlation. SGL gives an
work [17]. It comprises five layers, namely, input, pattern, summa-
equal weightage to all the genes, thus the relative importance of genes
tion, normalization, and output layer. The research work proposes a
is ignored. The group weight is determined based on the number of
Weighted GCNN that embeds feature weighting as a part of training
genes within a group which may not work well when the group sizes
the neural network. The weights are then utilized for selecting the
vary a lot.
subset of features that help in classification problems. Fig. 2 depicts
Wang et al. [5] developed a WGGL framework, that computes the
the architecture of WGCNN.
gene weight using JMI. The framework has been applied for two-class
The input layer is the first layer, and it is in charge of feeding data
cancer classification and gene selection. JMI cannot be applied to con-
𝑋 𝜖 𝑅𝑑 into the neural network. The number of neurons in this layer is
tinuous variables. The continuous variables need to be discretized thus equal to the total number of features, 𝑑, in the dataset. The pattern layer
leading to two disadvantages. Firstly, a different choice of discretiza- comprises neurons each of which represents a different training pattern.
tion method may lead to different results. Secondly, an additional This layer has the same number of neurons as the entire number of
computational overhead is incurred for the discretization step. training examples, 𝑚. The pattern layer computes the squared weighted
MSGL is designed in [29] for multi-class classification. Although Euclidean distance with input as indicated in Eq. (3), where 𝑇𝑗 𝜖 𝑅𝑑
the authors have made provisions for gene and group weights, no denotes the 𝑗th training pattern and 𝑊 𝜖 𝑅𝑑 is the weight vector.
guidelines for weight assignments have been provided. A non-weighted ‖ ‖
approach, MROGL, is applied to three-class cancer classification [30]. 𝑑𝑖𝑠𝑡(𝑗) = ‖𝑊 𝑇 (𝑋 − 𝑇𝑗 )‖ , 1 ≤ 𝑗 ≤ 𝑚 (3)
‖ ‖2
MROGL generates groups within each class using WGCNA. The groups Processing of the output from pattern to output layer follows the
are overlapping in nature. Thus, for a 3-class problem, with 𝑝 genes, the procedure similar to that in [17]. The pattern layer output is generated
method would lead to a 3𝑝 dimension. Since the microarray data are by utilizing a radial basis activation function as in Eq. (4). Every
of high-dimension in nature, with an increase in the number of classes, training pattern has an associated class vector 𝑦 to determine the class
the computational burden will increase if overlapping groups are used. it belongs to. The computation of 𝑦 is as given in Eq. (5). Value of 0.9
Throughout the years, numerous techniques in the statistics litera- and 0.1 is chosen to avoid the stuck neuron problem during learning.
ture have been developed and adapted to neural networks. Li et al. [31] −𝑑𝑖𝑠𝑡(𝑗)
( )
proposed inserting a sparse one-to-one linear layer between the input 𝑅(𝑗) = 𝑒 2𝜎 2 , 1≤𝑗≤𝑚 (4)
layer and the first hidden layer of a neural network and performing 𝐿1 {
regularization on the weights of this additional layer. Similar concepts 0.9 if 𝑇𝑗 𝜖 𝑖th class, 1 ≤ 𝑖 ≤ 𝑘
𝑦(𝑗, 𝑖) = (5)
are applied to different types of networks and learning contexts [32– 0.1 otherwise, 1 ≤ 𝑗 ≤ 𝑚
34]. Standard lasso is not ideal for neural networks because a feature
Summation layer has 𝑘 + 1 neurons, where 𝑘 represents the total
can only be dropped if all of its connections have been shrunk to zero
number of classes. To differentiate between the classes, divergent term,
simultaneously, an objective that the lasso does not actively pursue.
computed using the exponential of difference between 𝑦 and 𝑦𝑚𝑎𝑥 is
Zhao et al. [35], Scardapane et al. [36], and Zhang et al. [37]
used as in Eq. (6).
address this issue by employing group lasso and its variants for selecting
deep neural network features. Feng and Simon [38] discuss fitting 𝑑(𝑗, 𝑖) = 𝑒(𝑦(𝑗,𝑖)−𝑦𝑚𝑎𝑥 ) ∗ 𝑦(𝑗, 𝑖) (6)
a neural network with a Sparse Group Lasso penalty on the first-
The value of 𝑦𝑚𝑎𝑥 is initialized to a maximum value of 𝑦(𝑗, 𝑖), i.e. 0.9,
layer input weights and a sparsity penalty on subsequent layers. Dinh
and is updated with every iteration during learning. The output of
et al. [12] addressed the problem of feature selection for analytic deep
the pattern layer is multiplied with the respective divergent term and
networks. The authors report that the adaptive group lasso selection
summed up at the summation layer. The output of each neuron in the
procedure with group lasso as the base estimator is selection-consistent.
summation layer, denoted by 𝑆(𝑖) is computed as given in Eq. (7).
The paper presents an embedded feature selection method, based on
a sparse learning model that is constructed using GCNN and dropout ∑
𝑚
𝑆(𝑖) = 𝑅(𝑗) ∗ 𝑑(𝑗, 𝑖), 1 ≤ 𝑖 ≤ 𝑘 (7)
regularization. 𝑗=1
3
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
The extra neuron is used for normalization of the output of neurons in for 𝑖𝑑th class and 𝑁𝑖𝑑 is winner class value.
the summation layer and its output is denoted by 𝐷.
𝐸 = (𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )2 (11)
∑
𝑚
𝐷= 𝑅(𝑗), (8) According to [39] the first differentiation of 𝐸 with respect to 𝜎 is
𝑗=1 given in Eqs. (12)–(15)
The output of the summation layer output is normalized in the 𝜕𝐸 𝜕𝑁𝑖𝑑
= −2 ∗ [(𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )] ∗ (12)
normalization layer using the 𝐷 value. The output of the normalization 𝜕𝜎 𝜕𝜎
layer is calculated as in Eq. (9). 𝜕𝑁𝑖𝑑 𝐴(𝑖𝑑) − 𝐵(𝑖𝑑)
= (13)
𝑆(𝑖) 𝜕𝜎 𝜎3 ∗ 𝐷
𝑁(𝑖) = , 1≤𝑖≤𝑘 (9)
𝐷 ∑
𝑚
𝐴(𝑖𝑑) = 𝑑(𝑗, 𝑖𝑑) ∗ 𝑅(𝑗) ∗ 𝑑𝑖𝑠𝑡(𝑗) (14)
𝑗=1
The largest value of the normalization layer neuron is utilized [𝑚 ]
∑
for classification. The last layer computes the winner class according 𝐵(𝑖𝑑) = 𝑅(𝑗) ∗ 𝑑𝑖𝑠𝑡(𝑗) ∗ 𝑁𝑖𝑑 (15)
to Eq. (10), where 𝑜 denotes the maximum value among all the output 𝑗=1
neurons of the normalization layer and 𝑖𝑑 is the winning class index. 𝜎 value is then updated as in Eq. (16), where 𝜂𝜎 is the learning rate
[𝑜, 𝑖𝑑] = 𝑚𝑎𝑥(𝑁) (10) for updating 𝜎.
𝜕𝐸
𝜎𝑛𝑒𝑤 = 𝜎𝑜𝑙𝑑 − 𝜂𝜎 ∗ (16)
The GCNN learns the smoothing parameter 𝜎 during training phase. 𝜕𝜎
The method proposes to adopt gradient descent-based optimization WGCNN comprises the weight vector 𝑊 𝜖 (𝑤1 , 𝑤2 , … , 𝑤𝑑 ). Each
for learning of weight parameter 𝑊 along with 𝜎. The weights are 𝑤𝑖 weights the difference in the respective features of the input and
optimized so that the features contributing to the discrimination of training sample. To minimize the classification error, the weights have
classes are assigned higher values. The non-discriminating features, to be such that it discriminates the classes correctly. Similarly, the
which are not relevant for classification are assigned minimum weights. computation of partial differentiation of 𝐸 concerning 𝑤𝑖 is given in
Feature selection is then performed based on the weight values. Eqs. (17)–(19).
During training, for each of the input, squared error 𝐸 is calculated [ ]
𝜕𝐸 2 ∗ [(𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )] ∗ 𝑤𝑖 𝑈 (𝑖𝑑) − 𝑉 (𝑖𝑑)
as in Eq. (11), where 𝑦(𝑙, 𝑖𝑑) represents value for the 𝑙th training input = ∗ (17)
𝜕𝑤𝑖 𝜎2 𝐷
4
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
∑
𝑚 Algorithm 1: Training of WGCNN
𝑈 (𝑖𝑑) = 𝑑(𝑗, 𝑖𝑑) ∗ 𝑅(𝑗) ∗ (𝑥𝑖 − 𝑡𝑖𝑗 ) (18) Input: training data, epoch, 𝜂𝜎 , 𝜂𝑤
𝑗=1 Output: 𝜎, 𝑊
[ ]
∑
𝑚
1 while iteration ≤ epoch do
𝑉 (𝑖𝑑) = 𝑁𝑖𝑑 ∗ 𝑅(𝑗) ∗ (𝑥𝑖 − 𝑡𝑖𝑗 ) (19)
𝑗=1 2 for each training data 𝑇𝑗 do
‖ ‖
3 𝑑𝑖𝑠𝑡(𝑗) = ‖𝑊 𝑇 (𝑋 − 𝑇𝑗 )‖
The weights are then updated as in Eq. (20) where 𝜂𝑤 is the learning ‖ ‖2
−𝑑𝑖𝑠𝑡(𝑗)
rate for weight. 4 𝑅(𝑗) = 𝑒
(
2𝜎 2
)
𝜕𝐸
𝑤𝑖 = 𝑤𝑖(𝑜𝑙𝑑) − 𝜂𝑤 ∗ (20) 5 for each class i do
𝜕𝑤𝑖
6 𝑑(𝑗, 𝑖) = 𝑒(𝑦(𝑗,𝑖)−𝑦𝑚𝑎𝑥 ) ∗ 𝑦(𝑗, 𝑖)
The algorithm for training of WGCNN is given in Algorithm 1. ∑
7 𝑆(𝑖) = 𝑚 𝑅(𝑗) ∗ 𝑑(𝑗, 𝑖)
The microarray gene expression data comprises a large number of ∑𝑚 𝑗=1
8 𝐷 = 𝑗=1 𝑅(𝑗)
features than samples. Thus, the WGCNN is prone to overfitting. To 𝑆(𝑖)
tackle the overfitting problem, we adopt the dropout regularization at 9 𝑁(𝑖) = 𝐷
the input layer. Dropout refers to the removal of a neural network’s 10 [𝑜, 𝑖𝑑] = 𝑚𝑎𝑥(𝑁)
hidden and/or visible units. Dropout is a regularization approach for 11 𝑁𝑚𝑎𝑥 (𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛) = 𝑁𝑖𝑑
training a large number of neural networks in parallel with varied 12 𝐸 = (𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )2
topologies. During training, a certain amount of layer outputs are 13 𝑦𝑚𝑎𝑥 = 𝑚𝑎𝑥(𝑁𝑚𝑎𝑥 )
ignored or dropped at random. This causes the layer to appear and
14 𝜎𝑛𝑒𝑤 = 𝜎𝑜𝑙𝑑 − 𝜂𝜎 ∗ 𝜕𝐸𝜕𝜎
behave as if it were a layer with a different number of nodes and 𝜕𝐸
connections than the previous layer. WGCNN adopts dropout regu- 15 𝑤𝑖 = 𝑤𝑖(𝑜𝑙𝑑) − 𝜂𝑤 ∗ 𝜕𝑤𝑖
larization at the input layer. The reason is that it helps generate a
sparse feature input to the neural network. Instead of random dropout,
statistically directed dropout is used to accomplish the feature selection
objective. For each neuron in the input layer, which represents the fea- 5.1. Synthetic data
tures, variance across 𝑘 classes is computed. Higher the variance better
is the feature for discriminating among the classes. The variance value The WGCNN learns weights for each of the features. The relevant
is used for filtering out the neurons during dropout. The dropout retains features have higher weights. In order to validate this, a study on
the neurons with the highest variance among the classes. Fig. 3 shows synthetic data is carried out. The synthetic dataset is generated with
the flowchart for the proposed method. The blocks indicated in green
50 samples and two features 𝑓1 and 𝑓2 for a binary classification as
and blue represent the processing steps involved in the dropout and
in [41]. Out of the two features, only the first feature, 𝑓1 is relevant to
training of WGCNN, respectively. The entire process can be summarized
the class. Initially, equal weights are assigned to all features. During the
in the following steps.
training phase, WGCNN learns weights with the objective to minimize
1. Initialize the parameters w, 𝛼, 𝜎 and 𝑀𝑎𝑥𝑖𝑡𝑟 the error. The higher-weight features are then selected as the relevant
2. Each record in the training undergoes the following steps features.
Ideally the function learned by a classification model should exhibit
(a) Drop features based on variance across the classes. Higher a strong correlation with the class labels. The experiment is conducted
variance indicates that the features can differentiate to verify the relevance of the weights in the radial basis function with
among the classes. the class. For the synthetic data generation, the relevant feature 𝑓1
(b) The output of dropout is passed to the pattern layer for when having a higher weight in comparison to the irrelevant feature
computation using Eqs. (3)–(5) 𝑓2 must exhibit a higher correlation with the class labels.
(c) The summation layer computes d, S, and D using Eqs. (6)– Weights 𝑤1 , 𝑤2 for 𝑓1 and 𝑓2 are assigned respectively. The effect
(8) (𝑤1 𝑓1 +𝑤2 𝑓2 )
of varying the weights on the correlation of 𝑒(− 2
)
(radial basis
(d) The normalization layer computes 𝑁 using Eq. (9)
function of weighted input with 𝜎 = 1) with the class labels is depicted
(e) Update 𝜎 and w
in Fig. 4. The 𝑥 and 𝑦 axes represent the features 𝑓1 and 𝑓2 respectively.
3. Go to step 2 if the number of iteration ≤ 𝑀𝑎𝑥𝑖𝑡𝑟 else go to step The first plot shows an equal weight (1,1) to both features and the
4 observed distance correlation is 0.4971. Assignment of a higher weight
4. Sort w in decreasing order and choose features with higher to the irrelevant feature 𝑓2 and eliminating the relevant feature 𝑓1
weights. shows a lower correlation of 0.2355. The output value of the radial
basis function is utilized for classification, therefore, discovering a cor-
relation between them and the class label is important. It is observed in
5. Experimental results and analysis the plots that when the weights for the relevant features are increased,
the correlation with the class label is improved. Thus, the idea of
The performance of the WGCNN model is evaluated on synthetic choosing the features with higher weights, for a better classification
and seven microarray gene expression datasets. The weight and feature performance is justified.
relevance relationship is explored using synthetic data. This section
provides the experimental study on synthetic data, the dataset infor- 5.2. Microarray gene expression dataset
mation employed in experiments, and describes in detail the findings
of comparison with the state-of-art feature selection methods. The per- Experiments are performed on real-world datasets that include
formance of selected genes on other classification models is analyzed to seven high-dimensional microarray gene data [42]. These are bench-
determine whether their performance is biased toward the embedded mark datasets that include binary and multi-class data. The Leukemia,
approach. The implementation of experiments for the proposed method Leukemia-4c, Leukemia-3c, and CNS datasets consist of a total of 7129
is done on an i7 processor, 3.5 GHz with 8 GB RAM on MATLAB. genes, and the Prostrate, Small-Round-Blue-Cell Tumor (SRBCT) and
Feature miner [40] is used to obtain feature subsets for the existing Brain dataset consists of 10 509, 2308, and 1070 genes respectively.
information theoretic-based and sparse learning-based feature selection The CNS dataset contains 2 classes with 21 survivor samples and 39
methods. non-survivor samples. The Leukemia dataset comprises 2 classes with
5
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
Fig. 4. Effect of varying weights of relevant and irrelevant features on classification for the synthetic data.
6
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
Fig. 5. Comparison of the influence of feature selection approaches on classification for various datasets (a) CNS dataset (b) Leukemia dataset (c) Leukemia4c dataset (d) Leukemia3c
(e) Prostrate dataset (f) SRBCT (g) Brain dataset.
The average F1-scores obtained by varying the number of features the difference between WGCNN and the best F1-score is 0.0042. Com-
picked from 1 to 50 are shown in Tables 2, 4 and 6. The outcomes show paring the average F1-scores on the SVM model, the proposed approach
the performance of each learning model. The average F1 score for the performs the best with an average F1-score of 0.8531. The second best
entire dataset is shown in the table’s last row. It can be observed that performing method is IGFS with a F1-score of 0.8061. Table 6 indicates
the method performs the best on 3 datasets. On the prostrate dataset, the results on NB approach. WGCNN performs the best on Leukemia,
7
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
Table 2
Average F1-score obtained by kNN on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.9912 0.9882 0.9992 0.9933 0.9883 0.9956 1.0000
Leukemia4c 0.9753 0.9694 0.9647 0.9694 0.9694 0.9681 0.9680
Brain 0.9124 0.9666 0.9674 0.9493 0.9494 0.9822 0.7923
Leukemia 0.9988 0.9996 1.0000 0.9996 0.9984 1.0000 1.0000
CNS 1.0000 1.0000 1.0000 1.0000 0.9851 1.0000 0.9982
Leukemia3c 1.0000 1.0000 0.9984 1.0000 1.0000 1.0000 1.0000
SRBCT 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9970
Average 0.9825 0.9891 0.9900 0.9874 0.9844 0.9923 0.9651
Table 3
p values derived using statistical testing (one-tailed t-test) kNN.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 2.82E−08 6.61E−13 2.22E−02 1.04E−14 1.76E−07 2.27E−08 –
Leukemia4c – 5.05E−05 1.60E−07 5.05E−05 4.38E−05 1.47E−04 3.90E−05
Brain 7.32E−16 8.71E−05 3.20E−07 9.97E−10 2.81E−12 – 3.57E−10
Leukemia 6.38E−03 7.97E−02 – 7.97E−02 1.82E−03 – –
CNS – – – – 1.28E−07 – 2.22E−02
Leukemia3c – – 1.61E−01 – – – –
SRBCT – – – – – – 4.16E−02
Table 4
Average F1-score obtained by SVM on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.8996 0.5097 0.8908 0.9038 0.8903 0.8960 0.6527
Leukemia4c 0.8445 0.6093 0.6078 0.6093 0.7537 0.5528 0.6429
Brain 0.8853 0.9829 0.9655 0.9759 0.9380 0.9129 0.8482
Leukemia 0.9527 0.7833 0.8014 0.7833 0.8639 0.7751 0.7978
CNS 0.5400 0.3497 0.3796 0.3377 0.3324 0.3066 0.3596
Leukemia3c 0.9216 0.9347 0.9482 0.9175 0.9031 0.9611 0.4549
SRBCT 0.9283 0.9725 0.9866 0.9747 0.9615 0.9816 0.6208
Average 0.8531 0.7346 0.7971 0.7860 0.8061 0.7694 0.6253
Table 5
p values derived using statistical testing (one-tailed t-test) SVM.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 1.34E−01 1.96E−48 2.03E−02 – 2.32E−02 5.37E−03 2.43E−19
Leukemia4c – 3.91E−30 6.46E−26 3.91E−30 3.16E−13 5.20E−33 5.88E−23
Brain 2.81E−13 – 9.92E−04 6.58E−02 1.66E−12 1.99E−16 3.34E−16
Leukemia – 1.45E−34 6.16E−32 1.45E−34 3.73E−14 4.66E−39 1.75E−33
CNS – 2.99E−02 1.48E−01 1.87E−02 4.08E−04 8.49E−05 6.45E−02
Leukemia3c 1.73E−05 1.06E−13 2.43E−04 5.40E−24 1.76E−18 – 1.29E−51
SRBCT 1.20E−09 3.98E−04 – 1.36E−04 3.92E−06 7.26E−02 5.03E−22
CNS, and Leukemia3c datasets. MRMR demonstrates the highest F1- highest F1-score with the least NFS. Fig. 6(c) presents the comparison
score of 0.8772 on the prostrate dataset. WGCNN performs closely with for Leukemia-4c dataset. WGCNN shows a highest F1-score of 0.9302
an F1-score of 0.8676. with 34 NFS, followed by IGFS with an F1-score of 0.8732 with 48 NFS.
A statistical one-tailed t-test is used to compare the performance In Fig. 6(d), MRMR, DISR, and WGCNN achieve F1-score of 0.9886,
of all the feature selection approaches. The null hypothesis is that the 0.9886, and 0.9712 with NFS of 20,41, and 26 respectively. In Fig. 6(e)
two methods means are equal. The technique’s mean, according to IGFS depicts the highest F1-score of 0.9417 with 6 NFS on SVM, while
the alternative hypothesis, is higher than the method being compared WGCNN achieves the highest F1-score of 0.9231 with 14 NFS. On NB,
against. The statistical significance is calculated at a significance level WGCNN achieves the highest F1-score of 0.9091 but with the least
of 0.05. The 𝑝-value obtained when comparing the best technique to the NFS in comparison to DISR, IGFS, ICAP, and MRMR. In the SRBCT
other ways is shown in Tables 3, 5 and 7. The best-performing method dataset Fig. 6(f), the proposed method achieves the same highest F1-
in terms of the highest average F1-score is indicated with ’–’. According score, but with a higher NFS. It is observed that MRMR performs well
to statistical tests, the performance difference between WGCNN and the on the SRBCT dataset. The MRMR method, in addition to selecting
best-performing method on NB and SVM is insignificant. Similarly for relevant features, also eliminates redundant features. The proposed
NB, WGCNN performs at par with IGFS on the leukemia4c dataset. method does not incorporate the elimination of redundant features.
Fig. 6 compares all the methods in terms of the highest F1-score Thus, in the presence of redundant features, the NFS is higher in the
achieved and the corresponding Number of Features Selected (NFS) proposed approach. CMIM attains better performance in terms of the
for kNN, SVM, and NB. Fig. 6(a) depicts the performance on the CNS highest F1-score and NFS for the brain dataset Fig. 6(g).
dataset. The proposed method on SVM achieves the highest F1-score of Table 8 shows comparison in performance with and without fea-
0.7131 with a NFS of 12. The second-best-performing method achieves ture selection. The performance of the three models without feature
an F1-score of 0.5882 with a NFS of 32. The highest F1-score for NB is selection and the highest obtained F1-score after applying WGCNN
depicted by MRMR with an NFS of 44. WGCNN performs second best is tabulated. The table shows an improvement in performance after
with an F1-score of 0.6667 and NFS 7. Moving on to the Leukemia applying the proposed feature selection method on all the datasets
dataset, in Fig. 6(b), for kNN, SVM, and NB, WGCNN depicts a better except for Leukemia4c. WGCNN, however, shows a large reduction in
8
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
Fig. 6. Comparison of highest average F1-score and Number of Features Seleceted (NFS) for various datasets (a) CNS dataset (b) Leukemia dataset (c) Leukemia4c dataset (d)
Leukemia3c (e) Prostrate dataset (f) SRBCT (g) Brain dataset.
9
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
Table 6
Average F1-score obtained by NB on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.8676 0.4139 0.8757 0.8746 0.8754 0.8772 0.3999
Leukemia4c 0.7731 0.6416 0.6196 0.6416 0.7862 0.5044 0.6110
Brain 0.8812 0.9649 0.9644 0.9578 0.9667 0.8902 0.7698
Leukemia 0.9782 0.6787 0.8291 0.6787 0.8474 0.8223 0.7263
CNS 0.6308 0.4245 0.4816 0.4245 0.4201 0.4948 0.4747
Leukemia3c 0.9357 0.9181 0.8654 0.9181 0.8892 0.9079 0.5550
SRBCT 0.9113 0.9503 0.9660 0.9488 0.9634 0.9670 0.7162
Average 0.8540 0.7131 0.8003 0.7777 0.8212 0.7805 0.6076
Table 7
p values derived using statistical testing (one-tailed t-test) NB.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 6.89E−02 7.91E−41 3.89E−01 2.47E−01 3.02E−01 – 5.75E−43
Leukemia4c 8.32E−02 1.77E−18 4.61E−21 1.77E−18 – 1.03E−25 6.37E−25
Brain 2.02E−12 4.45E−01 1.61E−01 2.49E−01 – 1.25E−14 4.53E−18
Leukemia – 9.64E−38 1.08E−20 9.64E−38 4.38E−17 5.87E−33 1.93E−22
CNS – 1.92E−18 4.05E−19 1.92E−18 4.48E−23 2.41E−08 3.77E−11
Leukemia3c – 4.04E−02 1.55E−11 4.04E−02 2.70E−13 1.10E−02 6.62E−40
SRBCT 9.34E−05 3.15E−03 2.56E−01 9.99E−04 2.23E−01 – 1.17E−12
Table 8
Comparison between highest F1-score using WGCNN and all features.
SVM NB kNN
All WGCNN All WGCNN All WGCNN
Prostrate 0.9216 0.9231 0.3571 0.9091 0.8021 1.0000
Leukemia4c 0.9409 0.9302 0.8819 0.8154 0.7359 0.9861
Brain 0.9630 1.0000 1.0000 1.0000 0.3556 1.0000
Leukemia 0.9409 0.9691 0.9677 1.0000 0.9108 1.0000
CNS 0.4177 0.7131 0.6154 0.6667 0.5296 1.0000
Leukemia3c 0.9512 0.9712 0.7082 0.9744 0.7797 1.0000
SRBCT 0.9734 0.9911 0.9706 1.0000 0.8383 1.0000
feature set size from 7129 to 34 and 20 for Leukemia4c on SVM and 6. Conclusion
NB respectively.
The paper proposes WGCNN for feature selection and applies it to
5.2.2. Discussion microarray gene expression datasets. The proposed feature selection
A comparative analysis of six feature selection methods on seven method is based on GCNN. The learning phase comprises the gradient
different microarray datasets is performed. F1-score was chosen for
descent method to learn 𝜎 and the weights simultaneously. The learned
evaluating the performance. The graph of a number of features, varying
weights are then utilized to determine the importance of features in
from 1 to 50 versus F1-score in Fig. 5 indicates that the proposed
classification. The method is applied to five microarray datasets. F1-
method performs the best in the majority of the datasets. However, it
score is measured for the selected subset on SVM, NB, and kNN.
is observed that the superlative performance is not occurring for the
WGCNN performs well in comparison to six popular feature selection
Brain dataset. Upon secondary analysis, it is found that the dataset
methods. WCNN is a step towards an explainable model for feature
has less number of instances leading to smaller training data. The
selection using a neural network. The results on the brain dataset
proposed method classifies by separating the classes using the radial
basis function. This is possible when each class has enough training indicate that WGCNN relies on the training phase and hence lack of
samples to learn the parameters. training samples can lead to a degradation in performance.
The proposed method falls under the category of embedded feature The cancer-causing genes usually interact in groups. Group feature
selection. The features selected in an embedded approach fail to gener- selection selects relevant features in group, thus, making it a suitable
alize for the other learning models. We therefore validate the features approach for gene selection. As a part of future work, we can explore
selected in the proposed method on three learning models namely SVM, group feature selection using WGCNN. The selected set of genes can
NB, and kNN. The comparison is further aided by statistical testing. further be verified using biological domain knowledge. Another future
The method performs well in the majority of the datasets on all three direction is to work on the selection of non-redundant features. Redun-
learning models. Thus, the selected features generalize well with the dancy of features selected in WGCNN has not been explored in this
other learning models as well. work.
The next analysis is made to compare the number of features
selected and the performance together. For every dataset, performance
Declaration of competing interest
on kNN, SVM, and NB is evaluated. It is observed that the proposed
method performs best in either each or both metrics. The performance
with all features and selected features indicate that either better or The authors declare that they have no known competing finan-
comparable performance can be achieved even with 2% of the original cial interests or personal relationships that could have appeared to
feature set size. influence the work reported in this paper.
10
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677
References [24] B. Krishnapuram, L. Carin, M.A. Figueiredo, A.J. Hartemink, Sparse multinomial
logistic regression: Fast algorithms and generalization bounds, IEEE Trans. Pat-
[1] J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature tern Anal. Mach. Intell. 27 (2005) 957–968, http://dx.doi.org/10.1109/TPAMI.
selection: A data perspective, ACM Comput. Surv. 50 (6) (2017) 94:1–94:45, 2005.127.
http://dx.doi.org/10.1145/3136625. [25] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped
[2] J.C. Ang, A. Mirzal, H. Haron, H.N.A. Hamed, Supervised, unsupervised, and variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2006) 49–67, http://dx.doi.
semi-supervised feature selection: A review on gene selection, IEEE/ACM Trans. org/10.1111/j.1467-9868.2005.00532.x.
Comput. Biol. Bioinform. 13 (2016) 971–989, http://dx.doi.org/10.1109/TCBB. [26] L. Meier, S. Van De Geer, P. Bühlmann, The group lasso for logistic regression,
2015.2478454. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (1) (2008) 53–71.
[3] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, J.M. Benítez, F. [27] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse-group lasso, J. Comput.
Herrera, A review of microarray datasets and applied feature selection methods, Graph. Statist. 22 (2013) 231–245, http://dx.doi.org/10.1080/10618600.2012.
Inform. Sci. 282 (2014) 111–135, http://dx.doi.org/10.1016/J.INS.2014.05.042. 681250.
[4] J. Gui, Z. Sun, S. Ji, D. Tao, T. Tan, Feature selection based on structured [28] K. Fang, X. Wang, S. Zhang, J. Zhu, S. Ma, Bi-level variable selection via
sparsity: A comprehensive study, IEEE Trans. Neural Netw. Learn. Syst. 28 (7) adaptive sparse group lasso, J. Stat. Comput. Simul. 85 (13) (2014) 2750–2760,
(2016) 1490–1507. http://dx.doi.org/10.1080/00949655.2014.938241.
[5] Y. Wang, X. Li, R. Ruiz, Weighted general group lasso for gene selection in [29] M. Vincent, N.R. Hansen, Sparse group lasso and high dimensional multinomial
cancer classification, IEEE Trans. Cybern. 49 (2019) 2860–2873, http://dx.doi. classification, Comput. Statist. Data Anal. 71 (2014) 771–786, http://dx.doi.org/
org/10.1109/TCYB.2018.2829811. 10.1016/j.csda.2013.06.004.
[6] K.Y. Aram, S.S. Lam, M.T. Khasawneh, Linear cost-sensitive max-margin [30] J. Li, Y. Wang, T. Jiang, H. Xiao, X. Song, Grouped gene selection and multi-
embedded feature selection for SVM, Expert Syst. Appl. 197 (2022) 116683. classification of acute leukemia via new regularized multinomial regression, Gene
[7] H. Liu, M. Zhou, Q. Liu, An embedded feature selection method for imbalanced 667 (2018) 18–24, http://dx.doi.org/10.1016/j.gene.2018.05.012.
data classification, IEEE/CAA J. Autom. Sin. 6 (3) (2019) 703–715, http://dx. [31] Y. Li, C.-Y. Chen, W.W. Wasserman, Deep feature selection: theory and appli-
doi.org/10.1109/JAS.2019.1911447. cation to identify enhancers and promoters, J. Comput. Biol. 23 (5) (2016)
[8] D. Zhang, S. Lou, The application research of neural network and BP algorithm 322–336.
in stock price pattern classification and prediction, Future Gener. Comput. Syst. [32] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efficient l2, l1-norm
115 (2021) 872–879. minimization, 2012.
[9] G.-G. Wang, M. Lu, Y.-Q. Dong, X.-J. Zhao, Self-adaptive extreme learning [33] S. Ainsworth, N. Foti, A.K. Lee, E. Fox, Interpretable VAEs for nonlinear group
machine, Neural Comput. Appl. 27 (2) (2016) 291–303. factor analysis, 2018, arXiv:1802.06765.
[10] Z. Cui, F. Xue, X. Cai, Y. Cao, G.-g. Wang, J. Chen, Detection of malicious [34] I. Lemhadri, F. Ruan, L. Abraham, R. Tibshirani, Lassonet: A neural network
code variants based on deep learning, IEEE Trans. Ind. Inform. 14 (7) (2018) with feature sparsity, J. Mach. Learn. Res. 22 (1) (2021) 5633–5661.
3187–3196. [35] L. Zhao, Q. Hu, W. Wang, Heterogeneous feature selection with multi-modal
[11] N.M. ud din, R.A. Dar, M. Rasool, A. Assad, Breast cancer detection using deep neural networks and sparse group lasso, IEEE Trans. Multimed. 17 (11)
deep learning: Datasets, methods, and challenges ahead, Comput. Biol. Med. 149 (2015) 1936–1948.
(2022) 106073, http://dx.doi.org/10.1016/j.compbiomed.2022.106073. [36] S. Scardapane, D. Comminiello, A. Hussain, A. Uncini, Group sparse
[12] V.C. Dinh, L.S. Ho, Consistent feature selection for analytic deep neural networks, regularization for deep neural networks, Neurocomputing 241 (2017) 81–89.
Adv. Neural Inf. Process. Syst. 33 (2020) 2420–2431. [37] H. Zhang, J. Wang, Z. Sun, J.M. Zurada, N.R. Pal, Feature selection for neural
[13] B. He, W. Hu, K. Zhang, S. Yuan, X. Han, C. Su, J. Zhao, G. Wang, G. Wang, L. networks using group lasso regularization, IEEE Trans. Knowl. Data Eng. 32 (4)
Zhang, Image segmentation algorithm of lung cancer based on neural network (2019) 659–673.
model, Expert Syst. 39 (3) (2022) e12822. [38] J. Feng, N. Simon, Sparse-input neural networks for high-dimensional
[14] K. Hu, L. Zhao, S. Feng, S. Zhang, Q. Zhou, X. Gao, Y. Guo, Colorectal nonparametric regression and classification, 2017, arXiv:1711.07592.
polyp region extraction using saliency detection network with neutrosophic [39] T. Masters, W. Land, A new training algorithm for the general regression neural
enhancement, Comput. Biol. Med. 147 (2022) 105760. network, in: Proceedings of IEEE International Conference on Systems, Man, and
[15] M. Sepahvand, F. Abdali-Mohammadi, Joint learning method with teacher– Cybernetics. Computational Cybernetics and Simulation, Vol. 3, IEEE, Orlando,
student knowledge distillation for on-device breast cancer image classifica- FL, USA, 1997, pp. 1990–1994.
tion, Comput. Biol. Med. 155 (2023) 106476, http://dx.doi.org/10.1016/j. [40] K. Cheng, J. Li, H. Liu, FeatureMiner: A tool for interactive feature selection,
compbiomed.2022.106476. in: Proceedings of International Conference on Information and Knowledge
[16] Z. Xu, Y. Wang, M. Chen, Q. Zhang, Multi-region radiomics for artificially Management, Association for Computing Machinery, New York, United States,
intelligent diagnosis of breast cancer using multimodal ultrasound, Comput. 2016, pp. 2445–2448, http://dx.doi.org/10.1145/2983323.2983329.
Biol. Med. 149 (2022) 105920, http://dx.doi.org/10.1016/j.compbiomed.2022. [41] I. Kamkar, S.K. Gupta, D. Phung, S. Venkatesh, Stable feature selection with sup-
105920. port vector machines, in: Australasian Joint Conference on Artificial Intelligence,
[17] B.M. Ozyildirim, M. Avci, Generalized classifier neural network, Neural Netw. Springer, Canberra, Australia, 2015, pp. 298–308.
39 (2013) 18–26, http://dx.doi.org/10.1016/J.NEUNET.2012.12.001. [42] Z. Zhu, Y.-S. Ong, M. Dash, Markov blanket-embedded genetic algorithm for
[18] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. gene selection, Pattern Recognit. 40 (11) (2007) 3236–3248, http://dx.doi.org/
Ser. B Stat. Methodol. 58 (1996) 267–288, http://dx.doi.org/10.1111/j.2517- 10.1016/J.PATCOG.2007.02.007.
6161.1996.tb02080.x. [43] M. Vidal-Naquet, S. Ullman, Object recognition with informative features and lin-
[19] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. ear classification, in: Proceedings 9th IEEE International Conference on Computer
Stat. Soc. Ser. B Stat. Methodol. 67 (2) (2005) 301–320, http://dx.doi.org/10. Vision, Vol. 3, Nice, France, 2003, pp. 281–288.
1111/J.1467-9868.2005.00503.X. [44] A. Jakulin, A.I. Bratko, Machine learning based on attribute interactions PhD
[20] O. Arslan, Weighted LAD-LASSO method for robust parameter estimation dissertation, 2005.
and variable selection in regression, Comput. Statist. Data Anal. 56 (2012) [45] F. Fleuret, Fast binary feature selection with conditional mutual information, J.
1952–1965, http://dx.doi.org/10.1016/j.csda.2011.11.022. Mach. Learn. Res. 5 (9) (2004) 1531–1555.
[21] W. Yang, Y. Gao, Y. Shi, L. Cao, MRM-lasso: A sparse multiview feature selection [46] D.D. Lewis, Feature selection and feature extraction for text categorization, in:
method via low-rank analysis, IEEE Trans. Neural Netw. Learn. Syst. 26 (11) Proceedings of the Workshop on Speech and Natural Language, Harriman New
(2015) 2801–2815, http://dx.doi.org/10.1109/TNNLS.2015.2396937. York, 1992, pp. 212–217.
[22] Y. Zheng, C.K. Keong, A feature subset selection method based on high- [47] C. Ding, H. Peng, Minimum redundancy feature selection for microarray gene
dimensional mutual information, Entropy 13 (2011) 860–901, http://dx.doi.org/ expression data, J. Bioinform. Comput. Biol. 3 (2011) 185–205, http://dx.doi.
10.3390/E13040860, 2011, Vol. 13, Pages 860-901. org/10.1142/S0219720005001004.
[23] G.C. Cawley, N.L. Talbot, Gene selection in cancer classification using sparse
logistic regression with Bayesian regularization, Bioinformatics 22 (2006)
2348–2355, http://dx.doi.org/10.1093/bioinformatics/btl386.
11