You are on page 1of 11

Computers in Biology and Medicine 168 (2024) 107677

Contents lists available at ScienceDirect

Computers in Biology and Medicine


journal homepage: www.elsevier.com/locate/compbiomed

An embedded feature selection method based on generalized classifier neural


network for cancer classification
Akshata K. Naik ∗, Venkatanareshbabu Kuppili
Department of Computer Science and Engineering, National Institute of Technology, Farmagudi, Ponda, Goa, India

ARTICLE INFO ABSTRACT

Keywords: The selection of relevant genes plays a vital role in classifying high-dimensional microarray gene expression
Embedded feature selection data. Sparse group Lasso and its variants have been employed for gene selection to capture the interactions
Generalized classifier neural network of genes within a group. Most of the embedded methods are linear sparse learning models that fail to capture
Explainable model
the non-linear interactions. Additionally, very less attention is given to solving multi-class problems. The
existing methods create overlapping groups, which further increases dimensionality. The paper proposes a
neural network-based embedded feature selection method that can represent the non-linear relationship. In an
effort toward an explainable model, a generalized classifier neural network (GCNN) is adopted as the model
for the proposed embedded feature selection. GCNN has well-defined architecture in terms of the number of
layers and neurons within each layer. Each layer has a distinct functionality, eliminating the obscure nature of
most neural networks. The paper proposes a feature selection approach called Weighted GCNN (WGCNN) that
embeds feature weighting as a part of training the neural network. Since the gene expression data comprises
a large number of features, to avoid overfitting of the model a statistical guided dropout is implemented at
the input layer. The proposed method works for binary as well as multi-class classification problems likewise.
Experimental validation is carried out on seven microarray datasets on three learning models and compared
with six state-of-art methods that are popularly employed for feature selection. The WGCNN performs well in
terms of the F1 score and the number of features selected.

1. Introduction Individual characteristic features are used to select features in filter


algorithms. Often with an evolutionary approach for searching the
The rapid growth of data in various domains like bioinformatics, wrapper approach uses a specific machine learning algorithm to select
social media, and healthcare, has given rise to challenges in efficiently feature subset. Filter methods are characterized by very fast computa-
managing the data. The number of input variables needs to be reduced tion, whereas wrapper approaches have better accuracy performance
to the lower computational cost of modeling and in some situations,
with a lower computation rate. Due to their speed, filter-based meth-
to increase the model’s performance. Hence, dimensionality reduction
ods are superior to wrapper methods in domains with large datasets.
techniques have gained the attention of researchers lately. Dimen-
sionality reduction techniques can be further classified into feature However, the chosen features may underperform on the model because
selection and feature extraction too. Feature selection unlike feature of a filter approach does not involve interaction with the learning
extraction retains the original meaning of the features and is hence model. The embedded method traces an in-between path, imbibing
preferred in genetic analysis in bio-informatics [1]. the advantages of both filter and wrapper. A hybrid method is a
Gene expression data is high dimensional data with large number combination of two or more of the previously mentioned approaches.
of genes which are referred to as features in machine learning models. Therefore the paper focuses on embedded feature selection.
Although gene expression data is characterized by high dimensionality, Embedded feature selection is machine learning model with a built-
it includes irrelevant, redundant, and noisy features that are unimpor- in feature selection mechanism, such as ID3 and C4.5, or regularization
tant in disease diagnosis. Hence, feature selection plays a key role in models with objective functions that minimize fitting errors while
avoiding the overfitting of models and improving the accuracy of the
forcing the coefficients to be small or exact zero. Features with coeffi-
cancer classification model [2].
cients close to zero are then eliminated. The overview of an embedded
Feature selection methods are classified into four categories based
on the approaches that are filter, wrapper, embedded, and hybrid [3]. approach is depicted in Fig. 1. The machine learning module along

∗ Corresponding author.
E-mail address: akshata.naik@nitgoa.ac.in (A.K. Naik).

https://doi.org/10.1016/j.compbiomed.2023.107677
Received 2 June 2022; Received in revised form 26 October 2023; Accepted 6 November 2023
Available online 8 November 2023
0010-4825/© 2023 Elsevier Ltd. All rights reserved.
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Fig. 1. Embedded feature selection approach.

with an optimization technique outputs a hypothesis that is used in ultrasound is suggested by Zhou et al. [16] for artificially intelligent
feature selection. In this approach, the dataset is initially divided into diagnosis of aggressive and benign breast tumors. The growing appli-
training and testing sets. The training set is utilized by an optimization cations of machine learning models to decision-making, in turn, have
algorithm to learn an accurate set of parameters for a model. The led to a need for explainable learning models. The stakeholders and
learned parameters are used to determine the importance of features. decision-makers tend to trust a model that states transparency on how
The testing set is then used to evaluate the selected set of features. the decisions are arrived at. Dinh et al. [12] state that features selection
Most of the embedded methods incorporate regression as a con- is one of the prime steps towards explainability of models. We propose
straint in existing learning models to achieve a sparse solution. Many a feature selection method that not only accounts for the non-linear
popular embedded methods utilize 𝐿1 , 𝐿2 , 𝐿2,1 norm or regularized interaction of features using a neural network but also is explainable at
sparse multinomial logistics regression with penalty [4]. Sparse reg- the same time. The proposed method is based on GCNN [17] which un-
ularizers try to learn a model by minimizing the fitting errors and like most ANN and deep learning models, avoids parameters about the
simultaneously reducing the coefficients to zero or near zero. The architecture settings. GCNN provides clear guidelines on the number of
output is a model along with the set of selected features. Sparse regu- layers and neurons in each layer thus making it a suitable candidate for
larizers are adopted for grouped gene selection by constructing weights an explainable model. The contributions of the paper are summarized
for gene groups and genes based on the actual gene expression values. as follows:
Most of these methods do not capture the gene interaction information • An embedded feature selection method, WGCNN is proposed,
thereby leading to a selection of biologically unrelated genes. Wang and that captures the non-linear relationship efficiently using a neural
Li [5] further designed a gene and group weight calculation method network.
based on Joint Mutual Information (JMI) that considers biological • To make the model explainable, the proposed method embeds
relations. The Weighted General Group Lasso (WGGL) in [5] works feature selection into GCNN that has a clear interpretation of the
only for two class problems. The existing embedded methods are based number and functionality of layers.
on the following learning models, SVM, linear regression, and decision • A statistical method for guided dropout at the input layer is
trees [6,7]. adopted to counter the overfitting of the model.
An Artificial Neural Network (ANN), also known as a neural net- • The proposed method works efficiently for binary and multi-class
work, is a mathematical model inspired by how biological nervous classification problems and does not require any special handling
systems, such as the brain, process information. A neural network is an for the multi-class scenario.
interconnected group of simulated neurons that processes information • Experimental validation has been carried out on microarray gene
for computation using a connectionist approach. Neural networks can expression data along with statistical testing and comparative
easily model simple and complex relationships that are difficult to analysis with popular feature selection methods.
capture in linear or additive models. They can also be used to identify
patterns and clusters in data [8–10]. A comprehensive review of earlier 1.1. Motivation
research using deep learning, and reinforcement learning for breast
cancer detection and categorization is conducted by Nusrat et al. [11]. Microarray technology finds its application in the health sector to
The research has also examined the publicly accessible datasets for study and investigate gene expression levels to diagnose cancer. The
various imaging modalities. Through a learning process, an ANN can gene expression data consists of many irrelevant and unimportant genes
be designed for a specific application, such as data classification and for cancer diagnosis. Feature selection, thus helps in choosing a relevant
pattern categorization. Given these characteristics, the paper proposes set of genes for cancer classification. Embedded feature selection has a
an embedded feature selection method based on a neural network distinctive advantage over filter and wrapper about model interaction
model. and faster computation. One prominent technique employed in an
Although, neural networks like deep learning are widely and suc- embedded approach is sparse learning-based models that eliminate
cessfully applied for classification, the obscure nature of the model features with a lower score or weight. Linear sparse learning models are
provides little insight into how the classification is performed [12]. inefficient in capturing the non-linear interaction of features. Therefore,
Binjun et al. [13] created a model for the use of neural network the paper proposes a sparse learning model based on GCNN that can
algorithms in cancer diagnosis and treatment. CNN (Convolutional model a non-linear relationship. The weights for features are learned
Neural Network) based saliency detection network is presented to while training the GCNN model. The choice of GCNN is to make the
address the problem of colorectal polyp region extraction [14]. Ma- model more explainable. The proposed approach is called WGCNN as
jid and Fardin [15] present a lightweight, effective model for the it includes the learning of weights in GCNN.
categorization of breast cancer histopathology images using knowl- The remainder of the paper is organized as follows. Section 2
edge distillation. A multi-region radiomics strategy with multimodal discusses related works found in the literature. The problem statement

2
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

is presented in Section 3. The proposed model is explained in detail in 3. Problem statement


Section 4. The experimental findings and the analysis are discussed in
Section 5. The conclusive remarks are given in the Section 6. Consider a cancer classification problem with k classes, n samples
and p features (genes). Given a dataset (𝑋, 𝑌 ) = {(𝑥𝑖 , 𝑦𝑖 )|𝑖 = 1, … , 𝑛},
2. Related work where 𝑥𝑖 𝜖 𝑅𝑝 is the input vector and 𝑦𝑖 𝜖 {1, 2, … , 𝑘} is the response
vector. The multi-class classification problem can be formulated as
learning a decision function 𝑧, to classify an unseen record 𝑥𝑖 , based
Sparse regularizers have been widely used in gene selection meth-
on the discrimination function 𝑓𝑗 for each class.
ods. LASSO and its extensions using a 𝐿1 norm penalty in regression
have been employed in [18–22]. Cawley et al. [23] and Krishnapuram 𝑧 = argmax 𝑓𝑗 (𝑥𝑖 ) (1)
et al. [24] have designed sparse logistic regression and sparse multi- 𝑗=1,2,…,𝑘

nomial logistic regression respectively using Bayesian regularization. The discrimination functions 𝑓𝑗 for the WGCNN are founded on a radial
From a biological point of view, the cancer-causing genes usually basis function and are as given in Eq. (2), where 𝑇𝑗 refers to the training
interact with each other in a group. However, these methods do not set belonging to class 𝑗.
account for the group interaction of the genes. ( ‖𝑊 𝑇 (𝑥𝑖 −𝑇𝑗 )‖ , )
− 2
The group Lasso [25] and logistic group lasso [26] select rele- 𝑓𝑗 (𝑥𝑖 ) = ℎ 𝑒 2𝜎 2 (2)
vant features in the group, thus, making it a suitable approach for
gene selection. Simon et al. [27] designed SGL that could achieve The training phase of WGCNN learns the weight parameter through
a selection of both sparse groups and sparsity within groups, unlike a gradient descent optimization algorithm. The weight determines the
group lasso. Improvisation of SGL was designed in [28] that utilized importance of a feature in the discrimination function and hence in turn
data-dependent weights for feature selection. MSGL was applied for for classification.
multi-class problems in [29].
The efficiency of the group lasso and its extensions for microarray 4. Weighted generalized classifier neural network
gene selection relies heavily on the division of the groups. WGCNA has
GCNN is a radial basis function-based classification neural net-
been applied to divide the genes based on the correlation. SGL gives an
work [17]. It comprises five layers, namely, input, pattern, summa-
equal weightage to all the genes, thus the relative importance of genes
tion, normalization, and output layer. The research work proposes a
is ignored. The group weight is determined based on the number of
Weighted GCNN that embeds feature weighting as a part of training
genes within a group which may not work well when the group sizes
the neural network. The weights are then utilized for selecting the
vary a lot.
subset of features that help in classification problems. Fig. 2 depicts
Wang et al. [5] developed a WGGL framework, that computes the
the architecture of WGCNN.
gene weight using JMI. The framework has been applied for two-class
The input layer is the first layer, and it is in charge of feeding data
cancer classification and gene selection. JMI cannot be applied to con-
𝑋 𝜖 𝑅𝑑 into the neural network. The number of neurons in this layer is
tinuous variables. The continuous variables need to be discretized thus equal to the total number of features, 𝑑, in the dataset. The pattern layer
leading to two disadvantages. Firstly, a different choice of discretiza- comprises neurons each of which represents a different training pattern.
tion method may lead to different results. Secondly, an additional This layer has the same number of neurons as the entire number of
computational overhead is incurred for the discretization step. training examples, 𝑚. The pattern layer computes the squared weighted
MSGL is designed in [29] for multi-class classification. Although Euclidean distance with input as indicated in Eq. (3), where 𝑇𝑗 𝜖 𝑅𝑑
the authors have made provisions for gene and group weights, no denotes the 𝑗th training pattern and 𝑊 𝜖 𝑅𝑑 is the weight vector.
guidelines for weight assignments have been provided. A non-weighted ‖ ‖
approach, MROGL, is applied to three-class cancer classification [30]. 𝑑𝑖𝑠𝑡(𝑗) = ‖𝑊 𝑇 (𝑋 − 𝑇𝑗 )‖ , 1 ≤ 𝑗 ≤ 𝑚 (3)
‖ ‖2
MROGL generates groups within each class using WGCNA. The groups Processing of the output from pattern to output layer follows the
are overlapping in nature. Thus, for a 3-class problem, with 𝑝 genes, the procedure similar to that in [17]. The pattern layer output is generated
method would lead to a 3𝑝 dimension. Since the microarray data are by utilizing a radial basis activation function as in Eq. (4). Every
of high-dimension in nature, with an increase in the number of classes, training pattern has an associated class vector 𝑦 to determine the class
the computational burden will increase if overlapping groups are used. it belongs to. The computation of 𝑦 is as given in Eq. (5). Value of 0.9
Throughout the years, numerous techniques in the statistics litera- and 0.1 is chosen to avoid the stuck neuron problem during learning.
ture have been developed and adapted to neural networks. Li et al. [31] −𝑑𝑖𝑠𝑡(𝑗)
( )
proposed inserting a sparse one-to-one linear layer between the input 𝑅(𝑗) = 𝑒 2𝜎 2 , 1≤𝑗≤𝑚 (4)
layer and the first hidden layer of a neural network and performing 𝐿1 {
regularization on the weights of this additional layer. Similar concepts 0.9 if 𝑇𝑗 𝜖 𝑖th class, 1 ≤ 𝑖 ≤ 𝑘
𝑦(𝑗, 𝑖) = (5)
are applied to different types of networks and learning contexts [32– 0.1 otherwise, 1 ≤ 𝑗 ≤ 𝑚
34]. Standard lasso is not ideal for neural networks because a feature
Summation layer has 𝑘 + 1 neurons, where 𝑘 represents the total
can only be dropped if all of its connections have been shrunk to zero
number of classes. To differentiate between the classes, divergent term,
simultaneously, an objective that the lasso does not actively pursue.
computed using the exponential of difference between 𝑦 and 𝑦𝑚𝑎𝑥 is
Zhao et al. [35], Scardapane et al. [36], and Zhang et al. [37]
used as in Eq. (6).
address this issue by employing group lasso and its variants for selecting
deep neural network features. Feng and Simon [38] discuss fitting 𝑑(𝑗, 𝑖) = 𝑒(𝑦(𝑗,𝑖)−𝑦𝑚𝑎𝑥 ) ∗ 𝑦(𝑗, 𝑖) (6)
a neural network with a Sparse Group Lasso penalty on the first-
The value of 𝑦𝑚𝑎𝑥 is initialized to a maximum value of 𝑦(𝑗, 𝑖), i.e. 0.9,
layer input weights and a sparsity penalty on subsequent layers. Dinh
and is updated with every iteration during learning. The output of
et al. [12] addressed the problem of feature selection for analytic deep
the pattern layer is multiplied with the respective divergent term and
networks. The authors report that the adaptive group lasso selection
summed up at the summation layer. The output of each neuron in the
procedure with group lasso as the base estimator is selection-consistent.
summation layer, denoted by 𝑆(𝑖) is computed as given in Eq. (7).
The paper presents an embedded feature selection method, based on
a sparse learning model that is constructed using GCNN and dropout ∑
𝑚
𝑆(𝑖) = 𝑅(𝑗) ∗ 𝑑(𝑗, 𝑖), 1 ≤ 𝑖 ≤ 𝑘 (7)
regularization. 𝑗=1

3
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Fig. 2. WGCNN architecture.

Fig. 3. Flowchart for the proposed method.

The extra neuron is used for normalization of the output of neurons in for 𝑖𝑑th class and 𝑁𝑖𝑑 is winner class value.
the summation layer and its output is denoted by 𝐷.
𝐸 = (𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )2 (11)

𝑚
𝐷= 𝑅(𝑗), (8) According to [39] the first differentiation of 𝐸 with respect to 𝜎 is
𝑗=1 given in Eqs. (12)–(15)
The output of the summation layer output is normalized in the 𝜕𝐸 𝜕𝑁𝑖𝑑
= −2 ∗ [(𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )] ∗ (12)
normalization layer using the 𝐷 value. The output of the normalization 𝜕𝜎 𝜕𝜎
layer is calculated as in Eq. (9). 𝜕𝑁𝑖𝑑 𝐴(𝑖𝑑) − 𝐵(𝑖𝑑)
= (13)
𝑆(𝑖) 𝜕𝜎 𝜎3 ∗ 𝐷
𝑁(𝑖) = , 1≤𝑖≤𝑘 (9)
𝐷 ∑
𝑚
𝐴(𝑖𝑑) = 𝑑(𝑗, 𝑖𝑑) ∗ 𝑅(𝑗) ∗ 𝑑𝑖𝑠𝑡(𝑗) (14)
𝑗=1
The largest value of the normalization layer neuron is utilized [𝑚 ]

for classification. The last layer computes the winner class according 𝐵(𝑖𝑑) = 𝑅(𝑗) ∗ 𝑑𝑖𝑠𝑡(𝑗) ∗ 𝑁𝑖𝑑 (15)
to Eq. (10), where 𝑜 denotes the maximum value among all the output 𝑗=1
neurons of the normalization layer and 𝑖𝑑 is the winning class index. 𝜎 value is then updated as in Eq. (16), where 𝜂𝜎 is the learning rate
[𝑜, 𝑖𝑑] = 𝑚𝑎𝑥(𝑁) (10) for updating 𝜎.
𝜕𝐸
𝜎𝑛𝑒𝑤 = 𝜎𝑜𝑙𝑑 − 𝜂𝜎 ∗ (16)
The GCNN learns the smoothing parameter 𝜎 during training phase. 𝜕𝜎
The method proposes to adopt gradient descent-based optimization WGCNN comprises the weight vector 𝑊 𝜖 (𝑤1 , 𝑤2 , … , 𝑤𝑑 ). Each
for learning of weight parameter 𝑊 along with 𝜎. The weights are 𝑤𝑖 weights the difference in the respective features of the input and
optimized so that the features contributing to the discrimination of training sample. To minimize the classification error, the weights have
classes are assigned higher values. The non-discriminating features, to be such that it discriminates the classes correctly. Similarly, the
which are not relevant for classification are assigned minimum weights. computation of partial differentiation of 𝐸 concerning 𝑤𝑖 is given in
Feature selection is then performed based on the weight values. Eqs. (17)–(19).
During training, for each of the input, squared error 𝐸 is calculated [ ]
𝜕𝐸 2 ∗ [(𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )] ∗ 𝑤𝑖 𝑈 (𝑖𝑑) − 𝑉 (𝑖𝑑)
as in Eq. (11), where 𝑦(𝑙, 𝑖𝑑) represents value for the 𝑙th training input = ∗ (17)
𝜕𝑤𝑖 𝜎2 𝐷

4
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677


𝑚 Algorithm 1: Training of WGCNN
𝑈 (𝑖𝑑) = 𝑑(𝑗, 𝑖𝑑) ∗ 𝑅(𝑗) ∗ (𝑥𝑖 − 𝑡𝑖𝑗 ) (18) Input: training data, epoch, 𝜂𝜎 , 𝜂𝑤
𝑗=1 Output: 𝜎, 𝑊
[ ]

𝑚
1 while iteration ≤ epoch do
𝑉 (𝑖𝑑) = 𝑁𝑖𝑑 ∗ 𝑅(𝑗) ∗ (𝑥𝑖 − 𝑡𝑖𝑗 ) (19)
𝑗=1 2 for each training data 𝑇𝑗 do
‖ ‖
3 𝑑𝑖𝑠𝑡(𝑗) = ‖𝑊 𝑇 (𝑋 − 𝑇𝑗 )‖
The weights are then updated as in Eq. (20) where 𝜂𝑤 is the learning ‖ ‖2
−𝑑𝑖𝑠𝑡(𝑗)
rate for weight. 4 𝑅(𝑗) = 𝑒
(
2𝜎 2
)

𝜕𝐸
𝑤𝑖 = 𝑤𝑖(𝑜𝑙𝑑) − 𝜂𝑤 ∗ (20) 5 for each class i do
𝜕𝑤𝑖
6 𝑑(𝑗, 𝑖) = 𝑒(𝑦(𝑗,𝑖)−𝑦𝑚𝑎𝑥 ) ∗ 𝑦(𝑗, 𝑖)
The algorithm for training of WGCNN is given in Algorithm 1. ∑
7 𝑆(𝑖) = 𝑚 𝑅(𝑗) ∗ 𝑑(𝑗, 𝑖)
The microarray gene expression data comprises a large number of ∑𝑚 𝑗=1
8 𝐷 = 𝑗=1 𝑅(𝑗)
features than samples. Thus, the WGCNN is prone to overfitting. To 𝑆(𝑖)
tackle the overfitting problem, we adopt the dropout regularization at 9 𝑁(𝑖) = 𝐷
the input layer. Dropout refers to the removal of a neural network’s 10 [𝑜, 𝑖𝑑] = 𝑚𝑎𝑥(𝑁)
hidden and/or visible units. Dropout is a regularization approach for 11 𝑁𝑚𝑎𝑥 (𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛) = 𝑁𝑖𝑑
training a large number of neural networks in parallel with varied 12 𝐸 = (𝑦(𝑙, 𝑖𝑑) − 𝑁𝑖𝑑 )2
topologies. During training, a certain amount of layer outputs are 13 𝑦𝑚𝑎𝑥 = 𝑚𝑎𝑥(𝑁𝑚𝑎𝑥 )
ignored or dropped at random. This causes the layer to appear and
14 𝜎𝑛𝑒𝑤 = 𝜎𝑜𝑙𝑑 − 𝜂𝜎 ∗ 𝜕𝐸𝜕𝜎
behave as if it were a layer with a different number of nodes and 𝜕𝐸
connections than the previous layer. WGCNN adopts dropout regu- 15 𝑤𝑖 = 𝑤𝑖(𝑜𝑙𝑑) − 𝜂𝑤 ∗ 𝜕𝑤𝑖
larization at the input layer. The reason is that it helps generate a
sparse feature input to the neural network. Instead of random dropout,
statistically directed dropout is used to accomplish the feature selection
objective. For each neuron in the input layer, which represents the fea- 5.1. Synthetic data
tures, variance across 𝑘 classes is computed. Higher the variance better
is the feature for discriminating among the classes. The variance value The WGCNN learns weights for each of the features. The relevant
is used for filtering out the neurons during dropout. The dropout retains features have higher weights. In order to validate this, a study on
the neurons with the highest variance among the classes. Fig. 3 shows synthetic data is carried out. The synthetic dataset is generated with
the flowchart for the proposed method. The blocks indicated in green
50 samples and two features 𝑓1 and 𝑓2 for a binary classification as
and blue represent the processing steps involved in the dropout and
in [41]. Out of the two features, only the first feature, 𝑓1 is relevant to
training of WGCNN, respectively. The entire process can be summarized
the class. Initially, equal weights are assigned to all features. During the
in the following steps.
training phase, WGCNN learns weights with the objective to minimize
1. Initialize the parameters w, 𝛼, 𝜎 and 𝑀𝑎𝑥𝑖𝑡𝑟 the error. The higher-weight features are then selected as the relevant
2. Each record in the training undergoes the following steps features.
Ideally the function learned by a classification model should exhibit
(a) Drop features based on variance across the classes. Higher a strong correlation with the class labels. The experiment is conducted
variance indicates that the features can differentiate to verify the relevance of the weights in the radial basis function with
among the classes. the class. For the synthetic data generation, the relevant feature 𝑓1
(b) The output of dropout is passed to the pattern layer for when having a higher weight in comparison to the irrelevant feature
computation using Eqs. (3)–(5) 𝑓2 must exhibit a higher correlation with the class labels.
(c) The summation layer computes d, S, and D using Eqs. (6)– Weights 𝑤1 , 𝑤2 for 𝑓1 and 𝑓2 are assigned respectively. The effect
(8) (𝑤1 𝑓1 +𝑤2 𝑓2 )
of varying the weights on the correlation of 𝑒(− 2
)
(radial basis
(d) The normalization layer computes 𝑁 using Eq. (9)
function of weighted input with 𝜎 = 1) with the class labels is depicted
(e) Update 𝜎 and w
in Fig. 4. The 𝑥 and 𝑦 axes represent the features 𝑓1 and 𝑓2 respectively.
3. Go to step 2 if the number of iteration ≤ 𝑀𝑎𝑥𝑖𝑡𝑟 else go to step The first plot shows an equal weight (1,1) to both features and the
4 observed distance correlation is 0.4971. Assignment of a higher weight
4. Sort w in decreasing order and choose features with higher to the irrelevant feature 𝑓2 and eliminating the relevant feature 𝑓1
weights. shows a lower correlation of 0.2355. The output value of the radial
basis function is utilized for classification, therefore, discovering a cor-
relation between them and the class label is important. It is observed in
5. Experimental results and analysis the plots that when the weights for the relevant features are increased,
the correlation with the class label is improved. Thus, the idea of
The performance of the WGCNN model is evaluated on synthetic choosing the features with higher weights, for a better classification
and seven microarray gene expression datasets. The weight and feature performance is justified.
relevance relationship is explored using synthetic data. This section
provides the experimental study on synthetic data, the dataset infor- 5.2. Microarray gene expression dataset
mation employed in experiments, and describes in detail the findings
of comparison with the state-of-art feature selection methods. The per- Experiments are performed on real-world datasets that include
formance of selected genes on other classification models is analyzed to seven high-dimensional microarray gene data [42]. These are bench-
determine whether their performance is biased toward the embedded mark datasets that include binary and multi-class data. The Leukemia,
approach. The implementation of experiments for the proposed method Leukemia-4c, Leukemia-3c, and CNS datasets consist of a total of 7129
is done on an i7 processor, 3.5 GHz with 8 GB RAM on MATLAB. genes, and the Prostrate, Small-Round-Blue-Cell Tumor (SRBCT) and
Feature miner [40] is used to obtain feature subsets for the existing Brain dataset consists of 10 509, 2308, and 1070 genes respectively.
information theoretic-based and sparse learning-based feature selection The CNS dataset contains 2 classes with 21 survivor samples and 39
methods. non-survivor samples. The Leukemia dataset comprises 2 classes with

5
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Fig. 4. Effect of varying weights of relevant and irrelevant features on classification for the synthetic data.

Table 1 In multi-class classification, the F1 score for each class is computed


Gene microarray dataset.
using a One-vs-Rest approach rather than a single overall F1 score as in
Sr. No. Datasets Features Instances Classes
binary classification. The precision, recall, and F1-score are calculated
1 CNS 7129 60 2 separately for each class in this One-vs-Rest approach. Instead of having
2 Leukemia 7129 72 2
multiple per-class F1-scores, the macro average is computed to obtain a
3 Leukemia-4c 7129 72 4
4 Leukemia-3c 7129 72 3 single number that describes overall performance. The arithmetic mean
5 Prostrate 10 509 102 2 of all per-class F1-score is used to compute the macro-averaged F1
6 SRBCT 2308 83 4 score.
7 Brain 1070 28 2
IGFS is a feature selection approach that selects features based on
their association with the class labels. The redundancy of the features is
not taken into consideration in this calculation. CMIM, ICAP, and DISR
47 samples belonging to Acute lymphoblastic Leukemia (ALL) and 25 feature selection approaches use a non-linear combination of Shannon’s
samples belonging to Acute Myeloid Leukemia (AML). The Leukemia- information terms. MRMR, a pattern classification approach created
4c dataset comprises of 4 classes with 38 ALL B-Cell samples, 21 Bone by Peng et al. measures both relevance and redundancy using mutual
Marrow (BM) samples, 4 Peripheral Blood (PB) samples, and 9 ALL information. ll_l21 performs a supervised sparse feature selection via
T-Cell samples. The Leukemia-3c dataset comprises 7129 genes with l2,1 norm. For the existing feature selection approaches, the feature
25 AML samples, 38 B-Cell ALL samples, and 9 T-Cell ALL samples. miner [40] is used to get feature subsets.
The Prostrate dataset contains 52 samples in the normal class and 52 Fig. 5(a) shows the results from the CNS dataset. WGCNN is found
samples in the tumor class. The SRBCT dataset contains 2308 genes to have the best overall performance. The results of the Leukemia
with 29 samples in Ewing’s sarcoma class and Burkitt’s lymphoma, dataset are shown in Fig. 5(b). WGCNN performs the best throughout
neuroblastoma, and rhabdomyosarcoma classes having 11, 18, and 25 with a highest F1-score of 0.9897. Fig. 5(c) displays the findings from
samples respectively. The brain dataset contains two classes, each class the Leukemia-4c dataset, with WGCNN achieving a better performance
comprising 14 samples. The description of these datasets is given in till a feature subset size of 25. IGFS performs equally better with an
Table 1. increased feature subset size. The highest average F1-score of 0.9033 is
reported for IGFS for a feature size 35. WGCNN achieves the highest F1-
5.2.1. Experimental results score of 0.9013 with a feature set of size 34. Fig. 5(d) depicts the results
We compare the proposed feature selection approach to other cur- obtained on Leukemia3c dataset. WGCNN performs the best from a
rent feature selection methods utilizing NB and SVM classification, with feature size of 14 onwards.
3-fold cross-validation. 3-fold cross-validation is used to ensure that In the prostrate dataset Fig. 5(e), except for CMIM and ll_l21 all
all the classes are represented in every fold. The average F1-scores of methods perform competitively well. IGFS, DISR, and MRMR inter-
the three classifiers for seven methods namely, WGCNN, Conditional changeably perform the best for feature subset sizes from 1 to 35.
Mutual Information Maximization (CMIM) [43], Interaction Capping WGCNN consistently performs the second best with a marginal differ-
(ICAP) [44], Double Input Symmetrical Relevance (DISR) [45], In- ence from the best-performing method. SRBCT dataset MRMR performs
formation Gain Feature Selection (IGFS) [46], Minimum Redundancy the best on the SRBCT dataset Fig. 5(f). WGCNN performs at par with
Maximum Relevance (MRMR) [47] and 𝑙21 norm sparse regularization MRMR with a larger feature set size of 37. Fig. 5(g) depicts the Brain
(ll_l21) [34] versus the first fifty features selected are displayed in dataset output for the top 50 features selected. The performance of
Fig. 5. DISR is the best on this dataset. WGCNN performance is not commend-
The F1-score is a harmonic mean of precision and recall that pro- able on this dataset. This is because the dataset comprises a total of
vides a more balanced summary of model performance. The calculation 28 instances leading to very less training samples. The performance
for binary classes is straightforward as given in Eq. (21) of the proposed method is largely dependent on the training phase of
2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 the model. The embedded method ll_l21, does not perform well on this
𝐹 1 − 𝑠𝑐𝑜𝑟𝑒 = (21) dataset. WGCNN performs better in comparison to ll_l21.
(𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙)

6
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Fig. 5. Comparison of the influence of feature selection approaches on classification for various datasets (a) CNS dataset (b) Leukemia dataset (c) Leukemia4c dataset (d) Leukemia3c
(e) Prostrate dataset (f) SRBCT (g) Brain dataset.

The average F1-scores obtained by varying the number of features the difference between WGCNN and the best F1-score is 0.0042. Com-
picked from 1 to 50 are shown in Tables 2, 4 and 6. The outcomes show paring the average F1-scores on the SVM model, the proposed approach
the performance of each learning model. The average F1 score for the performs the best with an average F1-score of 0.8531. The second best
entire dataset is shown in the table’s last row. It can be observed that performing method is IGFS with a F1-score of 0.8061. Table 6 indicates
the method performs the best on 3 datasets. On the prostrate dataset, the results on NB approach. WGCNN performs the best on Leukemia,

7
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Table 2
Average F1-score obtained by kNN on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.9912 0.9882 0.9992 0.9933 0.9883 0.9956 1.0000
Leukemia4c 0.9753 0.9694 0.9647 0.9694 0.9694 0.9681 0.9680
Brain 0.9124 0.9666 0.9674 0.9493 0.9494 0.9822 0.7923
Leukemia 0.9988 0.9996 1.0000 0.9996 0.9984 1.0000 1.0000
CNS 1.0000 1.0000 1.0000 1.0000 0.9851 1.0000 0.9982
Leukemia3c 1.0000 1.0000 0.9984 1.0000 1.0000 1.0000 1.0000
SRBCT 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9970
Average 0.9825 0.9891 0.9900 0.9874 0.9844 0.9923 0.9651

Table 3
p values derived using statistical testing (one-tailed t-test) kNN.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 2.82E−08 6.61E−13 2.22E−02 1.04E−14 1.76E−07 2.27E−08 –
Leukemia4c – 5.05E−05 1.60E−07 5.05E−05 4.38E−05 1.47E−04 3.90E−05
Brain 7.32E−16 8.71E−05 3.20E−07 9.97E−10 2.81E−12 – 3.57E−10
Leukemia 6.38E−03 7.97E−02 – 7.97E−02 1.82E−03 – –
CNS – – – – 1.28E−07 – 2.22E−02
Leukemia3c – – 1.61E−01 – – – –
SRBCT – – – – – – 4.16E−02

Table 4
Average F1-score obtained by SVM on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.8996 0.5097 0.8908 0.9038 0.8903 0.8960 0.6527
Leukemia4c 0.8445 0.6093 0.6078 0.6093 0.7537 0.5528 0.6429
Brain 0.8853 0.9829 0.9655 0.9759 0.9380 0.9129 0.8482
Leukemia 0.9527 0.7833 0.8014 0.7833 0.8639 0.7751 0.7978
CNS 0.5400 0.3497 0.3796 0.3377 0.3324 0.3066 0.3596
Leukemia3c 0.9216 0.9347 0.9482 0.9175 0.9031 0.9611 0.4549
SRBCT 0.9283 0.9725 0.9866 0.9747 0.9615 0.9816 0.6208
Average 0.8531 0.7346 0.7971 0.7860 0.8061 0.7694 0.6253

Table 5
p values derived using statistical testing (one-tailed t-test) SVM.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 1.34E−01 1.96E−48 2.03E−02 – 2.32E−02 5.37E−03 2.43E−19
Leukemia4c – 3.91E−30 6.46E−26 3.91E−30 3.16E−13 5.20E−33 5.88E−23
Brain 2.81E−13 – 9.92E−04 6.58E−02 1.66E−12 1.99E−16 3.34E−16
Leukemia – 1.45E−34 6.16E−32 1.45E−34 3.73E−14 4.66E−39 1.75E−33
CNS – 2.99E−02 1.48E−01 1.87E−02 4.08E−04 8.49E−05 6.45E−02
Leukemia3c 1.73E−05 1.06E−13 2.43E−04 5.40E−24 1.76E−18 – 1.29E−51
SRBCT 1.20E−09 3.98E−04 – 1.36E−04 3.92E−06 7.26E−02 5.03E−22

CNS, and Leukemia3c datasets. MRMR demonstrates the highest F1- highest F1-score with the least NFS. Fig. 6(c) presents the comparison
score of 0.8772 on the prostrate dataset. WGCNN performs closely with for Leukemia-4c dataset. WGCNN shows a highest F1-score of 0.9302
an F1-score of 0.8676. with 34 NFS, followed by IGFS with an F1-score of 0.8732 with 48 NFS.
A statistical one-tailed t-test is used to compare the performance In Fig. 6(d), MRMR, DISR, and WGCNN achieve F1-score of 0.9886,
of all the feature selection approaches. The null hypothesis is that the 0.9886, and 0.9712 with NFS of 20,41, and 26 respectively. In Fig. 6(e)
two methods means are equal. The technique’s mean, according to IGFS depicts the highest F1-score of 0.9417 with 6 NFS on SVM, while
the alternative hypothesis, is higher than the method being compared WGCNN achieves the highest F1-score of 0.9231 with 14 NFS. On NB,
against. The statistical significance is calculated at a significance level WGCNN achieves the highest F1-score of 0.9091 but with the least
of 0.05. The 𝑝-value obtained when comparing the best technique to the NFS in comparison to DISR, IGFS, ICAP, and MRMR. In the SRBCT
other ways is shown in Tables 3, 5 and 7. The best-performing method dataset Fig. 6(f), the proposed method achieves the same highest F1-
in terms of the highest average F1-score is indicated with ’–’. According score, but with a higher NFS. It is observed that MRMR performs well
to statistical tests, the performance difference between WGCNN and the on the SRBCT dataset. The MRMR method, in addition to selecting
best-performing method on NB and SVM is insignificant. Similarly for relevant features, also eliminates redundant features. The proposed
NB, WGCNN performs at par with IGFS on the leukemia4c dataset. method does not incorporate the elimination of redundant features.
Fig. 6 compares all the methods in terms of the highest F1-score Thus, in the presence of redundant features, the NFS is higher in the
achieved and the corresponding Number of Features Selected (NFS) proposed approach. CMIM attains better performance in terms of the
for kNN, SVM, and NB. Fig. 6(a) depicts the performance on the CNS highest F1-score and NFS for the brain dataset Fig. 6(g).
dataset. The proposed method on SVM achieves the highest F1-score of Table 8 shows comparison in performance with and without fea-
0.7131 with a NFS of 12. The second-best-performing method achieves ture selection. The performance of the three models without feature
an F1-score of 0.5882 with a NFS of 32. The highest F1-score for NB is selection and the highest obtained F1-score after applying WGCNN
depicted by MRMR with an NFS of 44. WGCNN performs second best is tabulated. The table shows an improvement in performance after
with an F1-score of 0.6667 and NFS 7. Moving on to the Leukemia applying the proposed feature selection method on all the datasets
dataset, in Fig. 6(b), for kNN, SVM, and NB, WGCNN depicts a better except for Leukemia4c. WGCNN, however, shows a large reduction in

8
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Fig. 6. Comparison of highest average F1-score and Number of Features Seleceted (NFS) for various datasets (a) CNS dataset (b) Leukemia dataset (c) Leukemia4c dataset (d)
Leukemia3c (e) Prostrate dataset (f) SRBCT (g) Brain dataset.

9
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

Table 6
Average F1-score obtained by NB on real-world datasets using various feature selection approaches.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 0.8676 0.4139 0.8757 0.8746 0.8754 0.8772 0.3999
Leukemia4c 0.7731 0.6416 0.6196 0.6416 0.7862 0.5044 0.6110
Brain 0.8812 0.9649 0.9644 0.9578 0.9667 0.8902 0.7698
Leukemia 0.9782 0.6787 0.8291 0.6787 0.8474 0.8223 0.7263
CNS 0.6308 0.4245 0.4816 0.4245 0.4201 0.4948 0.4747
Leukemia3c 0.9357 0.9181 0.8654 0.9181 0.8892 0.9079 0.5550
SRBCT 0.9113 0.9503 0.9660 0.9488 0.9634 0.9670 0.7162
Average 0.8540 0.7131 0.8003 0.7777 0.8212 0.7805 0.6076

Table 7
p values derived using statistical testing (one-tailed t-test) NB.
Proposed CMIM DISR ICAP IGFS MRMR ll_l21
Prostrate 6.89E−02 7.91E−41 3.89E−01 2.47E−01 3.02E−01 – 5.75E−43
Leukemia4c 8.32E−02 1.77E−18 4.61E−21 1.77E−18 – 1.03E−25 6.37E−25
Brain 2.02E−12 4.45E−01 1.61E−01 2.49E−01 – 1.25E−14 4.53E−18
Leukemia – 9.64E−38 1.08E−20 9.64E−38 4.38E−17 5.87E−33 1.93E−22
CNS – 1.92E−18 4.05E−19 1.92E−18 4.48E−23 2.41E−08 3.77E−11
Leukemia3c – 4.04E−02 1.55E−11 4.04E−02 2.70E−13 1.10E−02 6.62E−40
SRBCT 9.34E−05 3.15E−03 2.56E−01 9.99E−04 2.23E−01 – 1.17E−12

Table 8
Comparison between highest F1-score using WGCNN and all features.
SVM NB kNN
All WGCNN All WGCNN All WGCNN
Prostrate 0.9216 0.9231 0.3571 0.9091 0.8021 1.0000
Leukemia4c 0.9409 0.9302 0.8819 0.8154 0.7359 0.9861
Brain 0.9630 1.0000 1.0000 1.0000 0.3556 1.0000
Leukemia 0.9409 0.9691 0.9677 1.0000 0.9108 1.0000
CNS 0.4177 0.7131 0.6154 0.6667 0.5296 1.0000
Leukemia3c 0.9512 0.9712 0.7082 0.9744 0.7797 1.0000
SRBCT 0.9734 0.9911 0.9706 1.0000 0.8383 1.0000

feature set size from 7129 to 34 and 20 for Leukemia4c on SVM and 6. Conclusion
NB respectively.
The paper proposes WGCNN for feature selection and applies it to
5.2.2. Discussion microarray gene expression datasets. The proposed feature selection
A comparative analysis of six feature selection methods on seven method is based on GCNN. The learning phase comprises the gradient
different microarray datasets is performed. F1-score was chosen for
descent method to learn 𝜎 and the weights simultaneously. The learned
evaluating the performance. The graph of a number of features, varying
weights are then utilized to determine the importance of features in
from 1 to 50 versus F1-score in Fig. 5 indicates that the proposed
classification. The method is applied to five microarray datasets. F1-
method performs the best in the majority of the datasets. However, it
score is measured for the selected subset on SVM, NB, and kNN.
is observed that the superlative performance is not occurring for the
WGCNN performs well in comparison to six popular feature selection
Brain dataset. Upon secondary analysis, it is found that the dataset
methods. WCNN is a step towards an explainable model for feature
has less number of instances leading to smaller training data. The
selection using a neural network. The results on the brain dataset
proposed method classifies by separating the classes using the radial
basis function. This is possible when each class has enough training indicate that WGCNN relies on the training phase and hence lack of
samples to learn the parameters. training samples can lead to a degradation in performance.
The proposed method falls under the category of embedded feature The cancer-causing genes usually interact in groups. Group feature
selection. The features selected in an embedded approach fail to gener- selection selects relevant features in group, thus, making it a suitable
alize for the other learning models. We therefore validate the features approach for gene selection. As a part of future work, we can explore
selected in the proposed method on three learning models namely SVM, group feature selection using WGCNN. The selected set of genes can
NB, and kNN. The comparison is further aided by statistical testing. further be verified using biological domain knowledge. Another future
The method performs well in the majority of the datasets on all three direction is to work on the selection of non-redundant features. Redun-
learning models. Thus, the selected features generalize well with the dancy of features selected in WGCNN has not been explored in this
other learning models as well. work.
The next analysis is made to compare the number of features
selected and the performance together. For every dataset, performance
Declaration of competing interest
on kNN, SVM, and NB is evaluated. It is observed that the proposed
method performs best in either each or both metrics. The performance
with all features and selected features indicate that either better or The authors declare that they have no known competing finan-
comparable performance can be achieved even with 2% of the original cial interests or personal relationships that could have appeared to
feature set size. influence the work reported in this paper.

10
A.K. Naik and V. Kuppili Computers in Biology and Medicine 168 (2024) 107677

References [24] B. Krishnapuram, L. Carin, M.A. Figueiredo, A.J. Hartemink, Sparse multinomial
logistic regression: Fast algorithms and generalization bounds, IEEE Trans. Pat-
[1] J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, H. Liu, Feature tern Anal. Mach. Intell. 27 (2005) 957–968, http://dx.doi.org/10.1109/TPAMI.
selection: A data perspective, ACM Comput. Surv. 50 (6) (2017) 94:1–94:45, 2005.127.
http://dx.doi.org/10.1145/3136625. [25] M. Yuan, Y. Lin, Model selection and estimation in regression with grouped
[2] J.C. Ang, A. Mirzal, H. Haron, H.N.A. Hamed, Supervised, unsupervised, and variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2006) 49–67, http://dx.doi.
semi-supervised feature selection: A review on gene selection, IEEE/ACM Trans. org/10.1111/j.1467-9868.2005.00532.x.
Comput. Biol. Bioinform. 13 (2016) 971–989, http://dx.doi.org/10.1109/TCBB. [26] L. Meier, S. Van De Geer, P. Bühlmann, The group lasso for logistic regression,
2015.2478454. J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (1) (2008) 53–71.
[3] V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos, J.M. Benítez, F. [27] N. Simon, J. Friedman, T. Hastie, R. Tibshirani, A sparse-group lasso, J. Comput.
Herrera, A review of microarray datasets and applied feature selection methods, Graph. Statist. 22 (2013) 231–245, http://dx.doi.org/10.1080/10618600.2012.
Inform. Sci. 282 (2014) 111–135, http://dx.doi.org/10.1016/J.INS.2014.05.042. 681250.
[4] J. Gui, Z. Sun, S. Ji, D. Tao, T. Tan, Feature selection based on structured [28] K. Fang, X. Wang, S. Zhang, J. Zhu, S. Ma, Bi-level variable selection via
sparsity: A comprehensive study, IEEE Trans. Neural Netw. Learn. Syst. 28 (7) adaptive sparse group lasso, J. Stat. Comput. Simul. 85 (13) (2014) 2750–2760,
(2016) 1490–1507. http://dx.doi.org/10.1080/00949655.2014.938241.
[5] Y. Wang, X. Li, R. Ruiz, Weighted general group lasso for gene selection in [29] M. Vincent, N.R. Hansen, Sparse group lasso and high dimensional multinomial
cancer classification, IEEE Trans. Cybern. 49 (2019) 2860–2873, http://dx.doi. classification, Comput. Statist. Data Anal. 71 (2014) 771–786, http://dx.doi.org/
org/10.1109/TCYB.2018.2829811. 10.1016/j.csda.2013.06.004.
[6] K.Y. Aram, S.S. Lam, M.T. Khasawneh, Linear cost-sensitive max-margin [30] J. Li, Y. Wang, T. Jiang, H. Xiao, X. Song, Grouped gene selection and multi-
embedded feature selection for SVM, Expert Syst. Appl. 197 (2022) 116683. classification of acute leukemia via new regularized multinomial regression, Gene
[7] H. Liu, M. Zhou, Q. Liu, An embedded feature selection method for imbalanced 667 (2018) 18–24, http://dx.doi.org/10.1016/j.gene.2018.05.012.
data classification, IEEE/CAA J. Autom. Sin. 6 (3) (2019) 703–715, http://dx. [31] Y. Li, C.-Y. Chen, W.W. Wasserman, Deep feature selection: theory and appli-
doi.org/10.1109/JAS.2019.1911447. cation to identify enhancers and promoters, J. Comput. Biol. 23 (5) (2016)
[8] D. Zhang, S. Lou, The application research of neural network and BP algorithm 322–336.
in stock price pattern classification and prediction, Future Gener. Comput. Syst. [32] J. Liu, S. Ji, J. Ye, Multi-task feature learning via efficient l2, l1-norm
115 (2021) 872–879. minimization, 2012.
[9] G.-G. Wang, M. Lu, Y.-Q. Dong, X.-J. Zhao, Self-adaptive extreme learning [33] S. Ainsworth, N. Foti, A.K. Lee, E. Fox, Interpretable VAEs for nonlinear group
machine, Neural Comput. Appl. 27 (2) (2016) 291–303. factor analysis, 2018, arXiv:1802.06765.
[10] Z. Cui, F. Xue, X. Cai, Y. Cao, G.-g. Wang, J. Chen, Detection of malicious [34] I. Lemhadri, F. Ruan, L. Abraham, R. Tibshirani, Lassonet: A neural network
code variants based on deep learning, IEEE Trans. Ind. Inform. 14 (7) (2018) with feature sparsity, J. Mach. Learn. Res. 22 (1) (2021) 5633–5661.
3187–3196. [35] L. Zhao, Q. Hu, W. Wang, Heterogeneous feature selection with multi-modal
[11] N.M. ud din, R.A. Dar, M. Rasool, A. Assad, Breast cancer detection using deep neural networks and sparse group lasso, IEEE Trans. Multimed. 17 (11)
deep learning: Datasets, methods, and challenges ahead, Comput. Biol. Med. 149 (2015) 1936–1948.
(2022) 106073, http://dx.doi.org/10.1016/j.compbiomed.2022.106073. [36] S. Scardapane, D. Comminiello, A. Hussain, A. Uncini, Group sparse
[12] V.C. Dinh, L.S. Ho, Consistent feature selection for analytic deep neural networks, regularization for deep neural networks, Neurocomputing 241 (2017) 81–89.
Adv. Neural Inf. Process. Syst. 33 (2020) 2420–2431. [37] H. Zhang, J. Wang, Z. Sun, J.M. Zurada, N.R. Pal, Feature selection for neural
[13] B. He, W. Hu, K. Zhang, S. Yuan, X. Han, C. Su, J. Zhao, G. Wang, G. Wang, L. networks using group lasso regularization, IEEE Trans. Knowl. Data Eng. 32 (4)
Zhang, Image segmentation algorithm of lung cancer based on neural network (2019) 659–673.
model, Expert Syst. 39 (3) (2022) e12822. [38] J. Feng, N. Simon, Sparse-input neural networks for high-dimensional
[14] K. Hu, L. Zhao, S. Feng, S. Zhang, Q. Zhou, X. Gao, Y. Guo, Colorectal nonparametric regression and classification, 2017, arXiv:1711.07592.
polyp region extraction using saliency detection network with neutrosophic [39] T. Masters, W. Land, A new training algorithm for the general regression neural
enhancement, Comput. Biol. Med. 147 (2022) 105760. network, in: Proceedings of IEEE International Conference on Systems, Man, and
[15] M. Sepahvand, F. Abdali-Mohammadi, Joint learning method with teacher– Cybernetics. Computational Cybernetics and Simulation, Vol. 3, IEEE, Orlando,
student knowledge distillation for on-device breast cancer image classifica- FL, USA, 1997, pp. 1990–1994.
tion, Comput. Biol. Med. 155 (2023) 106476, http://dx.doi.org/10.1016/j. [40] K. Cheng, J. Li, H. Liu, FeatureMiner: A tool for interactive feature selection,
compbiomed.2022.106476. in: Proceedings of International Conference on Information and Knowledge
[16] Z. Xu, Y. Wang, M. Chen, Q. Zhang, Multi-region radiomics for artificially Management, Association for Computing Machinery, New York, United States,
intelligent diagnosis of breast cancer using multimodal ultrasound, Comput. 2016, pp. 2445–2448, http://dx.doi.org/10.1145/2983323.2983329.
Biol. Med. 149 (2022) 105920, http://dx.doi.org/10.1016/j.compbiomed.2022. [41] I. Kamkar, S.K. Gupta, D. Phung, S. Venkatesh, Stable feature selection with sup-
105920. port vector machines, in: Australasian Joint Conference on Artificial Intelligence,
[17] B.M. Ozyildirim, M. Avci, Generalized classifier neural network, Neural Netw. Springer, Canberra, Australia, 2015, pp. 298–308.
39 (2013) 18–26, http://dx.doi.org/10.1016/J.NEUNET.2012.12.001. [42] Z. Zhu, Y.-S. Ong, M. Dash, Markov blanket-embedded genetic algorithm for
[18] R. Tibshirani, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. gene selection, Pattern Recognit. 40 (11) (2007) 3236–3248, http://dx.doi.org/
Ser. B Stat. Methodol. 58 (1996) 267–288, http://dx.doi.org/10.1111/j.2517- 10.1016/J.PATCOG.2007.02.007.
6161.1996.tb02080.x. [43] M. Vidal-Naquet, S. Ullman, Object recognition with informative features and lin-
[19] H. Zou, T. Hastie, Regularization and variable selection via the elastic net, J. R. ear classification, in: Proceedings 9th IEEE International Conference on Computer
Stat. Soc. Ser. B Stat. Methodol. 67 (2) (2005) 301–320, http://dx.doi.org/10. Vision, Vol. 3, Nice, France, 2003, pp. 281–288.
1111/J.1467-9868.2005.00503.X. [44] A. Jakulin, A.I. Bratko, Machine learning based on attribute interactions PhD
[20] O. Arslan, Weighted LAD-LASSO method for robust parameter estimation dissertation, 2005.
and variable selection in regression, Comput. Statist. Data Anal. 56 (2012) [45] F. Fleuret, Fast binary feature selection with conditional mutual information, J.
1952–1965, http://dx.doi.org/10.1016/j.csda.2011.11.022. Mach. Learn. Res. 5 (9) (2004) 1531–1555.
[21] W. Yang, Y. Gao, Y. Shi, L. Cao, MRM-lasso: A sparse multiview feature selection [46] D.D. Lewis, Feature selection and feature extraction for text categorization, in:
method via low-rank analysis, IEEE Trans. Neural Netw. Learn. Syst. 26 (11) Proceedings of the Workshop on Speech and Natural Language, Harriman New
(2015) 2801–2815, http://dx.doi.org/10.1109/TNNLS.2015.2396937. York, 1992, pp. 212–217.
[22] Y. Zheng, C.K. Keong, A feature subset selection method based on high- [47] C. Ding, H. Peng, Minimum redundancy feature selection for microarray gene
dimensional mutual information, Entropy 13 (2011) 860–901, http://dx.doi.org/ expression data, J. Bioinform. Comput. Biol. 3 (2011) 185–205, http://dx.doi.
10.3390/E13040860, 2011, Vol. 13, Pages 860-901. org/10.1142/S0219720005001004.
[23] G.C. Cawley, N.L. Talbot, Gene selection in cancer classification using sparse
logistic regression with Bayesian regularization, Bioinformatics 22 (2006)
2348–2355, http://dx.doi.org/10.1093/bioinformatics/btl386.

11

You might also like