Extreme Learning Machine

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/344055512
L2,1-Extreme Learning Machine: An Efficient Robust Classifier for Tumor

Classification
Article in Computational Biology and Chemistry · September 2020

DOI: 10.1016/j.compbiolchem.2020.107368
CITATIONS READS
0 10
5 authors, including:
Jin-Xing Liu
Qufu Normal University
127 PUBLICATIONS 611 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
network-based biological data mining View project
All content following this page was uploaded by Jin-Xing Liu on 10 September 2020.
The user has requested enhancement of the downloaded file.

Computational Biology and Chemistry 89 (2020) 107368
Contents lists available at ScienceDirect
Computational Biology and Chemistry

journal homepage: www.elsevier.com/locate/cbac
L2,1-Extreme Learning Machine: An Efficient Robust Classifier for

Tumor Classification
Liang-Rui Ren a, Ying-Lian Gao b, Jin-Xing Liu a, *, Rong Zhu a, Xiang-Zhen Kong a
a
School of Information Science and Engineering, Qufu Normal University, Rizhao, China
b
Library of Qufu Normal University, Qufu Normal University, Rizhao, China
A R T I C L E I N F O A B S T R A C T
Keywords: With the development of cancer research, various gene expression datasets containing cancer information show
Extreme Learning Machine an explosive growth trend. In addition, due to the continuous maturity of single-cell RNA sequencing (scRNA-
L2,1-norm seq) technology, the protein information and pedigree information of a single cell are also continuously mined. It
Robust
is a technical problem of how to classify these high-dimensional data correctly. In recent years, Extreme Learning
Single-cell RNA Sequencing
Supervised Learning
Machine (ELM) has been widely used in the field of supervised learning and unsupervised learning. However, the
traditional ELM does not consider the robustness of the method. To improve the robustness of ELM, in this paper,
a novel ELM method based on L2,1 -norm named L2,1 -Extreme Learning Machine (L2,1 -ELM) has been proposed.
The method introduces L2,1 -norm on loss function to improve the robustness, and minimizes the influence of
noise and outliers. Firstly, we evaluate the new method on five UCI datasets. The experiment results prove that
our method can achieve competitive results. Next, the novel method is applied to the problem of classification of
cancer samples and single-cell RNA sequencing datasets. The experimental results on The Cancer Genome Atlas
(TCGA) datasets and scRNA-seq datasets prove that ELM and its variants has great potential in the classification
of cancer samples.
1. Introduction output weights are the least square solution obtained by minimizing the
square loss function. And the nonlinear mapping functions in ELM can
Cancer is a general term for a series of diseases, including the endless be any nonlinear piecewise continuous functions, such as sigmoid
division and spread of some body’s cells. Some cancers may form tu function G(a, b, x) = 1/(1 + exp( − (a⋅x + b) ) ) Compared with the
mors. There is no doubt that cancer has become one of the greatest traditional method, such as gradient descent algorithm, ELM has faster
threats to human life in the 21 st century. Various studies on cancer are learning speed, less human intervention and better generalization per
still under way, and the classification of cancer is one of them. At the formance [9]. Owing to its good characteristics, ELM and its variants has
same time, in order to dig deeper into the disease information carried by been widely used. In [10], Deng et al. combined ELM with sub-band
cancer cells, single-cell RNA sequencing technology is maturing [1]. The kurtosis to evaluate image quality and proved the trained model has
development of single-cell RNA sequencing technology has also led to better quality evaluation performance on noisy images than existing
the emergence of many new research methods, and also brought a large blind noise assessment methods. In [11], Jin et al. proposed a sparse
amount of research data [2–5]. Therefore, the classification of single-cell Bayesian ELM (SBELM) by combining ELM with sparse Bayesian
datasets has become a hot topic. learning and achieved competitive experimental results on a public
Extreme Learning Machine (ELM), as a single-hidden layer feed- dataset from BCI Competition IV IIb. Combined with L1 -norm regula
forward neural networks (SLFNs) method, was proposed in [6]. rization, a sparse L1 -ELM [12] is proposed and applied to the anomaly
Different from the traditional methods based on neural network, ELM detection in traffic. The experimental results show that L1 -ELM not only
only update the output weights which connect the hidden nodes and the has good generalization ability, but also has a fast learning speed.
output nodes, while the parameters, i.e., the input weights and biases are In recent years, researchers have also been working to apply ELM to a
randomly generated and fixed during the process of training [7,8]. The variety of medical diagnostic problems. In [13], Liu et al. proposed a
* Corresponding author.
E-mail addresses: renliangrui@126.com (L.-R. Ren), yinliangao@126.com (Y.-L. Gao), sdcavell@qfnu.edu.cn (J.-X. Liu), zhurongsd@126.com (R. Zhu),
kongxzhen@163.com (X.-Z. Kong).
https://doi.org/10.1016/j.compbiolchem.2020.107368
Received 22 November 2019; Received in revised form 17 August 2020; Accepted 25 August 2020
Available online 1 September 2020
1476-9271/© 2020 Elsevier Ltd. All rights reserved.
L.-R. Ren et al. Computational Biology and Chemistry 89 (2020) 107368
hybrid method by integrating maximum relevance minimum redun The rest of the paper is organized as follows: In section 2, we briefly
dancy (MRMR) and ELM. They applied their method to the detection of review the relevant content of ELM [6], C-loss based ELM (CELM) [27]
erythemato-squamous diseases. Experimental results show that their and L2,1 -Regularization ELM (L2,1 -RELM) [28], which are compared
method can achieve an accuracy of 98.55%. Hu et al. [14] introduced with our method. In section 3, a method based onL2,1 -norm is proposed
ELM to the problem of the diagnosis of paraquat- poisoning. At the same for classification tasks. In section 4, the experiment results are showed
time, their feature selection experiment effectively selected the features and analyzed. And in section 5, the paper concluded with a conclusion
that had the greatest impact on the classification results, and further and the future work is pointed out.
improved the classification performance of ELM. In [15], Wang et al.
proposed an ensemble based fuzzy weighted extreme learning machine 2. Related Works
(EN-FWELM) for gene expression classification. In particular, the
introduction of fuzzy membership of samples and balance factor greatly In this section, we briefly introduce the main contents of ELM,L2,1 -
improved the ability of EN-FWELM to deal with multi-class imbalance RFELM and CELM, which are compared withL2,1 -ELM.
problems. In [16], Eshtay et al. proposed to use the competitive group
optimizer (CSO) to optimize the input weights and hidden neuron values 2.1. L2,1-norm
of ELM. Their goal is to improve the generalization performance of the
classifier, and to generate a more compact network by reducing the Ding et al. proposed L2,1 -norm in [29]. The L2,1 -norm of matrix can
number of hidden layer neurons. The experimental results based on 15 ∑m ∑m √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
∑n 2̅
medical classification datasets show that their method has good gener be written as ||X||2,1 = i=1 ‖xi ‖2 = i=1 j=1 xij . It is clear that L2,1
alization performance and high stability when the number of hidden -norm is to first compute the L2 -norm of each row of the matrix X, and
neurons is small. Moreover, ELM plays an important role in many other then compute the L1 -norm of each column [30].
medical diagnostic problems [17–20].
In these researches, both parameter optimization problems and 2.2. ELM
feature selection problems can be solved effectively. However, when the
dataset is doped with noise and outliers, it is also worth considering Huang et al. proposed the ELM algorithm in [6]. The output function
whether the generalization ability of ELM can be maintained. Therefore, of ELM is:
how to improve the robustness of ELM to noise and outliers has become
a hot topic in current research [21,22]. Aiming at improving the f (xi ) = h(xi )β, i = 1, …, N, (1)
robustness of ELM, in this study, L2,1 -norm is introduced to the loss
function and a novel method named L2,1 -Extreme Learning Machine is where h(xi ) is the output of the i -th hidden node output [31], and β is
the output weight matrix connected the hidden layer and the output
proposed. To our knowledge, due to the remarkable robustness to noise
layer.
and outliers and the ability of feature selection, a variety of algorithms
ELM aims to reach better generalization performance by reaching
based on the L2,1 -norm have been proposed. Kong et al. proposed a
both the smallest training error and the smallest norm of output weights
Robust Nonnegative Matrix Factorization (RNMF) using L2,1 -norm loss
[32], which can be expressed as follows:
function. Their method incurs almost the same computational cost as the
standard NMF, but handles noise and outliers better. In [23,24], Yu et al. 1 C∑N 2
min ||β||2 + ‖ξi ‖ ,
proposed a Robust Hyper-graph Regularized Nonnegative Matrix β 2 2 (2)
i=1
Factorization (RHNMF) by using L2,1 -norm when estimating the resid

ual of conventional NMF. Extensive experimental results on multi-view s.t.h(xi )β = tTi − ξTi , i = 1, …, N.
gene expression data prove that their method outperforms other In Eq. (2), the first term is a regularization item, ξi represents the
state-of-the-art methods. Jiang et al. [25] proposed a L2,1 -norm mini error vector of the i -th training data. C is a balance parameter. We
mization ELM which can generate group sparsity and reduce network denote N is the number of training samples and L is the number of
complexity as much as possible. Their method is better than other hidden nodes. By solving Eq. (2) we can obtain the solution of output
comparison methods in multi-label text classification. To get a more weight β:
compact hidden layer, Zhou et al. [26] introduced L2,1 -norm into ELM ( )− 1
as a regularization constraint, which can make the hidden layer output β = I + CHT H CHT T, (3)
weight matrix more sparse.
Driven by these work, this paper presents a robust ELM model named where I is an identity matrix with dimension L. When N > L, β can be
L2,1 -Extreme Learning Machine (L2,1 -ELM). It integrates L2,1 -norm into computed by Eq. (3). When N ≤ L, β is calculated by:
the loss function of traditional ELM, thus enabling ELM to reduce the ( )− 1
β = HT I + CHHT CT, (4)
negative impact of noise and outliers. The main contributions of this
paper are as follows:
where I is an identity matrix with dimension N.
(i)The L2,1 -norm based robust loss function is integrated into the
mathematical model of the original ELM, rather than the traditional L2
-norm. By this way, the robustness of the method is improved, and the 2.3. L2,1-RELM
negative effects of noise and outliers are reduced.
(ii)To solve the L2,1 -norm optimization problem, an iterative opti Zhou et.al [28] introduced the L2,1 -norm as a regularization term
mization algorithm is proposed. We also report the computational into the traditional ELM, and obtained a novel ELM with row sparsity.
complexity and convergence of our method. And in their paper, the random Fourier mapping is introduced as the
(iii) We first test our method on 5 benchmark datasets. The experi activation function. In our experiments, we use the sigmoid function as
mental results show that L2,1 -ELM has very good classification ability the activation function. The objective function of L2,1 -RELM is:
and good generalization ability. In addition, L2,1 -ELM is applied to the 1 C∑N 2
cancer sample classification problems, which 5 integrated cancer gene min ||β||2,1 +
β 2 2
‖ξi ‖ ,
(5)
i=1
expression datasets and 3 single-cell RNA sequencing datasets are used.
The classification results show that our method is competitive in clas s.t.h(xi )β = yTi − ξTi , i = 1, …, N.
sification problems.
By substituting the constraint into the Eq. (5), we can get:
2
1 ( ) C 3.1. The objective function of L2,1-ELM

F = Tr βT Dβ + ‖Y − Hβ‖2 . (6)
2 2
For a supervised learning task {X, T} = {xi , ti }Ni=1 , where X =
By calculating the gradient of β and set it to zero, the solution of β is:
{x1 , …, xN }T represents data matrix with N samples, T = {t1 , t2 , …, tN }T
( )− 1
β = D + CHT H CHT Y, (7) is the label matrix which indicates the class of samples and ti =
{ω1 , …, ω#N }T . When L2,1 -norm constraints are imposed on the loss
where D is a diagonal matrix and can be expressed as: function, the objective function of L2,1 -ELM is as follows:
1
(8) 1 C∑N
D= , min ||β||2 + ‖ξi ‖2,1 ,
2‖βi ‖2 β 2 2 (13)
i=1
i
and β represents i -th row of β. Eq. (7) is suitable for the situation that s.t.h(xi )β = yTi − ξTi , i = 1, …, N,
N > L. When N ≤ L, β is calculated by:
where the first term in Eq. (13) is a regularization term used to control
( )− 1
β = D− 1 HT I + CHD− 1 HT CY, (9) model complexity, it is bound by L2 -norm. ξi represents the error vector
of the i -th training data and C is a balance parameter defined by users.
and I is an identity matrix with dimension N.
3.2. The optimization of L2,1-ELM
2.4. CELM To solve the optimization problem of L2,1 -norm, we can follow the
rules in [23]:
Zhao et al. proposed a robust CELM by combining ELM with the ( )
correntropy induced loss function (C-loss) [27]. The objective function ||B||2,1 = Tr BT D2 B , (14)
of CELM can be expressed as:
where D2 is a diagonal matrix with the i -th diagonal element as dii = 1/
λ ∑N ( (
e2i
)) ( ⃦ i⃦ ) i
min ||β||2 + 1 − exp − , s.t.h(xi )β = yTi − eTi , i = 1, …, N, 2⃦B ⃦ , B represents i -th row of B.
2
2 i=1 2σ 2
By substituting Eq. (14) into Eq. (13), we can obtain the objective
(10) function:
where σ is the kernel bandwidth. According to the conjugate gradient C ( )
F=β+ Tr (Y − Hβ)T D2 (Y − Hβ) , (15)
method [33] and the alternating optimization algorithm, they obtained 2
the solution of βn+1 :
{ where
( )− 1
βn+1 = HT Ω HHT Ω + λ’ I Y, N < L,
n+1
( T )
’ − 1 T
(11) D2 = (
1
), (16)
β = H ΩH + λ I H ΩY, N ≥ L. 2 ‖Y − Hβ‖2
( )
And λ’ = λσ, and Ω = diag − vn+1 1 , …, − vN
n+1
, where (√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ )
( )T ( )
( ) and d2ii = 1/2 yi − h(xi )β yi − h(xi )β + υ υ is a regularization
2
(yi − f n (xi ) )
vn+1
i = − exp − < 0, i = 1, …, N. (12) constraint term that is small enough to avoid the case where the output
2σ 2
weight matrix β is a zero matrix.
According to Eq. (15), the gradient of β can be computed as follows:
3. L2,1-Extreme Learning Machine (L2,1-ELM)
∂F
= β − CHT D2 (Y − Hβ), (17)
In this section, we will describe the novel method named L2,1 -ELM in ∂β
detail. The method is proposed by introducing the L2,1 -norm to the loss
and set Eq. (17) equal to zero, we have
function of ELM. Compared with traditional Frobenius norm (F -norm),
L2,1 -norm can improve the robustness of the method.
3
Fig. 1. Classification decision boundary diagram with 20 noise points.
( )− 1
β = IL + CHT D2 H CHT D2 Y, (18) 3.3. Computational Complexity Analysis
where IL is an identity matrix with dimension L. In this section, we analyze the computational complexity of L2,1
Eq. (18) is applicable to the case where the number of nodes in the -ELM. When N ≥ L we have to compute HT D2 H, HT D2 Y, D2 and
( )− 1
hidden layer is less than the number of training samples. If the number of IL + CHT D2 H . The computational time complexity of HT D2 H is
nodes is larger than the number of training samples, we can follow the ( 2 )
O L N , while it needs O(LNm) and O(Nm) to compute HT D2 Y and D2
solution in [7] and get the solution of β as follows:
respectively. m is the number of output nodes. And the computational
( )− 1 ( )− 1 ( ) ( )
β = HT IN + CHD2 HT CY, (19) time complexity of IL + CHT D2 H is O L3 . It takes O L2 m to
( ) 1
multiply I + CHT D2 H with HT D2 Y. Hence, the total computational
−
where IN is an identity matrix with dimension of N. Finally, we get two ( ) ( )

complexity of the method is O L2 N + O(LNm) + O(Nm) + O L3 +
analytical solutions about β: ( 2 ) ( 2 )
{( O L m = O L N If we assume that the method converges after K -th
)− 1 ( )
IL + CHT D2 H CHT D2 YN ≥ L iterations, the overall computational time complexity is K ∗ O L2 N
β= ( )− 1 (20)
HT IN + CHD2 HT CYN < L
To obtain a more precise solution of β, we design an efficient itera 3.4. Robustness analysis
tive adjustment algorithm. The algorithm of L2,1 -ELM has been sum
marized as Algorithm 1. Compared with Frobenius norm, L2,1 -norm abandons the commonly
used square loss, reduces the influence of noise and outliers on the
Fig. 2. Classification decision boundary diagram with 50 noise points.
4
Fig. 3. Convergence curve on (a) Balance; (b) COIL20.
model, and improves the robustness of the method. To evaluate the

Table 1
robustness of the method, we conduct two groups of comparative ex
Details of benchmark datasets.
periments. According to the rules in [34], in the first group of experi
ments, we randomly generate 800 artificial Gaussian data points with a Datasets Classes Samples Training Testing Features
mean value ϖ 1 = [ − 2, − 2], and a covariance matrix Σ 1 = [10; 01] as Iris 3 150 105 45 4
class 1, another 800 artificial Gaussian data points with a mean value Balance 3 625 500 125 4
Ecoli 8 336 179 157 7
ϖ 2 = [2, 2], and a covariance matrix Σ 2 = [10; 01] are randomly gener
COIL20 20 1440 1152 288 1024
ated as class 2. ELM, L2,1 -RELM and L2,1 -ELM are trained on this dataset USPST 10 2007 1646 401 256
respectively. Fig. 1(a) describes the classification decision boundaries of
these methods. It is clear that class 1 and class 2 can be clearly separated
by three lines, and the classification boundary of every method is very 4. Experiments
clear. Then, 20 noisy points are randomly generated and added to this
training dataset. In fact, these 20 noise points belong to class 2, but are 4.1. Classification experiments on benchmark datasets
confused with class 1. ELM, L2,1 -RELM and L2,1 -ELM are trained on this
new dataset again. Fig. 1(b) describes the new classification decision Here, 5 benchmark datasets are selected to test our method. We also
boundaries. It is obvious that due to the influence of noise points, the compare the experimental results with other methods: LS-SVM [35],
classification boundary of ELM and L2,1 -RELM changes obviously, even DNN [36], ELM [6], L2,1 -RELM [28] and CELM [27]. To select the
has the tendency of bending. However, the classification boundary of optimal parameter and get a reasonable result, 10-fold cross validation is
L2,1 -ELM is almost unchanged. To further verify the robustness of L2,1 applied in the experiments. In the experiment, each dataset is normal
-ELM, we randomly generate another Gaussian artificial training dataset ized by the Min-Max normalization method. The purpose of this is to
and mix 50 noise points into it. Fig. 2(a) shows that ELM, L2,1 -RELM and scale the data in the data matrix so that the data in each row is
L2,1 -ELM can separate class 1 and class 2 easily. But after adding the distributed between - 1 and 1, and to minimize the negative impact of
noise points, as shown in Fig. 2(b), the classification boundary of ELM data exceptions [37,38].
and L2,1 -RELM is severely distorted, but the classification boundary of All algorithms are implemented using MATLAB R2016a on a 3.60
L2,1 -ELM is almost unaffected. Hence, compared with other methods, GHz computer with 64.0GB of memory. The details of every benchmark
L2,1 -ELM has better robustness to noise. datasets are listed in Table 1.
3.5. Convergence analysis 4.1.1. Settings

In the experiment, L2,1 -ELM has three important steps. The first step
Due to the introduction of L2,1 -norm, the output weight matrix β is to calculate the output matrix H of the hidden layer and the sigmoid
need to be adjusted iteratively. Therefore, L2,1 -ELM needs to converge function is used to map the input data to the feature space. The second
after K -th iterations. Fig. 3 shows the convergence of L2,1 -ELM on step is to use the iterative algorithm to calculate the output weight
different datasets. Taking benchmark dataset Balance and COIL20 as the matrix β. And the third step is to calculate the classification accuracy. To
examples, we plot the convergence of L2,1 -ELM. The X-axis represents ensure fairness and reliability of experimental results, in the process of
the number of iterations, and the Y-axis represents the error generated evaluating each method, the range of the trade-off parameter C and λ is
( −4 )
by each iteration. From Fig. 3, it is clear that our method have a fast 10 , …, 105 , the input weights and biases are randomly generated in
convergence rate, and it can achieve convergence after 15 iterations. ( − 1, 1) In addition, the number of output nodes of every ELM based
This also proves that our method can be applied to practical problems. method is equal to the number of classes of every dataset. In particular,
there are two parameters need to be adjusted: trade-off parameter C and
the number of hidden nodes L. Taking Iris and COIL20 as examples, we
analyze the sensitivity of L2,1 -ELM to these two parameters in Fig. 4.
From Fig. 4, we can draw a conclusion that L2,1 -ELM is not sensitive
to the number of hidden nodes, that is to say, with the increase of the
5
Fig. 4. Parametric sensitivity on the benchmark datasets (a) Iris; (b) COIL20.
From Table 2, it is obvious that our method can obtain better results
Table 2
on most datasets. Even compared with the deep learning model DNN,
The classification accuracy of each method on benchmark datasets (± var). ※
our method can produce competitive results. In terms of training time,
indicates that L21-ELM is significantly superior to the corresponding method (t-
test with 95% confidence level).
due to ELM does not need any iteration, so it can always train a neural
network model in the shortest time. And the training time of L2,1 -RELM
Datasets LSSVM DNN ELM L2,1- CELM L2,1-
is close to our method. Because of the need to calculate the kernel
RELM ELM
function, the training time for CELM is longer than other method. Of
Iris 0.9333 0.9595 0.9717 0.9553 0.9333 0.9778 course, DNN has the most training time. In general, L2,1 -ELM can ach
± 0.42※ ± ± 0.05 ± ± ± 1.12
0.98※ 0.95※ 0.02※
ieve satisfactory classification results without a large amount of training
Balance 0.9408 0.8880 0.8736 0.9001 0.9036 0.9280 time.
± 0.87 ± ± ± ± ± 0.85
0.88※ 0.27※ 1.67※ 0.38※
Ecoli 0.8089 0.7777 0.8407 0.8257 0.8185 0.8471 4.2. Classification experiments on cancer samples
± 0.29※ ± ± ± ± ± 0.57
1.47※ 0.05※ 0.87※ 0.25※
COIL20 0.9731 0.7465 0.9826 0.9615 0.9814 0.9826
To identify different subtypes of cancer cells and provide more ac
± 0.44 ± ± 1.57 ± ± 1.12 ± 0.96 curate information for targeted therapy of cancer, classification of
1.02※ 0.69※ cancer samples is a very important basic work. Thanks to the success of
USPST 0.9665 0.9102 0.9451 0.9401 0.9476 0.9252 The Cancer Genome Atlas (TCGA) program, we have abundant resources
± 0.13 ± 0.86 ± 1.43
to explore the mechanism of cancer occurrence and development. In this
± ± ±
0.52※ 0.30※ 1.19※
section, we evaluate the classification ability of L2,1 -ELM on five inte
grated TCGA datasets and three scRNA-seq datasets. We will introduce
our experimental process in detail in the following.
Table 3
Firstly, we download the level 3 standardized HT-Seq-Counts data
The training time (in seconds) of each method on benchmark datasets.
file from the TCGA database [40]. As the largest cancer genetic infor
Datasets LSSVM DNN ELM L2,1-RELM CELM L2,1-ELM mation database in the world, TCGA must contain a large amount of
Iris 0.1911 0.9831 0.1462 0.4956 0.5053 0.5308 valuable information, and various researches conducted around the
Balance 0.3118 2.6297 0.0277 1.3118 8.1469 1.2322 TCGA database will certainly be beneficial to disease treatment, medical
Ecoli 0.8849 1.0415 0.0469 1.5634 12.6543 1.2322 diagnosis and other aspects [41,42]. In our experiment, our data are
COIL20 41.5134 11.8373 0.1891 1.7069 24.1492 1.8943
USPST 42.1369 11.8607 0.1125 2.6837 11.5406 2.5006
mainly based on seven types of cancer, including Colon adenocarcinoma
(COAD), Esophageal carcinoma (ESCA), Head and Neck squamous cell
carcinoma (HNSC), Lung adenocarcinoma (LUAD), Lung squamous cell
number of hidden nodes, the classification accuracy has no obvious carcinoma (LUSC), Ovarian serous cystadenocarcinoma (OV) and
change. Hence, during the experiments, we fix the number of hidden Pancreatic adenocarcinoma (PAAD).
layer nodes to 500 and adopt the 10-fold cross validation to select the To explore differences between different cancers and to increase the
best parameter C. sample size, according to the integration method in [37,42], we remove
the normal samples of each cancer, leave only the diseased samples, and
4.1.2. Results and Discussion combine them together to form a new integrated dataset. Finally, we
To ensure the reliability of experimental results, we run each set of obtain five integrated datasets, and they are COAD_ESCA (D1), ESCA_
data 20 times. In addition, since the paired t-test is a statistical tool, it is PAAD (D2), COAD_ESCA_PAAD (D3), LUAD_LUSC_PRAD (D4) and
used to evaluate whether there is a significant statistical difference be COAD_ESCA_HNSC_PAAD (D5).
tween the two groups of results. We use t-test with 95% confidence level We also evaluate our method on three scRNA-seq datasets down
to evaluate whether our method is significantly different from other loaded from the GENE EXPRESSION OMNIBUS (GEO) database,
comparison methods [39]. Table 2 lists the average classification results including Melanoma [43,44], Multiple myeloma (MM) [45] and Hepa
of each method on benchmark datasets and Table 3 lists the average tocellular Carcinoma (HCC) [3]. With the continuous maturity of
training time of every method. The best results are shown in bold. single-cell research technology, a large number of single-cell datasets
6
Table 4 Table 6
Details of the integrated TCGA datasets used in the experiments. The classification accuracy of each method on different TCGA datasets (± var).
Datasets Classes Samples Training Testing Features ※ indicates that L21-ELM is significantly superior to the corresponding method
(t-test with 95% confidence level).
D1 2 445 267 178 20502
D2 2 359 215 144 20502 Datasets LSSVM DNN ELM L2,1- CELM L2,1-
D3 3 621 373 248 20502 RELM ELM
D4 3 1529 917 612 24991 D1 0.9342 0.9565 0.9827 0.9802 0.9472 0.9891
D5 4 1019 611 408 20502 ± ± 0.14※ ± 2.10 ± 0.00 ± ± 1.24
0.02※ 1.96※
D2 0.9382 0.9775 0.9629 0.9787 0.9551 0.9921
± ± 0.24※ ± ± 0.00 ± ± 0.86
Table 5 1.25※ 2.00※ 1.11※
Details of the scRNA-seq datasets used in the experiments. D3 0.9548 0.9355 0.9662 0.9432 0.9634 0.9766
Datasets Classes Samples Training Testing Features ± ± 0.30※ ± ± ± ± 0.67
2.01※ 1.48※ 0.02※ 2.48※
HCC 3 138 83 55 19525 D4 0.9639 0.9974 0.9693 0.9898 0.9895 0.9953
Melanoma 2 4513 2707 1806 23686 ± ± 1.18 ± ± 0.00 ± 0.67 ± 0.20
MM 4 597 358 239 23398 0.35※ 0.08※
D5 0.8169 0.9567 0.9533 0.9445 0.9490 0.9715
± ± 0.95※ ± ± ± ± 1.38
have been generated. The research on such datasets may have a great 1.62※ 1.19※ 0.00※ 2.62※
impact on cell genetics, transcription, targeted therapy of diseases and
other aspects. Hence, as a widely used classifier, it is meaningful to study
the role of ELMs in single-cell RNA sequencing data. Table 4 and Table 5 Table 7
lists the information of every dataset used in the experiments, The classification accuracy of each method on different scRNA-seq datasets (±
respectively. var). ※ indicates that L21-ELM is significantly superior to the corresponding
method (t-test with 95% confidence level).
4.2.1. Settings Datasets LSSVM DNN ELM L2,1- CELM L2,1-
Each dataset is normalized to between -1 and 1 using the Min-Max RELM ELM
normalization method. In addition, the range of the trade-off param HCC 0.9565 0.9806 0.9794 0.9588 0.9794 0.9853
( )
eter C is 10− 4 , …, 105 , the input weights and biases are randomly ± ± 0.62 ± 0.00 ± ± 2.08 ± 0.00
generated in ( − 1, 1) Besides, Taking D1 and HCC as the example, Fig. 5 1.11※ 0.00※
Melanoma 0.9538 0.9929 0.9663 0.9586 0.9494 0.9738
describes the sensitivity ofL2,1 -ELM for different C and L. ± ± 0.27 ± 1.01 ± ± ± 0.62
In the experiments of TCGA datasets and scRNA-seq datasets, it is 0.87※ 0.00※ 0.96※
clear that L2,1 -ELM is not very sensitive to the number of hidden nodes. MM 0.8558 0.8611 0.8329 0.8363 0.8501 0.8611
Therefore, 800 hidden nodes are selected for every ELM-based methods ± ± 0.44 ± ± ± ± 0.00
0.44※ 0.00※ 0.00※ 0.50※
and the 10-fold cross validation is used to select the best parameter C.
4.2.2. Results and Discussion

A comparison is made between L2,1 -ELM and the other methods, Table 8
such as LSSVM, DNN, ELM, L2,1 -RELM and CELM. On each dataset, each The training time (in seconds) of each method on integrated TCGA datasets.
method was run 20 times, and the classification accuracy examined by t- Datasets LSSVM DNN ELM L2,1-RELM CELM L2,1-ELM
test with 95% confidence level in the Table 6 and Table 7 is the average D1 0.4564 11.9377 0.0956 1.2968 0.4237 0.2533
of the 20 results. The best classification results have been bold. D2 0.2637 12.2405 0.0873 0.6153 0.4881 0.1287
As can be seen from Table 6, L2,1 -ELM has achieved better perfor D3 4.4377 37.6238 0.3220 2.1490 0.6688 1.1253
D4 17.8219 84.9308 0.7086 1.0397 1.6086 1.1632
mance on most integrated TCGA datasets. The results show that our
D5 7.5739 52.4661 0.9237 1.0145 1.4360 1.3514
method is suitable for the classification of cancer samples. From Table 7,
Fig. 5. Parametric sensitivity curve on real datasets (a) D1; (b) HCC.
7
Table 9
The training time (in seconds) of each method on scRNA-seq datasets.
Datasets LSSVM DNN ELM L2,1-RELM CELM L2,1-ELM
HCC 1.0415 1.5673 0.0116 0.4380 0.8310 0.3163

Melanoma 20.1518 155.2209 10. 517 25.6442 27.4858 13.0675
MM 3.9776 68.9377 0.1114 0.9022 2.0294 0.7655
a conclusion can be drawn that our method performs better on most Funding
scRNA-seq datasets than other methods. This can prove that our method
can be used as a potential method for the classification of cancer sam This work was supported in part by the grants provided by the Na
ples, and ELM and its variants have great development space in the tional Science Foundation of China, Nos. 61872220, and 61572284.
identification of cancer subtypes.
In addition, to evaluate the computational complexity and network
complexity of our method, we list the training time of every method in Declaration of Competing Interest
Table 8 and Table 9. The shortest time has been bold.
Table 8 lists the training time of each method on different integrated The authors declare that they have no conflict of interest.
TCGA datasets, while Table 9 lists the training time of each method on
every scRNA-seq datasets. Compared with ELM, L2,1 -ELM needs more Appendix A. Supplementary data
training time and this feature is particularly evident in the experiments
of scRNA-seq datasets. Since the ELM does not need to make iterative Supplementary material related to this article can be found, in the
adjustment to the output weight, L2,1 -ELM needs to update the output online version, at doi:https://doi.org/10.1016/j.compbiolchem.2020
weights iteratively until the method converges. But compared with other .107368.
methods, our method requires less training time and runs faster.
Generally speaking, our method not only inherits the high accuracy of References
ELM, but also maintains the rapidity of ELM.
[1] Stuart, T., Satija, R., 2019. Integrative single-cell analysis. Nature Reviews Genetics
20, 5.
5. Conclusions [2] Alquicira-Hernandez, J., Nguyen, Q., Powell, J.E., 2018. scPred: Cell type
prediction at single-cell resolution. bioRxiv, 369538.
In this paper, a new method named L2,1 -ELM is proposed applied to [3] Zheng, H., Pomyen, Y., Hernandez, M.O., Li, C., Livak, F., Tang, W., Dang, H.,
Greten, T.F., Davis, J.L., Zhao, Y., 2018. Single-cell analysis reveals cancer stem
tumor samples classification and single cell RNA-sequencing samples cell heterogeneity in hepatocellular carcinoma. Hepatology 68 (1), 127–140.
classification. The L2,1 -ELM is obtained by introducing L2,1 -norm to the [4] Zheng, G.X., Terry, J.M., Belgrader, P., Ryvkin, P., Bent, Z.W., Wilson, R.,
loss function. The introduction of L2,1 -norm effectively improves the Ziraldo, S.B., Wheeler, T.D., McDermott, G.P., Zhu, J., 2017. Massively parallel
digital transcriptional profiling of single cells. Nature communications 8, 14049.
robustness of the method and reduces the influence of noise and outliers [5] Jaitin, D.A., Kenigsberg, E., Keren-Shaul, H., Elefant, N., Paul, F., Zaretsky, I.,
on the classification results. Moreover, L2,1 -ELM is compared with other Mildner, A., Cohen, N., Jung, S., Tanay, A., 2014. Massively parallel single-cell
commonly used methods and the new method obtains comparatively RNA-seq for marker-free decomposition of tissues into cell types. Science 343
(6172), 776–779.
competitive results. A conclusion can be drawn from the results that L2,1 [6] Huang, G.B., Zhou, H., Ding, X., Zhang, R., 2012. Extreme learning machine for
-ELM is more stable and efficient than other methods. regression and multiclass classification. IEEE Transactions on Systems Man &
It should be noted that we only chose the sigmoid function as the Cybernetics Part B 42 (2), 513–529.
[7] Huang, G., Song, S., Gupta, J.N., Wu, C., 2014. Semi-supervised and unsupervised
activation function. In the future work, we plan to try to use different extreme learning machines. IEEE Transactions on Cybernetics 44 (12), 2405–2417.
activation functions to improve the performance of the algorithm. In [8] Huang, G.-B., Zhu, Q.-Y., Siew, C.-K., 2004. Extreme learning machine: a new
addition, the randomness of input weight and bias also have a certain learning scheme of feedforward neural networks. Neural networks 2, 985–990.
[9] Huang, G.B., Wang, D.H., Lan, Y., 2011. Extreme learning machines: a survey.
influence on the classification results. Whether the particle swarm
International Journal of Machine Learning & Cybernetics 2 (2), 107–122.
optimization algorithm and its variants can solve this problem is also the [10] Deng, C., Wang, S., Bovik, A.C., Huang, G.-B., Zhao, B., 2019. Blind Noisy Image
focus of our research. Of course, we will work on applying our new al Quality Assessment Using Sub-Band Kurtosis. IEEE Transactions on Cybernetics.
gorithm to a variety of bioinformatics problems. [11] Jin, Z., Zhou, G., Gao, D., Zhang, Y., 2018. EEG classification using sparse Bayesian
extreme learning machine for brain–computer interface. Neural Computing and
Applications 1–9.
Authors’ contributions [12] Wang, Y., Li, D., Du, Y., Pan, Z., 2015. Anomaly detection in traffic using L1-norm
minimization extreme learning machine. Neurocomputing 149, 415–425.
[13] Liu, T., Hu, L., Ma, C., Wang, Z.-Y., Chen, H.-L., 2015. A fast approach for detection
Liang-Rui Ren and Jin-Xing Liu designed the method and carried out of erythemato-squamous diseases based on extreme learning machine with
the experiment; Ying-Lian Gao and Rong Zhu analyzed the results of the maximum relevance minimum redundancy feature selection. International Journal
experiment; Liang-Rui Ren and Xiang-Zhen Kong drafted and revised the of Systems Science 46 (5), 919–931.
[14] Hu, L., Hong, G., Ma, J., Wang, X., Chen, H., 2015. An efficient machine learning
manuscript. All authors have approved the final manuscript. approach for diagnosis of paraquat-poisoned patients. Computers in biology and
medicine 59, 116–124.
Ethical approval [15] Wang, Y., Wang, A., Ai, Q., Sun, H., 2019. Ensemble based fuzzy weighted extreme
learning machine for gene expression classification. Applied Intelligence 49 (3),
1161–1171.
This article does not contain any studies with human participants or [16] Eshtay, M., Faris, H., Obeid, N., 2018. Improving extreme learning machine by
animals performed by any of the authors. competitive swarm optimization and its application for medical diagnosis
problems. Expert Systems with Applications 104, 134–152.
[17] Zhao, X., Zhang, X., Cai, Z., Tian, X., Wang, X., Huang, Y., Chen, H., Hu, L., 2019.
Informed consent Chaos enhanced grey wolf optimization wrapped ELM for diagnosis of paraquat-
poisoned patients. Computational biology and chemistry 78, 481–490.
Informed consent was not required as no human or animal subjects [18] Li, S., Jiang, H., Pang, W., 2017. Joint multiple fully connected convolutional
neural network with extreme learning machine for hepatocellular carcinoma nuclei
were involved. grading. Computers in biology and medicine 84, 156–167.
[19] Wang, M., Chen, H., Yang, B., Zhao, X., Hu, L., Cai, Z., Huang, H., Tong, C., 2017.
Toward an optimal kernel extreme learning machine using a chaotic moth-flame
optimization strategy with applications in medical diagnoses. Neurocomputing
267, 69–84.
8
[20] Xia, J., Chen, H., Li, Q., Zhou, M., Chen, L., Cai, Z., Fang, Y., Zhou, H., 2017. [34] Li, R., Wang, X., Lei, L., Song, Y., 2018. L2,1-Norm Based Loss Function and
Ultrasound-based differentiation of malignant and benign thyroid Nodules: An Regularization Extreme Learning Machine. IEEE Access 7, 6575–6586.
extreme learning machine approach. Computer methods and programs in [35] Zhang, T., Chen, W., M, Li, 2019. Classification of inter-ictal and ictal EEGs using
biomedicine 147, 37–49. multi-basis MODWPT, dimensionality reduction algorithms and LS-SVM: A
[21] Chen, B., Wang, X., Lu, N., Wang, S., Cao, J., Qin, J., 2018. Mixture correntropy for comparative study. Biomedical Signal Processing and Control 47, 240–251.
robust learning. Pattern Recognition 79, 318–327. [36] Elola, A., Aramendi, E., Irusta, U., Picón, A., Alonso, E., Owens, P., Idris, A., 2019.
[22] da Silva, B.L.S., Inaba, F.K., Salles, E.O.T., Ciarelli, P.M., 2020. Outlier Robust Deep Neural Networks for ECG-Based Pulse Detection during Out-of-Hospital
Extreme Machine Learning for multi-target regression. Expert Systems with Cardiac Arrest. Entropy 21 (3), 305.
Applications 140, 112877. [37] Jiao, C.-N., Gao, Y.-L., Yu, N., Liu, J.-X., L-Y, Qi, 2020. Hyper-graph Regularized
[23] Yu, N., Gao, Y.-L., Liu, J.-X., Wang, J., Shang, J., 2019. Robust hypergraph Constrained NMF for Selecting Differentially Expressed Genes and Tumor
regularized non-negative matrix factorization for sample clustering and feature Classification. IEEE journal of biomedical and health informatics.
selection in multi-view gene expression data. Human genomics 13 (1), 46. [38] Patro, S.G.K., Sahu, K.K., 2015. Normalization: A Preprocessing Stage.
[24] Yu, N., Gao, Y.-L., Liu, J.-X., Wang, J., Shang, J., 2018. Hypergraph regularized International Advanced Research Journal in Science, Engineering and Technology
NMF by L 2, 1-norm for Clustering and Com-abnormal Expression Genes Selection. 20–22.
In: 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). [39] Ke, J., Gong, C., Liu, T., Zhao, L., Yang, J., D, Tao, 2020. Laplacian Welsch
IEEE, pp. 578–582. Regularization for Robust Semisupervised Learning. IEEE transactions on
[25] Jiang, M., Pan, Z., Li, N., 2017. Multi-label text categorization using L21-norm cybernetics.
minimization extreme learning machine. Neurocomputing 261, 4–10. [40] Guo, Y., Sheng, Q., Li, J., Ye, F., Samuels, D.C., Shyr, Y., 2013. Large scale
[26] Zhou, S., Liu, X., Liu, Q., Wang, S., Zhu, C., Yin, J., 2016. Random Fourier extreme comparison of gene expression levels by microarrays and RNAseq using TCGA data.
learning machine with ℓ2, 1-norm regularization. Neurocomputing 174, 143–153. Plos One 8 (8), e71462.
[27] Zhao, Y.-P., Tan, J.-F., Wang, J.-J., Yang, Z., 2019. C-loss based extreme learning [41] Lu, Y., Gao, Y.-L., Liu, J.-X., Wen, C.-G., Wang, Y.-X., Yu, J., 2016. Characteristic
machine for estimating power of small-scale turbojet engine. Aerospace Science gene selection via L 2, 1-norm Sparse Principal Component Analysis. In: 2016 IEEE
and Technology 89, 407–419. International Conference on Bioinformatics and Biomedicine (BIBM). IEEE,
[28] Zhou, S., Liu, X., Liu, Q., Wang, S., Zhu, C., Yin, J., 2016. Random Fourier extreme pp. 1828–1833.
learning machine with L2,1-norm regularization. Neurocomputing 174, 143–153. [42] Hao, Y.-J., Gao, Y.-L., Hou, M.-X., Dai, L.-Y., Liu, J.-X., 2019. Hypergraph
[29] Ding, C., Zhou, D., He, X., Zha, H., 2006. otational invariant L 1-norm principal Regularized Discriminative Nonnegative Matrix Factorization on Sample
component analysis for robust subspace factorization. In: Proceedings of the 23rd Classification and Co-Differentially Expressed Gene Selection. Complexity 2019.
international conference on Machine learning. ACM, pp. 281–288. [43] Tirosh, I., Izar, B., Prakadan, S.M., Wadsworth, M.H., Treacy, D., Trombetta, J.J.,
[30] Kong, D., Ding, C., Huang, H., 2011. Robust nonnegative matrix factorization using Rotem, A., Rodman, C., Lian, C., Murphy, G., 2016. Dissecting the multicellular
l21-norm. Proceedings of the 20th ACM international conference on Information ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352 (6282),
and knowledge management 673–682. 189–196.
[31] Zong, W., Huang, G.-B., Chen, Y., 2013. Weighted extreme learning machine for [44] Hayashi, T., 2016. Single-Cell RNA-Seq Reveals Melanoma Transcriptional
imbalance learning. Neurocomputing 101, 229–242. Heterogeneity. Cancer Discovery 6 (6), 570.
[32] Huang, G., Huang, G.B., Song, S., You, K., 2015. Trends in extreme learning [45] Jang, J.S., Li, Y., Mitra, A.K., Bi, L., Abyzov, A., van Wijnen, A.J., Baughn, L.B., Van
machines: A review. Neural Networks 61 (C), 32–48. Ness, B., Rajkumar, V., Kumar, S., 2019. Molecular signatures of multiple myeloma
[33] Boyd, S., Vandenberghe, L., 2004. Convex optimization. Cambridge university progression through single cell RNA-Seq. Blood cancer journal 9 (1), 2.
press.
View publication stats

Extreme Learning Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extreme Learning Machine

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

L2,1-Extreme Learning Machine: An Efﬁcient Robust Classiﬁer for Tumor

Article in Computational Biology and Chemistry · September 2020

network-based biological data mining View project

The user has requested enhancement of the downloaded file.

Contents lists available at ScienceDirect

Computational Biology and Chemistry

L2,1-Extreme Learning Machine: An Efficient Robust Classifier for

Factorization (RHNMF) by using L2,1 -norm when estimating the resid

1 ( ) C 3.1. The objective function of L2,1-ELM

Fig. 1. Classification decision boundary diagram with 20 noise points.

where IN is an identity matrix with dimension of N. Finally, we get two ( ) ( )

Fig. 2. Classification decision boundary diagram with 50 noise points.

Fig. 3. Convergence curve on (a) Balance; (b) COIL20.

model, and improves the robustness of the method. To evaluate the

3.5. Convergence analysis 4.1.1. Settings

4.2.2. Results and Discussion

HCC 1.0415 1.5673 0.0116 0.4380 0.8310 0.3163

View publication stats

You might also like

Extreme Learning Machine

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extreme Learning Machine

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

L2,1-Extreme Learning Machine: An Efﬁcient Robust Classiﬁer for Tumor

Article in Computational Biology and Chemistry · September 2020

network-based biological data mining View project

The user has requested enhancement of the downloaded file.

Contents lists available at ScienceDirect

Computational Biology and Chemistry

L2,1-Extreme Learning Machine: An Efficient Robust Classifier for

Factorization (RHNMF) by using L2,1 -norm when estimating the resid­

1 ( ) C 3.1. The objective function of L2,1-ELM

Fig. 1. Classification decision boundary diagram with 20 noise points.

where IN is an identity matrix with dimension of N. Finally, we get two ( ) ( )

Fig. 2. Classification decision boundary diagram with 50 noise points.

Fig. 3. Convergence curve on (a) Balance; (b) COIL20.

model, and improves the robustness of the method. To evaluate the

3.5. Convergence analysis 4.1.1. Settings

4.2.2. Results and Discussion

HCC 1.0415 1.5673 0.0116 0.4380 0.8310 0.3163

View publication stats

You might also like

Factorization (RHNMF) by using L2,1 -norm when estimating the resid