You are on page 1of 11

Knowledge-Based Systems 187 (2020) 104806

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

Twin minimax probability extreme learning machine for pattern


recognition✩

Jun Ma a , Liming Yang b , , Yakun Wen a , Qun Sun c
a
College of Information and Electrical Engineering, China Agricultural University, Beijing, China
b
College of Science, China Agricultural University, Beijing, China
c
College of Agriculture and Biotechnology, China Agricultural University, Beijing, China

highlights

• The twin minimax probability extreme learning machine (TMPELM) is proposed.


• TMPELM learns two non-parallel hyperplanes in the feature space for final classification.
• TMPELM utilize the geometric information and statistical information of the samples.
• Numerical results show that TMPELM has better generalization performance.

article info a b s t r a c t

Article history: Minimax probability machine (MPM) is an excellent discriminant classifier based on prior knowledge.
Received 30 September 2018 It can directly estimate a probability accuracy bound by minimizing the maximum probability of
Received in revised form 16 June 2019 misclassification. However, the traditional MPM learns only one hyperplane to separate different
Accepted 19 June 2019
classes in the feature space, and it may bear a heavy computational burden during the training process
Available online 21 June 2019
because it needs to address a large-scale second-order cone programming (SOCP)-type problem. In
Keywords: this work, we propose a novel twin minimax probability extreme learning machine (TMPELM) for
Minimax probability machine pattern classification. TMPELM indirectly determines the separation hyperplane by solving a pair of
Twin extreme learning machine smaller-sized SOCP-type problems to generate a pair of non-parallel hyperplanes. Specifically, for
Nonparallel hyperplanes each hyperplane, TMPELM attempts to maximize the probability of correct classification for one class
Second-order cone programming sample points, that is, to minimize the worst-case (maximum) probability of misclassification of a class
Pattern recognition sample points, and is far away from the other class. TMPELM first utilizes the random feature mapping
mechanism to construct the feature space, and then two nonparallel separating hyperplanes are
learned for the final classification. The proposed TMPELM not only utilizes the geometric information
of the samples, but also takes advantage of the statistical information (mean and covariance of
the samples). Moreover, we extend a linear model of TMPELM to a nonlinear model by exploiting
kernelization techniques. We also analyzed the computational complexity of TMPELM. Experimental
results on both NIR spectroscopy datasets and benchmark datasets demonstrate the effectiveness of
TMPELM.
© 2019 Elsevier B.V. All rights reserved.

1. Introduction worst-case scenario, that is, under all possible choices of class-
conditional densities with a given mean and covariance matrix,
Minimax probability machine (MPM) learning strategy intro- minimize the worst-case (maximum) probability of misclassifi-
duced by Lanckriet et al. [1] is a popular and efficient pattern cation of future data points. Due to the outstanding performance
classification and regression method in machine learning fields. of MPM, it has been widely used in many fields, such as com-
MPM [1] attempts to control misclassification probabilities in the puter vision [2], engineering technology [3,4], agriculture [5] and
novelty detection [6], and so on.
✩ No author associated with this paper has disclosed any potential or
Recently, many improved versions of MPM have been pro-
posed from different perspectives. Representative work can be
pertinent conflicts which may be perceived to have impending conflict with
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
briefly reviewed as follows. In [7], Thomas and Gregory proposed
2019.06.014. MPM regression (MPMR) by estimating mean and covariance
∗ Corresponding author. matrix statistics of the regression data. Huang et al. [8] proposed
E-mail address: cauyanglm@163.com (L. Yang). a distribution-free Bayes optimal classifier, called the minimum

https://doi.org/10.1016/j.knosys.2019.06.014
0950-7051/© 2019 Elsevier B.V. All rights reserved.
2 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806

error MPM (MEMPM), which removes the equality constraint on In this paper, inspired by the of MPM and TELM, we propose
the classification error rates from the MPM. Since the MPM only a new nonparallel plane classifier, termed as the twin minimax
considers the prior probability distribution of each class with a probability extreme learning machine (TMPELM). For each hyper-
given mean and covariance matrix, the structural information plane, TMPELM attempts to maximize the probability of a correct
of the data cannot be effectively utilized. In response to the class of sample classification, i.e., to minimize the worst case
problems above, Gu et al. [9] proposed structural MPM (SMPM) by (maximum) probability of misclassification of a class of samples
combining the finite mixture models with the traditional MPM. while the distance to the other class is as large as possible. An
It improves the shortcomings of MPM not making better use of important feature of TMPELM is that the worst-case bound on
data structure information. To effectively deal with imbalanced the probability of misclassification of future data is always ob-
data, Huang et al. [10] propose biased-MPM (BMPM). BMPM tained explicitly. In addition, TMPELM indirectly determines the
firstly estimates the mean and covariance of the data, and then separating hyperplane through a pair of nonparallel hyperplanes
constructs the classification boundary by directly controlling the solved by two smaller sized SOCP-type problems. Similar to TELM,
lower limit of the actual precision, and finally realizes the ef- TMPELM also firstly utilizes the random feature mapping mech-
fective processing of the imbalanced data. Traditional dimension anism to construct the feature space, and then two nonparallel
reduction methods focus on maximizing the overall discrimina- separating hyperplanes are learned for the final classification. The
tion between all classes, which may be easily affected by outliers. proposed TMPELM not only utilizes the geometric information of
To overcome this disadvantage, Song et al. [11] proposed dimen- the data, but also takes advantage of the statistical information
sion reduction minimum error MPM (DR-MEMPM) for multi-class (mean and covariance of the samples). Our proposed TMPELM can
dimension reduction. Cousins et al. [12] proposed a high prob- be solved by a pair of smaller-scale SOCP problems, and its time
ability MPM (HP-MPM) for pattern recognition. Although the computational complexity is similar to the twin extreme learning
above improved versions of MPM have achieved good results, machine problem (TELM). Based on the linear model of TMPELM,
they are all supervised learning methods. In order to improve the we also propose a nonlinear TMPELM model by exploiting ker-
performance of MPM in a semi-supervised learning environment, nelization techniques. The experimental results on both the NIR
Yoshiyama et al. [13] proposed Laplacian minimax probability spectroscopy datasets and benchmark datasets demonstrate the
machine (Lap-MPM). effectiveness of TMPELM.
Single-layer forward networks (SLFNs) have been intensively The rest of this paper is organized as follows. In Section 2, we
studied during the past several decades. One of the most success- give a brief review of MPM and TELM. In Section 3, we describe
ful algorithms for training SLFNs is the support vector machines the details of proposed TMPELM and give the computational
(SVMs) [14,15], which is a maximal margin classifier derived complexity of TMPELM. The comparison of TMPELM with other
under the framework of structural risk minimization (SRM). To algorithms is given in Section 4. All the experimental results are
expand the applicability of SVM, twin support vector machine shown in Section 5. The conclusions and discussions of this paper
(TSVM) was proposed by Jayadeva et al. [16], which targets at are given in Section 6.
deriving two nonparallel separating hyperplanes. TSVM solves a
pair of smaller-sized QPPs instead of solving a large-scale QPP like 2. Related work
traditional SVM. This makes TSVM have a faster learning speed
As mentioned above, in order to propose an improved ver-
than traditional SVMs. Some extensions to TSVM include the ν -
sion of MPM, we briefly review MPM in Section 2.1 and, then,
TSVM [17], twin bounded support vector machine (TBSVM) [18],
introduce the TELM in Section 2.2.
Laplacian TSVM (Lap-TSVM) [19], robust TSVM (RTSVM) [20],
structural TSVM (STSVM) [21], structural twin parametric-margin
2.1. Minimax probability machine
SVM (STPMSVM) [22].
As a new and efficient SLFNs training algorithm, the extreme
Let x and y denote random vectors with means and covari-
learning machine (ELM) was proposed by Huang et al. [23,24].
ance matrices given by (µ1 , Σ 1 ) and (µ2 , Σ 2 ), respectively, with
In contrast to the traditional neural networks, the main advan-
x, y, µ1 , µ2 ∈ Rn , Σ 1 , Σ 2 ∈ Rn×n . As a binary classification
tages of ELM are that it runs fast with global optimal solution
model, the minimax probability machine minimizes the upper
and is easy to implement. Its hidden nodes and input weights
bound of the error classification probability. We attempt to obtain
are randomly generated and the output weights are expressed
a decision hyperplane H(w, b) = {z|wT z = b}, which separates
analytically. Recently, many researchers have done in-depth re-
the two classes of points with maximal probability with respect to
searches on ELM. For example, Horata et al. [25] proposed a
all distributions. The original model of the MPM can be expressed
robust ELM (RELM) to overcome the shortcomings of ELM sensi-
as follows:
tivity to outliers. At the same time, Lu et al. [26] applied RELM to
indoor positioning. Wang et al. [27] proposed a self-adaptive ELM max α
α,a,b
(SaELM). It can select the best neuron number in hidden layers
s.t. inf Pr {aT x ≥ b} ≥ α (1)
to construct the optimal networks. Zhang et al. [28] proposed x∼(µ1 ,Σ1 )
a memetic algorithm based ELM (M-ELM). It embeds the local
inf Pr {a y ≤ b} ≥ α
T
search strategy into the global optimization framework to obtain y∼(µ2 ,Σ2 )
optimal network parameters. Inspired by TSVM, Wan et al. [29]
where the notation x ∈ (µ1 , Σ1 ) refers to a class distribution,
propose twin extreme learning machine (TELM) for supervised
which has the prescribed mean µ1 and covariance Σ1 , but oth-
classification. TELM aims to learn two nonparallel separating hy- erwise arbitrary; likewise to y. α represents the lower bound of
perplanes in the ELM feature space. It is easy to know that TELM the accuracy for future data, namely, the worst case accuracy.
trains two smaller-scale QPPs simultaneously and inherits the ad- By using Chebyshev Inequality [32], the optimization problem
vantages of ELM and TSVM. Huang et al. [30] extended the ELM to can be rewritten as:
semi-supervised learning by introducing manifold regularization,
and proposed semi-supervised ELM (SSELM). Subsequently, Pei max α
α,a,b
et al. [31] propose a robust semi-supervised learning algorithm √
to overcome the limitation of the classical SS-ELM that it is s.t. − b + aT µ1 ≥ k(α ) aT Σ1 a (2)

sensitivity to outliers, termed as robust SS-ELM (RSS-ELM). b − aT µ2 ≥ k(α ) aT Σ2 a
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 3

α
where k(α ) = 1−α
. Since k(α ) is a monotone increasing function and
of α , thus we have: 1
min θ T H 1 (H T2 H 2 + ϵI )−1 H T1 θ − eT1 θ
max k θ 2 (10)
k
√ √ (3) s.t. 0 ≤ θ ≤ c2 e1
s.t. aT (µ1 − µ2 ) ≥ k( aT Σ1 a + aT Σ2 a).
where α and θ are the vectors of Lagrange multiplier, ϵI is a
The above optimization problem (3) can be equivalently writ- regularization term, ϵ is a randomly small positive scalar and I
ten as is an identity matrix.
For the nonlinear case and more details, please refer to [29].
√ √
max aT Σ1 a + aT Σ2 a
k (4)
s.t. aT (µ1 − µ2 ) = 1
3. Twin minimax probability extreme learning machines
It is easy to know that the above optimization problem (4) is a
convex optimization problem that can be solved by SOCP solvers,
e.g., SeDuMi [33]. In this section, we elaborate the formulation of twin minimax
For the nonlinear case and more details, please refer to [1]. probability extreme learning machines (TMPELM) for binary clas-
sification. Then, we extend the proposed TMPELM to a nonlinear
2.2. Twin extreme learning machine case and describe it in detail. Finally, we give the computational
complexity of our proposed TMPELM.
Consider a binary classification problem the training data Tl =
{(x1 , y1 ), . . . , (xl , yl )} ∈ (Rn , Y )l , where xi ∈ Rn , yi ∈ Y = {1, −1},
i = 1, . . . , l. Tl contains m1 positive class and m2 negative class, 3.1. Linear TMPELM
where l = m1 + m2 . Assume that A ∈ Rm1 ×n represents all
positive class samples and B ∈ Rm2 ×n represents all negative class Although the MPM described above shows superior gener-
samples. alization performance, it is usually solved by solving a large-
We assume that G(ai , bi , x) is a linear continuous function. scale second-order cone programming problem (SOCP). When the
For convenience, we use matrixes H 1 and H 2 to represent the number of samples increases, it will increase the complexity of its
hidden layer output of the samples belonging to positive class and calculation, which greatly limits its applicability. The motivation
negative class, respectively. Thus, we have for this paper is to construct a superior classifier that has better

h1 (x1 ) ··· hL (x1 )
⎤ ⎡
h1 (x1 ) ··· hL (x1 )
⎤ generalization performance and relatively lower computational
.. .. .. .. .. .. complexity. Therefore, based on the spirit of MPM and TELM,
H1 = ⎣ ⎦ , H2 = ⎣
⎢ ⎥ ⎢ ⎥
. . . . . . ⎦ we propose twin minimax probability extreme leaning machine
h1 (xm1 ) ··· hL (xm1 ) h1 (xm2 ) ··· hL (xm2 ) (TMPELM) for pattern recognition. Similar to TELM, TMPELM also
seeks for two non-parallel separating hyperplanes in the TELM
where hi (x) = G(ai , bi , x) = ai · x + bi , i = 1, . . . , L.
feature space. For each hyperplane, TMPELM attempts to maxi-
Similar to TSVM, the TELM [29] also determines two nonpar-
mize the probability of a correct class of sample classification, i.e.,
allel hyperplanes in the TELM feature space:
to minimize the worst case (maximum) probability of misclassi-
f1 (x) = wT1 h(x) = 0, (5) fication of a class of samples while the distance to the other class
is as large as possible.
and For the linear case, TMPELM first utilizes the random feature
f2 (x) = wT2 h(x) = 0. (6) mapping mechanism to construct the feature space, and then
two nonparallel separating hyperplanes are construct for the final
For each hyperplane, TELM minimizes the distance to one classification. Therefore, we can get two non-parallel TMPELM
class while the distance to the other class is as large as possible. classification hyperplanes in Rn :
This leds to the following two quadratic programming problems
(QPPs): f1 (x) = w1 · h(x) = 0 (11)
1 and
min ∥ H 1 w1 ∥2 +C1 eT2 ξ
w1 ,ξ 2
(7) f2 (x) = w2 · h(x) = 0 (12)
s.t. − H 2 w1 + ξ ≥ e2
ξ≥0 that attempt to separate positive and negative class points, re-
spectively. In this paper, we call f1 (x) and f2 (x) as the posi-
and
tive hyperplane and negative hyperplane, respectively. Similar
1 to the TELM, we utilize the positive and negative hyperplanes
min ∥ H 2 w2 ∥2 +C2 eT1 η
w2 ,η 2 (11) and (12) to separate the future sample points. In general,
(8)
s.t. H 1 w2 + η ≥ e1 the two hyperplanes are nonparallel. An intuitional geometric
η≥0 understanding is shown in Fig. 1.
Next, we elaborate the formulation of the TMPELM in the lin-
where ξ and η are slack vectors, 0 is a zero vector, C1 , C2 > 0 are ear case for binary classification problems (13) and (14). In order
regularization parameters, e1 ∈ Rm1 and e2 ∈ Rm2 are vectors of to obtain the positive and negative hyperplanes, we consider the
ones. following a pair of constrained optimization problems:
According to optimization theory, we can get the dual problem
of TELM as max α
α,w1
1 T
min α H 2 (H T1 H 1 + ϵI )−1 H T2 α − eT2 α s.t. inf Pr {wT1 h(x1 ) > 0} > α (13)
α 2 (9) x1 ∼(µ1 ,Σ 1 )

s.t. 0 ≤ αi ≤ c1 e2 − H 2 w1 > e.
4 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806

Fig. 1. Illustration of the TMPELM. The blue ◦ represents the positive point and the red + denotes the negative point.

and Therefore, by using Lemma 1, we can transform the TMPELM


optimization problems (13) and (14) as follows:
max θ
α,w2
max α
α,w1
s.t. inf Pr {wT2 h(x2 ) < 0} > θ (14)
x2 ∼(µ2 ,Σ 2 )

s.t. wT1 µ1 > ρ (α ) wT1 Σ1 w1 (15)
H 1 w2 > e
− H 2 w1 > e
where the notation x1 ∼ (µ1 , Σ 1 ), refers to the class of distribu-
tions that have prescribed mean µ1 and covariance Σ 1 , but are and
otherwise arbitrary; likewise for x2 . w1 , w2 ∈ Rn , e is vector of
max θ
ones of appropriate dimensions. θ,w2
To better understand our proposed TMPELM, we first describe √
(16)
the optimization problem (13). TMPELM minimizes the prob- s.t. − wT2 µ2 > η(β ) wT2 Σ2 w2
ability that future samples will be misclassified in the worst H 2 w2 > e
case, provides a lower boundary for the accuracy of classification,
and controls the probability of misclassification to maximize the √
α

β
accuracy of classification. That is the objective function of (13) where ρ (α ) = 1−α
and η(β ) = 1−β
. It is obvious that
provides the worst-case bound on the probability of misclassifica- ρ (α ) and η(β ) are monotonically increasing function of α and β
tion of positive sample points. The first constraint of (13) requires respectively. So (15) and (16) are equivalent to
that positive training samples lying above the positive hyperplane
with probability greater than α . In other words, the probability- wT1 µ1
max
based constraint in (13) is a soft constraint that maximizes α to

w1
make the positive class samples as much as possible above the wT1 Σ1 w1
(17)
classification hyperplane f1 (x) = w1 · h(x). The second constraint s.t. − H 2 w1 > e
in (13) makes the negative points below the hyperplanes f1 (x) =
wT1 µ1 > 0
w1 · h(x). We have the similar explanations for the problem (14).
Next, we will convert the optimization problems (13) and
and
(14) into convex formulations. As it shown in (13) and (14),
the constraints in the optimization problems need to compute −wT µ2
the infimums over all the possible distributions with the mean max √ 2
w2
µ1 and µ2 , covariance matrixes Σ 1 and Σ 2 . However, it is not wT2 Σ2 w2
easy to compute directly infimums over all the possible distribu- (18)
s.t. H 1 w2 > e
tions. Thusly, we introduce Lemma 1 (proved in [32]), which can
eliminate the computation of the infimums. − wT2 µ2 > 0.
w1
Lemma 1 ([32]). Given a ̸ = 0 and b, such that aT µ ≥ b and Let t1 = 1
, w1 = , then the optimization problem (17)
wT1 µ1 wT1 µ1
α ∈ [0, 1), the condition can be rewrite as

Pr(aT x ≤ b) ≥ α

inf min wT1 Σ1 w1
x∼(µ,Σx ) w1 ,t1
(19)
√ s.t. − H 2 w1 > et1 ,
α

holds if and only if b − a µ ≥ k(α ) T
aT Σ xa with k(α ) = 1−α
. wT1 µ1 > 0.
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 5

In a similar way, the optimization problem (18) can be rewrit- 3.2. Nonlinear TMPELM
ten as

min wT2 Σ2 w2 According to Hilbert space theory, w1,2 can be expressed as
w2 ,t2
∑l
w1,2 = i=1 µ1,2 h(xi ) = H µ1,2 . we consider the following two
T
(20)
s.t. H 1 w2 > et2 kernel generated separating surfaces:
− wT2 µ2 > 0
K (xT , C T )ν1 = 0 (24)
where t2 = − 1
wT2 µ2
, w2 = − wwT µ2 .
√ √2 2

Let wT1 Σ1 w1 ≤ s1 and wT2 Σ2 w2 ≤ s2 , we can convert the


K (xT , C T )ν2 = 0 (25)
(19) and (20) to the following optimization problems:
T T
where C = [A B] and K is an ELM kernel function. We can
min s1
s1 ,t1 calculate the ELM kernel function as

s.t. wT1 Σ1 w1 ≤ s1 (21) K (xi , xj ) = h(xi ) · h(xj ) = [G(wi , b1 , xi ), . . . , G(wi , bL , xi )]
− H 2 w1 > et1 ·[G(wj , b1 , xj ), . . . , G(wj , bL , xj )]. (26)
w µ1 > 0
T
1 In order to get hyperplanes (24) and (25), we will consider the
and following a pair of optimization problems:

min s2 max α
s2 ,t2 α,ν1

Pr {KELM (A, C T )ν1 > 0} > α



s.t. inf (27)
s.t. wT2 Σ2 w2 ≤ s2 (22) x1 ∼(µ1 ,Σ1 )

H 1 w2 > et2 − KELM (B, C T )ν1 ≥ e


− wT2 µ2 > 0. and
It is obvious that the above optimization problems (21) and max θ
(22) can be solved by SOCP solvers, e.g., SeDuM [33]. Based on θ,ν2

the above discussion, the TMPELM algorithm is summarized as s.t. inf Pr {KELM (B, C T )ν2 < 0} > θ (28)
x2 ∼(µ2 ,Σ2 )
Algorithm 1.
KELM (A, C T )ν2 > e.
Algorithm 1 Linear TMPELM Algorithm Similar to the linear case, we can use Lemma 1 to get the
Input: following optimization problems:
Given a training set Tl = {xi , yi }li=1 , i = 1, . . . , l, where xi ∈ Rn ,
max α
yi ∈ {−1, +1}, activation function G(x), and the number of α,ν1
hidden node number L. α
√√
Output: The decision function of Linear TMPELM. s.t. νT1 µ1 > νT1 Σ1 ν1 (29)
Step 1: Initiate an ELM network with L hidden nodes using the 1−α
random input weight ai and bias bi . − KELM (B, C T )ν1 ≥ e
Step 2: Construct input matrixes A and B, then calculate their
hidden layer output matrixes H 1 and H 2 , respectively. and
Step 3: Calculate µ1 , µ1 , Σ1 and Σ1 , respectively. max θ
Step 4: Calculate t1 and t2 by (21) and (22). θ,ν2

θ

Step 5: By

s.t. − νT2 µ2 > νT2 Σ2 ν2 (30)
1 1−θ
t1 = T
w1 µ1 KELM (A, C T )ν2 > e.
and ν1
1 Let t1 = 1
, w1 = , then the optimization problem (29)
νT1 µ1 νT1 µ1
t2 = can be rewrite as
w µ2
T
2
we can get w1 and w2 .

min wT1 Σ1 w1
Step 6: Calculate the perpendicular distance of data point x w1 t1
(31)
from the separating hyperplane using (23), then assign the x s.t. − K (B, C T )w1 > et1
to positive or negative class. wT1 µ1 > 0
In a similar way, the optimization problem (30) can be rewrit-
A new sample point x ∈ Rn is assigned a positive or neg- ten as
ative classification by comparing the following vertical distance √
measurement between it and the two hyperplanes, i.e. min wT2 Σ2 w2
w2 ,t2
(32)
f (x) = arg min dr (x) = arg min | wrT h(x) | (23) s.t. K (A, C T )w2 > et2
r =1,2 r =1,2
− wT2 µ2 > 0
where | · | is the perpendicular distance of sample point x from ν2
the hyperplane wr h(x). where t2 = − 1
, w2 = − .
νT2 µ2 νT2 µ2
6 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806
√ √
Let wT1 Σ1 w1 ≤ q1 and wT2 Σ2 w2 ≤ q2 , the (31) and (32) 4. Comparison with other algorithms
can be converted to the following optimization problems
In this section, we compare the proposed TMPELM with other
min q1 related algorithms (MPM [1], SMPM [9], TELM [29] and TSVM
q1 t1
√ [16]). Further introduce the difference between the proposed
s.t. wT1 Σ1 w1 ≤ q1 (33)
method and other related algorithms in the modeling idea.

− K (B, C T )w1 > et1


4.1. TMPELM vs. MPM
wT1 µ1 > 0
and From the above optimization problems (13) and (14), we can
see that the proposed TMPELM is based on the MPM. However,
min q2 there are three differences between them. Firstly, MPM directly
q2 ,t2
√ minimizes an upper bound of the future misclassification rate
s.t. wT2 Σ2 w2 ≤ q2 (34)
in a worst-case setting, that is, under all possible choices of
class-conditional distributions with a given mean and covariance
K (A, C T )w2 > et2 matrix, which does not efficiently exploit the geometric informa-
− wT2 µ2 > 0. tion of data samples. TMPELM seize the statistical and geometric
information of the data samples simultaneously and construct a
Similar to the linear case, the above optimization problem can pair of nonparallel hyperplanes such that each plane separate one
be solved SOCP solvers, such as SeDuMi [33]. Once the solutions of the two data samples as much as possible and is distant from
µ1 and µ2 of problems (27) and (28) are obtained, a new data the other data samples. Secondly, the bias b is not required in
point x is assigned to the positive class or negative class in a the TMPELM optimization constraints so that the two nonparallel
manner similar to the linear case. Based on the above discussion, separating hyperplanes pass through the origin in the TMPELM
the Nonlinear TMPELM algorithm is summarized as Algorithm 2. feature space. In addition, different from traditional MPM, the
random feature mapping is adopted by TELM and h(x) is known
Algorithm 2 Nonlinear TMPELM Algorithm to users.
Input:
Given a training set Tl = {xi , yi }li=1 , i = 1, . . . , l, where xi ∈ Rn , 4.2. TMPELM vs. TELM
yi ∈ {−1, +1}, activation function G(x), and the number of
hidden node number L. As mentioned earlier, the purpose of both TMPELM and TELM
Output: The decision function of Nonlinear TMPELM. is to construct a pair of non-parallel hyperplanes. We can find
Step 1: Initiate an ELM network with L hidden nodes using the that there is no need to bias b in both TELM and TMPELM, and
random input weight ai and bias bi . the random feature mappings h(x) used by TELM and TMPELM
Step 2: Construct input matrixes A and B, then calculate are known.
KELM (A, C ) and KELM (B, C ) by (26). However, there are some differences between the two clas-
Step 3: Calculate µ1 , µ1 , Σ1 and Σ1 , respectively. sifiers. First, the constructions of the optimization problems of
Step 4: Calculate t1 and t2 by (33) and (34). the TMPELM and TELM are totally different, including the objec-
Step 5: By tive functions and the constraints. Second, TMPELM determines
1 the final classification hyperplane by solving a pair of smaller-
t1 = T sized SOCPs, while TELM is achieved by optimizing a pair of
ν1 µ1
smaller-sized QPPs.
and
1
t2 = − 4.3. TMPELM vs. TSVM
νT2 µ2
we can get ν1 and ν2 . TSVM generates two nonparallel separating hyperplanes so
Step 6: Calculate the perpendicular distance of data point x that each hyperplane tends to reach the smallest distance to
from the separating hyperplane using (23), then assign the x one of the two classes and tries to be far away from the other
to positive or negative class. class. The bias b is required in the primal objective functions of
TSVM, so that the two hyperplanes in TSVM do not basically pass
through the origin in the TSVM feature space.
However, it is worth noting that TMPELM is similar to TSVM
3.3. Computational complexity and finds two nonparallel separating hyperplanes in the TMPELM
feature space. But, the two hyperplanes pass through the origin
in the TMPELM feature space. The main reason is that the bias b
For MPM, general purpose solvers such as SeDuMi [33] can
is not required in the primal objective functions of TMPELM.
handle the problem efficiently. This solver uses interior-point
methods for SOCP, which yields the worst-case complexity of
O(n3 . Of course, the first and second moments of x, y must be 5. Experiments
estimated beforehand, using for example sample moment plug-in
estimates x̂, ŷ, Σ̂x , Σ̂y for x̄, ȳ, Σx and Σy respectively. This brings In this section, we will verify the effectiveness of the proposed
the total complexity to O(n3 + Nn2 ), where N is the number of TMPELM by a series of experiments based on Near-infrared Spec-
data points and n is the number of the dimension of the samples. tral datasets and UCI datasets. After the experimental setup is
Since TMPELM has the similar complexity O(2n3 + Nn2 ). Al- described in Section 5.1, we carefully analysis the classification
though the time complexity is relatively large, TMPELM can still performance of our proposed TMPELM on different datasets in
be solved in polynomial time. Sections 5.2 and 5.3.
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 7

5.1. Experiment setup Table 1


Description of Near-infrared spectral datasets.
Regions Spectral range Number of Number of Positive:negative
To evaluate the performance of our proposed TMPELM, we (cm−1 ) samples variables
systematically compare it with four other state of-the-art meth-
Region A 4000–6000 240 518 1:1
ods, including the classic MPM1 [1], SMPM2 [9], TELM [29], Region B 8000–9000 240 260 1:1
TSVM3 [16]. In order to measure the actual classification perfor- Region C 8000–10000 240 518 1:1
mance of all algorithms, the traditional accuracy index Accuracy Region D 9000–10000 240 260 1:1
(ACC) is used. Additionally, to more specifically distinguish the Region E 4000–10000 240 1555 1:1
performance of different algorithms, the well-established F1 score
is also investigated in our experiments. The above values are
defined as: TMPELM. For fair comparison, we exploit the Matlab toolbox of
TP + TN quadratic programming (QP) to solve all the QP problems in the
ACC = (35)
TP + FN + TN + FP relevant algorithms. All the methods are implemented in MATLAB
2014a on Windows 10 running on a PC with system configuration
2 × TP Intel(R) Core(TM) i5-4590 processor (3.40 GHz) with 8 GB of RAM.
F1 = (36)
2 × TP + FP + FN
5.2. Experiment on near-infrared spectral datasets
where TP and TN denote true positives and true negatives, FN
and FP denote false negatives and false positives, respectively.
The higher the values of ACC and F1 -measure, the better the In order to verify the actual classification performance of
models are. To compare the computational time of all employed the proposed TMPELM, we first performed a series of numerical
algorithms, we also recorded their running time including both experiments using the near-infrared spectral datasets. The near-
training and testing on all involved datasets. infrared spectral datasets are described as follows: Licorice is
In order to test the performance of TMPELM, we test the a traditional Chinese herbal medicine. We utilize 244 licorice
TMPELM algorithm with different kernel, such as linear ker- seeds including 122 hard seeds and 122 soft seeds in our experi-
nel, Gaussian kernel and Sigmoid kernel. The width σ in the ment. Near-infrared (NIR) spectroscopic datasets of licorice seeds
Gaussian kernel, K (xi , xj ) = exp(−∥xi − xj ∥2 /2σ 2 ), being set to was obtained via an MPA spectrometer. The ‘‘NongDa108’’ corn
the average distance among all data instances. The activation hybrid seeds and ‘‘mother178’’ seeds used in the experiments
function 1/(1 + exp(−(w · x + b))) (w, b are randomly gen- were harvested in Beijing, China, in 2008. A total of 240 seed
erated) is used in our experiments. All parameters are chosen samples were selected in this experiment, 120 from hybrid seeds
by the grid search method. It is well known that a grid search and 120 from mother seeds. Near-infrared (NIR) spectroscopic
method is a method of optimizing the performance of a model datasets of corn seeds was obtained via an MPA spectrometer.
by traversing the given combination of parameters. That is, the The NIR spectral range of 4000–10,000 cm−1 is recorded with a
possible values of the various parameters are arranged and com- resolution of 4 cm−1 . The initial spectra are digitized by OPUS
bined, and all possible combined results are listed to generate 5.5 software. Experiments are carried out in five different spectral
a ‘‘grid". Each combination is then used for algorithm training ranges: 4000–6000 cm−1 , 8000–9000 cm−1 , 8000–10,000 cm−1 ,
and performance is evaluated using k-fold cross-validation. In 9000–10,000 cm−1 and 4000–10,000 cm−1 . The corresponding
our experiments, the parameters of all algorithms are set as sample regions are denoted as Regions A, B, C, D, E, respectively.
follows: the penalty parameters C1 , C2 ∈ {2i |−8, −7, . . . , 7, 8}, Information in them is summarized in Table 1.
the RBF kernel parameter σ are all selected from the set {10i |i = We run five algorithms on near-infrared spectral datasets,
−5, −4, −3, . . . , 3, 4, 5} and the hidden layer node L is selected each performing a ten-fold cross validation and looping ten times.
from {100, 200, 300, 500, 1000, 2000, 3000, 5000}. We combine All experimental results based on the optimal parameters are
the possible values of penalty parameters, the RBF kernel param- presented in Tables 2 and 3. Here, CPU running time includes
eter and the hidden layer node parameters, and list all possible both training and testing time. As for the classification accuracy,
combinations to generate a ‘‘grid’’. Each combination is then used in terms of ACC means and standard deviations. The analysis
for the specified algorithm (SMPM, TELM, TSVM, MPM and TM- results concerning the performances of all tested algorithms are
PELM) training and the performance is evaluated using a 10-fold as follows.
cross-validation. Finally, we screen out the optimal combination (1) The TMPELM algorithm generally performs well on most
of parameters for MPM, SMPM, TSVM, TELM and TMPELM. of the involved datasets. It achieves the best average ACC and the
To check the validity of the proposed TMPELM, we conduct best overall performance versus the other classification methods.
the numerical simulations on various datasets, including twelve (2) As indicated in Tables 2 and 3, TMPELM generally achieves
benchmark datasets from the UCI Repository4 and five groups of the highest in both the positive and negative classes. On one
Infrared Spectral datasets. We perform ten-fold cross validation dataset, if one algorithm achieves the highest F1 scores in both the
in all considered datasets. In other words, the dataset is split positive and negative classes of all datasets, it certainly achieves
randomly into ten subsets, and one of those sets is reserved as the best ACC. This is the reason why our TMPELM outperforms
a test set. This process is repeated ten times, and the average of the others.
ten test results is used as the performance measure. To obtain (3) As shown in Tables 2 and 3, we can see that the per-
more objective experimental results, we normalized all datasets formance of TMPELM is better than MPM on the five NIR spec-
that participated in the experiment to keep them in interval tral datasets. Furthermore, we can find that the classification
[0, 1]. We write our own Matlab scripts for TELM, MPM and performance of SMPM is also better than MPM.
(4) From Tables 2 and 3, we can observe that the proposed TM-
PELM runtime has no advantage over the other four algorithms on
1 http://www.cs.colorado.edu/grudic/software-codes. the five NIR spectral datasets.
2 https://sites.google.com/site/jsgubin/our-software-codes. Through the above experimental results analysis, we get an
3 http://www.optimal-group.org/software-codes. objective conclusion: the proposed TMPELM can construct a rea-
4 http://archive.ics.uci.edu/ml/datasets.html. sonable classifier by combining the advantages of TELM and MPM.
8 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806

Table 2
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with linear kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
F1 F1 F1 F1 F1
Times (s) Times (s) Times (s) Times (s) Times (s)
Region A 91.67 ± 2.34 75.06 ± 4.63 86.02 ± 1.39 45.83 ± 8.82 96.04 ± 1.15
(240 × 518) 92.00 74.89 83.45 52.64 96.17
91.442 96.779 59.102 65.183 74.472

Region B 85.42 ± 3.07 87.50 ± 2.41 86.02 ± 3.47 55.41 ± 11.26 85.01 ± 3.71
(240 × 260) 85.60 87.32 84.93 56.66 86.70
9.213 8.736 5.903 7.965 4.653

Region C 98.09 ± 0.93 100 ± 0 91.45 ± 1.14 53.65 ± 11.27 89.79 ± 2.31
(240 × 518) 97.90 100 89.89 56.25 90.12
45.191 37.281 17.907 57.332 28.178

Region D 82.34 ± 2.17 63.71 ± 5.33 89.18 ± 3.87 51.87 ± 13.23 90.07 ± 2.28
(240 × 260) 81.82 64.09 85.65 58.03 91.37
19.238 29.335 14.758 9.702 11.109

Region E 83.64 ± 2.14 96.31 ± 1.30 85.06 ± 3.73 44.37 ± 15.36 97.84 ± 0.83
(240 × 1555) 83.01 95.56 83.47 61.08 97.84
46.292 53.337 31.642 23.501 27.076

Table 3
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with RBF kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
F1 F1 F1 F1 F1
Times (s) Times (s) Times (s) Times (s) Times (s)
Region A 92.01 ± 1.72 75.58 ± 3.37 87.11 ± 1.31 51.55 ± 6.62 97.17 ± 0.95
(240 × 518) 93.24 75.39 86.75 58.94 96.96
117.247 187.653 97.749 101.886 116.028

Region B 85.78 ± 3.38 88.03 ± 1.87 87.11 ± 2.19 58.67 ± 9.89 85.89 ± 3.65
(240 × 260) 85.92 87.85 85.76 59.93 87.23
23.765 17.861 11.403 19.138 9.895

Region C 97.50 ± 1.13 100 ± 0 90.10 ± 2.43 49.37 ± 13.78 88.12 ± 3.93
(240 × 518) 97.65 100 87.94 51.47 89.45
19.218 15.372 9.521 29.613 17.201

Region D 81.25 ± 4.16 62.50 ± 7.82 88.14 ± 5.53 50.41 ± 12.38 89.79 ± 3.34
(240 × 260) 81.21 63.48 84.03 57.13 90.92
8.196 12.074 6.29 3.87 4.02

Region E 83.33 ± 2.26 95.83 ± 1.78 84.42 ± 3.01 42.29 ± 18.89 97.08 ± 1.13
(240 × 1555) 82.76 94.64 81.19 55.35 97.08
29.804 36.644 19.724 13.849 15.301

5.3. Experimental UCI Table 4


Description of the UCI datasets.
Datasets Number of samples Dimensionality Positive:negative
In the second part, to further test the classification perfor-
Breast Cancer 699 9 1:1.90
mance of the proposed TMPELM and other related algorithms, we Australian 690 14 1:0.80
run them on several publicly available UCI datasets, including the Diabetes 1151 19 1:1.13
Australian, German, Breast Cancer, Wisconsin diagnostic breast Spam 4601 57 1:1.54
German 1000 24 1:2.33
cancer (WDBC), Spam, Pima Indian, QSAR, Banknote, Diabetes
WDBC 569 30 1:1.68
datasets. Details of the nine UCI datasets are given in Table 4. QSAR 1055 41 1:1.96
These datasets are commonly used in testing machine learning Banknote 1372 4 1:1.25
algorithms. Note that all samples are normalized such that the Pima 768 8 1:1.87
continuous features are located in the range [0, 1] before learning.
In our experiments, we take advantage of the linear and RBF
kernels. We perform ten-fold cross validation in each considered
From Tables 5 and 6, we can see that the proposed TMPELM
dataset, where data is randomly split into ten subsets, and one
has achieved good classification results in both linear and non-
of subsets is reserved as testing set. This process is repeated
linear cases. In addition, we also found that the classification
ten times, and the average of results is used as the performance
performance of all algorithms is improved in the nonlinear case
measure. The outcomes of these algorithms, reported in terms of
compared to the linear case. We can also find that the classifica-
the ACC means and standard deviations, are listed in Tables 5 and
tion performance of our proposed TMPELM is better than that of
6. All experimental results presented in Tables 5 and 6. are based
MPM on all datasets. The classification accuracy of the proposed
on optimal parameters. The analysis of all experimental results is
TMPELM in both linear and non-linear cases is lower than TSVM
as follows:
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 9

Table 5
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with linear kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 84.35 ± 2.17 83.59 ± 1.34 83.65 ± 2.23 84.18 ± 3.12 88.75 ± 1.06
(690 × 14) 3.243 1.109 19.281 2.36 1.268

German 69.67 ± 4.57 72.35 ± 2.61 83.78 ± 1.12 72.09 ± 2.73 85.83 ± 1.54
(1000 × 24) 4.482 3.274 35.532 6.325 4.963

Breast Cancer 91.87 ± 2.01 93.85 ± 0.93 96.79 ± 0.67 95.11 ± 1.11 96.08 ± 1.03
(699 × 9) 3.752 0.819 29.422 2.13 1.39

WDBC 86.65 ± 0.85 94.81 ± 0.63 95.35 ± 1.64 96.74 ± 0.56 97.38 ± 0.16
(569 × 30) 4.740 0.732 17.479 0.253 0.235

Spam 87.28 ± 1.15 86.75 ± 1.31 87.79 ± 0.89 86.21 ± 1.17 89.85 ± 1.05
(4601 × 57) 90.561 134.285 283.457 101.679 97.894

Pima 65.35 ± 4.45 74.84 ± 2.43 76.21 ± 3.01 76.78 ± 1.41 76.89 ± 0.52
(768 × 8) 2.381 1.607 5.573 2.819 2.117

QSAR 78.56 ± 1.16 84.65 ± 1.72 86.87 ± 2.11 86.24 ± 1.49 88.31 ± 1.78
(1055 × 41) 3.158 4.671 23.873 7.778 4.233

Banknote 88.59 ± 2.01 86.79 ± 1.83 87.09 ± 1.33 86.85 ± 1.45 89.75 ± 0.94
(1372 × 4) 6.614 8.338 29.756 8.019 7.817

Diabetes 67.17 ± 2.13 53.84 ± 2.87 59.45 ± 3.39 60.07 ± 1.83 61.39 ± 1.25
(1151 × 19) 2.735 7.883 11.171 6.582 7.391

Table 6
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with RBF kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 88.09 ± 1.23 85.17 ± 1.53 86.09 ± 1.11 85.85 ± 2.27 90.34 ± 0.33
(690 × 14) 6.867 4.23 36.01 6.78 4.28

German 72.00 ± 2.06 75.80 ± 2.32 84.61 ± 1.58 73.50 ± 2.46 86.98 ± 1.46
(1000 × 24) 11.103 6.67 45.57 8.79 6.49

Breast Cancer 93.91 ± 1.19 95.47 ± 0.85 97.95 ± 0.50 96.83 ± 1.06 97.63 ± 0.71
(699 × 9) 5.926 1.46 41.09 4.251 3.378

WDBC 88.04 ± 0.56 96.49 ± 0.72 97.02 ± 0.82 97.68 ± 0.31 99.17 ± 0.11
(569 × 30) 6.315 1.23 28.81 1.27 1.16

Spam 89.71 ± 1.45 86.83 ± 1.27 88.25 ± 1.19 87.57 ± 1.22 90.20 ± 0.89
(4601 × 57) 187.902 223.45 682.89 257.09 211.17

Pima 66.18 ± 1.37 75.66 ± 1.33 77.09 ± 1.76 77.24 ± 1.02 77.58 ± 0.79
(768 × 8) 6.684 4.53 15.61 6.37 4.49

QSAR 79.13 ± 2.13 85.25 ± 2.08 87.93 ± 1.81 87.56 ± 2.01 89.50 ± 1.60
(1055 × 41) 10.865 13.37 63.05 18.65 10.72

Banknote 90.56 ± 1.21 87.30 ± 1.37 87.88 ± 1.16 87.54 ± 1.21 91.02 ± 0.69
(1372 × 4) 14.253 22.12 56.03 18.07 21.36

Diabetes 68.35 ± 0.96 55.80 ± 0.64 61.32 ± 0.70 61.55 ± 0.61 63.92 ± 0.53
(1151 × 19) 9.964 21.18 41.68 16.56 19.47

on the Breast Cancer dataset. Similarly, the proposed classification The five-number summary is the minimum, first quartile, median,
accuracy of TMPELM is lower than SMPM on the Diabetes dataset. third quartile and maximum. In this box-plot, we draw a box from
We can further find that the proposed TMPELM classification the first quartile to the third quartile. A vertical line goes through
performance is superior to other comparison algorithms in most the box at the median. The whiskers go from each quartile to the
cases. However, on all datasets, our proposed TMPELM CPU run- minimum or maximum.
time does not have significant advantages over SMPM, TELM, From Fig. 2, we can observe that the performance of TMPELM
MPM and TSVM, and some are lower than the runtime of these on the eight UCI datasets is better than MPM, SMPM TSVM and
algorithms. TELM. Intuitively, compared to MPM, SMPM TSVM and TELM,
We selected eight UCI datasets and performed ten times in- our proposed method TMPELM has higher accuracy and relatively
dependent experiments, each of which was based on optimal stable performance. In addition, we can see that the overall per-
parameters. All experimental results are presented in box-plot formance of SMPM is better than MPM in most cases. This is
form in Fig. 2. Fig. 2 shows ACC box-plot of MPM, SMPM, TELM, because the MPM only considers the prior probability distribution
TSVM and TMPELM with RBF kernel on eight UCI datasets. The of each class with a given mean and covariance matrix, which
x-axis demonstrates different classifiers including TELM, TSVM, cannot effectively utilize the structural information of the data.
MPM, SMPM and TMPELM. The y-axis is the ACC over all UCI However, the SMPM use two finite mixture models to capture the
datasets. This box-plot displays the five-number summary of ACC.
10 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806

Fig. 2. Performance comparison of ACC of TELM, TSVM, MPM, SMPM and TMPELM on eight UCI datasets.

structural information of the data. Therefore, it is reasonable for parameters in linear and nonlinear cases, which makes our meth-
SMPM to have better performance than MPM. ods more effective. Then, many experiments were designed to
Further, in order to intuitively assess our proposed method compare our proposed algorithm with other related algorithms.
TMPELM, we evaluate the performance of the proposed method As demonstrated by the experimental results, TMPELM has
from the perspective of multi-objective optimization [34]. We better classification accuracy on majority of datasets than SMPM,
use the training performance and stability of TMPELM as two TELM, MPM and TSVM. We can also find that the classifica-
objective functions. From Tables 2 and 3, we can see that the tion performance of TMPELM is better than that of MPM on all
proposed TMPELM has better training performance and stability datasets. However, on all datasets, the runtime of our proposed
than the other four algorithms on Region A and Region B datasets. TMPELM does not have significant advantages over SMPM, TELM
However, the training performance and stability proposed on and TSVM. Further, we found that on most datasets, the runtime
the Region C, Region D and Region E datasets are not superior of TMPELM is better than MPM. Although TMPELM can be effec-
to other algorithms. The reason for this phenomenon is that tively solved, the running time is not fast enough, especially when
the Near-infrared spectral dataset has the characteristics of high the training data size (or feature size) is large. The experimental
dimensional and low samples. As can be seen from Tables 5 results demonstrate that TMPELM will greatly expand the appli-
and 6, the proposed training performance and stability of TM- cation of MPM in pattern classification and provide a paradigm of
PELM performed well on the vast majority of UCI datasets. In new depth insight into MPM.
addition, we can see from Fig. 2 that the proposed TMPELM has Although SOCPs can effectively solve TMPELM, the running
relatively stable performance and good training performance. In time is not fast enough. Therefore, we hope to find a more
summary, we have come to the conclusion that the proposed effective way to solve TMPELM. In the future, we plan to find
method TMPELM does improve performance from the perspective some heuristic algorithms [34–36] for solving TMPLEM. Certainly,
of multi-objective optimization. how to extend our TMPELM to multi-classification, regression and
After the above analysis, we can clearly see that our proposed semi-supervised learning problems are also worth studying.
algorithm achieves better generalization performance than the
other four algorithms without calculating time. Acknowledgments

6. Discussion and conclusion This work was supported in part by National Natural Science
Foundation of China (No. 11471010) and Chinese Universities
In this paper, we propose a new binary classification method, Scientific Fund.
termed as twin minimax probability extreme learning machine Ethical approval. This paper does not contain any studies with
(TMPELM), which extends MPM to two nonparallel separating human participants or animals performed by any of the authors.
hyperplanes classifier. Similar to TELM, TMPELM also constructs
Informed consent. Informed consent was obtained from all indi-
a network by using the random feature mapping mechanism,
vidual participants included in the study.
and then learns a pair of nonparallel hyperplanes in the ran-
dom feature space by solving two smaller sized SOCPs. For each
References
hyperplane, TMPELM maximizes the probability of correct classi-
fication sample points while away from other classes. TMPELM [1] G.R.G. Lanckriet, L.E. Ghaoui, C. Bhattacharyya, M.I. Jordan, Minimax
seize the statistical and geometric information of the samples probability machine, Adv. Neural Inf. Process. Syst. 37 (1) (2001) 192–200.
simultaneously. We further proposed a nonlinear TMPELM model [2] J.K.C. Ng, Y. Zhong, S. Yang, A comparative study of minimax probability
by exploiting kernelization techniques. In addition, TMPELM in- machine-based approaches for face recognition, Pattern Recognit. Lett. 28
(15) (2007) 1995–2002.
corporates the idea of TELM into the basic framework of MPM, [3] Q. Li, Y. Sun, Y. Yu, C. Wang, T. Ma, Short-term photovoltaic power
thus TMPELM could inherit the advantages of TELM and MPM. forecasting for photovoltaic power station based on ewt-kmpmr, Trans.
Compared to other algorithms, the proposed TMPELM has fewer Chin. Soci. Agric. Eng. 33 (20) (2017) 265–273.
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 11

[4] B. Jiang, Z. Guo, Q. Zhu, G. Huang, Dynamic minimax probability machine- [21] Z. Qi, Y. Tian, Y. Shi, Structural twin support vector machine for
based approach for fault diagnosis using pairwise discriminate analysis, classification, Knowl.-Based Syst. 43 (2) (2013) 74–81.
IEEE Trans. Control Syst. Technol. 27 (2) (2017) 1–8. [22] X. Peng, Y. Wang, D. Xu, Structural twin parametric-margin support vector
[5] L. Yang, Y. Gao, Q. Sun, A new minimax probabilistic approach and its machine for binary classification, Knowl.-Based Syst. 49 (49) (2013) 63–72.
application in recognition the purity of hybrid seeds, CMES Comput. Model. [23] G.B. Huang, H. Zhou, X. Ding, Z. Rui, Extreme learning machine for
Eng. Sci. 104 (6) (2015) 493–506. regression and multiclass classification, IEEE Trans. Syst. Man Cybern. B
[6] G.R.G. Lanckriet, L.E. Ghaoui, M.I. Jordan, Robust novelty detection with 42 (2) (2012) 513–529.
single-class MPM, in: International Conference on Neural Information [24] H. Ya-Lan, C. Liang, A nonlinear hybrid wind speed forecasting model using
Processing Systems, MIT Press, 2002, pp. 905–912. lstm network, hysteretic elm and differential evolution algorithm, Energy
[7] T. Strohmann, G.Z. Grudic, A formulation for minimax probability machine Convers. Manage. 173 (2018) 123–142.
regression, Adv. Neural Inf. Process. Syst. 76 (2003) 9–776. [25] P. Horata, S. Chiewchanwattana, K. Sunat, Robust extreme learning
[8] K. Huang, H. Yang, I. King, M.R. Lyu, L. Chan, The minimum error minimax machine, Neurocomputing 102 (2) (2013) 31–44.
probability machine, J. Mach. Learn. Res. 5 (2004) 1253–1286. [26] X. Lu, Z. Han, H. Zhou, L. Xie, G.B. Huang, Robust extreme learning machine
[9] B. Gu, X. Sun, V.S. Sheng, Structural minimax probability machine, IEEE with its application to indoor positioning, IEEE Trans. Cybern. 46 (1)
Trans. Neural Netw. Learn. Syst. 28 (7) (2017) 1646–1656. (2016) 194.
[10] K. Huang, H. Yang, I. King, M.R. Lyu, Learning classifiers from imbalanced [27] G.G. Wang, M. Lu, Y.Q. Dong, X.J. Zhao, Self-adaptive extreme learning
data based on biased minimax probability machine, in: Computer Vision machine, Neural Comput. Appl. 27 (2) (2016) 291–303.
and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE [28] Y. Zhang, J. Wu, Z. Cai, P. Zhang, L. Chen, Memetic extreme learning
Computer Society Conference on, IEEE, 2004. machine, Pattern Recognit. 58 (C) (2016) 135–148.
[11] S. Song, Y. Gong, Y. Zhang, G. Huang, G.B. Huang, Dimension reduction [29] Y. Wan, S. Song, G. Huang, S. Li, Twin extreme learning machines for
by minimum error minimax probability machine, IEEE Trans. Syst. Man pattern classification, Neurocomputing 260 (2017) 235–244.
Cybern. Syst. 47 (1) (2017) 58–69. [30] G. Huang, S. Song, J.N.D. Gupta, C. Wu, Semi-supervised and unsupervised
[12] S. Cousins, J. Shawe-Taylor, High-probability minimax probability ma- extreme learning machines, IEEE Trans. Cybern. 44 (12) (2014) 2405–2417.
chines, Mach. Learn. 106 (6) (2017) 1–24. [31] Huimin Pei, Kuaini Wang, Qiang Lin, Ping Zhong, Robust semi-supervised
[13] K. Yoshiyama, A. Sakurai, Laplacian minimax probability machine, Pattern extreme learning machine, Knowl.-Based Syst. 159 (2018) 203–220.
Recognit. Lett. 37 (37) (2014) 192–200. [32] A.W. Marshall, I. Olkin, Multivariate chebyshev inequalities, Ann. Math.
[14] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, Stat. 31 (4) (1960) 1001–1014.
1995. [33] J.F. Sturm, Using sedumi 1.02, a matlab toolbox for optimization over
[15] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. symmetric cones, Optim. Methods Softw. 11 (1–4) (1999) 625–653.
[16] R. Jayadeva Khemchandani, S. Chandra, Twin support vector machines for [34] G.Q. Zeng, J. Chen, Y.X. Dai, L.M. Li, C.W. Zheng, M.R. Chen, Design of
pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) fractional order pid controller for automatic regulator voltage system based
905–910. on multi-objective extremal optimization, Neurocomputing 160 (2015)
[17] X. Peng, A ν -twin support vector machine ( ν -TSVM) classifier and its 173–184.
geometric algorithms, Inform. Sci. 180 (20) (2010) 3863–3875. [35] L. Kangdi, Z. Wuneng, Z. Guoqiang, D. Wei, Design of PID controller based
[18] Y.H. Shao, C.H. Zhang, X.B. Wang, N.Y. Deng, Improvements on twin on a self-adaptive state-space predictive functional control using extremal
support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. optimization method, J. Franklin Inst. B 355 (5) (2018) 2197–2220.
[19] Z. Qi, Y. Tian, Y. Shi, Laplacian twin support vector machine for [36] L. Kangdi, Z. Wuneng, Z. Guoqiang, Z. Yiyuan, Constrained population
semi-supervised classification, Neural Netw. 35 (11) (2012) 46–53. extremal optimization-based robust load frequency control of multi-area
[20] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern interconnected power system, Int. J. Electr. Power Energy Syst. 105 (2019)
classification, Pattern Recognit. 46 (1) (2013) 305–316. 249–271.

You might also like