Professional Documents
Culture Documents
Knowledge-Based Systems: Jun Ma, Liming Yang, Yakun Wen, Qun Sun
Knowledge-Based Systems: Jun Ma, Liming Yang, Yakun Wen, Qun Sun
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
highlights
article info a b s t r a c t
Article history: Minimax probability machine (MPM) is an excellent discriminant classifier based on prior knowledge.
Received 30 September 2018 It can directly estimate a probability accuracy bound by minimizing the maximum probability of
Received in revised form 16 June 2019 misclassification. However, the traditional MPM learns only one hyperplane to separate different
Accepted 19 June 2019
classes in the feature space, and it may bear a heavy computational burden during the training process
Available online 21 June 2019
because it needs to address a large-scale second-order cone programming (SOCP)-type problem. In
Keywords: this work, we propose a novel twin minimax probability extreme learning machine (TMPELM) for
Minimax probability machine pattern classification. TMPELM indirectly determines the separation hyperplane by solving a pair of
Twin extreme learning machine smaller-sized SOCP-type problems to generate a pair of non-parallel hyperplanes. Specifically, for
Nonparallel hyperplanes each hyperplane, TMPELM attempts to maximize the probability of correct classification for one class
Second-order cone programming sample points, that is, to minimize the worst-case (maximum) probability of misclassification of a class
Pattern recognition sample points, and is far away from the other class. TMPELM first utilizes the random feature mapping
mechanism to construct the feature space, and then two nonparallel separating hyperplanes are
learned for the final classification. The proposed TMPELM not only utilizes the geometric information
of the samples, but also takes advantage of the statistical information (mean and covariance of
the samples). Moreover, we extend a linear model of TMPELM to a nonlinear model by exploiting
kernelization techniques. We also analyzed the computational complexity of TMPELM. Experimental
results on both NIR spectroscopy datasets and benchmark datasets demonstrate the effectiveness of
TMPELM.
© 2019 Elsevier B.V. All rights reserved.
1. Introduction worst-case scenario, that is, under all possible choices of class-
conditional densities with a given mean and covariance matrix,
Minimax probability machine (MPM) learning strategy intro- minimize the worst-case (maximum) probability of misclassifi-
duced by Lanckriet et al. [1] is a popular and efficient pattern cation of future data points. Due to the outstanding performance
classification and regression method in machine learning fields. of MPM, it has been widely used in many fields, such as com-
MPM [1] attempts to control misclassification probabilities in the puter vision [2], engineering technology [3,4], agriculture [5] and
novelty detection [6], and so on.
✩ No author associated with this paper has disclosed any potential or
Recently, many improved versions of MPM have been pro-
posed from different perspectives. Representative work can be
pertinent conflicts which may be perceived to have impending conflict with
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
briefly reviewed as follows. In [7], Thomas and Gregory proposed
2019.06.014. MPM regression (MPMR) by estimating mean and covariance
∗ Corresponding author. matrix statistics of the regression data. Huang et al. [8] proposed
E-mail address: cauyanglm@163.com (L. Yang). a distribution-free Bayes optimal classifier, called the minimum
https://doi.org/10.1016/j.knosys.2019.06.014
0950-7051/© 2019 Elsevier B.V. All rights reserved.
2 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806
error MPM (MEMPM), which removes the equality constraint on In this paper, inspired by the of MPM and TELM, we propose
the classification error rates from the MPM. Since the MPM only a new nonparallel plane classifier, termed as the twin minimax
considers the prior probability distribution of each class with a probability extreme learning machine (TMPELM). For each hyper-
given mean and covariance matrix, the structural information plane, TMPELM attempts to maximize the probability of a correct
of the data cannot be effectively utilized. In response to the class of sample classification, i.e., to minimize the worst case
problems above, Gu et al. [9] proposed structural MPM (SMPM) by (maximum) probability of misclassification of a class of samples
combining the finite mixture models with the traditional MPM. while the distance to the other class is as large as possible. An
It improves the shortcomings of MPM not making better use of important feature of TMPELM is that the worst-case bound on
data structure information. To effectively deal with imbalanced the probability of misclassification of future data is always ob-
data, Huang et al. [10] propose biased-MPM (BMPM). BMPM tained explicitly. In addition, TMPELM indirectly determines the
firstly estimates the mean and covariance of the data, and then separating hyperplane through a pair of nonparallel hyperplanes
constructs the classification boundary by directly controlling the solved by two smaller sized SOCP-type problems. Similar to TELM,
lower limit of the actual precision, and finally realizes the ef- TMPELM also firstly utilizes the random feature mapping mech-
fective processing of the imbalanced data. Traditional dimension anism to construct the feature space, and then two nonparallel
reduction methods focus on maximizing the overall discrimina- separating hyperplanes are learned for the final classification. The
tion between all classes, which may be easily affected by outliers. proposed TMPELM not only utilizes the geometric information of
To overcome this disadvantage, Song et al. [11] proposed dimen- the data, but also takes advantage of the statistical information
sion reduction minimum error MPM (DR-MEMPM) for multi-class (mean and covariance of the samples). Our proposed TMPELM can
dimension reduction. Cousins et al. [12] proposed a high prob- be solved by a pair of smaller-scale SOCP problems, and its time
ability MPM (HP-MPM) for pattern recognition. Although the computational complexity is similar to the twin extreme learning
above improved versions of MPM have achieved good results, machine problem (TELM). Based on the linear model of TMPELM,
they are all supervised learning methods. In order to improve the we also propose a nonlinear TMPELM model by exploiting ker-
performance of MPM in a semi-supervised learning environment, nelization techniques. The experimental results on both the NIR
Yoshiyama et al. [13] proposed Laplacian minimax probability spectroscopy datasets and benchmark datasets demonstrate the
machine (Lap-MPM). effectiveness of TMPELM.
Single-layer forward networks (SLFNs) have been intensively The rest of this paper is organized as follows. In Section 2, we
studied during the past several decades. One of the most success- give a brief review of MPM and TELM. In Section 3, we describe
ful algorithms for training SLFNs is the support vector machines the details of proposed TMPELM and give the computational
(SVMs) [14,15], which is a maximal margin classifier derived complexity of TMPELM. The comparison of TMPELM with other
under the framework of structural risk minimization (SRM). To algorithms is given in Section 4. All the experimental results are
expand the applicability of SVM, twin support vector machine shown in Section 5. The conclusions and discussions of this paper
(TSVM) was proposed by Jayadeva et al. [16], which targets at are given in Section 6.
deriving two nonparallel separating hyperplanes. TSVM solves a
pair of smaller-sized QPPs instead of solving a large-scale QPP like 2. Related work
traditional SVM. This makes TSVM have a faster learning speed
As mentioned above, in order to propose an improved ver-
than traditional SVMs. Some extensions to TSVM include the ν -
sion of MPM, we briefly review MPM in Section 2.1 and, then,
TSVM [17], twin bounded support vector machine (TBSVM) [18],
introduce the TELM in Section 2.2.
Laplacian TSVM (Lap-TSVM) [19], robust TSVM (RTSVM) [20],
structural TSVM (STSVM) [21], structural twin parametric-margin
2.1. Minimax probability machine
SVM (STPMSVM) [22].
As a new and efficient SLFNs training algorithm, the extreme
Let x and y denote random vectors with means and covari-
learning machine (ELM) was proposed by Huang et al. [23,24].
ance matrices given by (µ1 , Σ 1 ) and (µ2 , Σ 2 ), respectively, with
In contrast to the traditional neural networks, the main advan-
x, y, µ1 , µ2 ∈ Rn , Σ 1 , Σ 2 ∈ Rn×n . As a binary classification
tages of ELM are that it runs fast with global optimal solution
model, the minimax probability machine minimizes the upper
and is easy to implement. Its hidden nodes and input weights
bound of the error classification probability. We attempt to obtain
are randomly generated and the output weights are expressed
a decision hyperplane H(w, b) = {z|wT z = b}, which separates
analytically. Recently, many researchers have done in-depth re-
the two classes of points with maximal probability with respect to
searches on ELM. For example, Horata et al. [25] proposed a
all distributions. The original model of the MPM can be expressed
robust ELM (RELM) to overcome the shortcomings of ELM sensi-
as follows:
tivity to outliers. At the same time, Lu et al. [26] applied RELM to
indoor positioning. Wang et al. [27] proposed a self-adaptive ELM max α
α,a,b
(SaELM). It can select the best neuron number in hidden layers
s.t. inf Pr {aT x ≥ b} ≥ α (1)
to construct the optimal networks. Zhang et al. [28] proposed x∼(µ1 ,Σ1 )
a memetic algorithm based ELM (M-ELM). It embeds the local
inf Pr {a y ≤ b} ≥ α
T
search strategy into the global optimization framework to obtain y∼(µ2 ,Σ2 )
optimal network parameters. Inspired by TSVM, Wan et al. [29]
where the notation x ∈ (µ1 , Σ1 ) refers to a class distribution,
propose twin extreme learning machine (TELM) for supervised
which has the prescribed mean µ1 and covariance Σ1 , but oth-
classification. TELM aims to learn two nonparallel separating hy- erwise arbitrary; likewise to y. α represents the lower bound of
perplanes in the ELM feature space. It is easy to know that TELM the accuracy for future data, namely, the worst case accuracy.
trains two smaller-scale QPPs simultaneously and inherits the ad- By using Chebyshev Inequality [32], the optimization problem
vantages of ELM and TSVM. Huang et al. [30] extended the ELM to can be rewritten as:
semi-supervised learning by introducing manifold regularization,
and proposed semi-supervised ELM (SSELM). Subsequently, Pei max α
α,a,b
et al. [31] propose a robust semi-supervised learning algorithm √
to overcome the limitation of the classical SS-ELM that it is s.t. − b + aT µ1 ≥ k(α ) aT Σ1 a (2)
√
sensitivity to outliers, termed as robust SS-ELM (RSS-ELM). b − aT µ2 ≥ k(α ) aT Σ2 a
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 3
√
α
where k(α ) = 1−α
. Since k(α ) is a monotone increasing function and
of α , thus we have: 1
min θ T H 1 (H T2 H 2 + ϵI )−1 H T1 θ − eT1 θ
max k θ 2 (10)
k
√ √ (3) s.t. 0 ≤ θ ≤ c2 e1
s.t. aT (µ1 − µ2 ) ≥ k( aT Σ1 a + aT Σ2 a).
where α and θ are the vectors of Lagrange multiplier, ϵI is a
The above optimization problem (3) can be equivalently writ- regularization term, ϵ is a randomly small positive scalar and I
ten as is an identity matrix.
For the nonlinear case and more details, please refer to [29].
√ √
max aT Σ1 a + aT Σ2 a
k (4)
s.t. aT (µ1 − µ2 ) = 1
3. Twin minimax probability extreme learning machines
It is easy to know that the above optimization problem (4) is a
convex optimization problem that can be solved by SOCP solvers,
e.g., SeDuMi [33]. In this section, we elaborate the formulation of twin minimax
For the nonlinear case and more details, please refer to [1]. probability extreme learning machines (TMPELM) for binary clas-
sification. Then, we extend the proposed TMPELM to a nonlinear
2.2. Twin extreme learning machine case and describe it in detail. Finally, we give the computational
complexity of our proposed TMPELM.
Consider a binary classification problem the training data Tl =
{(x1 , y1 ), . . . , (xl , yl )} ∈ (Rn , Y )l , where xi ∈ Rn , yi ∈ Y = {1, −1},
i = 1, . . . , l. Tl contains m1 positive class and m2 negative class, 3.1. Linear TMPELM
where l = m1 + m2 . Assume that A ∈ Rm1 ×n represents all
positive class samples and B ∈ Rm2 ×n represents all negative class Although the MPM described above shows superior gener-
samples. alization performance, it is usually solved by solving a large-
We assume that G(ai , bi , x) is a linear continuous function. scale second-order cone programming problem (SOCP). When the
For convenience, we use matrixes H 1 and H 2 to represent the number of samples increases, it will increase the complexity of its
hidden layer output of the samples belonging to positive class and calculation, which greatly limits its applicability. The motivation
negative class, respectively. Thus, we have for this paper is to construct a superior classifier that has better
⎡
h1 (x1 ) ··· hL (x1 )
⎤ ⎡
h1 (x1 ) ··· hL (x1 )
⎤ generalization performance and relatively lower computational
.. .. .. .. .. .. complexity. Therefore, based on the spirit of MPM and TELM,
H1 = ⎣ ⎦ , H2 = ⎣
⎢ ⎥ ⎢ ⎥
. . . . . . ⎦ we propose twin minimax probability extreme leaning machine
h1 (xm1 ) ··· hL (xm1 ) h1 (xm2 ) ··· hL (xm2 ) (TMPELM) for pattern recognition. Similar to TELM, TMPELM also
seeks for two non-parallel separating hyperplanes in the TELM
where hi (x) = G(ai , bi , x) = ai · x + bi , i = 1, . . . , L.
feature space. For each hyperplane, TMPELM attempts to maxi-
Similar to TSVM, the TELM [29] also determines two nonpar-
mize the probability of a correct class of sample classification, i.e.,
allel hyperplanes in the TELM feature space:
to minimize the worst case (maximum) probability of misclassi-
f1 (x) = wT1 h(x) = 0, (5) fication of a class of samples while the distance to the other class
is as large as possible.
and For the linear case, TMPELM first utilizes the random feature
f2 (x) = wT2 h(x) = 0. (6) mapping mechanism to construct the feature space, and then
two nonparallel separating hyperplanes are construct for the final
For each hyperplane, TELM minimizes the distance to one classification. Therefore, we can get two non-parallel TMPELM
class while the distance to the other class is as large as possible. classification hyperplanes in Rn :
This leds to the following two quadratic programming problems
(QPPs): f1 (x) = w1 · h(x) = 0 (11)
1 and
min ∥ H 1 w1 ∥2 +C1 eT2 ξ
w1 ,ξ 2
(7) f2 (x) = w2 · h(x) = 0 (12)
s.t. − H 2 w1 + ξ ≥ e2
ξ≥0 that attempt to separate positive and negative class points, re-
spectively. In this paper, we call f1 (x) and f2 (x) as the posi-
and
tive hyperplane and negative hyperplane, respectively. Similar
1 to the TELM, we utilize the positive and negative hyperplanes
min ∥ H 2 w2 ∥2 +C2 eT1 η
w2 ,η 2 (11) and (12) to separate the future sample points. In general,
(8)
s.t. H 1 w2 + η ≥ e1 the two hyperplanes are nonparallel. An intuitional geometric
η≥0 understanding is shown in Fig. 1.
Next, we elaborate the formulation of the TMPELM in the lin-
where ξ and η are slack vectors, 0 is a zero vector, C1 , C2 > 0 are ear case for binary classification problems (13) and (14). In order
regularization parameters, e1 ∈ Rm1 and e2 ∈ Rm2 are vectors of to obtain the positive and negative hyperplanes, we consider the
ones. following a pair of constrained optimization problems:
According to optimization theory, we can get the dual problem
of TELM as max α
α,w1
1 T
min α H 2 (H T1 H 1 + ϵI )−1 H T2 α − eT2 α s.t. inf Pr {wT1 h(x1 ) > 0} > α (13)
α 2 (9) x1 ∼(µ1 ,Σ 1 )
s.t. 0 ≤ αi ≤ c1 e2 − H 2 w1 > e.
4 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806
Fig. 1. Illustration of the TMPELM. The blue ◦ represents the positive point and the red + denotes the negative point.
Pr(aT x ≤ b) ≥ α
√
inf min wT1 Σ1 w1
x∼(µ,Σx ) w1 ,t1
(19)
√ s.t. − H 2 w1 > et1 ,
α
√
holds if and only if b − a µ ≥ k(α ) T
aT Σ xa with k(α ) = 1−α
. wT1 µ1 > 0.
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 5
In a similar way, the optimization problem (18) can be rewrit- 3.2. Nonlinear TMPELM
ten as
√
min wT2 Σ2 w2 According to Hilbert space theory, w1,2 can be expressed as
w2 ,t2
∑l
w1,2 = i=1 µ1,2 h(xi ) = H µ1,2 . we consider the following two
T
(20)
s.t. H 1 w2 > et2 kernel generated separating surfaces:
− wT2 µ2 > 0
K (xT , C T )ν1 = 0 (24)
where t2 = − 1
wT2 µ2
, w2 = − wwT µ2 .
√ √2 2
min s2 max α
s2 ,t2 α,ν1
the above discussion, the TMPELM algorithm is summarized as s.t. inf Pr {KELM (B, C T )ν2 < 0} > θ (28)
x2 ∼(µ2 ,Σ2 )
Algorithm 1.
KELM (A, C T )ν2 > e.
Algorithm 1 Linear TMPELM Algorithm Similar to the linear case, we can use Lemma 1 to get the
Input: following optimization problems:
Given a training set Tl = {xi , yi }li=1 , i = 1, . . . , l, where xi ∈ Rn ,
max α
yi ∈ {−1, +1}, activation function G(x), and the number of α,ν1
hidden node number L. α
√√
Output: The decision function of Linear TMPELM. s.t. νT1 µ1 > νT1 Σ1 ν1 (29)
Step 1: Initiate an ELM network with L hidden nodes using the 1−α
random input weight ai and bias bi . − KELM (B, C T )ν1 ≥ e
Step 2: Construct input matrixes A and B, then calculate their
hidden layer output matrixes H 1 and H 2 , respectively. and
Step 3: Calculate µ1 , µ1 , Σ1 and Σ1 , respectively. max θ
Step 4: Calculate t1 and t2 by (21) and (22). θ,ν2
θ
√
Step 5: By
√
s.t. − νT2 µ2 > νT2 Σ2 ν2 (30)
1 1−θ
t1 = T
w1 µ1 KELM (A, C T )ν2 > e.
and ν1
1 Let t1 = 1
, w1 = , then the optimization problem (29)
νT1 µ1 νT1 µ1
t2 = can be rewrite as
w µ2
T
2
we can get w1 and w2 .
√
min wT1 Σ1 w1
Step 6: Calculate the perpendicular distance of data point x w1 t1
(31)
from the separating hyperplane using (23), then assign the x s.t. − K (B, C T )w1 > et1
to positive or negative class. wT1 µ1 > 0
In a similar way, the optimization problem (30) can be rewrit-
A new sample point x ∈ Rn is assigned a positive or neg- ten as
ative classification by comparing the following vertical distance √
measurement between it and the two hyperplanes, i.e. min wT2 Σ2 w2
w2 ,t2
(32)
f (x) = arg min dr (x) = arg min | wrT h(x) | (23) s.t. K (A, C T )w2 > et2
r =1,2 r =1,2
− wT2 µ2 > 0
where | · | is the perpendicular distance of sample point x from ν2
the hyperplane wr h(x). where t2 = − 1
, w2 = − .
νT2 µ2 νT2 µ2
6 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806
√ √
Let wT1 Σ1 w1 ≤ q1 and wT2 Σ2 w2 ≤ q2 , the (31) and (32) 4. Comparison with other algorithms
can be converted to the following optimization problems
In this section, we compare the proposed TMPELM with other
min q1 related algorithms (MPM [1], SMPM [9], TELM [29] and TSVM
q1 t1
√ [16]). Further introduce the difference between the proposed
s.t. wT1 Σ1 w1 ≤ q1 (33)
method and other related algorithms in the modeling idea.
Table 2
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with linear kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
F1 F1 F1 F1 F1
Times (s) Times (s) Times (s) Times (s) Times (s)
Region A 91.67 ± 2.34 75.06 ± 4.63 86.02 ± 1.39 45.83 ± 8.82 96.04 ± 1.15
(240 × 518) 92.00 74.89 83.45 52.64 96.17
91.442 96.779 59.102 65.183 74.472
Region B 85.42 ± 3.07 87.50 ± 2.41 86.02 ± 3.47 55.41 ± 11.26 85.01 ± 3.71
(240 × 260) 85.60 87.32 84.93 56.66 86.70
9.213 8.736 5.903 7.965 4.653
Region C 98.09 ± 0.93 100 ± 0 91.45 ± 1.14 53.65 ± 11.27 89.79 ± 2.31
(240 × 518) 97.90 100 89.89 56.25 90.12
45.191 37.281 17.907 57.332 28.178
Region D 82.34 ± 2.17 63.71 ± 5.33 89.18 ± 3.87 51.87 ± 13.23 90.07 ± 2.28
(240 × 260) 81.82 64.09 85.65 58.03 91.37
19.238 29.335 14.758 9.702 11.109
Region E 83.64 ± 2.14 96.31 ± 1.30 85.06 ± 3.73 44.37 ± 15.36 97.84 ± 0.83
(240 × 1555) 83.01 95.56 83.47 61.08 97.84
46.292 53.337 31.642 23.501 27.076
Table 3
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with RBF kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
F1 F1 F1 F1 F1
Times (s) Times (s) Times (s) Times (s) Times (s)
Region A 92.01 ± 1.72 75.58 ± 3.37 87.11 ± 1.31 51.55 ± 6.62 97.17 ± 0.95
(240 × 518) 93.24 75.39 86.75 58.94 96.96
117.247 187.653 97.749 101.886 116.028
Region B 85.78 ± 3.38 88.03 ± 1.87 87.11 ± 2.19 58.67 ± 9.89 85.89 ± 3.65
(240 × 260) 85.92 87.85 85.76 59.93 87.23
23.765 17.861 11.403 19.138 9.895
Region C 97.50 ± 1.13 100 ± 0 90.10 ± 2.43 49.37 ± 13.78 88.12 ± 3.93
(240 × 518) 97.65 100 87.94 51.47 89.45
19.218 15.372 9.521 29.613 17.201
Region D 81.25 ± 4.16 62.50 ± 7.82 88.14 ± 5.53 50.41 ± 12.38 89.79 ± 3.34
(240 × 260) 81.21 63.48 84.03 57.13 90.92
8.196 12.074 6.29 3.87 4.02
Region E 83.33 ± 2.26 95.83 ± 1.78 84.42 ± 3.01 42.29 ± 18.89 97.08 ± 1.13
(240 × 1555) 82.76 94.64 81.19 55.35 97.08
29.804 36.644 19.724 13.849 15.301
Table 5
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with linear kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 84.35 ± 2.17 83.59 ± 1.34 83.65 ± 2.23 84.18 ± 3.12 88.75 ± 1.06
(690 × 14) 3.243 1.109 19.281 2.36 1.268
German 69.67 ± 4.57 72.35 ± 2.61 83.78 ± 1.12 72.09 ± 2.73 85.83 ± 1.54
(1000 × 24) 4.482 3.274 35.532 6.325 4.963
Breast Cancer 91.87 ± 2.01 93.85 ± 0.93 96.79 ± 0.67 95.11 ± 1.11 96.08 ± 1.03
(699 × 9) 3.752 0.819 29.422 2.13 1.39
WDBC 86.65 ± 0.85 94.81 ± 0.63 95.35 ± 1.64 96.74 ± 0.56 97.38 ± 0.16
(569 × 30) 4.740 0.732 17.479 0.253 0.235
Spam 87.28 ± 1.15 86.75 ± 1.31 87.79 ± 0.89 86.21 ± 1.17 89.85 ± 1.05
(4601 × 57) 90.561 134.285 283.457 101.679 97.894
Pima 65.35 ± 4.45 74.84 ± 2.43 76.21 ± 3.01 76.78 ± 1.41 76.89 ± 0.52
(768 × 8) 2.381 1.607 5.573 2.819 2.117
QSAR 78.56 ± 1.16 84.65 ± 1.72 86.87 ± 2.11 86.24 ± 1.49 88.31 ± 1.78
(1055 × 41) 3.158 4.671 23.873 7.778 4.233
Banknote 88.59 ± 2.01 86.79 ± 1.83 87.09 ± 1.33 86.85 ± 1.45 89.75 ± 0.94
(1372 × 4) 6.614 8.338 29.756 8.019 7.817
Diabetes 67.17 ± 2.13 53.84 ± 2.87 59.45 ± 3.39 60.07 ± 1.83 61.39 ± 1.25
(1151 × 19) 2.735 7.883 11.171 6.582 7.391
Table 6
Performance comparison of SMPM, TELM, TSVM, MPM and TMPELM with RBF kernel.
SMPM TELM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 88.09 ± 1.23 85.17 ± 1.53 86.09 ± 1.11 85.85 ± 2.27 90.34 ± 0.33
(690 × 14) 6.867 4.23 36.01 6.78 4.28
German 72.00 ± 2.06 75.80 ± 2.32 84.61 ± 1.58 73.50 ± 2.46 86.98 ± 1.46
(1000 × 24) 11.103 6.67 45.57 8.79 6.49
Breast Cancer 93.91 ± 1.19 95.47 ± 0.85 97.95 ± 0.50 96.83 ± 1.06 97.63 ± 0.71
(699 × 9) 5.926 1.46 41.09 4.251 3.378
WDBC 88.04 ± 0.56 96.49 ± 0.72 97.02 ± 0.82 97.68 ± 0.31 99.17 ± 0.11
(569 × 30) 6.315 1.23 28.81 1.27 1.16
Spam 89.71 ± 1.45 86.83 ± 1.27 88.25 ± 1.19 87.57 ± 1.22 90.20 ± 0.89
(4601 × 57) 187.902 223.45 682.89 257.09 211.17
Pima 66.18 ± 1.37 75.66 ± 1.33 77.09 ± 1.76 77.24 ± 1.02 77.58 ± 0.79
(768 × 8) 6.684 4.53 15.61 6.37 4.49
QSAR 79.13 ± 2.13 85.25 ± 2.08 87.93 ± 1.81 87.56 ± 2.01 89.50 ± 1.60
(1055 × 41) 10.865 13.37 63.05 18.65 10.72
Banknote 90.56 ± 1.21 87.30 ± 1.37 87.88 ± 1.16 87.54 ± 1.21 91.02 ± 0.69
(1372 × 4) 14.253 22.12 56.03 18.07 21.36
Diabetes 68.35 ± 0.96 55.80 ± 0.64 61.32 ± 0.70 61.55 ± 0.61 63.92 ± 0.53
(1151 × 19) 9.964 21.18 41.68 16.56 19.47
on the Breast Cancer dataset. Similarly, the proposed classification The five-number summary is the minimum, first quartile, median,
accuracy of TMPELM is lower than SMPM on the Diabetes dataset. third quartile and maximum. In this box-plot, we draw a box from
We can further find that the proposed TMPELM classification the first quartile to the third quartile. A vertical line goes through
performance is superior to other comparison algorithms in most the box at the median. The whiskers go from each quartile to the
cases. However, on all datasets, our proposed TMPELM CPU run- minimum or maximum.
time does not have significant advantages over SMPM, TELM, From Fig. 2, we can observe that the performance of TMPELM
MPM and TSVM, and some are lower than the runtime of these on the eight UCI datasets is better than MPM, SMPM TSVM and
algorithms. TELM. Intuitively, compared to MPM, SMPM TSVM and TELM,
We selected eight UCI datasets and performed ten times in- our proposed method TMPELM has higher accuracy and relatively
dependent experiments, each of which was based on optimal stable performance. In addition, we can see that the overall per-
parameters. All experimental results are presented in box-plot formance of SMPM is better than MPM in most cases. This is
form in Fig. 2. Fig. 2 shows ACC box-plot of MPM, SMPM, TELM, because the MPM only considers the prior probability distribution
TSVM and TMPELM with RBF kernel on eight UCI datasets. The of each class with a given mean and covariance matrix, which
x-axis demonstrates different classifiers including TELM, TSVM, cannot effectively utilize the structural information of the data.
MPM, SMPM and TMPELM. The y-axis is the ACC over all UCI However, the SMPM use two finite mixture models to capture the
datasets. This box-plot displays the five-number summary of ACC.
10 J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806
Fig. 2. Performance comparison of ACC of TELM, TSVM, MPM, SMPM and TMPELM on eight UCI datasets.
structural information of the data. Therefore, it is reasonable for parameters in linear and nonlinear cases, which makes our meth-
SMPM to have better performance than MPM. ods more effective. Then, many experiments were designed to
Further, in order to intuitively assess our proposed method compare our proposed algorithm with other related algorithms.
TMPELM, we evaluate the performance of the proposed method As demonstrated by the experimental results, TMPELM has
from the perspective of multi-objective optimization [34]. We better classification accuracy on majority of datasets than SMPM,
use the training performance and stability of TMPELM as two TELM, MPM and TSVM. We can also find that the classifica-
objective functions. From Tables 2 and 3, we can see that the tion performance of TMPELM is better than that of MPM on all
proposed TMPELM has better training performance and stability datasets. However, on all datasets, the runtime of our proposed
than the other four algorithms on Region A and Region B datasets. TMPELM does not have significant advantages over SMPM, TELM
However, the training performance and stability proposed on and TSVM. Further, we found that on most datasets, the runtime
the Region C, Region D and Region E datasets are not superior of TMPELM is better than MPM. Although TMPELM can be effec-
to other algorithms. The reason for this phenomenon is that tively solved, the running time is not fast enough, especially when
the Near-infrared spectral dataset has the characteristics of high the training data size (or feature size) is large. The experimental
dimensional and low samples. As can be seen from Tables 5 results demonstrate that TMPELM will greatly expand the appli-
and 6, the proposed training performance and stability of TM- cation of MPM in pattern classification and provide a paradigm of
PELM performed well on the vast majority of UCI datasets. In new depth insight into MPM.
addition, we can see from Fig. 2 that the proposed TMPELM has Although SOCPs can effectively solve TMPELM, the running
relatively stable performance and good training performance. In time is not fast enough. Therefore, we hope to find a more
summary, we have come to the conclusion that the proposed effective way to solve TMPELM. In the future, we plan to find
method TMPELM does improve performance from the perspective some heuristic algorithms [34–36] for solving TMPLEM. Certainly,
of multi-objective optimization. how to extend our TMPELM to multi-classification, regression and
After the above analysis, we can clearly see that our proposed semi-supervised learning problems are also worth studying.
algorithm achieves better generalization performance than the
other four algorithms without calculating time. Acknowledgments
6. Discussion and conclusion This work was supported in part by National Natural Science
Foundation of China (No. 11471010) and Chinese Universities
In this paper, we propose a new binary classification method, Scientific Fund.
termed as twin minimax probability extreme learning machine Ethical approval. This paper does not contain any studies with
(TMPELM), which extends MPM to two nonparallel separating human participants or animals performed by any of the authors.
hyperplanes classifier. Similar to TELM, TMPELM also constructs
Informed consent. Informed consent was obtained from all indi-
a network by using the random feature mapping mechanism,
vidual participants included in the study.
and then learns a pair of nonparallel hyperplanes in the ran-
dom feature space by solving two smaller sized SOCPs. For each
References
hyperplane, TMPELM maximizes the probability of correct classi-
fication sample points while away from other classes. TMPELM [1] G.R.G. Lanckriet, L.E. Ghaoui, C. Bhattacharyya, M.I. Jordan, Minimax
seize the statistical and geometric information of the samples probability machine, Adv. Neural Inf. Process. Syst. 37 (1) (2001) 192–200.
simultaneously. We further proposed a nonlinear TMPELM model [2] J.K.C. Ng, Y. Zhong, S. Yang, A comparative study of minimax probability
by exploiting kernelization techniques. In addition, TMPELM in- machine-based approaches for face recognition, Pattern Recognit. Lett. 28
(15) (2007) 1995–2002.
corporates the idea of TELM into the basic framework of MPM, [3] Q. Li, Y. Sun, Y. Yu, C. Wang, T. Ma, Short-term photovoltaic power
thus TMPELM could inherit the advantages of TELM and MPM. forecasting for photovoltaic power station based on ewt-kmpmr, Trans.
Compared to other algorithms, the proposed TMPELM has fewer Chin. Soci. Agric. Eng. 33 (20) (2017) 265–273.
J. Ma, L. Yang, Y. Wen et al. / Knowledge-Based Systems 187 (2020) 104806 11
[4] B. Jiang, Z. Guo, Q. Zhu, G. Huang, Dynamic minimax probability machine- [21] Z. Qi, Y. Tian, Y. Shi, Structural twin support vector machine for
based approach for fault diagnosis using pairwise discriminate analysis, classification, Knowl.-Based Syst. 43 (2) (2013) 74–81.
IEEE Trans. Control Syst. Technol. 27 (2) (2017) 1–8. [22] X. Peng, Y. Wang, D. Xu, Structural twin parametric-margin support vector
[5] L. Yang, Y. Gao, Q. Sun, A new minimax probabilistic approach and its machine for binary classification, Knowl.-Based Syst. 49 (49) (2013) 63–72.
application in recognition the purity of hybrid seeds, CMES Comput. Model. [23] G.B. Huang, H. Zhou, X. Ding, Z. Rui, Extreme learning machine for
Eng. Sci. 104 (6) (2015) 493–506. regression and multiclass classification, IEEE Trans. Syst. Man Cybern. B
[6] G.R.G. Lanckriet, L.E. Ghaoui, M.I. Jordan, Robust novelty detection with 42 (2) (2012) 513–529.
single-class MPM, in: International Conference on Neural Information [24] H. Ya-Lan, C. Liang, A nonlinear hybrid wind speed forecasting model using
Processing Systems, MIT Press, 2002, pp. 905–912. lstm network, hysteretic elm and differential evolution algorithm, Energy
[7] T. Strohmann, G.Z. Grudic, A formulation for minimax probability machine Convers. Manage. 173 (2018) 123–142.
regression, Adv. Neural Inf. Process. Syst. 76 (2003) 9–776. [25] P. Horata, S. Chiewchanwattana, K. Sunat, Robust extreme learning
[8] K. Huang, H. Yang, I. King, M.R. Lyu, L. Chan, The minimum error minimax machine, Neurocomputing 102 (2) (2013) 31–44.
probability machine, J. Mach. Learn. Res. 5 (2004) 1253–1286. [26] X. Lu, Z. Han, H. Zhou, L. Xie, G.B. Huang, Robust extreme learning machine
[9] B. Gu, X. Sun, V.S. Sheng, Structural minimax probability machine, IEEE with its application to indoor positioning, IEEE Trans. Cybern. 46 (1)
Trans. Neural Netw. Learn. Syst. 28 (7) (2017) 1646–1656. (2016) 194.
[10] K. Huang, H. Yang, I. King, M.R. Lyu, Learning classifiers from imbalanced [27] G.G. Wang, M. Lu, Y.Q. Dong, X.J. Zhao, Self-adaptive extreme learning
data based on biased minimax probability machine, in: Computer Vision machine, Neural Comput. Appl. 27 (2) (2016) 291–303.
and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE [28] Y. Zhang, J. Wu, Z. Cai, P. Zhang, L. Chen, Memetic extreme learning
Computer Society Conference on, IEEE, 2004. machine, Pattern Recognit. 58 (C) (2016) 135–148.
[11] S. Song, Y. Gong, Y. Zhang, G. Huang, G.B. Huang, Dimension reduction [29] Y. Wan, S. Song, G. Huang, S. Li, Twin extreme learning machines for
by minimum error minimax probability machine, IEEE Trans. Syst. Man pattern classification, Neurocomputing 260 (2017) 235–244.
Cybern. Syst. 47 (1) (2017) 58–69. [30] G. Huang, S. Song, J.N.D. Gupta, C. Wu, Semi-supervised and unsupervised
[12] S. Cousins, J. Shawe-Taylor, High-probability minimax probability ma- extreme learning machines, IEEE Trans. Cybern. 44 (12) (2014) 2405–2417.
chines, Mach. Learn. 106 (6) (2017) 1–24. [31] Huimin Pei, Kuaini Wang, Qiang Lin, Ping Zhong, Robust semi-supervised
[13] K. Yoshiyama, A. Sakurai, Laplacian minimax probability machine, Pattern extreme learning machine, Knowl.-Based Syst. 159 (2018) 203–220.
Recognit. Lett. 37 (37) (2014) 192–200. [32] A.W. Marshall, I. Olkin, Multivariate chebyshev inequalities, Ann. Math.
[14] V. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, Stat. 31 (4) (1960) 1001–1014.
1995. [33] J.F. Sturm, Using sedumi 1.02, a matlab toolbox for optimization over
[15] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. symmetric cones, Optim. Methods Softw. 11 (1–4) (1999) 625–653.
[16] R. Jayadeva Khemchandani, S. Chandra, Twin support vector machines for [34] G.Q. Zeng, J. Chen, Y.X. Dai, L.M. Li, C.W. Zheng, M.R. Chen, Design of
pattern classification, IEEE Trans. Pattern Anal. Mach. Intell. 29 (5) (2007) fractional order pid controller for automatic regulator voltage system based
905–910. on multi-objective extremal optimization, Neurocomputing 160 (2015)
[17] X. Peng, A ν -twin support vector machine ( ν -TSVM) classifier and its 173–184.
geometric algorithms, Inform. Sci. 180 (20) (2010) 3863–3875. [35] L. Kangdi, Z. Wuneng, Z. Guoqiang, D. Wei, Design of PID controller based
[18] Y.H. Shao, C.H. Zhang, X.B. Wang, N.Y. Deng, Improvements on twin on a self-adaptive state-space predictive functional control using extremal
support vector machines, IEEE Trans. Neural Netw. 22 (6) (2011) 962–968. optimization method, J. Franklin Inst. B 355 (5) (2018) 2197–2220.
[19] Z. Qi, Y. Tian, Y. Shi, Laplacian twin support vector machine for [36] L. Kangdi, Z. Wuneng, Z. Guoqiang, Z. Yiyuan, Constrained population
semi-supervised classification, Neural Netw. 35 (11) (2012) 46–53. extremal optimization-based robust load frequency control of multi-area
[20] Z. Qi, Y. Tian, Y. Shi, Robust twin support vector machine for pattern interconnected power system, Int. J. Electr. Power Energy Syst. 105 (2019)
classification, Pattern Recognit. 46 (1) (2013) 305–316. 249–271.