You are on page 1of 14

Knowledge-Based Systems 196 (2020) 105703

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

A novel twin minimax probability machine for classification and


regression✩

Jun Ma a , , Jumei Shen a,b,c
a
School of Mathematics and Information Sciences, North Minzu University, Yinchuan Ningxia 750021, PR China
b
The Key Laboratory of Intelligent Information and Big Data Processing of NingXia Province, North Minzu University, Yinchuan Ningxia 750021,
PR China
c
Health Big Data Research Institute of North Minzu University, Yinchuan Ningxia 750021, PR China

article info a b s t r a c t

Article history: As an excellent machine learning tool, the minimax probability machine (MPM) has been widely
Received 14 November 2019 used in many fields. However, MPM does not include a regularization term for the construction
Received in revised form 22 February 2020 of the separating hyperplane, and it needs to solve a large-scale second-order cone programming
Accepted 26 February 2020
problem (SOCP) in the solution process, which greatly limits it development and application. In this
Available online 28 February 2020
paper, to improve the performance of MPM, we propose a novel binary classification method called
Keywords: twin minimax probability machine classification (TMPMC). The TMPMC constructs two non-parallel
Minimax probability machine hyperplanes for final classification by solving two smaller SOCPs to improve the performance of the
Classification MPM. For each hyperplane, TMPMC attempts to minimize the worst case (maximum) probability that a
Regression class of samples is misclassified while being as far away as possible from the other class. Additionally,
Non-parallel hyperplane we extend TMPMC to the regression problem and propose a new regularized twin minimax probability
Second-order cone programming machine regression (TMPMR). Intuitively, the idea of TMPMR is to convert the regression problem into
a classification problem to solve. Both TMPMC and TMPMR avoid the assumption of distribution of
conditional density. Finally, we extend the linear models of TMPMC and TMPMR to nonlinear case.
Experimental results on several datasets show that TMPMC and TMPMR are competitive in terms of
generalization performance compared to other algorithms.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction In recent years, many researchers have proposed many im-


proved versions of MPM from different perspectives [6–13]. Rep-
With the development of data mining and machine learning, resentative work can be briefly reviewed as follows. In [7], by
classification and regression have received attention and research estimating the mean and covariance matrix statistics of the re-
in many fields. Recently, Lanckriet et al. [1] proposed an excellent gression data, Thomas and Gregory proposed MPM regression
learning algorithm for pattern classification, namely the minimax (MPMR). At the same time, Huang et al. [8] proposed minimum
probability machine (MPM) learning strategy. MPM minimizes error MPM (MEMPM) by removes the equality constraint of MPM.
the worst case classification error rate, with respect to any choice In addition, to overcome the structural information that the MPM
of class-conditional distributions that satisfy the given moment cannot effectively utilize the data, Gu et al. [9] proposed structural
constraints (means and covariance matrices of samples), that is, MPM (SMPM) by combining the finite mixture models with MPM.
MPM minimizes the upper bound of the misclassification prob- SMPM not only inherits the advantages of MPM, but also makes
ability. Due to the excellent performance of MPM, it has been better use of data structure information. Although the above
widely used in many fields, such as computer vision [2], engineer- methods based on MPM reform have achieved good results, they
ing technology [3,4], agriculture [5] and novelty detection [6], and tend to fail in a semi-supervised learning environment. Facing
so on. this problem, laplacian minimax probability machine (Lap-MPM)
by Yoshiyama et al. [13] was proposed.
✩ No author associated with this paper has disclosed any potential or In general, MPM-based learning methods are often formulated
pertinent conflicts which may be perceived to have impending conflict with as second-order cone programming problems (SOCPs) to solve.
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
SOCP as an excellent optimization method has received much
2020.105703.
∗ Corresponding author. attention in many areas [14–16]. Recently, SOCP-based support
E-mail addresses: jun_ma1990@yeah.net (J. Ma), shenjumei2839@163.com vector machine (SVM) [17,18] improvement schemes have been
(J. Shen). proposed by many scholars [19–24]. These consider all possible

https://doi.org/10.1016/j.knosys.2020.105703
0950-7051/© 2020 Elsevier B.V. All rights reserved.
2 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

choices of class-conditional densities with a given mean and


covariance matrix, i.e, in a worst-case setting. Despite the great
success of the MPM-based learning method, MPM still has defects
and deficiencies. It is well known that regularization is often used
to avoid ill-posed problems, but it is also used to fit training sam-
ples well while reducing the risk of over-fitting [25]. However,
MPM does not include a regularization term for the construction
of the separating hyperplane. In response to the above problem,
Saketha Nath and Bhattacharyya [26] proposed a regularized
alternative for MPM, in which the L2 -norm was minimized. How-
ever, this method uses a fixed value for the misclassification
rate, not as an optimization variable. Further, in [27], Sebastián
et al. proposed regularized minimax probability machine for clas-
sification. This method inclusion reduces the risk of obtaining
ill-posed estimators, stabilizing the problem, and, therefore, im-
proving the generalization performance. In addition, MPM needs
to solve a large-scale SOCP in the solution process, which in-
creases its computational burden. Therefore, this invisibly limits Fig. 1. Decision hyperplane calculated by the MPM.
the application of MPM in practice. To overcome this problem,
Xu et al. [28] combined the idea of twin support vector machines
(TSVMs) [29–31] and MPM and proposed a twin minimax proba-
(2) A new regression framework based on TMPMC is pre-
bility machine (TWMPM). TWMPM also inherits the advantages
sented, namely TMPMR. The core idea of TMPMR is to transform
of TSVM and MPM. On the one hand, TWMPM generates two
the regression problem into a classification problem for solving.
hyperplanes to improve the performance of the MPM. On the
other hand, TWMPM avoids the distributed assumption of class (3) The TMPMC and TMPMR constructs two non-parallel hy-
conditional density. Experimental results show that in most cases, perplanes for final classification and regression by solving two
TWMPM can improve the performance of MPM. Subsequently, smaller SOCPs, respectively. Both TMPMC and TMPMR avoid the
Ma et al. [32] proposed twin minimax probability extreme learn- assumption of distribution of conditional density.
ing machine (TMPELM) for pattern classification. The TMPELM (4) The experimental results show that TMPMC and TMPMR
not only utilizes the geometric information of the samples, but will greatly expand the application of MPM in pattern classifi-
also takes advantage of the statistical information (mean and cation and regression problems, and provide a new direction for
covariance of the samples). Experimental results on both NIR MPM research.
spectroscopy datasets and benchmark datasets demonstrate the The rest of this paper is organized as follows. In Section 2, we
effectiveness of TMPELM. give a brief review of MPM and TSVM. In Sections 3 and 4, we de-
Inspired by the research, this paper proposes a new MPM scribe the details of proposed TMPMC and TMPMR, respectively.
method called regularized twin minimax probability machine In Section 5, we show experiments results of proposed TMPMC
classification (TMPMC). TMPMC constructs two non-parallel hy- and TMPMR on various data sets. Finally, the conclusions and
perplanes by solving two smaller SOCPs to improve the perfor- discussions of this paper are given in Section 6.
mance of MPM to solve a large-scale SOCP problem. For each
hyperplane, TMPMC attempts to maximize the probability of a
2. Related work
correct class of sample classification, ie, to minimize the worst
case (maximum) probability of misclassification of a class of
samples while the distance to the other class is as large as In this section, we will briefly introduce MPM and TSVM.
possible. Intuitively, TMPMC can directly estimate the bounds of
probability accuracy by minimizing the maximum probability of 2.1. MPM
misclassification, while avoiding the distribution assumption of
class conditional density. Additionally, TMPMC not only utilize
MPM as an excellent machine learning algorithm was pro-
the geometric information of the samples, but also utilize the
posed by Lanckriet et al. [1]. The core idea of MPM is to minimize
statistical information (mean and covariance) of the samples.
the probability of worst-case misclassification, that is, MPM min-
Moreover, we extend TMPMC to the regression problem and pro-
pose regularized twin minimax probability machine regression imizes the upper bound of the misclassification probability, with
(TMPMR). Our TMPMR first transformed the regression into a respect to any choice of class-conditional distributions that satisfy
binary classification problem, so the solution to this classification the given moment constraints (means and covariance matrices of
problem directly solved the original regression problem. TMPMR samples).
only assumes that the mean and covariance matrices of the dis- For given random vectors x and y, MPM attempts to determine
tribution of the generated regression data are known, avoiding a hyperplane
the distribution hypothesis of class conditional density. Finally,
H(a, b) = {z|aT z = b, a ∈ Rn \ {0}, z ∈ Rn , b ∈ R} (1)
we extend the linear model of TMPMC and TMPMR to nonlin-
ear models by using kernel technology. Experimental results on as shown in Fig. 1. where x ∼ (µ1 , Σ1 ) and y ∼ (µ2 , Σ2 ),
multiple datasets demonstrate the effectiveness of TMPMC and x, y, µ1 , µ2 ∈ Rn , Σ1 , Σ2 ∈ Rn×n .
TMPMR. The original MPM can be written as follows:
Specifically, the main contributions of this paper are summa-
rized as follows: max α
α,a,b
(1) A novel classification method is proposed based MPM
and TSVM, namely TMPMC. The TMPMC not only utilizes the s.t. inf Pr {aT x ≥ b} ≥ α (2)
x∼(µ1 ,Σ1 )
geometric information of the data, but also takes advantage of
the statistical information (mean and covariance of the samples). inf Pr {a y ≤ b} ≥ α
T
y∼(µ2 ,Σ2 )
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 3

where x ∈ (µ1 , Σ1 ) denote to a class of statistical information, where ξ and η are slack vectors, 0 is a zero vector, C1 , C2 > 0 are
which has the prescribed mean µ1 and covariance Σ1 , but oth- regularization parameters, e1 ∈ Rm1 and e2 ∈ Rm2 are vectors of
erwise arbitrary; likewise to y. α represents the lower bound of ones.
the accuracy for test samples (the worst case accuracy). Based on the optimization theory and the dual space theory,
Using Chebyshev Inequality [33], the (2) can be rewritten as: we can have dual problems of (8) and (9) as follows
1
max α min αT G(H T H )−1 G T α − eT2 α
α,a,b α 2 (10)
s.t. 0 ≤ α ≤ C1 e2

s.t. − b + a µ1 ≥ k(α )
T
1a aT Σ (3)
and

b − a µ2 ≥ k(α ) a Σ2 a
T T
1 T

α min θ H (G T G)−1 H T θ − eT1 θ
where, k(α ) = 1−α . Then, we can get the following optimization θ 2 (11)
problem through a series of algebraic operation. Thus, we have: s.t. 0 ≤ θ ≤ C2 e1
max k where α and θ are the vectors of Lagrange multiplier; H = [A, e1 ]
k
√ √ (4) and G = [B, e2 ].
s.t. aT (µ1 − µ2 ) ≥ k( aT Σ1 a + aT Σ2 a).
Remark 1. A kernel-based version of the TSVM model can be
Further, the (4) is equal to the following optimization problem (5) obtained. This version can be found in Jayadeva et al. [29].
√ √ 3. Regularized twin minimax probability machine classifica-
max aT Σ1 a + aT Σ2 a
k (5) tion
s.t. aT (µ1 − µ2 ) = 1
3.1. Linear TMPMC
Here, we notice that the above optimization problem (5) can
be simplified to solve the SOCP problem, which can be effectively It is well known that MPM as a kind of learning algorithm
solved by the interior point algorithm, e.g., SeDuMi [34]. has been widely used in the fields of machine learning and data
For the nonlinear case and more details, please refer to [1]. mining [2,4–13]. In this section, we propose a new MPM classi-
fication algorithm, namely twin regularized twin minimax prob-
ability machine classification (for short TMPMC). Assume x and
2.2. TSVM
y represent random vectors with means and covariance matrices
given by (µ1 , Σ1 ) and (µ2 , Σ2 ) respectively, with x, y , µ1 , µ2 ∈
SVM has been applied to many fields as an excellent machine Rn , Σ1 , Σ2 ∈ Rn×n . Let us denote by m1 and m2 the number of
learning tool. However, one of the main challenges of traditional elements of the positive and negative class, respectively. Also, we
SVMs is that they require a lot of training time to solve a large denote by A ∈m1 ×n the data matrix for the positive class (i.e. for
QPP problems. Therefore, this greatly limits the development of yi = +1), and by B ∈m2 ×n the data matrix for the negative class
SVM in practical applications. To solve this problem, twin support (i.e. yi = −1). Thus ,we have
vector machine (TSVM) was proposed by Jayadeva et al. [29].
µ1 = 1
AT e1

TSVM is very similar in form to SVM, except that it aims to ⎪

⎪ m1
generate two non-parallel planes by solving a pair of smaller


⎨µ2 = BT e2
1


QPPs, making each plane closer to one class and as far as possible ⎪
m2
from the other. This results in TSVM having lower computational (12)
complexity than SVM. Σ1 = 1
(AT − µ1 eT1 )T (AT − µ1 eT1 )


⎪ m1
Given training data set Tl = {(x1 , y1 ), . . . , (xl , yl )} ∈ (Rn , Y )l ,



where xi ∈ Rn , yi ∈ Y = {1, −1}, i = 1, . . . , l. Tl contains


Σ2 = 1
(BT − µ2 eT2 )T (BT − µ2 eT2 )

m2
m1 positive class and m2 negative class, where l = m1 + m2 .
Let A ∈ Rm1 ×n denote all positive class samples and B ∈ Rm2 ×n where e1 ∈ Rm1 and e2 ∈ Rm2 are all-ones vectors.
denote all negative class samples. Then, we have: Similar to TSVM, the purpose of the TMPMC is also to find
two non-parallel hyperplanes trying to separate the positive and
f1 (x) = wT1 x + b1 = 0, (6) negative data points in the feature space
and H(w1 , b1 ) = {x|wT1 x + b1 + 1 = 0} (13)

f2 (x) = w T
2x + b2 = 0. (7) and

where w1 ∈ Rn , w2 ∈ Rn , b1 ∈ R and b2 ∈ R. H(w2 , b2 ) = {x|wT2 x + b2 − 1 = 0} (14)


Therefore, the original TSVM formula can be expressed as For each hyperplane, TMPMC attempts to maximize the proba-
follows: bility of a correct class of sample classification, ie, to minimize
1 the worst case (maximum) probability of misclassification of a
min ∥ Aw1 + e1 b1 ∥2 +C1 eT2 ξ class of samples while the distance to the other class is as large as
w1 ,b1 2
(8) possible. An intuitive geometric interpretation is shown in Fig. 2.
s.t. − (Bw1 + e2 b1 ) + ξ ≥ e2
ξ≥0 In order to get (13) and (14), we have the following a pair of
primal TMPMC optimization problems:
and
1
1 max α − ∥w1 ∥2
min ∥ Bw2 + e2 b2 ∥2 +C2 eT1 η α,w1 ,b1 2
w2 ,b2 2
(9) s.t. inf Pr {wT1 x1 + b1 ≥ 1} ≥ α (15)
s.t. Aw2 + e2 b1 + η ≥ e1 x1 ∼(µ1 ,Σ1 )

η≥0 − (Bw1 + eb1 ) > e


4 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

Based on Theorem 1, the above (15) and (16) can be trans-


formed into the following optimization problems:
1
max α − ∥w1 ∥2
α,w1 ,b1 2

s.t. wT1 µ1 + b1 ≥ κ1 (α ) wT1 Σ1 w1 (18)
− (Bw1 + eb1 ) > e
wT1 µ1 + b1 ≥ 1
and
1
max β − ∥w2 ∥2
β,w2 ,b2 2

s.t. w µ2 + b2 ≥ κ2 (β ) wT2 Σ2 w2
T
2 (19)
Aw2 + eb2 > e
− wT2 µ2 − b2 ≥ 1

β

α
Fig. 2. An intuitive geometric interpretation of TMPMC. where κ1 (α ) = 1−α
and κ 2 ( β ) = 1−β
. It is obvious that
κ1 (α ) and κ2 (β ) are monotonically increasing function of α and
β respectively.
and By introducing√two new variables r1 , r2 , and two constraints

max
1
β − ∥w2 ∥2 ∥w1 ∥ ≤ r1 and wT1 Σ1 w1 ≤ r2 , the formulation (18) can be
β,w2 ,b2 2 casted as the following optimization problem:
s.t. inf Pr {wT2 x2 + b2 ≤ −1} ≥ β (16)
1
x2 ∼(µ2 ,Σ2 ) min r2 + r1
w1 ,b1 ,r 1 ,r 2 2
Aw2 + eb2 > e
s.t. ∥w1 ∥ ≤ r1
where the notation x1 ∼ (µ1 , Σ1 ) refers to the class of distribu- √
(20)
tions that have prescribed mean µ1 and covariance Σ1 , likewise wT1 Σ1 w1 ≤ r2
for x2 . Note that w1 , w2 ∈ Rn , b1 , b2 ∈ R. e is vector of ones − (Bw1 + eb1 ) > e
of appropriate dimensions. A and B are matrices consisting of wT1 µ1 + b1 ≥ 1
positive and negative sample points respectively.
Similar to (18), (19) can be rewritten as
Remark 2. To facilitate understanding of our proposed TMPMC, 1
max q2 + q1
we first describe the optimization problem (15), and we explain w2 ,b2 ,q1 ,q2 2
the same explanation for (16). TMPMC minimizes the possi- s.t. ∥w2 ∥ ≤ q1
bility of misclassification of future samples in the worst case, √
(21)
provides boundaries for classification accuracy, that is, TMPMC wT2 Σ2 w2 ≤ q2
controls the probability of misclassification to maximize classi- Aw2 + eb2 > e
fication accuracy, as well as to minimize regularization terms − wT2 µ2 − b2 ≥ 1
for the construction of the separating hyperplane. In the ob-
jective function of (15), TMPMC provides the worst case for Remark 3. Note that optimization problems (20) and (21) are
the probability that the positive sample points are misclassified SOCPs. A SOCP problem is a convex optimization problem with
while minimizing the regularization term used to construct the a linear objective function, and second-order cone (SOC) con-
separated hyperplane. The first constraint of (15) requires that straints [14–16]. It is obvious that the above optimization prob-
positive samples lying above the (13) with probability greater lems (20) and (21) can be solved by SOCP solvers, e.g., Se-
DuM1 [34].
than α . Intuitively, the probability-based constraint in (15) is a
soft constraint that maximizes α , making as many positive class Once vectors w1 and w2 are obtained, a new sample point
samples as possible on the classification hyperplane (13). The x ∈ Rn is assigned the positive or negative class by comparing
second constraint in (15) makes the negative points below the the following vertical distance measurement between it and the
hyperplanes (14). two hyperplanes, i.e.
wT1 x + b1 + 1 wT2 x + b2 − 1
Theorem 1 (See [33] Multivariate Chebyshev Inequality). Let x be a f (x) = sign( + ). (22)
∥w1 ∥ ∥w2 ∥
n-dimensional random variable with mean and covariance (µ, Σ ),
where Σ is a positive semi-definite symmetric matrix. Given a ∈ 3.2. Nonlinear TMPMC
Rn \ {0}, such that aT µ + b ≥ 0, and α ∈ [0, 1], the condition
According to Hilbert space theory, w1,2 can be expressed as
Pr {aT µ + b ≥ 0} ≥ α
inf (17)
∑l
x∼(µ,Σ )
w1,2 = i=1 ν1,2 ϕ (xi ). we consider the following two kernel

√ √
α
holds if and only if aT µ + b ≥ κ (α ) aT Σ a, where κ (α ) = 1−α
. 1 http://sedumi.ie.lehigh.edu/.
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 5

generated separating surfaces:

K (xT , C T )ν1 + b1 = 0 (23)

K (xT , C T )ν2 + b2 = 0 (24)

where C T = [A B]T and K is an appropriately chosen kernel.


In order to get hyperplanes (23) and (24), we will consider the
following a pair of optimization problems:

1
max α − ∥ν1 ∥2
α,ν1 ,b1 2
s.t. inf Pr {K (A, C T )ν1 + b1 > 1} ≥ α (25)
x1 ∼(µ̂1 ,Σ̂1 )
Fig. 3. TMPMC algorithm framework.
− (K (B, C T )ν1 + eb1 ) ≥ e
and
and
1 1
max β − ∥ν2 ∥ 2
min s2 + z2
β,ν2 ,b2 2 s2 ,z2 ,ν2 ,b2 2
s.t. inf Pr {K (B, C T )ν2 + b2 < −1} ≥ β (26) s.t. ∥ν2 ∥ ≤ z2
x2 ∼(µ̂2 ,Σ̂2 )
(31)

K (A, C T )ν2 + eb2 > e. νT2 Σ̂2 ν2 ≤ s2

where K (A, C T )ν2 + eb2 > e


K (B, C T )µ̂2 + b2 ≤ −1
µ̂1 = 1
AT e

⎪ m1 ϕ (x) 1
We show in Fig. 3 how the TMPMC works.




⎨µ̂2 = BT e
1


m2 ϕ (x) 2

3.3. Discussions
(27)
Σ̂1 = 1
(ATϕ (x) − µ̂ T T T
1 e1 ) (Aϕ (x) − µ̂ T
1 e1 )




⎪ m1 In this section, we analyze the differences in the modeling


⎪ ideas of MPM and TMPMC algorithms and their computational
Σ̂2 = 1
(BTϕ (x) − µ̂2 eT2 )T (BTϕ (x) − µ̂2 eT2 )

m2 complexity.

where e1 ∈ Rm1 and e2 ∈ Rm2 are all-ones vectors. 3.3.1. Difference of modeling ideas
Similar to the linear case, we have It is well known that the core idea of MPM [1] is to minimize
1 the worst-case classification error rate, that is, the lower bound
max α − ∥ν1 ∥2 of the MPM directly maximizes the probability of correct classifi-
α,ν1 ,b1 2
√ cation of samples, with respect to any choice of class-conditional
s.t. K (A, C T )µ̂1 + b1 − 1 > κ1 νT1 Σ̂1 ν1 (28) distributions that satisfy the given moment constraints. On the
one hand, MPM is proposed based on the constraints of the first
− (K (B, C )ν1 + eb1 ) ≥ e
T
and second moments for each class-conditional distribution. On
K (A, C T )µ̂1 + b1 ≥ 1 the other hand, MPM needs to solve a large-scale SOCP problem
in the solution process, which increases its computational burden.
and However, TMPMC constructs two non-parallel hyperplanes
by solving two smaller SOCPs to improve the performance of
1
max β − ∥ν2 ∥2 MPM to solve a large-scale SOCP problem. For each hyperplane,
β,ν2 ,b2 2
√ Intuitively, TMPMC attempts to maximize the probability of a
s.t. K (B, C T )µ̂2 + b2 + 1 > κ2 νT2 Σ̂2 ν2 correct class of sample classification, ie, to minimize the worst
(29)
case (maximum) probability of misclassification of a class of
K (A, C )ν2 + eb2 > e
T
samples while the distance to the other class is as large as
K (B, C T )µ̂2 + b2 ≤ −1 possible. Additionally, TMPMC can directly estimate the bounds
of probability accuracy by minimizing the maximum probability
of misclassification, while avoiding the distribution assumption

β

α
where κ1 = 1−α
and κ2 = 1−β
.
of class conditional density.
Introducing two variables s1 and s2 , we can get the following
two optimization problems through algebraic operations 3.3.2. Computational complexity
Before analyzing the computational complexity of the pro-
1
min s1 + z1 posed TMPMC, we first analyze the computational complexity
s1 ,z1 ,ν1 ,b1 2 of the MPM. Usually MPM can be reduced to SOCP problems
s.t. ∥ν1 ∥ ≤ z1 and can be solved with SeDuMi2 [34]. It is well known that
(30) SeDuMi can effectively handle SOCP problems by the interior

νT1 Σ̂1 ν1 ≤ s2 point method and has a computational complexity of O(n3 ) in the
− (K (B, C T )ν1 + eb1 ) ≥ e
2 http://sedumi.ie.lehigh.edu/.
K (A, C T )µ̂1 + b1 ≥ 1
6 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

Table 1 Remark 4. It is easy to see that (33) and (34) are similar in form
The difference between MPM and TMPMC.
to (15) and (16). TMPMR is similar in spirit to TMPMC because it
MPM TMPMC
also derives a pair of non-parallel planes around the data points.
Classification hyperplane A classification A pair of nonparallel However, there are some differences in nature. First, the goals
hyperplane classification hyperplane
Optimization task One large SOCP Two smaller sized SOCPs
of TMPMR and TMPMC are different. TMPMR aims to find the
Computational complexity O (n3 + Nn2 )
N
O (n3 + 2 n2 ) right amount of regression function, and TMPMC is to build a
classifier. Second, TMPMC learns two non-parallel hyperplanes,
making each plane closer to one of the two classes and moving
away from the other as much as possible, whereas the TMPMR
worst case. Before solving SOCP, we need to estimate µ1 , µ1 , Σ1 , pair find the ε -insensitive up-bound and down-bound functions
Σ2 . Therefore, the total computational complexity of the MPM
for the end regressor.
is approximately O(n3 + Nn2 ). Here, N is the number of data
points, n is the number of the dimension of the samples. Next, Let w1 = [w̃1 ; η1 ] and w2 = [w̃2 ; η2 ]. Base on Theorem 1, we
we give the computational complexity analysis of TMPMC. When can get
solving TMPMC, we solve a pair of smaller scale SOPCs. Assuming
1
that the number of data samples is roughly equal to N /2, then max α − ∥w1 ∥2
the computational complexity of TMPM is approximately O(n3 + α,w1 ,b1 2
N 2 √ (35)
n ), which is faster than that of traditional MPM. The difference
2 s.t. w1T µ1 + b1 ≥ k1 w1T Σ1 w1
and relationship among MPM and TMPMC can be summarized in
Table 1. − (Bw1 + eb1 ) ≥ e
and
4. Regularized twin minimax probability machine regression
1
max β − ∥w2 ∥2
In this section, we will detail the proposed regression method, β,w2 ,b2 2
called regularized twin minimax probability machine regression √ (36)
(TMPMR). s.t. − w2T µ2 − b2 ≥ k2 w2T Σ2 w2
Aw2 + eb2 ≥ e
4.1. Linear TMPMR √
β

α
where k1 = 1−α
and k2 = 1−β
are the increasing functions
Let the samples to be trained be represented by a set of N
of α and β , respectively.
column vectors xi ∈ Rn . Also let X = {xT1 ; xT2 ; . . . ; xTN } and let Y =
{y1 ; y2 ; . . . ; yN } denote the response vector of samples, where By introducing√two new variables t1 , t2 , and two constraints
yi ∈ R. Let us consider the regression problem on data set C = ∥w1 ∥ ≤ t1 and wT1 Σ1 w1 ≤ t2 , the formulation (35) can be
[X Y ]. The aim of this problem is to construct an ϵ -insensitive casted as the following optimization problem:
regressor. In order to solve this problem by classification, we add
1
and subtract ϵ to y component of each point, and obtain the min t2 + t1
following up and down-shifted data sets respectively w1 ,b1 ,t 1 ,t 2 2
s.t. ∥w1 ∥ ≤ t1
(up-shifted set) A = [X Y + eϵ]
(32) (37)

(down-shifted set) B = [X Y − eϵ] wT1 Σ1 w1 ≤ t2
where e is a vector of ones of appropriate dimensions. We next − (Bw1 + eb1 ) > e
assign label ±1 to every point of A±ϵ and thus form a classification wT1 µ1 + b1 ≥ 1
problem in Rn+1 with 2N points. Then the problem of constructing
an ϵ -insensitive regressor is equivalent to linearly separating the Similar to (35), (36) can be rewritten as
two data sets. In [7], Strohmann and Grudic has been shown that 1
for given a regression training set, a regressor y = w T x + b is max q2 + q1
w2 ,b2 ,q1 ,q2 2
an ϵ -insensitive regressor if and only if the sets A and B locate
on the different sides of the (n + 1) dimensional hyperplane s.t. ∥w2 ∥ ≤ q1
wT x − y + b = 0 respectively. Now we build the following two (38)

hyperplanes w̃1 x + η1 y + b1 = 0 and w̃2 x + η2 y + b2 = 0
T T wT2 Σ2 w2 ≤ q2
1 Aw2 + eb2 > e
max α − ∥w̃1 ∥2
α,w˜1 ,b1 2 − wT2 µ2 − b2 ≥ 1
s.t. inf Pr {w̃1 x + η1 (y + ϵ ) + b1 ≥ 0} ≥ α
T (33)
x1 ∼(µ1 ,Σ1 ) 4.2. Nonlinear TMPMR
− (X w̃1 + η1 (Y − ϵ ) + b1 ) ≥ e
and To generalize the above linear TMPMR to nonlinear, we con-
1 sider the following kernel-generated surfaces K (xT , X T )w̃1 +η1 y +
max β − ∥w̃1 ∥2 b1 = 0 and K (xT , X T )w̃2 + η2 y + b2 = 0. Let us consider the
β,w˜2 ,b2 2
regression problem on data set Cϕ = [K (X , X T ) Y ]. The aim of
Pr {w̃2 x + η2 (y − ϵ ) + b2 ≤ 0} ≥ β
T (34)
s.t. inf this problem is to construct an ϵ -insensitive regressor. Similarly,
x1 ∼(µ2 ,Σ2 )
we add and subtract ϵ to y component of each point, and obtain
X w̃2 + η2 (Y + ϵ ) + b2 ≥ e
the following up and down-shifted data sets respectively
∑N
where µT1 = N1 i=1 A, Σ1 = N (A − eµ1 ) (A − eµ1 ), µ2 =
1 T T T T
(up-shifted set) Aϕ = [K (X , X T ) Y + eϵ]
∑N
i=1 B and Σ2 = N (B − eµ2 ) (B − eµ2 ).
1 1 T T T (39)
N (down-shifted set) Bϕ = [K (X , X T ) Y − eϵ]
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 7

We define
N
1 ∑ 1
µTϕ1 = Aϕi· , Σϕ1 = (Aϕ − eµT1 )T (Aϕ − eµT1 ),
N N
i=1
N
1 ∑ 1
µTϕ2 = Bϕi· , Σϕ2 = (Bϕ − eµT2 )T (Bϕ − eµT2 )
N N
i=1

We get the fitting function by solving the following two clas-


sification problems
1 Fig. 4. TMPMR algorithm framework.
max α − ∥w̃1 ∥2
w˜1 ,b1 2
(40)
s.t . Pr {K (xT , X T )w̃1 + η1 (y + ϵ ) + b1 ≥ 0} ≥ α
f2 (x) = − η1∗ (K (xT , X T )w̃2 + b∗2 ). Therefore, we obtain the regressor
− (K (X , X )w̃1 + η1 (Y − ϵ ) + b1 ) ≥ e
T
as
2

1 1 w̃1

w̃2 ∗
1 f (x) = (f1 (x) + f2 (x)) = − K (xT , X T )( ∗ + ∗ )
max β − ∥w̃2 ∥2 2 2 η1 η2
w˜2 ,b2 2 ∗ ∗
(41) b b
s.t . Pr {K (xT , X T )w̃2 + η2 (y − ϵ ) + b2 ≤ 0} ≥ β − ( 1∗ + 2∗ ) (46)
η1 η2
K (X , X )w̃2 + η2 (Y + ϵ ) + b2 ≥ e
T

Let w1 = [w̃1 ; η1 ] and w2 = [w̃2 ; η2 ]. Similar to the linear case, Remark 5. Since TMPMR also solves a pair of smaller SOPCs, it
we have has the same computational complexity as TMPMC. For details,
please refer to Section 3.3.2. In addition, we show in Fig. 4 how
1
max α − ∥w1 ∥2 the TMPMR works.
α,w1 ,b1 2
(42)

s.t. w1T µϕ1 + b1 ≥ κ1 w1T Σϕ1 w1 5. Experiments
− (Bϕ w1 + eb1 ) ≥ e To appraise the performance of the proposed TMPMC and
and TMPMR, we conduct a sequence of numerical simulations on di-
1 verse datasets. We execute ten-fold cross validation in all datasets
max β − ∥w2 ∥2 involved. Namely, each dataset is randomly divided into ten sub-
β,w2 ,b2 2
sets, and nine of them are used for training, the rest for testing.
(43)

s.t. − w2T µϕ2 − b2 ≥ κ2 w2T Σϕ2 w2 This procedure is duplicated ten times, and the mean of ten test
results serve as the performance measure. All the algorithms are
Aϕ w2 + eb2 ≥ e carried out in MATLAB 2014a on Windows 10 running on a PC

α

β with system configuration Intel (R) Core (TM) i7-8700 processor
where κ1 = 1−α
and κ2 = 1−β
. (3.20 GHz) with 16 GB of RAM.
Introducing four variables s1 , s2 , z1 and z2 into (42) and (42),
respectively, we can get the following two optimization problems 5.1. Experimental results of the classification algorithms
through algebraic operations
1 In this section, we first performed a series of numerical ex-
min s1 + z1 periments on nine UCI datasets to verify the performance of our
s1 ,z1 ,w1 ,b1 2 proposed TMPMC. Second, we performed statistical analysis on
s.t. ∥w1 ∥ ≤ z1 the performance of the algorithms involved in nine UCI datasets.
(44)

w1T Σϕ1 w1 ≤ sw 5.1.1. Experiment on UCI datasets
− (Bϕ w1 + eb1 ) ≥ e To evaluate the performance of our proposed TMPMC, we
systematically compare it with other state of-the-art methods,
w1T µϕ1 + b1 ≥ 1 including: TSVM3 [29], TWMPM [28], MPM [1] and TMPELM [32].
and To measure the classification performance of all algorithms, the
1 traditional accuracy index Accuracy (ACC) is used. The above
min s2 + z2 value is defined as:
s2 ,z2 ,w2 ,b2 2
TP + TN
s.t. ∥w2 ∥ ≤ z2 ACC = (47)
TP + FN + TN + FP
(45)

w2T Σϕ2 w2 ≤ s2 where TP and TN denote true positives and true negatives, FN and
Aϕ w2 + eb2 ≥ e FP denote false negatives and false positives, respectively. Time:
the total training time.
− w2T µϕ2 − b2 ≥ 1 To check the validity of the proposed TMPMC, we conduct
It is obvious that the above optimization problems (44) and (45) the numerical simulations on various datasets, including nine
can be solved by SOCP solvers, e.g., SeDuM [35]. benchmark datasets from the UCI Machine Learning Repository4
By solving the above (44) and (45), we can obtain the optimal (see Table 2).
value w̃1 , η1∗ , w̃2 , η2∗ . Note that, as the discussion in Bi and
∗ ∗

Bennett’s work [7], we generally have η1∗ ̸ = 0 and η1∗ ̸ = 0. Then 3 http://www.optimal-group.org/Resources/Code/TWSVC.html.
we have a pair of shifted functions f1 (x) = − η1∗ (K (xT , X T )w̃1 + b∗1 ), 4 http://archive.ics.uci.edu/ml/datasets.html.
1
8 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

Fig. 5. Learning efficiency of TMPMC, TWMPM, TSVM, MPM and TMPELM in nonlinear cases.

Table 2 classification accuracy is lower than the TSVM on the breast


Description of the UCI datasets.
cancer data set.
Datasets Number of samples Dimensionality Positive:negative In order to test the proposed TMPMC learning efficiency, we
Breast cancer 699 9 1:1.90 performed experiments on nine UCI data sets, and the experimen-
Australian 690 14 1:0.80
tal results were presented in Fig. 5 based on the optimal param-
Diabetes 1151 19 1:1.13
Spam 4601 57 1:1.54 eters. Fig. 5 shows learning time plot of MPM, TWMPM, TSVM,
German 1000 24 1:2.33 TMPMC and TMPELM with RBF kernel on eight UCI datasets. The
WDBC 569 30 1:1.68 x-axis demonstrates different datasets. The y-axis is the Time (s)
QSAR 1055 41 1:1.96
over all methods. From Fig. 5, we can see that TSVM has better
Banknote 1372 4 1:1.25
Pima 768 8 1:1.87 learning time on all data sets than the other four algorithms. We
can also observe that in most cases, the proposed TMPMC is better
than MPM, and there is no obvious advantage compared with
TWMPM and TMPELM.
In our experiments, we take advantage of the linear kernel
∥x −x ∥2
and the Radial Basis Function (RBF) K (xi , xj ) = exp{− i2σ 2j }. 5.1.2. Statistical analysis
All parameters are selected by the grid search method. As is To compare the significant differences between the five algo-
well known, the grid search method is a method of optimizing
rithms, we use statistical tests in both linear and nonlinear cases.
the performance of a model by traversing a given combina-
The Friedman test with the corresponding post hoc test [33]
tion of parameters. In our experiments, the parameters of all
is a simple, safe, and robust nonparametric test that is com-
algorithms are set as follows: the penalty parameters C1 , C2 ∈
monly used for statistical testing. The average ranking of the five
{2i |−6, −5, . . . , 5, 6}, the RBF kernel parameter σ are all selected
algorithms on all considered data sets is shown in Table 5.
from the set {2i |i = −5, −4, −3, . . . , 3, 4, 5}. In addition, we
First, we compare the five algorithms on the nine UCI data
perform ten-fold cross validation in each considered dataset,
where data is randomly split into ten subsets, and one of subsets sets in a linear case. Let k and N be the numbers of the involved
is reserved as testing set. This process is repeated ten times, algorithms and the employed datasets, respectively, it is clear that
and the average of results is used as the performance measure. k = 5 and N = 9. Based on the average ranks displayed in Table 5,
The outcomes of these algorithms, reported in terms of the ACC we can calculate the Friedman statistic:
means and standard deviations, are listed in Tables 3 and 4. All 12N ∑ k(k + 1)2
experimental results presented in Tables 3 and 4. are based on χF2 = [ R2j − ] = 33.22
k(k + 1) 4
optimal parameters. The analysis of all experimental results is as j
follows:
which is distributed according to the χF2 -distribution with (k −
From Tables 3 and 4, we can see that our proposed TMPMC
1) degrees of freedom, where Rj is the average rank of the jth
achieves good classification accuracy in both linear and nonlinear
algorithm, and
cases relative to the other four algorithms. In addition, we have
found that the classification accuracy of the algorithms involved (N − 1)χF2
is further improved in the nonlinear case compared to the linear FF = = 95.60
N(k − 1) − χF2
case. Further, we can also observe that the classification accuracy
of the proposed TMPMC is superior to MPM and TWMPM on all which is distributed according to the F -distribution with (k − 1)
data sets. In linear and non-linear cases, the proposed TMPMC and (k − 1)(N − 1) degrees of freedom. Thus, for α = 0.05, we can
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 9

Table 3
Performance comparison of TMPMC, TWMPM, TSVM, MPM and TMPELM with linear kernel.
TMPMC TWMPM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 88.93 ± 2.25 86.74 ± 2.31 87.84 ± 2.23 86.18 ± 3.12 88.74 ± 2.02
German 86.24 ± 3.27 83.61 ± 3.22 85.98 ± 3.12 72.09 ± 2.73 85.83 ± 1.54
Breast cancer 96.47 ± 2.16 94.38 ± 2.13 96.56 ± 2.13 95.11 ± 1.11 96.08 ± 1.03
WDBC 96.88 ± 1.11 95.67 ± 1.27 96.77 ± 1.32 96.74 ± 0.56 97.38 ± 0.16
Spam 89.98 ± 1.16 87.77 ± 1.41 88.79 ± 1.29 86.21 ± 1.17 89.85 ± 1.05
Pima 77.09 ± 2.25 73.14 ± 2.23 76.56 ± 2.01 76.78 ± 1.41 76.89 ± 0.52
QSAR 88.75 ± 1.56 85.68 ± 1.52 87.44 ± 1.21 86.24 ± 1.49 88.31 ± 1.78
Banknote 89.16 ± 1.71 84.81 ± 1.83 88.99 ± 1.33 86.85 ± 1.45 89.75 ± 0.94
Diabetes 63.22 ± 1.16 61.09 ± 1.18 62.38 ± 1.21 60.07 ± 1.83 61.39 ± 1.25

Table 4
Performance comparison of TMPMC, TWMPM, TSVM, MPM and TMPELM with RBF kernel.
TMPMC TWMPM TSVM MPM TMPELM
Datasets ACC± S(%) ACC± S(%) ACC±S(%) ACC± S(%) ACC± S(%)
Times (s) Times (s) Times (s) Times (s) Times (s)
Australian 90.18 ± 1.32 87.17 ± 1.27 88.43 ± 1.34 86.85 ± 2.27 90.34 ± 0.33
German 88.16 ± 2.11 85.39 ± 2.16 86.77 ± 1.18 78.48 ± 2.16 86.98 ± 1.46
Breast cancer 97.47 ± 1.33 96.62 ± 1.25 98.07 ± 1.26 96.83 ± 1.06 97.63 ± 0.71
WDBC 99.18 ± 0.65 96.39 ± 0.71 97.84 ± 0.68 97.68 ± 0.31 99.17 ± 0.11
Spam 90.12 ± 1.33 88.43 ± 1.28 89.34 ± 1.29 87.57 ± 1.22 90.20 ± 1.11
Pima 78.11 ± 1.35 75.63 ± 1.29 77.71 ± 1.36 77.24 ± 1.02 77.95 ± 0.79
QSAR 89.61 ± 2.45 86.18 ± 2.48 88.13 ± 2.38 87.56 ± 2.01 89.50 ± 1.60
Banknote 91.51 ± 1.21 87.32 ± 1.17 89.78 ± 1.19 87.54 ± 1.21 91.02 ± 0.69
Diabetes 66.27 ± 0.76 62.18 ± 0.74 63.37 ± 0.78 61.55 ± 0.61 63.92 ± 0.53

get Fα = (4.32) = 2.92. Since the value of FF > Fα , so we reject Table 5


Average ranks of five algorithms in both linear and nonlinear cases.
the null hypothesis.
Case TMPMC TWMPM TSVM MPM TMPELM
Next, the Nemenyi post-hoc test [33] is exploited to further
Linear 1.33 4.67 2.22 4.56 2.22
compare the six algorithms√ in pairs. Based on the Studentized Nonlinear 1.22 4.78 2.56 4.33 2.11
range statistic divided by 2, we know qα=0.05 = 2.850 and the
critical difference (CD)

We observed that the proposed TMPMC performance is signif-


√ √
k(k + 1) 5(5 + 1)
CD = qα=0.05 = 2.728 × = 2.03 icantly better than TWMPM and MPM, and the obvious differ-
6N 6×9
ence between them. Further we found no significant differences
Thus, if the average ranks of two algorithms differ by at least CD, between TMPMC, TSVM, TMPELM.
then their performance is significantly different. From Table 5, Through the above experimental results analysis, we can get
we can derive the differences between the proposed TMPMC and an objective conclusion that the proposed TMPMC improves the
other algorithms as follows: MPM in terms of classification performance and learning effi-
ciency.
Ξ(TWMPM − TMPMC ) = 4.67 − 1.33 = 3.34 > 2.03



5.2. Experimental results of the regression algorithms


⎨Ξ(TSVM − TMPMC ) = 2.22 − 1.33 = 0.89 < 2.03


⎪ Ξ(MPM − TMPMC ) = 4.56 − 1.33 = 3.23 > 2.03 In this section, we first give the experimental setup in Sec-
tion 5.2.1. Secondly, in Section 5.2.2 we performed experiments




Ξ(TMPELM − TMPMC ) = 2.22 − 1.33 = 0.89 < 2.03 on the artificial dataset. Finally, we selected nine UCI datasets and

performed experiments on these datasets in Section 5.2.3.
where Ξ(·) denotes the difference between two algorithms. Based
on the above calculation results, we can conclude that the pro- 5.2.1. Experimental setup
pose TMPMC performance is significantly better than TWMPM In order to effectively evaluate our proposed TMPMR, we
and MPM, while TMPMC, TSVM and TMPELM have no significant selected three excellent algorithms: MPMR [7], SVR [36] and
difference. TSVR [35] for comparative experiments. The Radial Basis Function
∥xi −xj ∥2
Similarly linear case, based on the average ranks and relevant (RBF) K (xi , xj ) = exp{− 2δ 2
} is used in our experiments.
values reported in Table 5, we can get To evaluate the performance of our proposed algorithm, the

Ξ(TWMPM − TMPMC ) = 5.78 − 1.22 = 4.56 > 2.03 evaluation criteria are specified before presenting the experimen-


⎪ tal results. The total number of testing samples is denoted by
N, yi denotes the real-value of a sample point xi , ŷi denotes the

⎨Ξ(TSVM − TMPMC ) = 2.56 − 1.22 = 1.34 < 2.03


predicted value of xi and ȳ is the mean of y1 , y2 , . . . , yN −1 , yN . We


⎪ Ξ(MPM − TMPMC ) = 4.33 − 1.22 = 3.11 > 2.03 use the following five criteria for algorithm comparison.


⎪ (1) MAE: Mean absolute error, MAE is also a popular deviation
Ξ(TMPELM − TMPMC ) = 2.11 − 1.22 = 0.89 < 2.03 measurement between the real and predicted values, which is

10 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

Table 6
Comparisons of our TMPMR with the corresponding TSVR, SVR and MPMR on artificial datasets.
Data (size) Models MAE RMSE
SVR 0.00620298 ± 0.00242762 0.00725217 ± 0.00269158
Type A MPMR 0.00735951 ± 0.00219177 0.00843847 ± 0.00364198
TSVR 0.00603797 ± 0.00376251 0.00692903 ± 0.00570901
TMPMR 0.00497492 ± 0.00236007 0.00583626 ± 0.00405734
SVR 0.0163501 ± 0.00542138 0.0194081 ± 0.00595115
Type B MPMR 0.0153401 ± 0.00660441 0.0188546 ± 0.00114481
TSVR 0.0125476 ± 0.00259486 0.0174104 ± 0.00316339
TMPMR 0.012276 ± 0.00329066 0.0160635 ± 0.00402451
SVR 0.0361966 ± 0.0229129 0.0380207 ± 0.0219081
Type C TSVR 0.014189 ± 0.00746775 0.0160624 ± 0.00659519
MPMR 0.025479 ± 0.0525161 0.0672296 ± 0.145493
TMPMR 0.0140129 ± 0.0222036 0.0172439 ± 0.0274847
SVR 0.0405993 ± 0.0206293 0.0448267 ± 0.0196752
Type D TSVR 0.02144 ± 0.00629705 0.0266254 ± 0.0065397
MPMR 0.0256028 ± 0.0526975 0.0672499 ± 0.145498
TMPMR 0.0194954 ± 0.00491478 0.0240573 ± 0.00587822
Data (size) Models SSE/SST SSR/SST
SVR 3.11833 ± 1.74766 4.11833 ± 1.74766
Type A MPMR 5.67050 ± 1.67078 2.67051 ± 1.07078
TSVR 3.85178 ± 1.21885 4.15772 ± 1.10728
TMPMR 2.11059 ± 1.85603 2.31958 ± 1.04152
SVR 0.951802 ± 0.26782 1.95180 ± 0.267826
Type B MPMR 0.648239 ± 0.354220 1.48239 ± 0.175422
TSVR 0.532819 ± 0.295008 1.28714 ± 0.512702
TMPMR 0.458593 ± 0.29199 1.26581 ± 0.611772
SVR 23.0195 ± 30.058 24.0195 ± 30.058
Type C TSVR 2.52339 ± 2.7628 2.55488 ± 2.605
MPMR 6.59029 ± 6.69421 6.17339 ± 7.07098
TMPMR 16.5949 ± 48.3551 14.9915 ± 45.3217
SVR 4.53641 ± 5.14677 5.53641 ± 5.14677
Type D TSVR 0.840627 ± 0.615775 1.56124 ± 0.578731
MPMR 10.7712 ± 12.4511 9.79583 ± 12.507
TMPMR 0.626133 ± 0.422497 1.11503 ± 0.349133

defined as the grid search method to select the best parameters. The pa-
N rameters selection range for all algorithms in our experiments is
1 ∑ {2i |−10, −9, . . . , 10} except parameter ε . The parameter ε is se-
MAE = |yi − ŷi |.
N lected in the range of {2τ (max{y}−min{y}) | τ = −10, −9, . . . , 0},
i=1
where max {y} and min {y} are the maximum and minimum
(2) RMSE: Root mean squared error, which is defined as values of output.
 For all experiments, we used ten-fold cross-validation to eval-
N

1 ∑ uate the performance of the algorithm. That is, the data set is
RMSE = √ (yi − ŷi )2 . randomly divided into ten subsets, one of these subsets is used
N
i=1 as a test set, and the other subsets are used as a training set, and
(3) SSE/SST the process is repeated 10 times. The average of ten test results
∑N was used as the final result of measuring the performance of each
(yi − ŷi )2 algorithm.
SSE /SST = ∑Ni=1
i=1 (yi − ȳi )2
5.2.2. Experiments on artificial datasets
(4) SSR/SSE In this section, we use four 2D artificial datasets to illustrate
∑N the performance of the regression function generated by the
(ŷi − ȳi )2 TMPMR method discussed in this study. To verify the predicted
SSR/SSE = ∑Ni=1
i=1 (yi − ŷi )2 performance of the propose TMPMR, we generate the data set by
the following function
Here, SSE is the sum ∑N of the squared errors of the test, which is ⎧
defined as SSE = 2
i=1 (yi −ŷi ) . SSE indicates the fitting precision,
⎨ sinc(x)
if x ̸ = 0
the smaller is SSE, the fitter the estimation is. However, the value Function A: f1 (x) = x
of SSE is too small may indicate over-fitting of the regressor. SST 1 otherwise

is the total squared deviation of the test sample, which is defined
∑N 2 x−1 x−1
as SST = i=1 (yi − ȳi ) . SSR is the total squared deviation that Function B: f2 (x) = | | + |sin(π (1 + ))| + 1
can 4 4
∑N be explained 2
by the estimator, which is defined as SSR =
In order to effectively reflect the performance of our algorithm
i=1 (ŷi −ȳi ) . SSR reflects the explanation ability of the regressor.
TPMMR, we have added various noises to the training samples,
The larger SSR is, the more statistical information it catches from
including Gaussian noise with zero mean and evenly distributed
testing samples.
noise. Specially, we have the following training samples
It is well known that the performance of an algorithm de-
pends to a large extent on the choice of parameters. We use Type A : g1 (xi ) = f1 (xi ) + ξi , −4π ≤ x ≤ −4π
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 11

Fig. 6. Predictions of SVR, TSVR, MPMR and TMPMR on Function A and Function B with different types of noise.

ξi ∼ U(0, 0.3); However, over-fitting occurred on (c) Type C and (d) Type D. In
order to make a fair comparison of these methods, we performed
Type B : g2 (xi ) = f1 (xi ) + ξi , −4π ≤ x ≤ −4π a ten-fold cross-validation on these four sets of artificial data sets.
ξi ∼ N(0, 0.4 );
2 The experimental results are listed in Table 6 and we can see that
our method is the best one. By analyzing the experimental results
Type C : g3 (xi ) = f2 (xi ) + ξi , −10 ≤ x ≤ 10 on the artificial datasets, we can get the proposed TMPMR as an
effective regression algorithm.
ξi ∼ U(0, 0.3);
5.2.3. Experiments on UCI datasets
Type D : g4 (xi ) = f2 (xi ) + ξi , −10 ≤ x ≤ 10
For further evaluation, we test nine benchmark datasets, in-
ξi ∼ N(0, 0.42 ); cluding the triazines, SlumpTest1, SlumpTest2, ENB2012data1,
To obtain more objective experimental results, we normalized ENB2012data2, forestfires, Realestatevaluation, housing and con-
all datasets that participated in the experiment to keep them in crete. The description of the data set is listed in Table 7. We also
interval [0, 1] to reduce the difference between the characteris- normalize the samples to interval [0, 1]. The results of the experi-
tics of different samples. We performed a series of experiments ment are shown in Table 8. Table 8 shows the best performance of
on SVR, TSVR, MPMR and TMPMR on these four data sets. All the kernel based on all the algorithms considered on all data sets.
experimental results are presented in Fig. 6 and Table 6 based on For performance metrics and all data sets, the best of all methods
the optimal parameters. Fig. 6 shows the fitting curves of the four is highlighted in bold.
algorithms on the four data sets. From Fig. 6 we can see that the According to the numerical experimental results, we can see
proposed TMPMR is better than the MPMR on the four data sets. that the proposed TMPMR algorithm ranks better on the six data
12 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

Table 7
UCI datasets description used in this article.
Datasets Instances Number of Datasets Instances Number of
features (n) features (n)
Triazines 186 60 SlumpTest1 103 8
SlumpTest2 103 8 ENB2012data1 768 9
ENB2012data2 768 9 Forestfires 517 13
Realestatevaluation 414 5 Housing 506 13
Concrete 1030 8

Table 8
Comparisons of our TMPMR with the corresponding TSVR, SVR and MPMR on UCI datasets.
Datasets Models MAE RMSE SSE/SST SSR/SST
SVR 0.146895 ± 0.0397562 0.203239 ± 0.0521754 0.203909 ± 0.168131 1.16813 ± 0.203909
Triazines MPMR 0.0525751 ± 0.0427031 0.0713404 ± 0.0760383 0.433355 ± 0.305047 0.335508 ± 0.13252
(186 × 60) TSVR 0.131017 ± 0.0257415 0.176428 ± 0.0345463 0.637579 ± 0.221582 0.905581 ± 0.172685
TMPMR 0.134052 ± 0.0363837 0.176024 ± 0.0410261 0.766489 ± 0.185604 0.885894 ± 0.161877
SVR 0.220492 ± 0.120201 0.295072 ± 0.146288 0.338068 ± 0.448773 1.33807 ± 0.448773
SlumpTest1 MPMR 0.227384 ± 0.072021 0.352311 ± 0.0954211 0.343214 ± 0.423519 1.28942 ± 0.424672
(103 × 8) TSVR 0.191024 ± 0.0550345 0.203804 ± 0.0560071 0.961774 ± 0.756135 1.14869 ± 0.837252
TMPMR 0.180285 ± 0.0664819 0.224541 ± 0.0756505 0.848556 ± 0.792155 1.02184 ± 0.746856
SVR 0.243782 ± 0.091179 0.293561 ± 0.0934142 0.151107 ± 0.16873 1.15111 ± 0.16873
SlumpTest2 MPMR 0.229834 ± 0.0745111 0.299831 ± 0.0853133 0.142341 ± 0.17421 1.17414 ± 0.19231
(103 × 8) TSVR 0.1777 ± 0.0380878 0.185394 ± 0.040416 0.779847 ± 0.473819 0.750636 ± 0.568823;
TMPMR 0.174099 ± 0.0492334 0.21177 ± 0.0536715 0.821702 ± 0.454409 0.764794 ± 0.590482
SVR 0.0248683 ± 0.0211346 0.274372 ± 0.0240412 0.100522 ± 0.0858023 1.10052 ± 0.0858023
ENB2012data1 MPMR 0.0275231 ± 0.0421313 0.300132 ± 0.009834 0.116331 ± 0.0321434 1.12253 ± 0.0634103
(768 × 9) TSVR 0.0304985 ± 0.0072031 0.0353269 ± 0.00874308 0.990818 ± 0.0508249 0.0213302 ± 0.011548
TMPMR 0.0246604 ± 0.00996666 0.0306202 ± 0.01068 0.989421 ± 0.0332569 0.0148942 ± 0.00932672
SVR 0.0492312 ± 0.00841321 0.0594231 ± 0.00942621 0.9052412 ± 0.0956121 0.0713982 ± 0.0170832
ENB2012data2 MPMR 0.0574241 ± 0.0543241 0.0841311 ± 0.0594241 1.015341 ± 0.185231 0.202381 ± 0.269151
(768 × 9) TSVR 0.0460823 ± 0.00626864 0.0553824 ± 0.00897876 0.973679 ± 0.0570776 0.0556928 ± 0.0170832
TMPMR 0.0533367 ± 0.0444224 0.0718074 ± 0.0585417 1.00908 ± 0.11876 0.134447 ± 0.244459
SVR 0.0191807 ± 0.0125547 0.0392715 ± 0.0467163 Inf ± NaN Inf ± NaN
Forestfires MPMR 0.01903131 ± 0.0135413 0.0394624 ± 0.0406723 Inf ± NaN Inf ± NaN
(517 × 13) TSVR 0.0180373 ± 0.0132726 0.0365336 ± 0.0442525 Inf ± NaN Inf ± NaN
TMPMR 0.0175797 ± 0.0166177 0.0424634 ± 0.0473164 Inf ± NaN Inf ± NaN
SVR 0.0973117 ± 0.0102466 0.122805 ± 0.0147101 0.0132246 ± 0.016427 1.01322 ± 0.016427
Realestatevaluation MPMR 0.0741241 ± 0.01352167 0.113911 ± 0.0257291 0.01465952 ± 0.010351 1.05316 ± 0.0103511
(414 × 5) TSVR 0.0494648 ± 0.00476858 0.0689683 ± 0.0180032 0.706661 ± 0.154039 0.349311 ± 0.101601
TMPMR 0.0611162 ± 0.00687473 0.0948109 ± 0.0164103 1.19282 ± 0.37668 0.627383 ± 0.210086
SVR 0.157454 ± 0.0650776 0.200587 ± 0.0783326 1.19882 ± 1.70521 2.19882 ± 1.70521
Housing MPMR 0.121617 ± 0.0865784 0.187604 ± 0.162911 2.14764 ± 1.48696 2.14326 ± 0.485899
(506 × 13) TSVR 0.0822281 ± 0.0249698 0.110546 ± 0.0404735 1.42734 ± 0.770699 0.763571 ± 0.587579
TMPMR 0.0730728 ± 0.0307942 0.0987962 ± 0.048531 1.28852 ± 0.86076 0.660019 ± 0.660031
SVR 0.172719 ± 0.0609181 0.206708 ± 0.0621078 0.419547 ± 0.465175 1.41955 ± 0.465175
Concrete MPMR 0.109651 ± 0.085816 0.220831 ± 0.272817 1.728057 ± 0.276271 7.73645 ± 1.76073
(1030 × 8) TSVR 0.146974 ± 0.085484 0.225938 ± 0.27412 5.44621 ± 1.28512 5.08667 ± 13.8394
TMPMR 0.0831612 ± 0.0336246 0.110034 ± 0.041841 0.952391 ± 0.298453 0.484685 ± 0.527929

sets than the other three algorithms on the nine UCI benchmark also constructs a pair of non-parallel hyperplanes for final classi-
data sets. In addition, our algorithm has comparable performance fication. For each hyperplane, TMPMC attempts to minimize the
on the remaining three data sets relative to SVR, MPMR and probability of misclassification of a class of samples (to minimize
TSVR. The reason for this phenomenon is that our algorithm can the worst case (maximum) probability of misclassification of a
apply the distribution information (mean and covariance) of the class of samples) while as far as possible from the principle of
samples. Further, we also found that TSVR and TPMMR have
other classes. Compared with MPM, our proposed TMPMC has a
similar performance on multiple datasets, even better than our
relatively low computational burden because TMPMC only needs
TMPMR on ENB2012data2 and Realestatevaluation. At the same
to solve a pair of smaller-scale SOCPs and MPM needs to solve
time, MPMR performs better on only one data set, and SVR also
has good performance, but in these nine UCI data sets, it is not a larger-scale SOCP. An important advantage of TMPMC is that
the best performance. it can directly estimate the limits of probability accuracy by
minimizing the maximum probability of misclassification, while
6. Conclusions avoiding the assumption of distribution of conditional density.
Additionally, we extend TMPMC to the regression problem and
In this paper, we propose a new classification method based propose regularized twin minimax probability machine regres-
MPM and TSVM by introducing regularization terms, that is, a sion (TMPMR). The TMPMR starts by posing regression as a binary
regularized twin minimax probability machine (TMPMC). TMPMC classification problem, such that a solution to this single classi-
not only inherits the advantages of MPM and TSVM, but also im- fication problem directly solves the original regression problem.
proves the classification performance of MPM. Like TSVM, TMPMC Finally, we extend the linear model of TMPMC and TMPMR to
J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703 13

nonlinear models by using kernel technology. Experimental re- [5] L. Yang, Y. Gao, Q. Sun, A new minimax probabilistic approach and its
sults of synthetic data sets and benchmark data sets demonstrate application in recognition the purity of hybrid seeds, CMES Comput. Model.
the effectiveness of TMPMC and TMPMR. Eng. Sci. 104 (6) (2015) 493–506.
[6] G.R.G. Lanckriet, L.E. Ghaoui, M.I. Jordan, Robust novelty detection with
The experimental results show that TMPMC has better classi-
single-class MPM, in: International Conference on Neural Information
fication accuracy on most datasets than MPM, TBSVM, TWMPM Processing Systems, MIT Press, 2002, pp. 905–912.
and TMPELM. At the same time, we can also find that the clas- [7] T. Strohmann, G.Z. Grudic, A formulation for minimax probability machine
sification performance of TMPMC is better than MPM on all data regression, Adv. Neural Inf. Process. Syst. 76 (2003) 9–776.
sets. However, TMPMC does not have a significant advantage over [8] K. Huang, H. Yang, I. King, M.R. Lyu, L. Chan, The minimum error minimax
TSVM learning time on all data sets. In addition, we found that probability machine, J. Mach. Learn. Res. 5 (2004) 1253–1286.
TMPMR has a good fitting effect with other regression algorithms [9] B. Gu, X. Sun, V.S. Sheng, Structural minimax probability machine, IEEE
Trans. Neural Netw. Learn. Syst. 28 (7) (2017) 1646–1656.
in the analysis of regression experiments, which fully proves
[10] K. Huang, H. Yang, I. King, M.R. Lyu, Learning classifiers from imbalanced
the effectiveness of TMPMR. The experimental results show that data based on biased minimax probability machine, in: Computer Vision
TMPMC and TMPMR will greatly expand the application of MPM and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE
in pattern classification and regression problems, and provide a Computer Society Conference on, IEEE, 2004.
new direction for MPM research. [11] S. Song, Y. Gong, Y. Zhang, G. Huang, G.B. Huang, Dimension reduction by
Although SOCP can effectively solve TMPMC and TPMMR, minimum error minimax probability machine, IEEE Trans. Systems Man
Cybern. Syst. 47 (1) (2017) 58–69.
there is no advantage in learning time relative to SVM-based
[12] S. Cousins, J. Shawe-Taylor, High-probability minimax probability ma-
algorithms. Therefore, in the future we hope to find a more chines, Mach. Learn. 106 (6) (2017) 1–24.
effective way to solve TMPMC and TMPMR. Of course, the pro- [13] K. Yoshiyama, A. Sakurai, Laplacian minimax probability machine, Pattern
motion of TMPMC to data dimensionality reduction is also worth Recognit. Lett. 37 (37) (2014) 192–200.
studying. In addition, imbalanced learning has become a new [14] F. Alizadeh, D. Goldfarb, Second-order cone programming, Math. Program.
research topic in machine learning in recent years [37,38]. In 95 (2003) 3C51.
future work, we will focus on the application of TMPMC in [15] F. Alvarez, J. López, H. RamirezC, Interior proximal algorithm with variable
metric for second-order cone programming: applications to structural
imbalanced learning, especially class-imbalanced learning such as
optimization and support vector machines, Optim. Methods Softw. 25 (6)
multi-classification, online anomaly detection, abnormal activity (2010) 859C881.
recognition and semi-supervised classification. [16] C. Bhattacharyya, Second order cone programming formulations for feature
selection, J. Mach. Learn. Res. 5 (2004) 1417C1433.
CRediT authorship contribution statement [17] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995)
273C297.
[18] V. Vapnik, Statistical Learning Theory, John Wiley and Sons, 1998.
Jun Ma: Writing-reviewing & editing. Jumei Shen: Conceptu-
[19] M. Carrasco, Julio López, Sebastián Maldonado, A second-order cone pro-
alization, Formal analysis, Methodology, Software, Data curation, gramming formulation for nonparallel hyperplane support vector machine,
Writing-original draft, Visualization, Investigation, Supervision, Expert Syst. Appl. (2016) S0957417416000671.
Software, Validation, Project administration, Funding acquisition. [20] C. Bhattacharyya, Robust classification of noisy data using second order
cone programming approach, in: International Conference on Intelligent
Sensing & Information Processing, IEEE, 2004.
Acknowledgments
[21] Sebastián Maldonado, Julio López, Imbalanced data classification us-
ing second-order cone programming support vector machines, Pattern
This work was supported in part by National Natural Sci- Recognit. 47 (5) (2014) 2070–2079.
ence Foundation of China (No 11601012, 11661002, 61562001). [22] Sebastián Maldonado, Julio Làpez, M. Carrasco, A second-order cone pro-
The scientific research funds of North minzu university (No. gramming formulation for twin support vector machines, Appl. Intell. 45
2018XYZSX07), First-Class Disciplines Foundation of Ningxia (No. (2) (2016) 265–276.
NXYLXK2017B09); Advanced Intelligent Perception and Control [23] Julio López, Sebastián Maldonado, Robust twin support vector regres-
sion via second-order cone programming, Knowl.-Based Syst. (2018)
Technology Innovative Team of NingXia.
S0950705118301709.
[24] Sebastián Maldonado, Julio López, Ellipsoidal support vector regression
Ethical approval based on second-order cone programming, Neurocomputing (2018).
[25] Q. Xiao, J. Dai, J. Luo, H. Fujita, Multi-view manifold regularized learning-
This paper does not contain any studies with human partici- based method for prioritizing candidate disease miRNAs, Knowl.-Based
Syst. 175 (2019) 118–129.
pants or animals performed by any of the authors.
[26] J. Saketha Nath, C. Bhattacharyya, Maximum margin classifiers with spec-
ified false positive and false negative error rates, in: Proceedings of the
Informed consent SIAM International Conference on Data Mining, 2007.
[27] Maldonado Sebastián, Carrasco Miguel, López Julio, Regularized minimax
Informed consent was obtained from all individual partici- probability machine, Knowl.-Based Syst. 177 (2019) 127–135.
pants included in the study. [28] Z.J. Xu, J.Q. Zhang, H.Y. Wang, Twin minimax probability machine for
handwritten digit recognition, Int. J. Hybrid Inf. Technol. 8 (2) (2015)
31–40.
References [29] Jayadeva, R. Khemchandani, S. Chandra, Twin support vector machines
for pattern classification. ieee trans pattern anal mach intell, IEEE Trans.
[1] G.R.G. Lanckriet, L.E. Ghaoui, C. Bhattacharyya, M.I. Jordan, Minimax Pattern Anal. Mach. Intell. 29 (5) (2007) 905–910.
probability machine, Adv. Neural Inf. Process. Syst. 37 (1) (2001) 192–200. [30] Y.H. Shao, C.H. Zhang, X.B. Wang, N.Y. Deng, Improvements on twin
[2] J.K.C. Ng, Y. Zhong, S. Yang, A comparative study of minimax probability support vector machines, IEEE Trans. Neural. Netw. 22 (6) (2011) 962–968.
machine-based approaches for face recognition, Pattern Recognit. Lett. 28 [31] J. Zhao, Y. Xu, H. Fujita, An improved non-parallel universum support
(15) (2007) 1995–2002. vector machine and its safe sample screening rule, Knowl.-Based Syst. 170
[3] Z. Deng, S. Wang, F.L. Chung, A minimax probabilistic approach to feature (2019) 79–88.
transformation for multi-class data, Appl. Soft Comput. 13 (1) (2013) [32] J. Ma, L. Yang, Y. Wen, et al., Twin minimax probability extreme learning
116–127. machine for pattern recognition, Knowl.-Based Syst. (2019) http://dx.doi.
[4] B. Jiang, Z. Guo, Q. Zhu, G. Huang, Dynamic minimax probability machine- org/10.1016/j.knosys.2019.06.014.
based approach for fault diagnosis using pairwise discriminate analysis, [33] A.W. Marshall, I. Olkin, Multivariate chebyshev inequalities, Ann. Math.
IEEE Trans. Control Syst. Technol. 27 (2) (2017) 1–8. Stat. 31 (4) (1960) 1001–1014.
14 J. Ma and J. Shen / Knowledge-Based Systems 196 (2020) 105703

[34] J.F. Sturm, Using sedumi 1.02, a matlab toolbox for optimization over [37] J. Bi, C. Zhang, An empirical comparison on state-of-the-art multi-class
symmetric cones, Optim. Methods Softw. 11 (1–4) (1999) 625–653. imbalance learning algorithms and a new diversified ensemble learning
[35] X. Peng, Tsvr: an efficient twin support vector machine for regression, scheme, Knowl.-Based Syst. 158 (2018) 81–93.
Neural Netw. 23 (3) (2010) 365–372. [38] C. Zhang, C. Liu, X. Zhang, G. Almpanidis, An up-to-date comparison
[36] H. Drucker, C. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector re- of state-of-the-art classification algorithms, Expert Syst. Appl. 82 (2017)
gression machines, in: Advances in Neural Information Processing Systems 128–150.
(NIPS), Vol. 9, MIT Press, 1997, p. 155C161.

You might also like