You are on page 1of 67

Global resolution of the support vector

machine regression parameters selection


problem with LPCC

Yu-Ching Lee, Jong-Shi Pang & John


E. Mitchell

EURO Journal on Computational


Optimization

ISSN 2192-4406
Volume 3
Number 3

EURO J Comput Optim (2015) 3:197-261


DOI 10.1007/s13675-015-0041-z

1 23
Your article is protected by copyright and
all rights are held exclusively by EURO -
The Association of European Operational
Research Societies. This e-offprint is for
personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.

1 23
Author's personal copy
EURO J Comput Optim (2015) 3:197–261
DOI 10.1007/s13675-015-0041-z

ORIGINAL PAPER

Global resolution of the support vector machine


regression parameters selection problem with LPCC

Yu-Ching Lee1 · Jong-Shi Pang2 ·


John E. Mitchell3

Received: 1 August 2013 / Accepted: 26 June 2015 / Published online: 15 July 2015
© EURO - The Association of European Operational Research Societies 2015

Abstract Support vector machine regression is a robust data fitting method to mini-
mize the sum of deducted residuals of regression, and thus is less sensitive to changes
of data near the regression hyperplane. Two design parameters, the insensitive tube
size (εe ) and the weight assigned to the regression error trading off the normed support
vector (Ce ), are selected by user to gain better forecasts. The global training and val-
idation parameter selection procedure for the support vector machine regression can
be formulated as a bi-level optimization model, which is equivalently reformulated as
linear program with linear complementarity constraints (LPCC). We propose a rectan-
gle search global optimization algorithm to solve this LPCC. The algorithm exhausts
the invariancy regions on the parameter plane ((Ce , εe )-plane) without explicitly iden-
tifying the edges of the regions. This algorithm is tested on synthetic and real-world
support vector machine regression problems with up to hundreds of data points, and
the efficiency are compared with several approaches. The obtained global optimal
parameter is an important benchmark for every other selection of parameters.

Keywords Support vector · Machine regression · Parameter selection · global


optimal parameter · Mathematical program with complementarity constraints · Global
optimization algorithm

Mathematics Subject Classification 90C26

B Yu-Ching Lee
ylee77@illinois.edu

1 University of Illinois at Urbana-Champaign, Champaign, USA


2 University of Southern California, Los Angeles, USA
3 Rensselaer Polytechnic Institute, Troy, USA

123
Author's personal copy
198 Y.-C. Lee et al.

1 Introduction

The support vector machine (SVM) method was originally developed as a tool of data
classification by Vapnik in 1964 and its use has been extended to regression since 1997
(Vapnik et al. 1997). This method has drawn much attention in the past 20 years because
of the good prediction accuracy it provides on practical applications in data mining
and machine learning. The SVM regression, or SVR, has two design parameters that
significantly affect its performance: (1) the size of the insensitivity zone, and (2) the
regularization parameter that is assigned to provide a trade-off between the absolute
residual and the separation of the data. These design parameters of SVR are commonly
selected by employing the grid-search with the training and validation techniques.
When grid-searching, the box-constrained feasible region on the parameter plane is
partitioned into grids where the intersection points of the grids correspond to the
pairs of parameters. These pairs form a pool of candidates. Provided a pre-determined
partition of the data set, the SVR model with the fixed parameters is trained by one or
several sets of the training data and validated by the validation data. The parameters
that lead to the least prediction error for the validation data are then the best choices
among the others in the pool.
The parameters selected from searching the grids, however, have no guarantees
about the global optimality within the entire feasible region of parameters. A formu-
lation of the bi-level optimization model has been shown to resolve this shortcoming
in the preceding work of Kunapuli (2008) and Yu (2011). The best parameter that can
be found by the training and validation technique is simply the global optimum of a
bi-level optimization model, where the two levels are referred to as the upper level
and lower level in our later discussion.
In the bi-level parameter selection problem of the SVM regression, the lower-level
optimization problem is an SVM regression model that determines a mapping featur-
ing the normal vector (ws ) and the bias (bs ), given the size of the insensitivity zone (εe )
and the regularization parameter (Ce ). The SVM regression model is a strictly convex
quadratic problem, and its Karush–Kuhn–Tucker (KKT) condition is necessary and
sufficient for optimality. The lower-level problem, thus, can be equivalently reformu-
lated by the KKT condition as a linear complementarity problem (LCP). Making use
of this LCP reformulation, the semismooth method (Ferris and Munson 2004), succes-
sive overrelaxation method (Mangasarian and Musicant 1998), and the interior-point
method (Ferris and Munson 2002) are algorithms that have been studied for solving
SVM problems with large numbers of data points, and have yielded some good results.
(Problems with up to 60 million data points and 34 features are solved in Ferris and
Munson (2002, 2004). Problems with up to 10 million data points and 32 features
are solved in Mangasarian and Musicant (1998).) On the other hand, the upper-level
problem is an absolute deviation regression model subject to the lower-level optimiza-
tion problem and the box constraints on its design parameters. There indeed exist
algorithms for solving the general nonlinear bi-level optimization to global optimum,
including the αBB-type method in Gumus and Floudas (2001), the branch-and-bound-
type method in Bard and Moore (1990), and other methods being reviewed in Floudas
and Gounaris (2009). Nevertheless, algorithm efficiency for obtaining a global opti-

123
Author's personal copy
Global resolution of the support vector machine regression... 199

mal solution to the middle- and large-scaled nonlinear bi-level problems remains a
challenge.
This work is among a recent series of research about selecting and validating the
parameters of the SVR, which reformulates the bi-level model as a model of bi-
parametric linear complementarity constrained program. Previous works in this series
include the studies presented in Kunapuli et al. (2006, 2008), Kunapuli (2008), Jing
et al. (2008), Yu (2011), Lee et al. (2014). A multi-fold cross-validated SVR parameter
selection model containing a “feature selection” scheme is considered in Kunapuli et al.
(2006, 2008), Kunapuli (2008). The benchmark for parameter selection in their work
is the performance of the selected SVR model on the “hold-out” set of testing data
points rather than the quality of the solution to the bi-level program. The numerical
results in Kunapuli et al. (2006) have shown that the bi-level programming approach is
resistant to overfitting. (Similar results about reducing overfitting when employing the
cross-validation technique can be seen in Cawley and Talbot (2010), Ng (1997), Arlot
and Celisse (2010).) In Yu (2011), a two-stage branch-and-cut algorithm is proposed
for solving the linear program with complementarity constraints (LPCC) to global
optimum. It is concluded that if the lower bound of the objective value can be pushed
closer to the upper bound in the preprocessing stage, the global solution is identified
efficiently in the second stage in Yu (2011). The optimization algorithms proposed in
Jing et al. (2008), Lee et al. (2014) are designed for the general and the bi-parametric
forms of the LPCC, respectively.
The parameter selection of the Kernel function, which will impose additional one
or more parameters depending on the types of the Kernel function, of the SVM is also
an important issue. None of the above papers including this work have extended the bi-
level optimal parameter selection scheme to the SVR with a Kernel function. We refer
the interested readers to Sathiya Keerthi and Lin (2003), Carrizosa et al. (2014) for the
heuristic methods, Schittkowski (2005) for a mathematical programming approach,
and the approaches surveyed in Carrizosa and Morales (2013) about selecting the
Kernel parameters alone.
We develop algorithms to accommodate the structure of SVM regression parameter
selection. The main algorithm we propose is to search the optimal design parameters
in the feasible region on (Ce , εe )-plane. To do this, the concept of “invariancy region”
for the inner level problem is crucial. An invariancy region is a region in the parameter
space where the basis remains unchanged. Invariancy regions are convex, and they
partition the whole feasible region without overlapping. The searching and partitioning
scheme on (Ce , εe )-plane initiates the search from a fixed point (Ce , εe ), with which a
lower-level SVR problem with fixed parameters is solved, and proceeds by identifying
the invariancy interval along one chosen direction. A queue of the rectangular (Ce , εe )-
areas, which result from the partitioning, is maintained and searched one by one. For
each rectangular area, we either conclude that all the invariancy regions inside the
area have been examined and remove the rectangle from the queue, or we partition the
rectangular areas horizontally and/or vertically, and add the new rectangular subareas
to the queue while removing the original one. The algorithm terminates when all the
rectangular areas in the queue are eliminated. The solution obtained from this algorithm
can be verified to be global optimal. Different from other methods (Ghaffari-Hadigheh

123
Author's personal copy
200 Y.-C. Lee et al.

et al. 2010; Columbano et al. 2009; Tondel et al. 2003) which also perform the search on
the parameter plane, the boundaries of invariancy regions are not explicitly identified
in our algorithm. Although revisiting a previously examined invariancy region is not
avoided, the effort of finding and storing the boundaries of invariancy regions is saved.
Since the algorithm involves exploring the invariancy regions, we expect the solution
time proportional to the number of feasible invariancy regions. This conjecture is
confirmed in the numerical experiments.
We propose a second way to solve the parameter training and validation for SVM
regression as an improved integer program (IP). The linear complementarity con-
strained program can be formulated as such an IP via the big-M technique. The valid
values of big-M for this specific application are derived from finding the upper bound
on the multipliers of the lower-level problem of SVM. Moreover, we propose a pro-
cedure to further tighten the upper bound on the multipliers. The tightened multiplier
upper bounds can reduce the feasible regions for the IP and has enabled us to improve
the running time for solving the IP by CPLEX (IBM ILOG CPLEX Optimizer 2010).
The IP program that employs the improved values of big-M is what we referred to as the
improved IP. The solutions produced by the improved IP and the (Ce , εe )-rectangular
search mentioned in the previous paragraph are both globally optimal. However, the
improved IP loses its the efficiency or even the possibility to solve a problem when
the size of problem is above some threshold. By monitoring the process of branch
and bound, we notice that the gap between the lower and upper bounds reduces very
slowly because the lower bound of the objective is hardly improved, whereas the upper
bound of the objective is usually tight.
The contributions of this paper can be seen from the mathematics, algorithms, and
applications points of view. The first contribution is in defining mathematically the
global optimum for the training and validation SVM regression parameter selection
problem, bridging the area of mathematical programming with machine learning. Such
a global optimal parameter and its corresponding regression residual can serve as a
benchmark for other parameters selected by heuristics, such as grid search. The second
contribution is in the development of the two approaches proposed to solve the problem
to global optimum: the (Ce , εe )-rectangle search algorithm that aims to take advantage
of existing efficient methods of solving the lower-level problem and an improved IP
model that relies on an existing IP solver. The algorithms are tested on instances with
single- and multi-fold training and validation data sets, including those generated by
us and those from the real world. A significant number of numerical experiments are
presented to uncover the strength and limit of each algorithm. We show that the level
of difficulty of the instances inputting to the (Ce , εe )-rectangle search algorithm can be
classified by a four-quadrant diagram. Comparing the convergent time of the instances
between different quadrants is meaningless because they belong to different scales of
difficulty. The running time of the instances in the same quadrant is proportional to the
size of data. The third contribution is in providing a way to evaluate other parameter
selection algorithms. We compare the global solution to the solution produced by
the non-global optimization commercial solver. Although substantially more time is
needed by our algorithm (while the solver can produce a non-global optimal solution
in seconds), the solution quality is guaranteed.

123
Author's personal copy
Global resolution of the support vector machine regression... 201

So far, our methodology can only deal with the 2-parameter instances with small
size of training data sets on one single computer. Future research that integrate the
methodology with computer techniques should have a great chance to expedite the
convergence time and solve problems with larger training data sets. Straightforward
extension of the methodology to the instances with more than 2 parameters, however,
is not possible.
The remaining part of this paper is organized as follows. Section 2 introduces the
SVM classification and regression models and further derives the SVM regression
parameter selection problem as a LPCC. Section 3 introduces the (Ce , εe )-rectangle
search algorithm, including the semismooth method for solving the lower-level prob-
lem of this bi-level parameter selection model, our definition of the invariancy region,
and the recording method of the geometrical allocation of data points. We will explain
how the algorithm searches for the invariancy region on chosen sides of a rectangle,
thus reducing the effort to searching for the invariancy “interval”, and how the algo-
rithm verifies weather an input rectangular region on the (Ce , εe )-plane belongs to
one or two invariancy regions. Section 4 describes the big-M tightening algorithm
that we employed to form the modified IP. Section 5 displays the numerical results
of the (Ce , εe )-rectangle search algorithm and the modified IP. These global optimal
solutions are compared with the local optimal solution obtained by KNITRO (Byrd
et al. 2006). The performance of the algorithms and the difficulty of the instances are
depicted and analyzed. Section 6 summarizes the paper and concludes the findings
from this research.

2 Models and properties

In Sect. 2.1, we review the parameters used in the soft margin SVM classification, hard
margin SVM classification, and SVM regression. The soft margin SVM classification
involves one parameter Ce , the hard margin SVM classification does not need any
parameters, and the SVM regression involves two parameters (Ce , εe ). In Sect. 2.2,
we introduce the global training and validation technique in selecting the parameters
for the SVM regression specifically. This technique can be formulated as an LPCC.
It can be seen that the multi-fold cross-validation method is a special case of our
formulation. Following this formulation, the meaning of the “optimal parameter” in
our framework is clarified.

2.1 The parameters in the SVM type of problems

The SVM model for classification is the original version of every SVM-related study.
j
The SVM classifier is a hyperplane wsT xd + bs = yd j 1 that classifies the data points

1 The 1st order subscript represents the role of that mathematical expression—subscript d: data; subscript
s: support vector machine regression; subscript e: exogenous parameter to the support vector machine
regression.

123
Author's personal copy
202 Y.-C. Lee et al.

j j
(xd , yd j ) ∀ j ∈ D, where xd ∈ R K , yd j ∈ {+1, −1}, and D denotes the set of training
j
data. The dimension of xd represents the number of features or characteristics that
describe the data point j ∈ D, and the value of the variable yd j indicates the group
to which the data point j belongs. A data point j is said to be misclassified by the
j
hyperplane wsT xd + bs = yd j if the predicted yd j is +1 yet the true yd j is −1, or vice
j
versa. A tube area around the target hyperplane wsT xd + bs = 0 is defined by the two
j j
parallel hyperplanes wsT xd + bs = +1 and wsT xd + bs = −1. In the basic setting of
the SVM classification, the size of the tube (estimated by 2/ws ) is maximized.
In a soft margin SVM classification, the desired hyperplane (ws , bs ) is the optimizer
of the following program:

 1
min Ce ξ j + wsT ws
ws ,bs ,ξ 2
j∈D
j
subject to wsT xd + bs ≥ 1 − ξ j , ∀yd j = +1, (1)
j
wsT xd + bs ≤ −1 + ξ j , ∀yd j = −1,
and ξ j ≥ 0.

where Ce is a parameter trading off the two terms in the objective function, and ξ
is the slack. The hyperplane (ws , bs ) minimizes the sum of the distance between the
misclassified observation and the closer bound of the tube, and also maximizes the
j j
margin size, 2/ws , between wsT xd + bs = +1 and wsT xd + bs = −1.
In contrast to the soft margin SVM, the hard margin SVM does not allow any
misclassification. To formulate the hard margin SVM, the model in (1) is revised by
removing the Ce j∈D ξ j term from the objective function and removing the −ξ j and
+ξ j terms from the two constraints, respectively. In Kecman (2005), the hard margin
version is referred to as “linear maximal margin classifier for linearly separable data”,
and the soft margin version is referred to as “linear soft margin classifier for overlapping
classes.”
Extended from the classification, the SVM regression identifies a hyperplane yd j =
j
wsT xd + bs where yd j is a real-valued dependent variable. Besides the Ce parameter,
the other parameter, εe , defines the size of the tube in which the residual is neglected.
Given Ce and εe , the desired hyperplane minimizes the following problem:
⎧ ⎫
⎨ 
n
1 ⎬
j
min Ce max(|wsT xd + bs − yd j | − εe , 0) + wsT ws . (2)
ws ,bs ⎩ 2 ⎭
j=1

The 21 wsT ws term in the objective of (2) is directly borrowed from its usage in (1).
This term is also called a regularization term because it imposes strong convexity
and forces the optimal ws to be unique. The other term in the objective is the (least)
absolute residual outside the εe -insensitive tube, i.e., the tube area defined by the two
j j
parallel hyperplanes wsT xd + bs = +εe and wsT xd + bs = −εe .

123
Author's personal copy
Global resolution of the support vector machine regression... 203

2.2 LPCC formulation of the parameter training and validation

We consider F folds of training and validation data. That is to say, we divide the train-
j
ing observations (xd , yd j ), j = 1, . . . , n into F groups and validation observations
j
(xd , yd j ), j = n + 1, . . . , n + m into another F groups. The partition of the data set
is assumed to be pre-determined and unchangeable.
f f
Given any (Ce , εe ) pair, a hyperplane (ws , bs ) minimizing the objective function of
SVR is obtained by solving the model (2) for each fold of the training data. Using this
trained hyperplane, the regression residuals of the validation data in each fold are then
calculated. A small cumulative regression error from the validation data is desired,
and this hope might be achieved by choosing a different (Ce , εe ) pair and repeating
the aforementioned process. The procedure of sequentially choosing (Ce , εe ) pair,
solving for the optimal hyperplane on the training data, and computing the regression
errors on validation data, is what we call the training and validation technique. Such
a parameter selection process is nontrivial since there is no guarantee that the Ce or
εe with a large value always produces a smaller regression error than that result from
smaller values of parameters.
Is there really a “best choice” of the parameters? We say that a set of parameters is
“good” if it defines a model that forecasts the future with small errors. Since the future
is not known, it is impossible to find the parameter that will yield a precise prediction.
It should be clearly noted that the optimal parameters being studied in this paper is
in fact the best merely under the framework of training and validation with a fixed
number of features and the fixed partition of the validation and training data sets.
The observations are arbitrarily labeled in each subset. We denote the first and the
last index of the observations in the f th training data set as front f and end f , respec-
tively, and there are n f observations in the f th training data set. Similarly, the first
f
and the last index of the observations in the f th validation data is denoted as front v
f
and endv respectively, and there are m f observations in the f th validation data set.
The training and validation process of the SVR parameter selection is formulated
as follows:
f

F v
end
T
min |wsf xdi + bsf − ydi |
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws2 ,bs2 ,...wsF ,bsF v

subject to 0 ≤ C ≤ Ce ≤ C, 0 ≤ ε ≤ εe ≤ ε,
f f
and ∀ f = 1, . . . , F : (ws , bs ) ∈ the solution set of the following problem
⎧ ⎫
⎨ 
end f
1 ⎬
T j T
min Ce max(| wsf xd + bsf − yd j | − εe , 0) + wsf wsf .
ws f ,bs ⎩ 2 ⎭
f
f
j=front
(3)

It is known that the KKT conditions of the lower-level problems in (3) is sufficient
for optimality because the lower-level problems are convex. The lower-level problems,
thus, can be replaced by the KKT conditions, which are expressed as F folds of the

123
Author's personal copy
204 Y.-C. Lee et al.

f
linear complementarity problem LCPSVR as follows:

∀ f = 1, . . . , F : ⎧

f
⎪ 
end

⎪ wsf −
j
(β j − α j )xd = 0,





⎪ j=front f





f f
⎪ 
end 
end

⎪ 0= αj − β j ⊥ bsf ,


f (4)
LCPSVR := j=font f j=front f



⎪ ∀ j = front f . . . end f :



⎪ ⎧

⎪ ⎪
j f
0 ≤ es j + εe − (xd )T wsf − bs + yd j ⊥ α j ≥ 0,

⎪ ⎪




⎪ j f
0 ≤ es j + εe + (xd )T wsf + bs − yd j ⊥ β j ≥ 0,

⎪ ⎪

⎩ ⎪

0 ≤ Ce − α j − β j ⊥ es j ≥ 0.

Furthermore, the upper-level problem in (3) can be linearized by introducing the


variable pi , ∀i = n + 1, . . . , n + m. This standard linearization trick leads to the
following LPCC:

f

F v
end
min pi
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws ,bs2 ,...,wsF ,bsF ,
2 v
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 , (5)
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n+m 1 +m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n+m 1 +m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,
and constraints in (4).

The optimal parameter is now well defined as the global optimal solution to the
LPCCs. Note that the F-fold cross-validation approach for the parameter selection is
a special case of the formulation (5) when the observations are repeatedly contained
in each subset of training and validation data. See the F-fold cross-validation LPCC
formulation in Kunapuli et al. (2008), Kunapuli (2008).

123
Author's personal copy
Global resolution of the support vector machine regression... 205

2.3 Non-uniqueness of bs and the selection of bs

While the optimal ws solution is unique to problem (2) due to the strictly convexity
imposed by the 2-normed regularization term (1/2)wsT ws , the optimal bs solution is
not unique in the SVR problem (see Burges and Crisp 1999 for exceptions). This
f
fact implies that the direct use of the values (wsf , bs ) obtained by solving the linear
complementarity program (4) does not sufficiently yield the minimum value of the
upper-level objective
f

F v
end
T
|wsf xdi + bsf − ydi |. (6)
f =1 i=front f
v

f f
The optimal bs ’s, which is an interval denoted by [bmin bmax ], should be obtained by
the following procedure.
f
Given the unique optimal wsf values and any single optimal value of bs , the lower-
level objective value V corresponding to a fixed (Ce , εe ) is computed by
f

f

end
T j 1 T
V f
= Ce max(|wsf xd + bsf − yd j | − εe , 0) + wsf wsf
2
j=front f

for each fold f . Then, solve one maximizing model and one minimizing model for
each fold f = 1, . . . , F:

f f f
bmax /bmin = max / min bs
f
bs ,a j
f

end
1 T
subject to Ce a j ≤ − wsf wsf + V f ,
2
j=front f
f T j
bs − a j ≤ −wsf xd + yd j + εe , ∀ j = front f , . . . , end f ,
f T j
−bs − a j ≤ wsf xd − yd j + εe , ∀ j = front f , . . . , end f ,
a j ≥ 0, ∀ j = front f , . . . , end f

f f f
will give us the optimal intervals of bs : [bmin bmax ].
For an end user, selecting a bs ∈ [bmin bmax ] to be used in prediction is of equal
importance with selecting the parameters to train the SVM. The simplest selection is
the bs value produced by the software or the midpoint of bmin and bmax . Sometimes, one
can select bs ∈ [bmin bmax ] to adjust the number of false positives and false negatives as
mentioned in Scholkopf and Smola (2001). If data are at hand, following the previous
f
setting of the validation data and the optimal wsf , the optimal bs that minimizes the
validation absolute error (6) can be easily obtained by solving a linear mathematical
program:

123
Author's personal copy
206 Y.-C. Lee et al.

f

F v
end
min pi
bs1 ,bs2 ,...,bsF
f =1 i=front f
v
f f f
subject to bmin ≤ bs ≤ bmax , ∀ f = 1, . . . , F,
(xd ) ws + bs − ydi ≤ pi ,
i T 1 1 ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF . . . . , endvF ,
(7)
Note that the bs f obtained by solving (7) will be the same with the global solution
f f f
bs produced by the bi-level model (3) when the intervals [bmin bmax ] and wsf in (7)
are obtained at the global optimal parameters Ce and εe . In other words, The bi-level
 f 
formulation implicitly produces an optimal bs = Ff=1 bs /F and ws = Ff=1 wsf
simultaneously with the global parameter. The analysis of this (ws , bs ) has not yet
been covered in this work.

3 (Ce , εe )-rectangle search algorithm

We demonstrate a global optimization algorithm that solves the program (5). At every
iteration, we fix the design parameters at different values to solve the lower-level
problem, and with this lower-level solution, an upper bound of the outer-level problem
can be obtained. The algorithm is named rectangle search because the search of the
values for the parameters proceeds along the boundary of rectangular areas to obtain
information of termination or further area partitioning.
For any fixed values of (Ce , εe ), we can solve the lower-level problem (4), an LCP,
by existing methodologies, such as semismooth method. The active and inactive con-
straints of the solved LCP will provide us a set of linear equalities and inequalities. We
call this set of linear equalities and inequalities a piece. Replacing the complementarity
constraints in model (5) by the linear constraints (piece) restricts the feasible Ce and εe
in a smaller but convex region. This region is called an invariancy region on (Ce , εe )-
plane. Because of the “invariancy”, it is sufficient to find the local best (Ce , εe ) pair
and a local minimum of (5) by solving a linear program (restricted linear program)
subject to the piece resulting from the LCP fixed at an arbitrary point within the same
convex region. Then, we search for the next (Ce , εe ) (outside the current invariancy
region) at which the lower-level problem is again fixed and solved. The above proce-
dure repeats continually until the algorithm exhausts all the piecewise-convex feasible
regions2 and achieves global optimality.

2 By exhausting all the invariancy regions, we mean that for every invariancy region, at least one (C , ε )
e e
point within the region is chosen and the associated restricted linear program is solved.

123
Author's personal copy
Global resolution of the support vector machine regression... 207

The specialty of our algorithm lies in the search technique for the next (Ce ,εe ).
The rectangle search scheme we proposed decomposes the entire feasible regions into
small rectangles. Searching along the boundary of these rectangles reduces the effort
of identifying geometric location of the invariancy regions to identifying the geometric
location of invariancy intervals on a given line. Consider the initial [C, C] × [ε, ε]
rectangular area on the 2-dimensional (Ce , εe )-plane. We first fix the values of (Ce , εe )
where the lower-level program is solved at one vertex of the rectangular area. Along
one side of the boundary, we can easily identify the endpoints of the invariancy interval
where this (Ce , εe ) point belongs. Then, we find a new (Ce , εe ) point on the same side
of the boundary but outside the current invariancy interval, and solve another lower-
level problem with this new (Ce , εe ). All the invariancy intervals along the four sides
of the boundary are identified with this repeated procedure.
For each invariancy region, there is a corresponding SVR data points allocation in
the feature space. We call one specific data points allocation a grouping. Recording
groupings is equivalent to recording the invariancy regions that have been found.
Based on the grouping information associated with the invariancy intervals along
the boundary, one can either conclude that all the invariancy regions contained in
this rectangular area have been examined or that further area partitioning is required.
Throughout the algorithm, we maintain a queue of rectangular areas to be examined.
The rectangular areas can be removed from the queue if (1) the four-corner points
of the (Ce , εe )-rectangular area have the same grouping vector, or (2) the (Ce , εe )-
rectangular area is bisected into two regions by a straight line, each belonging to one
invariancy region. These two sufficient conditions are called the 1st and the 2nd stage
of the algorithm respectively.
To summarize, this algorithm contains the following 7 key procedural tasks:
1. Solve the lower-level problem with a fixed (Ce , εe ) by known methodologies, such
as semismooth method (Ferris and Munson 2004).
2. Replace the complementarity constraints by linear constraints (piece), which
restricts the feasible region in a smaller but convex region (invariancy region).
3. Solve for the local best (Ce , εe ) and the optimal minimum within the invariancy
region. This can be done by solving a linear program.
4. Search for the next (Ce , εe ) (outside the current invariancy region) at which the
lower-level problem is fixed and solved. We propose the rectangle search scheme to
search along the boundary of a rectangle and thus the invariancy region is reduced
to invariancy interval.
5. Partition the initial [C, C] × [ε, ε] area into small rectangular regions at chosen
points and maintain a queue of rectangular areas in to be examined.
6. Maintain a list of visited invariancy regions by recording their corresponding data
allocation in space (grouping).
7. Eliminate the rectangular areas from the queue if (1) the four-corner points of
the (Ce , εe )-rectangular area have the same grouping vector, or (2) the (Ce , εe )-
rectangular area is bisected into two regions by a straight line, each belonging to
one invariancy region.
Task 1 is discussed in Sect. 3.1; the linear constraints (piece) and the convex region
(invariancy region) mentioned in task 2 are defined in Sect. 3.2; the allocation of data

123
Author's personal copy
208 Y.-C. Lee et al.

points in the feature space (grouping) mentioned in task 6 and the degeneracy issue
are also discussed in Sect. 3.2; the linear program mentioned in task 3 is introduced
in Sect. 3.3; the search strategy along the boundary of a rectangle mentioned in task 4
can be seen in Sect. 3.4; the partitioning, recording, and eliminating steps mentioned
in tasks 5–7 are shown in Sects. 3.5 and 3.6.

3.1 Lower-level problem with fixed (Ce , εe )

The lower-level problem with a fixed (Ce , εe ) is the SVM regression model for each
fold of training data. We have shown that the lower-level problem is equivalent to a
collection of LCPs. To solve these LCPs, we employ the semismooth method (Luca
et al. 1996), which involves the use of the semismooth Fischer–Burmeister function.
Consider the general complementarity problem as follows:

0 ≤ Ui (a) ⊥ ai ≥ 0, ∀i ∈ I,
0 = Ui (a) ⊥ ai : f r ee, ∀i ∈ E, (8)

where I and E denotes the nonoverlapping sets of indices for inequalities and equalities,
respectively. The Fischer–Burmeister function for the LCP (8) is defined as:

φ(ai , Ui (a)) := ai + Ui (a) − ai2 + Ui (a)2 .

The equality φ(ai , Ui (a)) = 0 holds if and only if 0 ≤ Ui (a) ⊥ ai ≥ 0. In the


semismooth method, the merit function being used is of the following form:
⎡ ⎤
..
⎢ . ⎥
⎢ φ(ai , Ui (a)), ∀i ∈ I ⎥
⎢ ⎥
⎢ .. ⎥
⎢ . ⎥
(a) := ⎢
⎢ ..
⎥.

⎢ . ⎥
⎢ ⎥
⎢ Ui (a), ∀i ∈ E ⎥
⎣ ⎦
..
.

It has been proven in many research papers, such as Facchinei and Soares (1997),
that the merit function (a) has some desirable properties including that (a) is a
semismooth function and that g(a) := 21 (a)22 is continuously differentiable. Most
importantly, for any H ∈ ∂ B (a), where ∂ B (a) represents the B-subdifferential
of (a), we have ∇g(a) = HT (a). These properties hold under the continuous
differentiability of U(a), which is satisfied in the application of SVR. Thus, by solv-
ing the equation g(a) = 0, a solution to the complementarity problem (8) is found.
Theoretical foundations of the semismooth method can be seen in Luca et al. (1996),
Billups (1995), Ferris and Munson (2004). The damped Newton method (Facchinei
and Soares 1997) embedded in our algorithm to solve the lower-level problems of the
SVR is in Appendix A.

123
Author's personal copy
Global resolution of the support vector machine regression... 209

In the context of the SVR, the merit function (a f ) ∈ R3n f +1 is of the form as
shown in (9), provided a fixed pair of parameters (Ce , εe ). Following Theorem 13, the
computation of the B-subdifferential H will require the matrix U (a f ) of the following
form: ⎡ ⎤
⎢ XT X −XT X In f ×n f −1n f ×1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ −XT X XT X In f ×n f 1n f ×1 ⎥
⎢ ⎥
⎢ ⎥
U (a f ) = ⎢

⎥,

⎢ ⎥
⎢ −In ×n −In ×n 0n f ×n f 0n f ×1 ⎥
⎢ f f f f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ 1 −11×n f 0 0 ⎦
1×n f 1×n f

T j
where XT X is a n f × n f matrix comprising elements xdi xd for all i =
front f , . . . , end f and j = front f , . . . , end f ; In f ×n f is the identity matrix that belongs
in Rn f ×n f ; 1n f ×1 is a matrix belonging in Rn f ×1 that has all 1 entries; 0n f ×1 is the
matrix belonging in Rn f ×1 that has all 0 entries; and the variables vector a f ∈ R(3n f +1)
is written as

a f = [αfront f · · · αend f | βfront f · · · βend f | esfront f · · · esend f | bsf ]T .

Now suppose {ak } is a sequence generated in the damped Newton method (Algo-
rithm 12) and {ak } → a∗ , where a∗ is the final solution to the system (a) = 0.
It is known that if k is sufficiently large, the search direction dk is always chosen
at the Newton step computed in (31) rather than the steepest decent step as in (33).
Meanwhile, if k is sufficiently large, t k is always chosen at 1 (Luca et al. 1996). The
decent condition in (32) ensures the semismooth method converges globally, i.e., any
initial point, not necessarily close to the solution, can lead to convergence.
The semismooth method is neither the only nor the guaranteed best way to solve
the lower-level problem with a fixed (Ce , εe ). The successive overrelaxation method
(Mangasarian and Musicant 1998) and the interior method (Ferris and Munson 2002)3
applied in the SVM might be substitutions, but the comparison is not within the scope
of this work.

3 According to the numerical experiments provided in work Ferris and Munson (2004), the semismooth
method outperforms the interior method (Ferris and Munson 2002) specifically in solving the large-scale
SVM classification problems.

123
210

⎡  ⎤
⎛  ⎞ ⎛ ⎞2
end f  end f
  f 

123
⎢ ⎜ front f T i f ⎟  2 i f ⎥
⎢ α ) (βi − αi )xd − bs + yd ⎠ − (αfront f ) + ⎝es + ε e − (xdfront )T (βi − αi )xd − bs + yd ⎠ ⎥
⎢ front f + ⎝esfront f + ε̄e − (xd front f front f front f ⎥
⎢ ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ . ⎥
⎢ . ⎥
⎢ . ⎥
⎢ ⎛ ⎞  ⎥
⎢ ⎛ ⎞2 ⎥
⎢ end f  end f ⎥
⎢    ⎥
⎢ ⎜ ⎟  2 ⎝ ⎠ ⎥
⎢ α + ⎝ e s + ε e − (x endf )T
d (β i − α )x
i d
i −bf + y
s d ⎠ −  (α ) + es + ε e − (x endf )T
d (β i − α )x
i d
i −bf + y
s d ⎥
⎢ end f end f end f end f end f end f ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ ⎥
⎢  ⎥
⎢ ⎛ ⎞  ⎛ ⎞2 ⎥
⎢ end f  end f ⎥
⎢ f    ⎥
⎢ ⎜ front T i f ⎟  2 ⎝ front f T i f ⎠ ⎥
⎢ β1 + ⎝e + εe + (xd ) (βi − αi )xd + bs − yd + ε e + (xd ) (βi − αi )xd + bs − yd ⎥
⎢ front f front f
⎠ − (βfront f ) + es
front f front f ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ . ⎥
⎢ . ⎥
(a f ) = ⎢



(9)
⎢ . ⎥
⎛ ⎞  ⎛ ⎞2

⎢ end f  end f ⎥

f   f 
⎢ ⎜ f ⎟ f ⎥
⎢ β f + ⎝es f + ε e + (xdend )T
end
(βi − αi )xdi + bs − yd f ⎠ −   (β f )2 + ⎝es f + εe + (xdend )T
end
(βi − αi )xdi + bs − yd f ⎠ ⎥
⎢ end end end end ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ ⎥
⎢  ⎥
⎢ ⎥
⎢ es + (Ce − α − β f ) − (es )2 + (Ce − α −β )2 ⎥
⎢ front f front f end front f front f front f ⎥
⎢ ⎥
Author's personal copy

⎢ ⎥
⎢ . ⎥
⎢ . ⎥
⎢ . ⎥
⎢   ⎥
⎢ ⎥
⎢ es f + Ce − α f − β f − (es f )2 + (Ce − α f − β f )2
end end end end ⎥
⎢ end end ⎥
⎢ ⎥
⎢ ⎥
⎢ f f

⎢ end end ⎥
⎢   ⎥
⎣ αj − βj ⎦
j=front f j=front f
Y.-C. Lee et al.
Author's personal copy
Global resolution of the support vector machine regression... 211

3.2 Piece of the complementarity and data point allocation (grouping) in space

Consider the LCPSVR in (4). We define the binary variables z j , z  j and η j for each
f

j = front f , . . . , end f as

j f
1, if es j + εe − (xd )T wsf − bs + yd j = 0,
zj =
0, if α j = 0.

j f
1, if es j + εe + (xd )T wsf + bs − yd j = 0,
z j =
0, if β j = 0,

1, if es j = 0,
ηj =
0, if Ce − α j − β j = 0.

Provided large numbers θ1 j , θ2 j , θ3 j , θ4 j , θ5 j , and θ6 j , an equivalent formulation of


(4) can be written as:


⎧f = 1, . . . , Ff :

⎪ 
end

⎪ f

j
(β j − α j )xd = 0,

⎪ w s



⎪ j=front f



⎪ 
end f

end f



⎪ αj − β j = 0,



⎪ f f
⎨ j=font j=front
(10)
⎪ ∀ j = front . . . end f :
f

⎪ ⎧
⎪ ⎪
⎪ 0 ≤ αj ≤ θ1 j · z j,

⎪ ⎪

⎪ ⎪ j T f f


⎪ ⎪ 0 ≤ es j + εe − (xd ) ws − bs + yd j


≤ θ2 j · (1 − z j ),


⎪ 0 ≤ βj ≤ θ3 j · z j ,

⎪ ⎪ · (1 − z  j ),
j T f f

⎪ ⎪
⎪ 0 ≤ esj + εe + (x ) w + bs − yd j ≤ θ4 j

⎪ ⎪

d s
⎪ ≤ − α j − βj ≤ θ5 j · ηj,
⎩ ⎪
⎪ ⎩ 0 C e
0 ≤ es j ≤ θ6 j · (1 − η j ).

The values of [z j , z  j , η j ], ∀ j = front f , . . . , end f , have important meaning geo-


j j
metrically. As Fig. 1 shows, the allocation of a single data point (xd , yd ) in the space
of an SVR system has 5 possible geometrical locations: below the lower hyperplane,
above the upper hyperplane, on the lower hyperplane, on the upper hyperplane, and
f f f f f
inside the tube. The index sets A1 , A2 , A3 , A4 , and A5 can be defined by the binary
solutions:

∈ {front f , . . . , end f } | (z j , z  j , η j ) = (1, 0, 0)},


f
A1 := { j
∈ {front f , . . . , end f } | (z j , z  j , η j ) = (0, 1, 0)},
f
A2 := { j
∈ {front f , . . . , end f } | (z j , z  j , η j ) = (1, 0, 1)},
f
A3 := { j
∈ {front f , . . . , end f } | (z j , z  j , η j ) = (0, 1, 1)}, and
f
A4 := { j
∈ {front f , . . . , end f } | (z j , z  j , η j ) = (0, 0, 1)},
f
A5 := { j

123
Author's personal copy
212 Y.-C. Lee et al.

Fig. 1 Within the SVR context, a single data point can be labeled by its allocation in space

f f f f f
such that A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 = {front f , . . . , end f }. Since a point can’t be
allocated on both hyperplanes or on both sides, there are two natural cuts derived:

z j + z  j ≤ 1, ∀ j = front f , . . . , end f , f = 1, . . . , F

z j + z j + η j ≥ 1, ∀ j = front f , . . . , end f , f = 1, . . . , F.

We define grouping and piece in the following.

Definition 1 A grouping G corresponding to a (Ce , εe )-pair is a vector of integers


whose jth entry captures the spacial allocation of the jth observation of training data.
f
Let (wsf , bs ) ∀ f be the hyperplanes produced by applying Algorithm 12 to the lower-
j
level problem fixed at (Ce , εe ). If (xd , yd j ) is below the lower hyperplane,4 G j = 1.
j j
If (xd , yd j ) is above the upper hyperplane,5 G j = 2. If (xd , yd j ) is on the lower
j j
hyperplane, G j = 3. If (xd , yd j ) is on the upper hyperplane, G j = 4. If (xd , yd j ) is
inside the εe -tube, G j = 5.
f
Definition 2 A piece of the LCPSVR is a set of linear equality and inequality con-
straints that result from fixing the binary variables [z j , z  j , η j ] at one of the following
values:

[1, 0, 0], [0, 1, 0], [1, 0, 1], [0, 1, 1], or [0, 0, 1]

for each j, in model (10).

A grouping vector has dimension n. There are at most 5n possible grouping vectors
for n training data points, regardless of the choices of Ce and εe . Given a grouping,

4 Lower-hyperplane: y j T f f
d j = (xd ) ws + bs − εe .
5 Upper-hyperplane: y j T f f
d j = (xd ) ws + bs + εe .

123
Author's personal copy
Global resolution of the support vector machine regression... 213

the corresponding piece is expressed as the collection of the linear equalities and
inequalities:
f f

end
j

end
j
ws + f
α j xd − β j xd = 0,
j=front f j=front f
f f (11)

end 
end
αj − β j = 0,
j=front f j=front f
j f 
(xd )T ws f + bs − yd j − εe ≥ 0, f
∀ j ∈ A1 , (12)
α j = Ce , β j = 0,
j f 
yd j − (xd )T ws f − bs − εe ≥ 0, f
∀ j ∈ A2 , (13)
α j = 0, β j = Ce ,

Ce ≥ α j ≥ 0, β j = 0, f
j f ∀ j ∈ A3 , (14)
yd j = (xd )T ws f + bs − εe ,

α j = 0, Ce ≥ β j ≥ 0, f
j f ∀ j ∈ A4 , (15)
yd j = (xd )T ws f + bs + εe ,

εe − (xd )T ws f − bs + yd j ≥ 0, ⎪
j f

f
εe − yd j + (xd ) ws + bs ≥ 0, ⎪ ∀ j ∈ A5 .
j T f f (16)
α j = 0, β j = 0. ⎭

Thus, by recording the vector of the grouping, the piece can be monitored and recovered
on the fly. We follow the algorithm below to transform the solution of the lower-level
problem of the LCP with a fixed (Ce , εe ) into a grouping vector.
f
Algorithm 3 Transform the Solutions (α, β, es , wsf , bs ) to a Grouping Vector. (Use
f f
this algorithm when the solution to the LCP SV R is not degenerate and bs is unique)
f
Given solutions α, β, es , wsf and bs , declare Gr oupingV as a vector with a length
f f f f f
of n and let A1 , A2 , A3 , A4 and A5 = ∅.
for f = 1, . . . , F, j = f r ont , . . . , end f
f
j f
if α j > es j + εe − (xd )T wsf − bs + yd j then
if es j > Ce − α j − β j then
f f
Gr oupingV j = 1, and A1 ← A1 ∪ { j}.
else
f f
Gr oupingV j = 3, and A3 ← A3 ∪ { j}.
end if
j f
if es j + εe + (xd )T wsf + bs − yd j > β j then
f f
Gr oupingV j = 5, and A5 ← A5 ∪ { j}.
else
if Ce − α j − β j > es j then
f f
Gr oupingV j = 4, and A4 ← A4 ∪ { j}.

123
Author's personal copy
214 Y.-C. Lee et al.

else
f f
Gr oupingV j = 2, and A2 ← A2 ∪ { j}.
end if
end if
end if
end for
Return Gr oupingV .
f
The solution to the LCPSVR , however, is tricky. The tricky parts are the non-
uniqueness of the bs solution and the degeneracy of the complementarities. One of the
degenerate cases occurs when

j
case 1: {yd j  = (xd )T ws + bs − εe , α j  = 0, β j  = 0}

at some index j  . This case implies that the j  th data point in the space could either
be on the lower hyperplane or inside tube, and the index j  could either be contained
in A3 or A5 . The degeneracy of the following complementarity

j
0 ≤ es j  + εe − (xd )T wsf − bsf + yd j  ⊥ α j ≥ 0

can result in the solutions of case 1. It is not hard to deduce that there are a total of four
possible cases where a data point is eligible for two geometry locations. They are:

j
case 2: {yd j  = (xd )T ws + bs − εe , α j  = C j , β j  = 0} ⇒ j  ∈ A3 or A1 ,
j
case 3: {yd j  = (xd )T ws + bs + εe , α j  = 0, β j  = 0} ⇒ j  ∈ A4 or A5 , and
j
case 4: {yd j  = (xd )T ws + bs + εe , α j  = 0, β j  = C j } ⇒ j  ∈ A4 or A2 .

j f
For case 2, the complementarities 0 ≤ es j  + εe + (xd )T wsf + bs − yd j  ⊥ β j ≥ 0 and
0 ≤ Ce −α j −β j ⊥ e j ≥ 0 are degenerate; for case 3, the complementarity 0 ≤ es j  +
j f
εe + (xd )T wsf + bs − yd j  ⊥ β j ≥ 0 is degenerate; for case 4, the complementarities
j f
0 ≤ es j  + εe − (xd )T wsf − bs + yd j  ⊥ α j ≥ 0 and 0 ≤ Ce − α j − β j ⊥ e j ≥ 0.
The following algorithm generates a set GS containing all the groupings associated
f f
with a (Ce , εe ) pair when the solution to the LCPSVR is degenerate or the solution bs
f f
is not unique but an interval [bmin bmax ].
f f !
Algorithm 4 Transform the Solutions (α, β, es , wsf , bmin bmax ) to a Set of Group-
f
ing Vectors. (Use this algorithm when the solution to the LCP SV R is degenerate or
f
bs is not unique)
f f f !
Given solutions α, β, es , wsf and bs ∈ bmin bmax , declare V as a vector with a length of n and
entries 0 and initialize a set GS ← {V }.
for f = 1, . . . , F and i = 1, 2
f f f f
Let bs = bmin when i = 1. Let bs = bmax when i = 2. Initialize GSi ← {GS}.

123
Author's personal copy
Global resolution of the support vector machine regression... 215

for j = f r ont f , . . . , end f


switch s do
case 1
Duplicate every vectors in the GSi . Thus, for an arbitrary vector in GSi ,
one can find another identical vector in the GSi . For each pair of the identical
vectors in GSi , let the jth element of one vector be 3(= PossibleGr ouping One)
and the jth element of the other vector be 5(= PossibleGr oupingT wo).
end case
case 2
As in case 1 but the PossibleGr ouping One = 1 and PossibleGr oupingT wo = 3.
end case
case 3
As in case 1 but the PossibleGr ouping One = 3 and PossibleGr oupingT wo = 4.
end case
case 4
As in case 1 but the PossibleGr ouping One = 2 and PossibleGr oupingT wo = 4.
end case
case 5 None of the above cases. No degeneracy
for Every vectors in the set GSi
Algorithm 3 steps 3-18.
end for
end case
end switch
end for "
Let GS = GS1 GS2 .
end for
Return GS.

Below we define the invariancy region in the context of this work and show that
the invariancy region is convex.
Definition 5 An invariancy region IR is a region on the parameter space, i.e., the
(Ce , εe )-plane, such that the grouping vector induced by every (Ce , εe ) ∈ IR is the
same.
f
Theorem 6 Consider the following process (1)–(3). (1) Solve the LCPSVR by
Algorithm 12 with the variables (Ce , εe ) fixed at (C̄e , ε̄e ). (2) If the solution is non-
degenerate, transform the solution to a grouping vector G by Algorithm 3. If the
solution is degenerate, let G be any one members in the set GS obtained by Algorithm
4. (3) Use the grouping G to form the index sets Ai ∀i = 1, . . . , 5 and write down the
piece P as expressed in (11)–(16). Let invariancy region IR be a set of (Ce , εe )-pairs
such that the corresponding grouping vectors equal to G. Then IR is a convex set.
Proof Without loss of generality, let F = 1 and ignore the superscripts f in the
notation of variables. Let (C̄e(1) , ε̄e(1) ) ∈ IR and (C̄e(2) , ε̄e(2) ) ∈ IR. Let the solution
to LCPSVR with (Ce , εe ) fixed at (C̄e(1) , ε̄e(1) ) and (C̄e(2) , ε̄e(2) ) be {α j (1) , β j (1) , e j (1) ,
ws(1) , bs(1) } and {α j (2) , β j (2) , e j (2) , ws(2) , bs(2) } respectively. Assume that they give
the same grouping, i.e., Ai (1) = Ai (2) , ∀i = 1, . . . , 5, and that the correspondent
pieces P are given as (11)–(16). For any arbitrary λ ∈ (0, 1), consider (Ce(3) , εe(3) ) =
(λC̄e(1) +(1−λ)C̄e(2) , λε̄e(1) +(1−λ)ε̄e(2) ). Since (C̄e(1) , ε̄e(1) ) and (C̄e(2) , ε̄e(2) ) produce

123
Author's personal copy
216 Y.-C. Lee et al.

the same groupings, we claim that the solution {α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } equals
to {λα j (1) + (1 − λ)α j (2) , λβ j (1) + (1 − λ)β j (2) , λe j (1) + (1 − λ)e j (2) , λws(1) + (1 −
λ)ws(2) , λbs(1) + (1 − λ)bs(2) } because the following system can be satisfied by it.


n
j
ws(3) − (β j (3) − α j (3) )xd = 0,
j=1

n 
n
α j (3) − β j (3) = 0,
j=1 j=1
∀ j = 1, . . . , n :
⎧ j T
⎨ 0 ≤ es j (3) + [λε̄e(1) + (1 − λ)ε̄e(2) ] − (xd ) ws(3) − bs(3) + yd j ⊥ α j (3) ≥ 0,

j
0 ≤ es j (3) + [λε̄e(1) + (1 − λ)ε̄e(2) ] + (xd )T ws(3) + bs(3) − yd j ⊥ β j (3) ≥ 0,


0 ≤ [λC̄e(1) + (1 − λ)C̄e(2) ] − α j (3) − β j (3) ⊥ es j (3) ≥ 0.

The grouping for {α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } is again the same. So { Ce(3) , εe(3) ,
α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } is feasible to P, and IR is convex. 


The following property directly results from Theorem 6, which is one of the suffi-
cient conditions we use in the algorithm to claim that all the invariancy regions inside
a rectangular area has been found.

Corollary 7 Given a rectangle6 on the (Ce , εe )-plane, if its four-corner points pro-
duce the same vector of grouping, the whole rectangle all produce the same vector of
the grouping.

Proof The invariancy regions IR are convex. 




3.3 Restricted LP, reduced restricted LP, and restricted QCP

In this section, we introduce three types of the restricted programs: restricted lin-
ear program, reduced restricted linear program, and restricted quadratic constrained
program. The restricted linear program and the reduced restricted linear program are
called “restricted” because the feasible set of these two programs is restricted within
the invariancy region of some fixed values of (Ce , εe ). On the other hand, the restricted
quadratic constrained program is even more restricted because its feasible region is
confined by a single (Ce , εe )-pair.
A restricted linear program RLP of the LPCC (5) is obtained by replacing the
f f
LCPSVR with one of its pieces. Since the input of RLP are index sets Ai for all
#F
i = 1, . . . , 5 and f = 1, . . . , F, there are f =1 5n f many RLP defined as follows:

f
RLP(Ai |i = 1, . . . , 5, f = 1, . . . , F) :

6 A rectangle is defined by four bounds: upper and lower bounds of C and ε .


e e

123
Author's personal copy
Global resolution of the support vector machine regression... 217

f

F v
end
min pi
f f
Ce ,εe ,ws ,bs , pi , f =1 i=front f
α j ,β j v

subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
and ∀
⎧f = 1, . . . , F :
⎪ (xi )T wsf + bsf − ydi ≤ pi , ∀i = front vf , . . . , endvf ,


⎪ d

⎪ f
−(xdi )T wsf − bs + ydi ≤ pi , ∀i = front v , . . . , endv ,
f f



⎪ f f

⎪ 
end 
end

⎪ ws + f
α j xd −
j j
β j xd = 0,





⎪ j=front f j=front f
⎪ end f


⎪  
end f

⎪ α − β j = 0,

⎪ j



⎪ j=front f j=front f

⎪  (17)

⎪ (x
j T
) w f + b f − y − ε ≥ 0,

⎪ d s s d j e f
∀ j ∈ A1 ,

⎪ α j = Ce , β j = 0,


j f 
⎪ yd j − (xd )T ws f − bs − εe ≥ 0, f

⎪ ∀ j ∈ A2 ,

⎪ α = 0, β = C ,


j j e

⎪ 

⎪ Ce ≥ α j ≥ 0, β j = 0,

⎪ f
∀ j ∈ A3 ,

⎪ = (x
j T
) f +bf −ε ,

⎪ y d j d w s s e



⎪ 

⎪ α j = 0, Ce ≥ β j ≥ 0,


f
∀ j ∈ A4 ,

⎪ y = (x
j T
) w f +bf +ε ,

⎪ d j d s s e

⎪ ⎫

⎪ ε − (x )T w f − b f + y ≥ 0, ⎪


j



e d s s d j

⎪ ε − + (x
j T
) f + b f ≥ 0, f
∀ j ∈ A5 .

⎪ e y dj d w s s ⎪
⎩ α j = 0, β j = 0. ⎭

If we look closely at the model (17), α j∈A f and β j∈A f can be replaced by a single
1 2
f f f f f f
variable Ce . The variables α j , ∀ j ∈ A2 , A4 , A5 , and β j , ∀ j ∈ A1 , A3 , A5 , can be
eliminated. Thus, we obtain a reduced restricted linear program RRLP as follows:

123
Author's personal copy
218 Y.-C. Lee et al.

f
RRLP(Ai |i = 1, . . . , 5, f = 1, . . . , F) :
f

F v
end
min pi
f f
Ce ,εe ,ws ,bs , pi , f =1 i=front f
α f ,β f v
j∈A3 j∈A4

subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
and ∀
⎧f = 1, . . . , F :f
⎪ (xdi )T wsf + bs − ydi ≤ pi , ∀i = front vf , . . . , endvf ,




⎪ −(xdi )T wsf −
f
bs + ydi ≤ pi , ∀i =
f
front v , . . . , end
f
v,

⎪  

⎪ ws f + Ce
j
xd − Ce
j
xd +
j
α j xd −
j
β j xd = 0,





⎪ j∈A1
f f f f

⎪ j∈A2 j∈A3 j∈A4

⎪ f f
|A1 |Ce − |A2 |Ce + αj − β j = 0,






f
j∈A3
f
j∈A4

⎪ $
⎨ j T f
∀ j ∈ A1 ,
f
(xd ) ws + bs − yd j − εe ≥ 0,
f
$

⎪ j f
∀ j ∈ A2 ,
f

⎪ yd j − (xd )T ws f − bs − εe ≥ 0,

⎪ 

⎪ Ce ≥ α j ≥ 0, β j = 0,

⎪ ∀ j ∈ A3 ,
f

⎪ y = (x
j T
) w f +bf −ε ,

⎪ dj d s s e


⎪ α j = 0, Ce ≥ β j ≥ 0,

⎪ f

⎪ j T f +bf +ε , ∀ j ∈ A4 ,

⎪ yd = (x ) w s s e %


j d
⎪ j f
⎪ εe − (xd )T ws f − bs + yd j ≥ 0,

⎪ ∀ j ∈ A5 .
f
⎩ j f
εe − yd j + (xd )T ws f + bs ≥ 0,

f f
where |Ai | denotes the cardinality of the set Ai .
Except for the two linear restricted programs, a fixed (Ce , εe ) pair allows us to
formulate a restricted convex quadratically constrained program RQCP as follows:

123
Author's personal copy
Global resolution of the support vector machine regression... 219

RQCP(Cefix , εefix ) :
f

F v
end
min pi
f f
ws ,bs , pi , f =1 f
α j ,β j i=frontv
subject to ∀ f = 1, . . . , F :
⎧ i T f f f f
⎪ (xd ) ws + bs − ydi ≤ pi , ∀i = front v , . . . , endv ,





f
−(xdi )T wsf − bs + ydi ≤ pi , ∀i = front v , . . . , endv ,
f f



⎪ 
end f

end f



⎪ w f
+ α x
j

j
β j xd = 0,


s j d




j=front f j=front f

⎪ f f

⎪ 
end 
end

⎪ αj − β j = 0, (18)



⎨⎛ j=front f f
f
⎞j=front ⎛ f

⎪ 
end 
end
⎪⎝
⎪ esfj ⎠ Cefix + ⎝ (α j + β j )⎠ εefix





⎪ j=front ⎛ f f
j=front ⎞ ⎛ ⎞



⎪ 
end f

end f

⎪ +⎝ (β j − α j )(xd )⎠ T ⎝
j
(βi − αi )(xdi )⎠





⎪ j=front f i=front f



⎪ 
end f



⎪ + (α j − β j )y j = 0,

j=front f

where the convex quadratic constraints are the aggregation of complementarities. We


postpone the derivation of the convex quadratic constraints of model (18) to Sect. 4.
The objective values obtained from solving every RLP, RRLP and RQCP are the
upper bounds of the original problem (5). The grouping vectors obtained from three
of the models are identical. Thus, for the purpose of getting the grouping and the
piece, solving the quadratic program RQCP can be a substitute for solving the linear
f
complementarity program LCPSVR . Based on our numerical experiments, solving an
f
RQCP is more time-consuming than solving an LCPSVR plus an RLP. We only use
f
RQCP when LCPSVR or RLP fails to be solved.

3.4 Invariancy interval along a chosen line

Identifying the invariancy interval on a line is not as complicated as the work (See the
methods in Ghaffari-Hadigheh et al. 2010, Bemporad et al. 2002) of identifying the
& ε)
invariancy region of a point (Ce , εe ). The line passing through a point (Ce , εe ) = (C,&
is either a vertical line expressed as

&
Ce = C,

123
Author's personal copy
220 Y.-C. Lee et al.

or it’s of the form


εe = L a Ce + L b ,
where L a and L b are the slope and intercept, respectively, such that the line passes
& ε). The invariancy interval [(C
through (C,& &1 ,& &2 ,&
ε1 ), (C ε2 )] can be obtained by solving
the following four linear optimization problems:

Cmax \Cmin = max \ min Ce


subject to &
εe = L a Ce + L b , (or Ce = C) (19)
and constraints in (17),

and
εmax \εmin = max \ min εe
subject to &
εe = L a Ce + L b , (or Ce = C) (20)
and constraints in (17).
&1 ,&
The solution to (19) and (20) is a line segment [(C &2 ,&
ε1 ) , (C ε2 )] that belongs in
one of the following four cases:
(i) On Ce = C & (a vertical line): C
&1 = C,
&& ε1 = εmax , C&2 = C,& and &ε2 = εmin .
(ii) On εe = L a Ce + L b and L a = 0, L b = & &1 = Cmax ,&
ε (a horizontal line): C ε1 = &ε,
&2 , = Cmin , and &
C ε2 = &ε.
&1 = Cmax , &
(iii) On εe = L a Ce + L b and L a is positive: C &2 = Cmin , and
ε1 = εmax , C
ε2 = εmin .
&
(iv) On εe = L a Ce + L b and L a is negative: C&1 = Cmax ,& &2 = Cmin , and
ε1 = εmin , C
ε2 = ε max .
&
We propose a procedure to find all the groupings and invariancy intervals along the
boundaries of a given rectangle {C̄, C, ε̄, ε}. When searching along a boundary line,
either εe or Ce is fixed at the corresponding value. The procedure starts with solving for
the grouping (or the set of groupings if having degeneracy) at a vertex of the boundary
and finding the invariancy interval [Cmin , Cmax ] or [εmin , εmax ]. When there are more
than one groupings, we first compute the invariancy intervals for each of the possi-
ble groupings. The interval with the farthest endpoint from the starting point is then
chosen to represent the invariancy interval of the starting point, and its corresponding
groping is chosen to represent the grouping of the starting point. After obtaining the
endpoint of the interval, to continue the search for a new grouping vector, we select
a new starting point that is outside and deviates a very small amount from the end-
point of the current interval. The deviation needs to be small enough to make sure that
no groupings are missed and that the invariancy interval containing the new starting
point is adjacent to the current interval. During the procedure, the number of invari-
ancy intervals on each side of the boundary are recorded by count T op, count Le f t,
count Bottom, and count Right. At the end of the procedure, a set of grouping vectors
GroupingVFound and the least objective value LeastU pper Bound are obtained.
It should be noted that in the situation of having degeneracy at a point, all the group-
ings found during the process should be recorded in the set GroupingVFound, but
only the invariancy interval that are chosen to represent a point should be counted in
count T op, count Le f t, count Bottom, and count Right.

123
Author's personal copy
Global resolution of the support vector machine regression... 221

Algorithm 8 Identifying Groupings on Boundaries of a Rectangle.


Step 0. Input.
Set the parameter deviation (0.0001 for example).
Initialize the count T op, count Le f t, count Bottom, and count Right = 0.
Initialize LeastU pper Bound at any valid upper bound.
Initialize the set GroupingVFound.
Exogenously find the sets of the grouping vectors at the four corners (C̄, ε̄),
(C, ε), (C̄, ε), and (C, ε̄) and name them as GroupingVSet uu , GroupingVSet ll ,
GroupingVSet ul , and GroupingVSet lu respectively.
Step 1. Search on the horizontal line εe = ε̄.
Initialize the set GroupingVSet top = GroupingVSet uu .
1a: For every pieces corresponding to a member of GroupingVSet top , solve (19)
subject to
εe = ε̄ to obtain !invariancy intervals.
Let Cmin , Cmax be the largest intervals among others.
Do Add-In procedure (explained later) when repeating. count T op + 1.
f
1b: Solve LCP SV R at (Cmin , ε̄) and obtain the grouping vectors set.
1c: Replace GroupingVSet top by the set of the grouping vectors obtained in 1b.
If any members of GroupingVSet top are not in the set GroupingVFound,
add them to the later set.
1d: If the objective value of RLP is smaller than LeastU pper Bound, update
LeastU pper Bound.
1e: If Cmin is greater than C, let newStar ting = (Cmin − deviation , ε̄).
f
1f: Solve LCP SV R at newStar ting and obtain the grouping vectors set. Do as in
1c-1d.
1g: Repeat 1a-1d until Cmin = C.
Steps 2–4. Search on the other three sides of the boundary.
See Appendix B.
Step 5. Output. Return count T op, count Le f t, count Bottom, and count Right.
Return the set GroupingVFound and the value LeastU pper Bound. 


We have noticed from the experiments that the invariancy intervals obtained at
Step 1a, 2a, 3a, and 4a are sometimes problematic/unappropriate maybe because of
arithmetic errors. An interval I B is said to be appropriately located adjacent to the
previous interval I A if:

B
Cmax ≤ Cmin
A B
and Cmin < Cmax
B
. (21)

Figure 2 illustrates an invariancy interval I B = [Cmin


B , C B ] which appears to locate
max
appropriately, given the location of its adjacent interval I A = [Cmin
A , C A ]. Figure 3
max
are two examples of the problematic intervals on a horizontal boundary. In these
examples, the second intervals (I B ) are not properly adjacent to the first intervals
(I A ). In the case of overlapping, the locations of the intervals are not appropriate
because Cmax B > Cmin
B > C A ; in the case of repeating, we get C B
min max = C min . The
B

overlapping case is usually due to arithmetic errors and it can cause an endless loop

123
Author's personal copy
222 Y.-C. Lee et al.

Fig. 2 Appropriate invariancy interval I B = [Cmin B , C B ] subsequent and adjacent to the interval I A on
max
A − deviation, ε̄) if searching on the line ε = ε̄ at step 1e
a line with fixed εe . newStar ting is set at (Cmin e
of Algorithm 10 (or at steps 2e, 3e, and 4e if searching on other line functions of the boundary)

B , C B ] subsequent to the interval I A on a line with


Fig. 3 Problematic invariancy intervals I B = [Cmin max
fixed εe

in Algorithm 8, while the repeating case can be a natural result of a single-valued


invariancy interval.
We, thus, propose a small add-in to screen out and modify the problematic intervals
violating the conditions of (21) for enforcing the forward searching. The following
add-in procedure should be adopted in Algorithm 8 after the second repetition of step
1a and before 1b. The add-ins used between the steps 2a and 2b, 3a and 3b, and 4a
and 4b are similar to it.

Add-In: Enforce the Forward Searching.


0. Input: The invariancy interval obtained from Step 1a of Algorithm 8:
adj adj
[Cmin , Cmax ]. The previous7 invariancy interval [Cmin , Cmax ]. Parameter
per tur bation (say 0.001). Parameter counter = 1. Every parameters and con-
tainers used in Algorithm 8.
adj
1. Check: If Cmax ≤ Cmin and Cmin < Cmax , this is an appropriate invariancy
interval. Stop.
adj
2. Perturb: Let newStar ting = (Cmin − deviation − counter × perturbation, ε̄).
f
Solve LCPSVR at newStar ting and obtain the grouping vector set. Do the same
thing as in steps 1c and 1d of Algorithm 8.

7 If there is no previous interval, no need to run the add-in.

123
Author's personal copy
Global resolution of the support vector machine regression... 223

3. Solve for a new interval: Solve (19) subject to εe = ε̄ and the piece out of
Gr oupingV top and obtain the invariancy interval [Cmin , Cmax ]. counter + 1. Go
to 1. 


Figure 4 provides one example of using Algorithm 8 with the add-in to identify
the invariancy intervals along the four boundaries: boundary on the top (εe = ε̄ = 1
in this example), left-hand-side boundary (Ce = C = 1), bottom boundary (εe =
ε = 0.1), and right-hand-side boundary (Ce = C̄ = 10). In Fig. 4, all appropriate
and problematic intervals are sequentially shown. The parameter deviation is set
at 0.0001 and per tur bation is set at 0.001. This specific instance (35_35_5_2) has
twofolds, and each fold has 35 training and 35 validation data points with 5 features.
Among the intervals obtained, intervals #12, #20, #25, #26, #27, #28, #30, #32, #33,
#35, #36, #38, #39, and #41 are problematic intervals at which the add-in is applied
to enforce the forward searching. The problematic intervals are not counted in the

Fig. 4 An example of searching along the boundaries of a rectangle and identifying the invariancy intervals
using the Algorithm 8 with the add-in

123
Author's personal copy
224 Y.-C. Lee et al.

number of invariancy intervals at each side of boundary. Therefore, at the end of the
algorithm, we obtained count T op, count Le f t, count Bottom, and count Right at 2,
13, 2 and 11, respectively.

3.5 Main algorithm

The (Ce , εe )-rectangle search algorithm explores possible groupings and the corre-
sponding objective values of the restricted linear programs for each area [C, C]×[ε, ε]
in the queue Ar eaToBeSear ched. We only examine, remove, and partition into areas
which are rectangular. The areas are viewed on the base of a rectangle mainly for the
convenience of partitioning and the ease of searching invariancy intervals as described
in Sect. 3.4. The search of groupings and invariancy intervals starts at the four vertices
and along the boundaries of the rectangles. We do not search into the interior of rectan-
gles but infer the interior conditions based on the information of invariancy intervals
gathered along the boundary. If we can not conclude that all groupings in one rectangle
are realized, we partition the rectangle into small rectangles at the midpoints8 of the
invariancy intervals. The process proceeds to sequentially search the rectangular areas
decomposed from the initial box.
In the 1st stage of algorithm, we try to explore as many groupings as possible
and eliminate the areas containing only one grouping. By Corollary 3.2, a rectangu-
lar area is guaranteed to have only one grouping when the same grouping vector
is obtained at its four vertices. A rectangular area with this property requires no
more partitioning and is eliminated (in step 2d of Algorithm 9) from the list/queue
of Ar eaToBeSear ched. The total area of the eliminated rectangles is recorded at
Ar ea Reali zed I nT he1st Stage.
When the number of groupings is not greater than 2 on any side of the boundary
of one rectangle, the 2nd stage of the algorithm is activated for that rectangle. The
condition that initializes the 2nd stage is stated in step 4 of Algorithm 9. For a rectangle
passed to the 2nd stage, there are two possible results. One, the rectangle is eliminated
from Ar eaToBeSear ched permanently because no other invariancy regions are in
the interior (explained in Sect. 3.6). Two, the rectangle is partitioned into smaller
rectangles. By updating the queue Ar eaToBeSear ched, these new rectangles are
passed back to the 1st stage. At the end of the algorithm (consisting of the 1st and
2nd stage), a least upper bound (LeastU pper Bound) for the parameter training and
validation SVR model is obtained.

Algorithm 9 (Ce , εe )-Rectangle Search Algorithm


Step 0. Declaration.
Initialize the tuple Ar ea1 := {C̄, C, ε̄, ε}.
Initialize the list of areas Ar easToBeSear ched = {Ar ea1}.
Initialize the set GroupingVFound = ∅.

8 Partitioning at the endpoints is also a theoretically valid strategy. We choose to partition at the midpoints
rather than the endpoints to avoid the loss of information due to arithmetic imprecision at the dividing line.
The drawback of it is that most of the rectangles are not eliminated in the 1st stage but have to be passed to
the 2nd stage.

123
Author's personal copy
Global resolution of the support vector machine regression... 225

Initialize LeastU pper Bound = 0.


Initialize Ar ea Reali zed I nT he1st Stage = 0.
Initialize the parameter insensitive (say 0.00001).
Step 1. Get the first entry of the list Ar easToBeSear ched.
if Ar easToBeSear ched is empty then
Terminate the Algorithm.
else
Let the first entry of the list Ar easToBeSear ched be Ar ea curr ent .
Let C̄ = Ar ea curr ent (1), C = Ar ea curr ent (2), ε̄ = Ar ea curr ent (3), and ε =
Ar ea curr ent (4).
(Note: Ar ea curr ent (k) denotes the kth entry of the tuple Ar ea curr ent .)
end if
Step 2. Find the sets of groupings at the four vertices of Ar ea curr ent .
f
2a: For f=1,…,F, solve LCP SV R with (Ce , εe ) fixed at (C̄, ε̄), (C, ε), (C̄, ε), and
(C, ε̄) using Algorithm 12.
2b: Use Algorithm 4 to transform the solution to sets of grouping vectors
GroupingVSet uu ,
GroupingVSet ll , GroupingVSet ul , and GroupingVSet lu respectively.
2c: Check whether the members of GroupingVSet uu , GroupingVSet ll ,
GroupingVSet ul ,
and GroupingVSet lu are in the set GroupingVFound. If not, add them(it)
to the set.
2d:
if there exist grouping vectors Gr oupingV uu ∈ GroupingVSet uu , Gr oupingV ll ∈
GroupingVSet ll , Gr oupingV ul ∈ GroupingVSet ul , and Gr oupingV lu ∈
GroupingVSet lu , such that

Gr oupingV uu = Gr oupingV ll = Gr oupingV ul = Gr oupingV lu (The 1st stage)

then
Solve the corresponding RLP and let the objective value be VU B .
Let LeastU pper Bound ← VU B if VU B < LeastU pper Bound.
Ar ea Reali zed I nT he1st Stage = Ar ea Reali zed I nT he1st Stage+(C −C)×
(ε − ε).
Go to Step 1 after eliminating Ar ea curr ent from Ar easToBeSear ched.
else
Continue to Step 3.
end if
Step 3. Find groupings along the four boundaries of Ar ea curr ent .
if C̄ − C < insensitive then
Apply Algorithm 8 but skip steps 1, 3, and 4.
Go to Step 1 after eliminating Ar ea curr ent from the list of Ar easToBeSear ched.
else if ε̄ − ε < insensitive then

123
Author's personal copy
226 Y.-C. Lee et al.

Apply Algorithm 8 but skip steps 2, 3, and 4.


Go to Step 1 after eliminating Ar ea curr ent from Ar easToBeSear ched.
else
Apply Algorithm 8. Continue to Step 4.
end if
Step 4. Check count T op, count Le f t, count Bottom, and count Right.
if count T op ≤ 2 and count Le f t ≤ 2 and count Bottom ≤ 2 and count Right ≤ 2
then
Pass the Ar ea curr ent to Algorithm 10 (the 2nd stage).
Go to Step 1 after eliminating Ar ea curr ent from the list Ar easToBeSear ched.
else
Continue to Step 5.
end if
Step 5. Partition the area into small rectangles.
5a: Let count H ori zetental Br eak Point = max (count Le f t, count Right).
If count H ori zetental Br eak Point = count Le f t, collect the εe values of
the midpoints of the
invariancy intervals along the left-hand-side boundary in the set Hori zontal
Br eakPointsSet.
Otherwise, collect the εe values of the midpoints from the right-hand-side
boundary.
5b: Let count V er tical Br eak Point = max (count T op, count Bottom).
If count V er tical Br eak Point = count T op, collect the Ce values of the
midpoints of the
invariancy intervals along the top boundary in the set Ver ticalBr eakPoints
Set.
Otherwise, collect the Ce values of the midpoints from the bottom boundary.
5c: Cut the rectangular area at each εe ∈ Hori zontalBr eakPointsSet and each
Ce ∈ Ver ticalBr eakPointsSet.
5d: Eliminate Ar ea curr ent and add a total of

(count H ori zetental Br eak Point + 1) × (count V er tical Br eak Point +1)

areas to the front of the list Ar easToBeSear ched.


5e: Go to Step 1 after partitioning.
Output: LeastU pper Bound (global optimum), GroupingVFound, Ar ea
Reali zed I nT he1st Stage. 


We do not explicitly find the edges of invariancy regions in the algorithm. Instead,
the allocation of the invariancy intervals along the boundaries of each rectangular area
is monitored. Maintaining the search area as a rectangle is especially convenient in
partitioning and eliminating, but the drawback is the possibility of revisiting the same
invariancy region many times. One can compare this method with the partitioning
technique in Bemporad et al. (2002) and the graph identifying technique in Ghaffari-
Hadigheh et al. (2010).

123
Author's personal copy
Global resolution of the support vector machine regression... 227

3.6 The 2nd stage: identify the non-vertical and non-horizontal boundary of
invariancy region

Examining vertices, which is the criteria in the 1st stage, is quick and convenient
in removing some areas from Ar easToBeSear ched. However, the convexity of the
invariancy region implies that a boundary of the region is not a curve but not necessarily
a vertical or a horizontal line. The 1st stage alone is not enough to clear the queue
Ar easToBeSear ched. The 2nd stage of the algorithm is proposed to resolve this
issue. Note for any rectangular area, it is viewed as one of the following three status.
1: belonging in one invariancy region, 2: belonging in two invariancy regions, and 3:
belonging in more than two invariancy regions. The 1st stage of the algorithm tackled
the area of status 1, and the 2nd stage of the algorithm aims to tackle the area of status
2. If a rectangle is of status 3, it is decomposed again.
Given that the number of groupings is no more than 2 on any side of the boundary,
we want to see if there is a single straight line splitting the input rectangular area into
two invariancy regions, thus concluding the realization of the area. There are a total
of six possible cases as shown in Fig. 5. In the figure, the node denotes where a task
of finding the grouping vector and solving the restricted linear program is done. The
arrow denotes a task of solving for the invariancy interval given the grouping vector.
In Case 1, the dividing line passes through the top and left-hand-side boundaries; in
Case 2, the dividing line passes through the top and bottom boundaries; in Case 3,
the dividing line passes through the top and right-hand-side boundaries; in Case 4,
the dividing line passes through the left-hand-side and right-hand-side boundaries; in
Case 5, the dividing line passes through the left-hand-side and bottom boundaries; in
Case 6, the dividing line passes through the bottom and right-hand-side boundaries.
To verify whether a rectangular area fits one of the six cases or not, we first need
to check the number of invariancy intervals at each side of the boundary, denoted as
count T op, count Le f t, count Bottom, and count Right in the 2nd stage of the algo-
rithm. We then compare the vectors of the groupings obtained. The following algorithm
describes the details of verifying the six cases and the way to handle exceptions.

Algorithm 10 The 2nd Stage Subroutine


Step 0: Input an Ar ea := {C̄, C, ε̄, ε} with no greater than two invariancy intervals
at each side of the boundary. Set parameter insensitive the same as in Algorithm 9
Step 1: Do Algorithm 8 to get count T op, count Le f t, count Bottom, and count Right.
During the process of Algorithm 8, additionally maintain four lists of grouping vectors
whose entries are the grouping vectors identified on the top boundary, left-hand-side
boundary, bottom boundary, and right-hand-side boundary respectively:

Gr oupingV List T op , Gr oupingV List Le f t , Gr oupingV List Bottom ,


and Gr oupingV List Right .

The entries in the lists are in the order of obtaining. We denote the first entry of the
lists by “(1)” adjacent to the name of lists, and the second entry of lists, if it exists, by
“(2)” adjacent to the name of lists.

123
Author's personal copy
228 Y.-C. Lee et al.

Fig. 5 The six cases in which the rectangular area is bisected by a straight line into two invariancy regions
A1 and A2 on the (Ce , εe )-plane

Step 2: Find one of the following cases that fits the reality.

1: count T op = count Le f t = 2 and count Bottom = count Right = 1


2: count T op = count Bottom = 2 and count Le f t = count Right = 1
3: count T op = count Right = 2 and count Le f t = count Bottom = 1
4: count T op = count Bottom = 1 and count Le f t = count Right = 2
5: count T op = count Right = 1 and count Le f t = count Bottom = 2
6: count T op = count Le f t = 1 and count Bottom = count Right = 2
7: None of the above.

switch s do
case 1
if Gr oupingV List T op (1) = Gr oupingV List Le f t (2)
and Gr oupingV List T op (2) = Gr oupingV List Le f t (1) then
The Ar ea satisfies Case 1. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 2 − 6
See Appendix C.

123
Author's personal copy
Global resolution of the support vector machine regression... 229

end case
case 7
Go to Step 3, Exception 1.
end case
end switch
Step 3.
Exception 1: more than two groupings.
Partition at the middle of the rectangular area.
Update Ar easToBeSear ched of Algorithm 9 by removing Ar ea from and adding
the four
resulting small areas to it.
Exception 2: unknown situation.
Let the randomly picked interior point be (C r , εr ). Partition the Ar ea into the
following
four
' areas (
C̄ − insensitive , C r + insensitive , ε̄ − insensitive , εr + insensitive ,
' (
C̄ − insensitive , C r + insensitive , εr − insensitive , ε + insensitive .
' r (
C − insensitive , C + insensitive , ε̄ − insensitive , εr + insensitive ,
' r (
C − insensitive , C + insensitive , εr − insensitive , ε + insensitive .
Update Ar easToBeSear ched of Algorithm 9 by removing Ar ea from and adding
the four
resulting small areas to it. 

An example of Exception 1, step 3 of Algorithm 10, is shown in Fig. 6. Such an


area activates the 2nd stage at Step 4 of the 1st stage, but will be thrown to Exception
1 in step 3, at item 7 in Step 2 of the 2nd stage of algorithm. We can see there are
in fact more than two groupings, thus more than two non-vertical and non-horizontal
boundaries inside the area. This area is split into four small rectangles.
Exception 2, step 3, captures the unknown cases resulting from either the failure
of solving LCPs and LPs or arithmetic errors. For the areas thrown to Exception 1, a
conclusion about the realization of groupings can be made after further partitioning;

Fig. 6 Example for the Exception 1 in the Step 3 of the 2nd stage. There are three invariancy regions, A1,
A2 and A3

123
Author's personal copy
230 Y.-C. Lee et al.

Fig. 7 (Ce , εe )-rectangle search algorithm illustration

but the areas thrown to Exception 2 in the worst case can only be partitioned and
cropped until they are too thin to have any unrevealed groupings.
An overview of running the whole algorithm on the (Ce , εe )-plane is shown in Fig. 7.
f
The node denotes a (Ce , εe ) point at which the LCPSVR is solved; the number marked
next to a node, if exists, denotes the objective value of the associated restricted linear
program; the arrow in the 2nd figure denotes the direction of search for the invariancy
interval, associated with the starting (Ce , εe ) point of that arrow. The 1st and 2nd stages
can both be seen in this illustration, but not all details in Algorithm 9 and Algorithm
10 are stated in the description of the overview.

123
Author's personal copy
Global resolution of the support vector machine regression... 231

4 Integer program with the big-numbers tightening procedure

In this section, we propose a technique for finding and tightening the valid big numbers
θ1 j , θ2 j , θ3 j , θ4 j , θ5 j , and θ6 j that are used in the formulation of (10). The direct
effect of tightening the big numbers θi j , i = 1, . . . , 6 is that the feasible region
defined by (10) will shrink. We use these tightened big numbers to form a binary-
integer program that can be solved using any IP commercial solvers. This integer
program with the tightened big numbers is an alternative to solve the bi-level program
(3).
The valid values of θ1 j , θ3 j and θ5 j are not related with es j and are all up-bounded
by C̄. The following optimization programs with the choice of the objective functions
α j , β j , and Ce − α j − β j ) solve for the valid θ1 j , θ3 j , and θ5 j respectively.

max α j \β j \Ce − α j − β j
Ce ,εe ,ws1 ,bs1 ,
ws2 ,bs2 ,...,wsF ,bsF ,
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 +1, . . . , n +m 1 +m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n +m 1 +1, . . . , n +m 1 + m 2 ,
.. (22)
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,
f

F v
end
obj L B ≤ pi ≤ objU B,
f =1 i=front f
v
f f

end 
end
αj − β j = 0, ∀ f = 1, . . . , F,
j=front f j=front f
⎛ f
⎞ ⎛ f


end 
end
and ⎝ es j ⎠ Ce + ⎝ (α j + β j )⎠ εe
j=front f j=front f
⎛ f
⎞ ⎛ f


end 
end
+⎝ (β j − α j )(xd )⎠ T ⎝ (βi − αi )(xdi )⎠
j
(23)
j=front f i=front f

f f

end 
end
+ (−α j + β j )bs + (α j − β j )y j = 0, ∀ f = 1, . . . , F.
j=front f j=front f

The last F equalities of the above model are the sums of zeros of F folds, which
is obtained from summing up the products of the two sides to the ⊥ sign. As a
 f j end f
result, we get a quadratic term ( end
j=front f
(β j − α j )(xd ))T ( i=front f (βi − αi )(xd ))
i

123
Author's personal copy
232 Y.-C. Lee et al.

 f
in the equality. Meanwhile, the end j=front f
(−α j + β j )bs term in (23) cancels out
end f
because j=front f
(α j − β j ) = 0, ∀ f = 1, . . . , F. The two nonconvex terms
end f  f
( j=front f es j )Ce and ( endj=front f
(α j +β j ))εe that remain in Eq. (23) are to be approx-
imated by the Tyler expansion type of linear expressions. The quadratic relaxation of
(23) is as follows:

) *
Ce
xh + = 0,
ε
⎡ e f


end
⎢ es j ⎥
⎢ ⎥
⎢ f ⎥

zh + ⎢
f j=front ⎥ = 0, ∀ f = 1, . . . , F,
f ⎥
⎢ 
end

⎣ (α j + β j ) ⎦
j=front f
0 ≤ xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf , ∀f
= 1, . . . , F,
0 ≤ xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf , ∀f
= 1, . . . , F, (24)
xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf ≤ 0, ∀f
= 1, . . . , F,
xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf ≤ 0, ∀f
= 1, . . . , F,
⎛ f
⎛ ⎞ f


end 
end
v f (1) + v f (2) + ⎝ (β j − α j )(xd )⎠ T ⎝ (βi − αi )(xdi )⎠
j

j=front f i=front f
f

end
+ (α j − β j )y j ≤ 0, ∀ f = 1, . . . , F,
j=front f

where xh ∈ R2 , zhf ∈ R2 , vf ∈ R2 , and v f (1) and v f (2) are the first and second
entries of vf respectively. xh , xh , z h f , and z h f are the upper and lower bounds of xh
and zhf .
We set xh , xh , z h f , and z h f at the following values:

xh = [−C; −ε],
xh = [−C̄; −ε̄],
zh f = [0; 0], and
⎡ + , ⎤
f
end
f
⎢ − upper bound of es j ⎥
⎢ j=front f ⎥
zh f
=⎢

+ ,⎥.

⎣ f
end
f f ⎦
− upper bound of (α j + βj )
j=front f

123
Author's personal copy
Global resolution of the support vector machine regression... 233

Note the second entry of zh f is the direct result of solving the model (22) subject to
(24). It is the main idea of our big-numbers tightening procedure that a refinement of
the bounds zh f is made whenever the objective values of (22) are improved.
On the other hand, the valid values of θ2 j , θ4 j , and θ6 j are all related to es j . By
definition,

j
es j ≥ (xd )T ws f + bsf − yd j − εe ,
j
es j ≥ −(xd )T ws f − bsf + yd j − εe ,
es j ≥ 0.

This implies that θ6 j is the larger objective value of the following two optimization
problems:
j
max (xd )T ws f + bsf − yd j − εe
variables in (22),
xh ,zhf ,vf (25)
subject to constraints in (22) and (24),

and
j
max −(xd )T ws f − bsf + yd j − εe
variables in (22),
xh ,zhf ,vf (26)
subject to constraints in (22) and (24).

Let the larger objective values among (25) and (26) be es j , then a valid θ2 j can be
obtained by solving

j
max es j + εe − (xd )T ws f − bsf + yd j
variables in (22),
xh ,zhf ,vf (27)
subject to constraints in (22) and (24).

Similarly, a valid θ4 j can be obtained by solving

j
max es j + εe + (xd )T ws f + bsf − yd j
variables in (22),
xh ,zhf ,vf (28)
subject to constraints in (22) and (24).

The big numbers θ1 j , θ2 j , θ3 j , θ4 j , θ5 j , and θ6 j well define the integer program as


follows:

123
Author's personal copy
234 Y.-C. Lee et al.

f

F v
end
min pi
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws ,bs2 ,...,wsF ,bsF ,
2 v
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 +m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,

⎧f = 1, . . . , F :

⎪ 
end f

⎪ j

⎪ w f
− (β j − α j )xd = 0,


s

⎪ j=front f



⎪ 
end f

end f



⎪ αj − β j = 0,



⎪ f f


j=front j=front


⎪ ∀ j = front f , . . . , end f :

⎪ ⎧

⎪ ⎪ 0 ≤ αj ≤ θ1 j · z j,

⎪ ⎪


⎪ ⎪
⎪ 0 ≤ e + ε − (x
j T f
) w − b
f
s + yd j ≤ θ2 j · (1 − z j ),

⎪ ⎪

sj e d s

⎪ 0 ≤ βj ≤ θ3 j · z j ,



⎪ ⎪ 0 ≤ es j + εe + (xd ) ws + bs − yd j
j T f f
≤ θ4 j · (1 − z  j ),

⎪ ⎪


⎪ ⎪
⎪ 0 ≤ Ce − α j − β j ≤ θ5 j · ηj,

⎪ ⎪

⎩ 0 ≤ es j ≤ θ6 j · (1 − η j ).
(29)
We suggest using the following big-values tightening procedure.
Algorithm 11 Obtaining the Tightened Big Numbers.
Step 0: Initialization.
Set obj L B and objU B at the exogenous valid lower and upper bound of
 F endvf
f =1 f pi respectively.
i= f r ontv
Step 1: Obtaining initial big numbers by solving the linear program.
for j = f r ont f , . . . , end f , f = 1, . . . , F
1a: Solve (22) for all choices of the objective functions without constraints
(24).
Let the optimal objective values of α j , β j , and Ce − α j − β j be θ1 j , θ3 j ,
and θ5 j respectively.
1b: Solve (25) and (26) without constraints (24). Let the optimal objective
(1) (2) (1) (2)
values be es j and es j respectively. Then es j = max(es j , es j ).

123
Author's personal copy
Global resolution of the support vector machine regression... 235

θ6 j = es j .
1c: Solve (27) and (28) without constraints (24). Let the optimal objective
values be θ2 j and θ4 j respectively.
end for
Step 2: Solving for the improved lower bound.
  f
Solve for max Ff=1 endv f pi subject to the constraints in (22) and (24), where
i= f r ontv
 f  end f
zh f = [− end j= f r ont f θ6 j ; − j= f r ont f (θ1 j + θ3 j )]. Let the objective be obj L B .
c

if obj L B c > obj L B then


Let obj L B ← obj L B c and go to Step 1.
else
Go to Step 3.
end if
Step 3: Solving for the tightened big numbers.
for j = f r ont f , . . . , end f , f = 1, . . . , F
3a: Solve (22) for all choices of the objective functions with constraints (24).
Let the optimal objective values of α j , β j , and Ce − α j − β j be θ̃1 j , θ̃3 j ,
and θ̃5 j respectively.
3b: Solve (25) and (26) with constraints (24). Let the optimal objective values
(1) (2) (1) (2)
be es j and es j respectively. Then es j = max(es j , es j ). θ̃6 j = es j .
3c: Solve (27) and (28) with constraints (24). Let the optimal objective values
be θ̃2 j and θ̃4 j respectively.
if any of the θ̃i j , ∀i = 1, . . . , 6, are smaller than θi j then
θi j ← θ̃i j , ∀i = 1, . . . , 6. Update zh f .
end if
end for
Step 4: Output.
θ1 j , θ2 j , θ3 j , θ4 j , θ5 j , and θ6 j that define the modified integer program (29). 

The use of the tightened big numbers significantly improves the running time for
the instances that are originally solvable and also allows more instances to be solved.
However, the level of tightening which can be reached by this procedure is limited due
to the effect of aggregation. The approximation to the aggregated complementarities
becomes looser as the number of complementarity constraints increase. When the size
of training data is above some threshold, the tightened values θ1 j , θ2 j , θ3 j , θ4 j , θ5 j ,
and θ6 j resulting from Algorithm 11 are not small enough, so the IP (29) cannot be
solved by the solver. In this situation, we lack good big numbers and a good lower
bound of the upper-level objective function.
Now suppose a global solution set is known: (α ∗ j , β ∗ j , es∗j , εe ∗ , Ce ∗ , ws∗ f , bs∗ f )
and let

θ1∗ j = θ2∗ j = max (α ∗ j , e∗ s j + εe ∗ − (xd )T ws∗ f − bs∗ f + yd j ),


j

123
Author's personal copy
236 Y.-C. Lee et al.

θ3∗ j = θ4∗ j = max (β ∗ j , e∗ s j + εe ∗ + (xd )T ws∗ f + bs∗ f − yd j ),


j

θ5∗ j = θ6∗ j = max (es∗ j , Ce ∗ − α ∗ j − β ∗ j ). (30)

We say that the integer program defined by the big numbers θ1 j = θ1∗ j , θ2 j = θ2∗ j ,
θ3 j = θ3∗ j , θ4 j = θ4∗ j , θ5 j = θ5∗ j and θ6 j = θ6∗ j contains at least one of the global
optimal solutions. Although some portions of the feasible regions are cut off, the values
in (30) provide a benchmark for the tightness of the big numbers θ .
Table 1 gives examples of the averaged tightened values θ1 , θ2 , θ3 , θ4 , θ5 , and θ6
that are obtained from Algorithm 11. The instances in the examples have 30 features
and Ce ∈ [1, 10], εe ∈ [0.1, 0.5]. Compared with the benchmark values of θ1 ∗ , θ3 ∗ ,
and θ5 ∗ , these big numbers are still very large even after tightening, especially when
the numbers of training data points exceed 30.

5 Numerical experiments

In this section we provide the numerical experiments for solving the SVM regression
parameters selection. The following themes are covered:
• Data sources: synthetic data where (xd , yd ) are random values in [0, 10] and real-
world data where xd are the indicators and yd represents the diseases.
• Number of folds for the training data: onefold, twofolds, threefolds, or fivefolds.
• Number of folds for the validation data: the same as the number of folds for the
training data, or 1 single fold.
• Number of features: 5 features or 30 features. (25 features for the real-world data.)
• Size of training data set: 5, 10, 15,…, 95, 100; 105, 120, 135, or 150. (40, 60, or
100 for the real-world data.)
• Algorithms for the global optimal solution: the (Ce , εe )-rectangle search algorithm,
or the improved integer programs solved by CPLEX.
• Algorithms for the local optimal solution: an intermediate solution obtained from
the (Ce , εe )-rectangle search algorithm, or a solution produced by KNITRO, which
is an sequential quadratic programming (SQP) nonlinear programming (NLP)
solver that can be used for mathematical programs with equilibrium constraints
(MPEC).
• Parameters employed for the testing data set: globally optimized parameter and
grid-searched parameter.
The experiments are run on a machine with Intel i7-2600k CPU, 16 GB memory, and
OS windows 7, except those that use KNITRO on NEOS. Both the (Ce , εe )-rectangle
search algorithm described in Sect. 3 and the improved inter program described in
Sect. 4 are implemented in C++.

5.1 Synthetic data

In Tables 2, 3 and 4, the results of the parameter selection problem for synthetic data
with onefold, twofolds and threefolds of training data are shown, respectively. Here,

123
Table 1 Comparison of the averaged values θi , i = 1, . . . , 6, obtained by Algorithm 11 and the optimal values θi∗ , i = 1, . . . , 6

# of training data θ1 θ2 θ3 θ4 θ5 θ6 θ1∗ (= θ2∗ ) θ3∗ (= θ4∗ ) θ5∗ (= θ6∗ )

5 0.126235 1.081324 0.126234 1.081114 9.989524 0.049280 0.086908 0.126908 0.986183


10 0.161111 1.113099 0.161130 1.112452 9.996173 0.063365 0.275497 0.275497 0.991417
15 0.239673 1.176334 0.239672 1.177155 9.996574 0.094693 0.468006 0.539704 0.992290
20 0.515332 1.383488 0.515283 1.386732 9.995492 0.207011 0.341058 0.536199 0.986745
25 1.333268 2.023181 1.333353 2.018388 9.992558 0.540339 0.534622 0.491325 0.974053
Global resolution of the support vector machine regression...

30 5.066198 9.006363 5.063483 9.061301 9.957015 4.378865 0.561134 0.625488 0.813378


35 8.862497 88.76880 8.795676 88.94464 9.998000 45.24857 n/a
Author's personal copy

40 9.910701 127.2467 9.886909 127.2871 9.998719 64.88660

Each observation comprises 30 features

123
237
238

Table 2 Result of the training and validating SVR on the synthetic data with onefold of training data and onefold of validation data that were solved by the (Ce , εe )-rectangle
search algorithm

123
Instance profile Optimal solution Algorithm statistics at termination/interuption

Data # of comp. Global? Least upper Ce εe Convergence # of groupings # of objective # of rectangles


bound time (s) values

5_5_5_1 15 Yes 15.6326 10 0.5 4.85 3 2 5


10_10_5_1 30 Yes 39.6893 8.76829 0.38201 3090.51 8 2 534
15_15_5_1 45 Yes 36.2791 10 0.14950 5849.94 13 4 2184
20_20_5_1 60 Yes 60.0738 10 0.44891 850.40 7 4 7
25_25_5_1 75 Yes 68.2918 10 0.1 2628.77 9 5 124
30_30_5_1 90 Yes 76.0628 10 0.26901 829.86 6 4 7
35_35_5_1 105 Yes 91.9837 10 0.5 19133.20 16 11 704
40_40_5_1 120 Yes 103.696 5.13124 0.5 23682.12 18 9 573
45_45_5_1 135 Yes 116.924 2.20279 0.5 13586.20 19 10 364
50_50_5_1 150 Yes 125.983 1 0.5 25676.30 27 16 620
55_55_5_1 165 Yes 137.630 1 0.5 43118.30 44 28 3783
60_60_5_1 180 Yes 151.469 10 0.5 25609.00 39 24 857
Author's personal copy

65_65_5_1 195 Yes 166.002 1 0.4063 36377.70 45 26 976


70_70_5_1 210 Yes 182.654 1 0.23630 103535.00 42 20 1053
75_75_5_1 225 Yes 200.019 1 0.32186 49803.40 26 16 516
80_80_5_1 240 Yes 201.423 1 0.19269 27998.70 41 20 292
85_85_5_1 255 Yes 208.301 10 0.39157 74622.50 47 23 733
90_90_5_1 270 Fail1 n/a n/a n/a n/a n/a n/a n/a
95_95_5_1 285 Fail1 229.582 1 0.5 n/a 23 12 7
100_100_5_1 300 Fail1 239.402 10 0.5 n/a 15 10 3
Y.-C. Lee et al.
Table 2 continued

Instance profile Optimal solution Algorithm statistics at termination/interuption

Data # of comp. Global? Least upper Ce εe Convergence # of groupings # of objective # of rectangles


bound time (s) values

5_5_30_1 15 Yes 6.55739 10 0.5 1.29 1 1 1


10_10_30_1 30 Yes 18.5152 10 0.5 7.99 2 1 1
15_15_30_1 45 Yes 28.1909 10 0.5 34.36 3 2 5
20_20_30_1 60 Yes 59.0814 10 0.5 79.15 4 3 6
25_25_30_1 75 Yes 89.6480 10 0.5 87.95 7 6 9
30_30_30_1 90 Yes 162.201 10 0.5 832.38 7 6 228
35_35_30_1 105 No2 209.622 1 0.5 n/a 258 232 722
40_40_30_1 120 Yes 191.939 1 0.5 149511.00 92 74 4447
45_45_30_1 135 No2 178.813 1 0.5 n/a 101 79 341
50_50_30_1 150 No2 170.815 1 0.5 n/a 164 116 240
Global resolution of the support vector machine regression...

55_55_30_1 165 No2 185.846 1 0.5 n/a 65 44 153


60_60_30_1 180 No2 218.191 1 0.5 n/a 83 71 222
65_65_30_1 195 No2 232.715 10 0.5 n/a 46 30 97
Author's personal copy

70_70_30_1 210 No2 261.759 1 0.5 n/a 99 72 115


75_75_30_1 225 No2 290.898 1 0.46158 n/a 38 24 88
80_80_30_1 240 No2 296.855 1 0.1 n/a 92 61 91
85_85_30_1 255 No2 295.585 1 0.43787 n/a 41 25 62
90_90_30_1 270 No2 291.292 1 0.1 n/a 55 33 70
95_95_30_1 285 No2 290.273 1.43635 0.35099 n/a 72 54 72
100_100_30_1 300 No2 304.353 1.58101 0.33420 n/a 75 40 19

Ce ∈ [1 , 10] and εe ∈ [0.1 , 0.5]

123
239
Table 3 Result of the training and validating SVR on the synthetic data with twofolds of training data and twofolds of validation data that were solved by the (Ce , εe )-rectangle
240

search algorithm

Instance profile Optimal solution Algorithm statistics at termination/interuption

123
Data # of comp. Global? Least upper Ce εe Convergence # of groupings # of objective # of rectangles
bound time (s) values

5_5_5_2 30 Yes 31.7797 10 0.5 13.77 3 2 5


10_10_5_2 60 Yes 70.7773 1 0.13787 5328.00 10 5 576
15_15_5_2 90 Yes 84.0852 10 0.1 29095.50 28 17 2359
20_20_5_2 120 Yes 111.454 10 0.44891 18673.61 20 11 380
25_25_5_2 150 Yes 136.731 7.21495 0.14678 88873.05 56 32 1863
30_30_5_2 180 Yes 155.316 1 0.26901 2925.18 21 12 15
35_35_5_2 210 Yes 183.513 1 0.42584 41847.62 30 17 894
40_40_5_2 240 Yes 203.311 4.45153 0.5 111367.00 44 24 1481
45_45_5_2 270 Yes 227.785 3.80531 0.5 131920.00 72 47 3157
50_50_5_2 300 Yes 255.149 1 0.5 33351.00 44 23 385
5_5_30_2 30 Yes 20.3554 1 0.48409 9.91 2 1 1
10_10_30_2 60 Yes 48.9086 10 0.5 66.79 3 2 5
Author's personal copy

15_15_30_2 90 Yes 79.7037 1 0.49640 170.23 6 5 8


20_20_30_2 120 Yes 132.146 1 0.5 343.00 9 8 11
25_25_30_2 150 Yes 213.516 1 0.5 216.58 9 8 10
30_30_30_2 180 Yes 394.917 1 0.5 36911.21 32 30 2721
35_35_30_2 210 No3 405.604 1 0.5 >29926.30 544 519 1772
40_40_30_2 240 No2 415.477 1 0.5 n/a 204 174 158
45_45_30_2 270 No2 411.858 1 0.5 n/a 281 229 180
50_50_30_2 300 No2 414.801 1 0.5 n/a 195 132 155
Y.-C. Lee et al.

Ce ∈ [1 , 10] and εe ∈ [0.1 , 0.5]


Table 4 Result of the training and validating SVR on the synthetic data with threefolds of training data and threefolds of validation data that were solved by the (Ce , εe )-
rectangle search algorithm

Instance profile Optimal solution Algorithm statistics at termination/interuption

Data # of comp. Global? Least upper Ce εe Convergence # of groupings # of objective # of rectangles


bound time (s) values

5_5_5_3 45 Yes 49.2105 1 0.12443 30.61 4 3 6


10_10_5_3 90 Yes 99.2702 1.25255 0.1 34219.21 29 22 1469
15_15_5_3 135 Yes 132.017 1 0.12097 109288.00 65 47 5031
20_20_5_3 180 Yes 176.307 1 0.43310 43128.63 32 17 721
25_25_5_3 225 Yes 235.656 1 0.37009 71215.80 125 78 5411
30_30_5_3 270 Yes 259.713 1 0.1 40633.40 68 45 472
35_35_5_3 315 Yes 298.358 1 0.5 80964.40 90 50 1610
40_40_5_3 360 Yes 334.663 3.79575 0.15225 145298.00 63 40 1739
45_45_5_3 405 Yes 364.376 3.80531 0.5 507302.00 176 104 5602
50_50_5_3 450 Yes 407.516 1 0.40875 34460.60 55 29 430
Global resolution of the support vector machine regression...

5_5_30_3 45 Yes 41.4245 1 0.48409 14.81 2 1 1


10_10_30_3 90 Yes 97.3750 1 0.5 119.42 4 3 6
Author's personal copy

15_15_30_3 135 Yes 159.109 10 0.5 306.89 8 7 10


20_20_30_3 180 Yes 210.289 10 0.5 613.85 13 12 15
25_25_30_3 225 Yes 323.267 1 0.5 459.90 11 10 13
30_30_30_3 270 Yes 536.511 1 0.5 85320.60 48 44 3540
35_35_30_3 315 No2 660.533 1 0.5 n/a 351 303 228
40_40_30_3 360 No2 645.140 1 0.5 n/a 376 302 137
45_45_30_3 405 No2 621.336 1 0.5 n/a 298 232 117
50_50_30_3 450 No2 632.321 1 0.5 n/a 196 123 96

123
241

Ce ∈ [1, 10] and εe ∈ [0.1, 0.5]


Author's personal copy
242 Y.-C. Lee et al.

the number of folds for the validation data is the same as that for the training data.
The first column is a self-explanatory name for each instance. For example, instance
5_6_30_1 means that there are 5 data points in each fold of the validation data, 6 data
points in each fold of the training data, 30 features for each data point, and a total of
onefold for the training data. The second column is the number of complementarity
constraints. The 3rd to 10th columns record the solution obtained at convergence (if
the algorithm converges) or at termination (if the algorithm is interrupted due to an
error or after a long time waiting). The possible entries in the 3rd column are Yes, fail1 ,
No2 , or No3 . “Yes” denotes the case where the global optimum is obtained, and the Ce
and εe values in the 5th and 6th columns are the optimal solution. “fail1 ” denotes the
case where errors occur either in the 1st or 2nd stage of the algorithm so the running
process is forced to stop. “No2 ” denotes the case where the instance cannot be solved
within a limit of time. In this case, the Ce and εe recorded in the 5th and 6th columns are
solutions for which the least valid upper bound is obtained. The time limit imposed
on these instances is 8000 s for each stage. Note that we impose the time limit on
some instances only when we know the instance is unlikely to be solved in a decent
running time, or we have confronted a long waiting time on a simpler or equal size of
instance. “No3 ” is similar to “No2 ” except that the running process was interrupted
before converging at about 8000 s. The 7th column records the time in seconds to
obtain the global optimum. If the algorithm doesn’t converge, we mark it by “n/a”.
The 8th column records the number of groupings found during the search. The 9th
column records the number of different objective values ever obtained from solving
RLP or RQCP. If the instance is solved to global optimum, the values in the 8th and
9th column are the ultimate realizations of possible groupings and objective values,
respectively; otherwise, they only show the intermediate understandings. The number
of objective values of RLP or RQCP is at most the number of possible groupings.
The values in the 9th column can be less than those in the 8th column because it is
possible for two different groupings to lead to the same objective. The 10th column
records the number of rectangular areas being processed in the entire algorithm.
For the instances with 5 features shown in Table 2, the (Ce , εe )-rectangle search
algorithm can solve the onefold-5-features problems with up to 85 training and 85
validation data points to global optimum. The unsuccessful runs for instances with
more than 80 data points are due to the failure in solving the restricted linear program
RLP and the restricted quadratic constrained program RQCP9 at the vertices of some
rectangular areas. This is indeed a problematic issue for Algorithms 9 and 10 when
there is any unsolved RLP (and RQCP) during the process of searching and elim-
inating. The constraints set (17) of RLP is used in models (19) and (20) obtaining
the invariancy intervals. Without successfully solving RLP and RQCP, an invariancy
interval cannot be identified, followed by an inefficient partitioning at random points.
In other words, an accurate lower-level solution that can lead to the feasible RLP is
critical for controlling the search step of this algorithm.
Shown in Tables 3 and 4, the (Ce , εe )-rectangle search algorithm solves the twofold-
and threefold-5-features problems to global optimum without troubles. From the 7th

9 Recall that we solve a RQCP when RLP or LCP fail to be solved.

123
Author's personal copy
Global resolution of the support vector machine regression... 243

Fig. 8 The four quadrants of difficulty that are separated by the number of training data points each fold
and the features of a data point of an instance

and 8th columns of the tables, we can see that the time spent to get convergence is about
linear to the number of groupings being discovered. The number of feasible groupings
that an instance can carry is determined by the two factors: number of training data
points and number of features.
For instances with 30 features, as shown in Tables 2, 3 and 4, the search algorithm
can only solve up to 30 data points for each fold with onefold, twofolds, or threefolds
of training data, except a single instance 40_40_30_1. The intermediate number of
groupings for these unsolved instances is already very large compared with those
instances of the same size of training data but fewer features.
We notice that the relationship between number of features and the number of
data points affect the difficulty of solving an instance by the (Ce , εe )-rectangle search
algorithm. If the number of training data points is less than the number of features, the
problem is easier. This observation can be related to the geometric analysis where we
aim to identify a k−1-dimensional hyperplane in a k-dimensional space to fit n training
data points to a least square linear regression. Consider the fact that k properly located
data points determine a hyperplane in k-dimension. If n < k, there are more than one
hyperplane containing the n data points on it. If n = k, given non-collinearity and
other conditions on the relationship of points, there is only one hyperplane containing
the n data points on it. If n > k, a hyperplane is determined under the rule of least
square, yet there will be at least n − k points outside the hyperplane. Therefore, when
n < k, there is more freedom in determining a hyperplane with no residuals.
To analyze the difficulty level of an instance for the (Ce , εe )-rectangle search algo-
rithm, we propose to separate the instances into four quadrants- I , I I , I I I , and I V -
to categorize the instances into four levels of difficulty: the easiest, easy, har d, the
har dest, as shown in Fig. 8. The separating horizontal and vertical axes represent
the number of training data points and number of features, respectively, and the origin
denotes the instance with equal number of the training data points and features. Within
each quadrant, the running time of the instances is locally proportional to the number
of training data points and folds. From Fig. 9, for instances belonging to the same

123
Author's personal copy
244 Y.-C. Lee et al.

quadrant II(easy) quadrant I(the hardest)


·105
1
30 features, 1 fold 30 data points/fold, 30 features
800 30 features, 2 folds
30 features, 3 folds 0.8
running time (sec)

running time (sec)


600
0.6
400
0.4

200 0.2

0 0
5 10 15 20 25 1 2 3
# of training data points (<30) each fold # of folds

quadrant III(the easiest) quadrant IV(hard)


·105
6
5 data points/fold, 5 features 5 features, 1 fold
5 features, 2 folds
30 5 features, 3 folds
running time (sec)
running time (sec)

20

10

1 2 3 10152025303540455055606570758085
# of folds # of training data points (>5) each fold

Fig. 9 Running time of the instances for four quadrants separately

difficulty level, the running time is about linear to number of training data points and
number of folds.
The global optimal results shown in Tables 2, 3, and 4 contain the successful runs
for the easy, the easiest, and har d instances, but the har dest instances remain
unsolved using the (Ce , εe )-rectangle search algorithm.

5.2 Real-world data

Besides the synthetic data, we run parameter selection for the real-world data of
chemoinformatics, which has been used in the work of Yu (2011), Kunapuli (2008),
Demiriz et al. (2001). The original usage of these data sets is for building the quan-
titative structure–activity relationship (QSAR) models, but we borrow the data only
to test the algorithm without discussing the meaning of the real-world application. A
profile of these data sets is shown in Table 5.
We divide the training data into fivefolds and keep the validation data as onefold. For
example, the 100 training data points for data set aquasol is divided into fivefolds,
that is, each fold of training data contains 20 data points and the single fold of validation
data contains all 97 validation data points. We follow the same method to create the
folds of data. In addition, compared to the parameter selection studies in Yu (2011),

123
Author's personal copy
Global resolution of the support vector machine regression... 245

Table 5 Profile of the chemoinformatics data (Table 3.1 in Kunapuli (2008))

Name of # of # of observations # of observations # of observations # of features


data set sets each set assigned to assigned to each observation
training data validation data

Aquasol 10 197 100 97 25


Blood/brain 20 62 60 2 25
barrier (BBB)
Cancer 20 46 40 6 25
Cholecystokinin 10 66 60 6 25
(CCK)

Kunapuli (2008) on the same data set, our experiments have some modifications. In
Kunapuli (2008), a lower and upper bound is imposed on the normal vector ws of the
hyperplane in SVM regression problems, thus making them slightly different from
problem (3). The approaches employed in Kunapuli (2008) don’t aim at confirming
the global optimality. In Yu (2011), a branch and cut-based algorithm was proposed
for the global optimality, but all experiments with real-world data were terminated in
7200 s at a local optimal solution. In both Yu (2011) and Kunapuli (2008) the outer-
level objective function takes the average of the regression error, while we take the
sum. This small change doesn’t affect the optimal values of Ce and εe when the number
of training data points in each fold is equal, which is our case, but it affects the outer
objective values being presented to readers. The analysis in these two previous works
focuses on performance, which is viewed by the mean average deviation (MAD) and
mean squared error (MSE) of the trained parameters on a testing data set. However,
the performance of the parameters on a testing data set is really not guaranteed. (See
the debates about the meaning of a “best parameter” in Sect. 2.2 and two small-scale
performance comparisons in Sect. 5.5.) Instead, we focus on the global optimality
itself. Under these modifications, we are the first to obtain a certificate of global
optimality for the problem sets cancer, BBB and CCK, while the global optimality
for the set aquasol remains unrealized. The results of solving real-world instances
using the (Ce , εe )-rectangle search are shown in Table 6.
From Table 6, the runs on data sets cancer, BBB and CCK are successful both
in global optimality and in convergent time. For the data set aquasol, we can see
that the algorithm cannot converge after 1,052,390 s, and the number of groupings
identified for “aquasol_1” is already really large. We then acknowledge that the set
aquasol is a challenge to our rectangle search algorithm. Runs on the remaining
instances “aquasol_2–aquasol_10” are forced to stop at around 8000 s. We
reasonably believe that the number of groupings identified at 8000 s is much smaller
than the number of all possible groupings.
Numerical results show that the (Ce , εe )-rectangle search algorithm is less sensitive
to the increase of the number of folds than to the number of features. For the data sets
cancer, BBB and CCK, dividing the training data set into fivefolds actually makes
the number of training data points each fold drop far below the number of features.
The difficulty of these instances is categorized in the I I quadrant-easy. Although

123
246

123
Table 6 Result of the training and validating SVR on the real-world chemoinformatics data which were divided into fivefolds of training data and onefold of validation data
and solved by the (Ce , εe )-rectangle search algorithm

Instance profile Optimal solution Algorithm statistics

Name Data # of comp. Global? Avg. least Avg. converge. Avg. # of Avg. # of Avg. # of
upper bound time (s) groupings objective values rectangles

Acquasol_1 97_20_25_5 300 No3 470.703 >1052390 5260 5220 17579


Acquasol_2-acquasol_10 97_20_25_5 300 No2 429.975 n/a 575.8 564.6 341.8
BBB_1-BBB_20 2_12_25_5 180 Yes 4.10514 1526.44 58.5 53.6 53.8
Aancer_1-cancer_20 6_8_25_5 120 Yes 16.38208 902.15 62.4 50.7 34.9
CCK_1-CCK_10 6_12_25_5 180 Yes 33.85086 3329.85 40.4 36.7 136.3
Author's personal copy

Ce ∈ [0.1 , 10] and εe ∈ [0.01 , 1]


Y.-C. Lee et al.
Author's personal copy
Global resolution of the support vector machine regression... 247

f
we need to solve more LCPSVR s, it seems to be beneficial for the (Ce , εe )-rectangle
search to divide the data set into many folds such that the problem of each fold is less
difficult. Associated with the smaller size of training data in each fold is the risk of
losing representativeness of future data.

5.3 Performance analysis between methods

In this subsection, we compare the global and local optimal solutions with the con-
vergent efficiency provided by various approaches. Among these algorithms, the
(Ce , εe )-rectangle search algorithm and the improved integer program with tightened
big numbers θ solve the instances to the global optimum at convergence. We imple-
ment Algorithm 11 in C++ and use CPLEX 12.2 to solve the program (29). The initial
obj L B in the Step 1 can be set at 0, or at the least sum of residuals of fitting the
validation data to the absolute regression hyperplane. The initial objU B is set at the
least upper bound obtained in the 1st stage of the (Ce , εe )-rectangle search algorithm.
Besides the two global optimal algorithms, the instances were run on NEOS machine
to use KNITRO, the mathematic program with equilibrium constraints solver. In this
way, we obtained a quick local solution to compare with others.
The results obtained by the three methods on the synthetic instances with onefold,
twofolds, and threefolds training and validation data sets are shown in Tables 7, 8, and
9 respectively. In the tables, the results of the (Ce , εe )-rectangle search algorithm are
excerpted from Table 2, 3, and 4. Next to the objective values obtained by KNITRO
in the tables, some are marked by a double asterisk (**.) This means that KNITRO
returned a solution but it is claimed an infeasible point.
From Tables 7, 8, and 9, the global optimal objective values obtained from the
(Ce , εe )-rectangle search algorithm and the improved integer program basically match
with each other except for small discrepancies in instances 15_15_30_1, 20_20_30 _1,
25_25_30_1, 15_15_30_2, and 15_15_30_3. We think these discrepancies are the
result of different precisions on the right-hand-side feasibility. The local optimal solu-
tions provided by KNITRO are quite good considering the running time, yet we can
also see that the global optimal solution to the application of training and validation
SVR is always better than the local optimal solution provided by KNITRO.
A series of figures that compare the running time of the (Ce , εe )-rectangle search
algorithm and the improved integer program is shown in Fig. 10. In general, the
(Ce , εe )-rectangle search algorithm can solve many more instances to the global
optimum than the improved integer program. The instance 35_35_30_1 is the only
instance that was unsolved by the (Ce , εe )-rectangle search algorithm but solved as
the improved integer program. However, for instances with fewer training data points,
the convergent speed of an improved integer program outperforms that of the (Ce , εe )-
rectangle search algorithm, including the onefold-5-feature instances with up to 55
data points, onefold-30-feature instances with up to 35 data points, twofold-5-feature
instances with up to 25 data points, twofold-30-feature instances with 10 and 15 data
points, threefold-5-feature instances with up to 15 data points, and threefold-30-feature
instances with up to 20 data points. As the number of training data points increases, the
required processing time for the improved integer program rises suddenly at a critical

123
248

Table 7 Comparisons between three methods for training and validating the SVR with onefold of training data and onefold of validation data

Data Rectangle search CPLEX 12.2-modified IP KNITRO 7.0.0 (on NEOS)-MPEC

123
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value

5_5_5_1 15.6326 15.6326 1 0.5 0.33 15.7786 1 0.1 0.01


10_10_5_1 39.6893 39.6895 1 0.38203 2.39 40.4560 1 0.46525 0.04
15_15_5_1 36.2791 36.2791 9.97703 0.1 6.04 36.2791 9.97703 0.1 0.05
20_20_5_1 60.0738 60.0738 1 0.44892 6.58 60.0738 1.40263 0.44892 0.09
25_25_5_1 68.2918 68.2918 10 0.1 14.41 71.9826 10 0.5 0.23
30_30_5_1 76.0628 76.0628 10 0.17840 21.22 77.7165 10 0.5 0.23
35_35_5_1 91.9837 91.9837 1 0.5 48.19 93.2193 1 0.31257 0.18
40_40_5_1 103.696 103.696 4.74122 0.5 66.08 104.676 1 0.38487 0.26
45_45_5_1 116.924 116.924 3.26979 0.5 106.35 116.924 2 0.5 0.65
50_50_5_1 125.983 125.983 1.77608 0.5 1088.94 125.983 1.18875 0.5 0.49
55_55_5_1 137.630 137.630 1 0.5 16920.90 137.630 1 0.5 0.60
60_60_5_1 151.469 Unable to converge after a long time 151.789 1 0.46283 1.34
Author's personal copy

65_65_5_1 166.002 166.536 1.08105 0.47379 1.05


70_70_5_1 182.654 184.530 1.55488 0.36707 1.50
75_75_5_1 200.019 201.299 1 0.43783 1.05
80_80_5_1 201.423 204.190 1.46351 0.33637 2.06
85_85_5_1 208.301 208.842 1 0.47521 1.81
90_90_5_1 n/a 225.318 1.10350 0.49327 3.86
95_95_5_1 229.582 229.582 1.28808 0.5 2.93
100_100_5_1 239.402 239.402 1 0.5 3.13
Y.-C. Lee et al.
Table 7 continued

Data Rectangle search CPLEX 12.2-modified IP KNITRO 7.0.0 (on NEOS)-MPEC

Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value

5_5_30_1 6.55739 6.55739 1 0.1 2.03 6.55739 1 0.1 0.01


10_10_30_1 18.5152 18.5152 1 0.27121 4.06 18.5152 4.55453 0.27121 0.07
15_15_30_1 28.1909 28.1908 1 0.5 7.22 28.1909 10 0.5 0.13
20_20_30_1 59.0814 60.0000 1 0.43200 10.67 59.0814 3.96691 0.5 0.24
25_25_30_1 89.6480 89.6479 1 0.49912 34.80 89.6479 10 0.5 0.48
30_30_30_1 162.201 162.201 1 0.5 142.02 162.201 1 0.5 0.62
35_35_30_1 209.622 209.622 1 0.5 8087.00 208.560 1.11641 0.49899 0.81
40_40_30_1 191.939 Unable to converge after a long time 199.722 3.68451 0.5 2.00
45_45_30_1 178.813 178.813 1 0.5 1.49
50_50_30_1 170.815 186.680 9.26593 0.45452 3.05
Global resolution of the support vector machine regression...

55_55_30_1 185.846 193.096 5.62466 0.42484 2.15


60_60_30_1 218.191 218.190 1.92665 0.42306 3.00
65_65_30_1 232.715 254.189 9.40120 0.14750 5.21
Author's personal copy

70_70_30_1 261.759 263.206 1.20318 0.43617 4.60


75_75_30_1 290.898 295.756∗∗ 2.49391 0.5 5.79
80_80_30_1 296.855 297.732∗∗ 2.45822 0.5 10.61
85_85_30_1 295.585 299.108∗∗ 4.15007 0.49457 19.54
90_90_30_1 291.292 296.340∗∗ 2.04075 0.5 20.75
95_95_30_1 290.273 291.981∗∗ 2.09729 0.5 235.01
100_100_30_1 304.353 318.601∗∗ 8.50051 0.47386 33.96

Ce ∈ [1 , 10] and εe ∈ [0.1 , 0.5]

123
249
250

Table 8 Comparisons between three methods for training and validating the SVR with twofolds of training data and twofolds of validation data

Data Rectangle search CPLEX 12.2-modified IP KNITRO 7.0.0 (on NEOS)-MPEC

123
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value

5_5_5_2 31.7797 31.7797 10 0.5 1.49 32.4176 1 0.1 0.07


10_10_5_2 70.7773 70.7773 1 0.1 17.43 73.4668∗∗ 1.00032 0.5 0.05
15_15_5_2 84.0852 84.0852 9.97703 0.1 58.37 93.2997 1 0.48358 0.14
20_20_5_2 111.454 111.454 1 0.38245 95.47 111.728 1.09922 0.44892 0.26
25_25_5_2 136.731 136.731 4.42991 0.1 941.13 137.779 1 0.5 0.37
30_30_5_2 155.316 153.316 1 0.26902 160899.22 156.836 1 0.38169 0.82
35_35_5_2 183.513 Unable to converge after a long time 183.739∗∗ 1 0.5 0.90
40_40_5_2 203.311 205.228∗∗ 4.48850 0.5 1.29
45_45_5_2 227.785 228.521 1.13580 0.5 2.18
50_50_5_2 255.149 255.191 1.58265 0.44947 1.98
5_5_30_2 20.3554 20.3554 10 0.1 16.70 20.3554 1 0.1 0.04
10_10_30_2 48.9086 48.9068 10 0.5 22.24 48.9068 10 0.5 0.13
Author's personal copy

15_15_30_2 79.7037 79.7072 10 0.5 38.92 79.7072 1 0.5 0.31


20_20_30_2 132.146 132.146 10 0.5 427.47 132.146 1.78153 0.5 0.55
25_25_30_2 213.516 213.516 10 0.5 10342.53 259.695∗∗ 10 0.1 0.96
30_30_30_2 394.917 Unable to converge after a long time 394.917 1 0.5 1.89
35_35_30_2 405.604 405.604 1 0.5 2.65
40_40_30_2 415.477 449.047∗∗ 8.39580 0.46775 3.25
45_45_30_2 411.858 411.929∗∗ 1 0.5 2.94
50_50_30_2 414.801 414.801 1 0.5 11.42

Ce ∈ [1 , 10] and εe ∈ [0.1 , 0.5]


Y.-C. Lee et al.
Table 9 Comparisons between three methods for training and validating the SVR with threefolds of training data and threefolds of validation data

Data Rectangle search CPLEX 12.2-modified IP KNITRO 7.0.0 (on NEOS)-MPEC

Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value

5_5_5_3 49.2105 49.2105 1 0.1 16.46 49.2105 10 0.1 0.05


10_10_5_3 99.2702 99.2702 1 0.1 176.94 101.912∗∗ 1 0.5 0.12
15_15_5_3 132.017 132.017 1 0.1 4552.27 145.881 1 0.48358 0.35
20_20_5_3 176.307 176.307 1 0.38245 16630.10 176.838∗∗ 1.00032 0.5 0.29
25_25_5_3 235.656 Unable to converge after a long time 236.622∗∗ 3.37509 0.5 0.72
30_30_5_3 259.713 264.806 1.22635 0.47940 0.99
35_35_5_3 298.358 298.353∗∗ 1 0.5 1.29
40_40_5_3 334.663 338.828 1 0.46393 1.65
45_45_5_3 364.376 363.918∗∗ 1.13580 0.5 2.36
50_50_5_3 407.516 407.244 2.06124 0.5 3.12
Global resolution of the support vector machine regression...

5_5_30_3 41.4245 41.4245 10 0.1 10.30 41.4245 1 0.1 0.05


10_10_30_3 97.3750 97.3750 10 0.5 40.95 98.7191 10 0.38233 0.22
Author's personal copy

15_15_30_3 159.109 159.112 1 0.5 35975.54 159.112 1 0.5 0.32


20_20_30_3 210.289 210.289 1 0.5 758045.79 229.998∗∗ 10 0.1 0.44
25_25_30_3 323.267 323.267 10 0.5 450402.26 323.267 1 0.5 1.40
30_30_30_3 536.511 Unable to converge after a long time 536.511 2.76246 0.5 2.70
35_35_30_3 660.533 662.131 1 0.5 3.06
40_40_30_3 645.140 723.764 1.38466 0.27788 6.38
45_45_30_3 621.336 621.336 1 0.5 7.57
50_50_30_3 632.321 632.321 1 0.5 12.80

Ce ∈ [1 , 10] and εe ∈ [0.1 , 0.5]

123
251
Author's personal copy
252 Y.-C. Lee et al.

·105 5 features 1 fold ·105 30 features 1 fold


2 2
Rectangle search Rectangle search
convergence time (sec)

convergence time (sec)


Improved IP Improved IP

1.5 1.5

1 1

0.5 0.5

0 0
5 10152025303540455055606570758085 5 10 15 20 25 30 35 40
# of training data points each fold # of training data points each fold

·105 5 features 2 folds ·105 30 features 2 folds


2.5 1
Rectangle search Rectangle search
convergence time (sec)

convergence time (sec)


Improved IP Improved IP
2 0.8

1.5 0.6

1 0.4

0.5 0.2

0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30
# of training data points each fold # of training data points each fold
·106 5 features 3 folds ·106 30 features 3 folds
1 1
Rectangle search Rectangle search
convergence time (sec)

convergence time (sec)

Improved IP Improved IP
0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30
# of training data points each fold # of training data points each fold
Fig. 10 Comparisons between the (Ce , εe )-rectangle search algorithm and the improved integer program
on convergent capability and convergent speed

point, after which the larger instances cannot be solved. This sudden rise confirms the
aggregation effect mentioned in Sect. 4.
Similarly, we employed the three approaches on real-world chemoinformatics data
sets. The solution provided by KNITRO on the real-world data is shown in Table 10.
Our conclusion about the comparison between KNITRO and the (Ce , εe )-rectangle

123
Author's personal copy
Global resolution of the support vector machine regression... 253

Table 10 Training and validating SVR solved by KNITRO on the real-world chemoinformatics data
divided into fivefolds training data and onefold validation data

Name KNITRO 7.0.0 (on NEOS)-MPEC

Successful instances Avg. local Time (s)


optimal objective
value

aquasol_1-aquasol_10 6/10 435.8246 10.59


BBB_1-BBB_20 20/20 4.327035 0.69
cancer_1-cancer_20 20/20 17.05943 0.31
CCK_1-CCK_10 10/10 34.63429 4.31

Table 11 Training and validating SVR as an integer program solved by CPLEX on the real-world chemoin-
formatics data divided into fivefolds training data and onefold validation data

Name CPLEX 12.2-modified IP

Avg. local optimal objective value Time (s)

cancer_1-cancer_20 16.39567 990.75


CCK_1 28.2879 441247
CCK_2 34.5219 150886
CCK_3 24.3504 2134000

search algorithm remains the same: the running time of KNITRO is desirable, but
the solution quality of the (Ce , εe )-rectangle search algorithm is absolutely better
than that produced by KNITRO. Results from the improved integer programming on
the real-world data are limited because we only solved the set cancer and a few
instances in CCK. We didn’t actually wait long enough (e.g., weeks or months) to get
the optimal solutions of the remaining instances in CCK because it takes 25 days to
get a convergence on “CCK_3.” Table 11 shows that the processing time needed for
the improved integer program for this set cancer is slightly less than but very close
to the time needed for the (Ce , εe )-rectangle search algorithm. Knowing that the set
CCK and BBB can be solved by the (Ce , εe )-rectangle search algorithm, we conclude
that the improved integer program is less effective on instances with a large number
of folds.

5.4 Observation on solutions

In all the numerical experiments provided in this work, we found that most of the
optimal (Ce , εe ) lies on the boundaries of the initial rectangular area [C, C̄] × [ε, ε̄].
For example, for those instances solved to global optimum shown in Table 2, either the
optimal Ce ∈ {1, 10} or the optimal εe ∈ {0.1, 0.5}. Even though the optimal (Ce , εe )
for the instance 10_10 5_1 was recorded at an interior point, we have noted that a point
on the boundaries is also optimal. However, the situation of optimizing the parameter

123
Author's personal copy
254 Y.-C. Lee et al.

selection problem at one bound of either Ce or εe is not universally correct. Counter


example indeed exists.
In application, though we have shown that the global optimal parameters are not
always on the boundaries of [C, C̄]×[ε, ε̄], it appears to be a good and efficient strategy
to follow Algorithm 8 and search on the boundaries in order to get a good upper bound.
It is highly probable that this upper bound is equal to the true global optimum based on
our experiments. Out of a total of 140 instances (80 generated synthetic instances and
60 real-world instances), the (Ce , εe )-rectangle search algorithm solved 97 of them to
global optimum. Among the 97 instances, 96 of them obtain a least valid upper bound
at the completion of the 1st stage of the algorithm equivalent to the global optimum
found later. We may be able to apply the 1st stage of the algorithm alone to obtain a
set of trustworthy parameters and save running time for the convergence at the 2nd
stage.

5.5 Performance of the globally optimized parameters on the testing data set

Attempting to compare the globally optimized parameters with the local grid-searched
parameters, two small-scale experiments on the testing data sets (which should have
a similar pattern with the training and validation data sets) are designed as follows.
For each one out of eight trials of the first experiment, 1090 data points (xd , yd ) are
generated (using a similar strategy as in Yu (2011)). xd contains 5 features that are
uniformly distributed between −2 and 2, and yd = wsT xd + . Let ws be uniformly
distributed between 0 and 1 and be the Gaussian noise. 45 data points are randomly
chosen as training data points and another 45 data points are chosen as validation
data points, each dividing into threefolds. The remaining 1000 data points become the
testing data.
We first solve the (threefold) parameter training and validation program (3) to
obtain the global optimal parameters (Ce ,εe ). Then, employing the pair of the globally
optimized parameters to find a SVM regression model (ws∗ , bs∗ ) that fits the 90 (pooled
training and validation) data points.10 Finally, we compute the averaged absolute
residual |yd − ws∗T xd − b∗s |, or MAD, for the 1000 testing data points.
To obtain the locally grid-searched parameters, the region [Ce C¯e ] × [εe ε¯e ] =
[0.1 10] × [0.001 0.1] is evenly partitioned into grids with 16 coordinates.11 Among
these 16 pairs of (Ce , εe ), the pair that minimizes the absolute validation error is
chosen. For both the globally optimized parameters and the grid-searched parameters,
the same partition of the training and validation data is used. As a result, 4 out of 8
trials have a reduced MAD when using the globally optimized parameters, compared
with the grid-searched parameters. The averaged MAD reduction is 0.56 % (maximum
MAD reduction: 2.23 %; minimum MAD reduction: −0.44 %).

10 Because the b of SVM regression is not unique, we select a b minimizing the absolute residual of the
s s
45 validation data. See Sect. 2.3 for the selection of bs .
11 A finer grid might give us a local grid-searched parameter that coincides the global optimal parameter
since finding the global optimum is relatively easy compared to varying it. Note that the grids commonly
used in the SVM parameter selection are the logarithmic grids (Fung and Mangasarian 2004) of base 2 or
10, i.e., 2i or 10 j where i and j ranges from some negative integer to some positive integer.

123
Author's personal copy
Global resolution of the support vector machine regression... 255

The second experiment is done for CCK real-world data. The testing set, which
contains 660 data points, is obtained from pooling all the training and validation data
of CCK_1-CCK_10 due to lack of extra data points. Therefore, each trial of the training
and validation uses exactly one portion of the testing data set. The numbers of trials that
has a reduced MAD is 5, out of 10. In average, there is a 0.09 % reduction (maximum
reduction: 0.14 %; minimum reduction: −1.10 %) for the MAD of the testing data if
using the globally optimized parameters instead of the grid-searched parameters.
Due to long solution time for finding a global optimum, the size of the data sets
remains small in the above two experiments. The global optimal parameters reduces the
MAD in average. Nevertheless, the reduction of the averaged MAD is small (without
a proper benchmark), and the numbers of instances with a reduced MAD is merely
half of the total numbers of instances. We shall not conclude that the global optimal
parameters really outperform the grid-searched parameters in the prediction accuracy
before a greater size of data set can be tackled.

6 Conclusion

This paper presents the research on selecting the optimal parameters for the support
vector machine regression within the framework of training and validation. The para-
meter selection problem for the support vector machine regression is formulated as a
linear program with complementarity constraints, and the main challenge turns into
the verification of the global optimal solution to the linear program with complemen-
tarity constraints. The development of our (Ce , εe )-rectangle search algorithm that
searches on the parameter plane primarily relies on the definitions of the grouping and
the corresponding invariancy region. Then, the conditions sufficient to conclude that
the groupings and the invariancy regions in a rectangular area are fully realized are
essential. The two stages of the (Ce , εe )-rectangle search algorithm are distinguished
by the two different sufficient conditions: (1) The four vertices of a rectangular area
have the same grouping and are in the same invariancy region (the 1st stage), and (2)
The rectangular area is split up into two invariancy regions (the 2nd stage). A potential
direction for improving the (Ce , εe )-rectangle search algorithm is thus in discovering
other sufficient conditions.
A total of 140 instances including 80 synthetic instances and 60 real-world instances
were run in the numerical experiments. The synthetic instances were generated without
imposing natural structure between the indicators and the dependent variables, while
the real-world instances are statistical data of chemoinformatics that has been studied
in Yu (2011), Kunapuli (2008), Demiriz et al. (2001). A total of 56 of the synthetic
instances and 43 of the real-world instances were solved to global optimum, while tight
valid upper bounds were provided for the remaining instances. The results allow us to
further categorize difficulty level of the instances under the (Ce , εe )-rectangle search
algorithm by a 4-quadrant diagram with axes the number of features and the number of
training data in each fold. The effect of increasing the number of folds remains within
each quadrant because the effect of number of folds is of a smaller scale compared
with other factors (number of features and number of training data each fold). This
categorization helps the users to understand the performance of the (Ce , εe )-rectangle

123
Author's personal copy
256 Y.-C. Lee et al.

search algorithm and to be aware of its limitations. The unsolved instances from our
experiments all belong to the quadrant that is of the hardest difficulty level.
The long processing time could be a concern, but obtaining the global optimal
parameters is more critical than runtime for the purpose of long term benefit brought
from the global optimum in some applications. Moreover, the global optimum result
obtained in this paper can serve as a global optimum certificate of every other local
optimal solution to a parameter training and validation process for the support vector
machine regression.

Acknowledgments Thanks for the thoroughgoing comments and suggestions from reviewers that allow
us to significantly improve the completeness and the presentation of this work. Lee and Pang were supported
in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0151. Mitchell
was supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0260.
Pang was supported by the National Science Foundation under Grant Number CMMI-1333902. Mitchell
was supported by the National Science Foundation under Grant Number CMMI-1334327. Lee’s new affil-
iation after August 1, 2015 will be Department of Industrial Engineering and Engineering Management,
National Tsing Hua University, Hsinchu, Taiwan.

Appendix A: Semismooth method for SVR

Algorithm 12 Damped Newton method (Algorithm 2 in [17])

Step 0: Initialization: Let a0 ∈ Rn , ρ ≥ 0, p ≥ 2, and σ ∈ (0, 21 ) be given. Set


k = 0. Set tol 12 .
Step 1: Termination: If g(ak ) := 21  (ak ) 22 ≤ tol, stop.
Step 2: Direction Generation: Otherwise, let Hk ∈ ∂ B (ak ), and calculate dk ∈ Rn
solving the Newton system:

Hk dk = −(ak ). (31)

If either (31) is unsolvable or the decent condition


p
∇g(ak )T dk < −ρ  dk 2 (32)

is not satisfied, then set


dk = −∇g(ak ). (33)
Step 3: Line Search: Choose t k = 2−ik , where i k is the smallest integer such that
 
g ak + 2−ik dk ≤ g(ak ) + σ 2−ik ∇g(ak )T dk .

Step 4: Update: Let ak+1 := ak + t k dk and k := k + 1. Go to Step 1. 




The B-subdifferential used in step 3 of Algorithm 12 is obtained using the following


theorem.

12 We use tol = 10−14 .

123
Author's personal copy
Global resolution of the support vector machine regression... 257

Theorem 13 (Theorem 5 in Ferris and Munson (2004)) Let U : Rn → Rn be


continuously differentiable. Then

∂ B (a) ⊆ {Da + Db U  (a)},

where Da ∈ Rn×n and Db ∈ Rn×n are diagonal matrices with entries defined as
follows:
1. For all i ∈ I: If  (ai , Ui (a)) = 0, then
ai
(Da )ii = 1 − ,
 ai , Ui (a) 
Ui (a)
(Db )ii = 1 − ; (34)
 ai , Ui (a) 

otherwise

((Da )ii , (Db )ii ) ∈ {(1 − κ, 1 − γ ) ∈ R2 | (η, γ ) ≤ 1}. (35)

2. For all i ∈ E:

(Da )ii = 0,
(Db )ii = 1.


If  (ai , Ui (a)) = 0, (a) is differentiable at a and the formula (34) computes the
exact Jacobian. On the other hand, if  (ai , Ui (a)) = 0 occurs at the i th comple-
mentarity, κ and γ appeared in (35) are computed as suggested in Facchinei and Pang
(2003):
vi (U  v)i
κ= , and γ = , (36)
vi2 + (U  v)i2 vi2 + (U  v)i2
where v ∈ Rn is a vector of user’s choice whose ith element is nonzero.
To compute κ and γ in (36), we can choose v = 1, and let
⎡ ⎤
⎢ 0n f ×1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 2n f ×1 ⎥
⎢ ⎥
⎢ ⎥
hv := U (a )v = ⎢
 f

⎥,

⎢ ⎥
⎢ −2n ×1 ⎥
⎢ f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ 0 ⎦

123
Author's personal copy
258 Y.-C. Lee et al.

which doesn’t require update. Then the computation of κ and γ is simplified as

1 (hv)i
κ= , and γ = .
1 + (hv)i2 1 + (hv)i2

where the index i is the same as defined in (36).


In our experiments, we ignore steps (32) and (33) to save running time because the
system (31) is always solvable, and the convergence is obtained in most cases with
the initial {a0 } = 0. For rare cases where the condition g(ak ) ≤ tol in Step 2 can not
be fulfilled, we try a different initial point to restart.

Appendix B: Steps 2–4 of Algorithm 8

Step 2. Search on the vertical line Ce = C.


Initialize the set GroupingVSet le f t = GroupingVSet lu .
2a: For every pieces corresponding to a member of GroupingVSet le ft
! , solve (20)
subject to Ce = C to obtain invariancy intervals. Let εmin , εmax be the largest
intervals among others. Do Add-In when repeating. count Le f t + 1.
f
2b: Solve LCP SV R at (C, εmin ) and obtain the grouping vectors set.
2c: Replace GroupingVSet le f t by the set of the grouping vectors obtained in 2b. If
any members of GroupingVSet le f t are not in the set GroupingVFound, add
them to the later set.
2d: If the objective value of RLP is smaller than LeastU pper Bound, update
LeastU pper Bound.
2e: If εmin is greater than ε, let newStar ting = (C , εmin − deviation).
f
2f: Solve LCP SV R at newStar ting and obtain the grouping vectors set. Do as in
2c–2d.
2g: Repeat 2a–2d until εmin = ε.
Step 3. Search on the horizontal line εe = ε.
Initialize the set GroupingVSet bottom = GroupingVSet ul .
3a: For every pieces corresponding to a member of Gr oupingV Set bottom
! , solve (19)
subject to εe = ε to obtain invariancy intervals. Let Cmin , Cmax be the largest
intervals among others. Do Add-In when repeating. count Bottom + 1.
f
3b: Solve LCP SV R at (Cmin , ε) and obtain the grouping vectors set.
3c: Replace GroupingVSet bottom by the set of the grouping vectors obtained in 3b.
If any members of GroupingVSet bottom are not in the set GroupingVFound,
add them to the later set.
3d: If the objective value of RLP is smaller than LeastU pper Bound, update
LeastU pper Bound.
3e: If Cmin is greater than C, let newStar ting = (Cmin − deviation , ε).
f
3f: Solve LCP SV R at newStar ting and obtain the grouping vectors set. Do as in
3c-3d.
3g: Repeat 3a–3d until Cmin = C.

123
Author's personal copy
Global resolution of the support vector machine regression... 259

Step 4. Search on the vertical line Ce = C̄.


Initialize the GroupingVSet right = GroupingVSet uu .

4a: For every pieces corresponding to a member of GroupingVSet right ! , solve (20)
subject to Ce = C̄ to obtain invariancy intervals. Let εmin , εmax be the largest
intervals among others. Do Add-In when repeating. count Right + 1.
f
4b: Solve LCP SV R at (C̄, εmin ) and obtain the grouping vectors set.
4c: Replace GroupingVSet right by the set of the grouping vectors obtained in 4b. If
any members of GroupingVSet right are not in the set GroupingVFound, add
them to the later set.
4d: If the objective value of RLP is smaller than LeastU pper Bound, update
LeastU pper Bound.
4e: If εmin is greater than ε, let newStar ting = (C̄, εmin − deviation).
f
4f: Solve LCP SV R at newStar ting and obtain the grouping vectors set. Do as in
4c–4d.
4g: Repeat 4a–4d until εmin = ε.

Appendix C: Cases 2–6 of Algorithm 10

case 2
if Gr oupingV List T op (1) = Gr oupingV List Bottom (1)
and Gr oupingV List T op (2) = Gr oupingV List Bottom (2) then
The Ar ea satisfies Case 2. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 3
if Gr opuoingV List T op (1) = Gr oupoingV List Right (1)
and Gr oupingV List T op (2) = Gr oupingV List Right (2) then
The Ar ea satisfies Case 3. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 4
if Gr oupingV List Le f t (1) = Gr oupingV List Right (1)
and Gr oupingV List Le f t (2) = Gr oupingV List Right (2) then
The Ar ea satisfies Case 4. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 5
if Gr oupingV List Le f t (1) = Gr oupingV List Bottom (1)

123
Author's personal copy
260 Y.-C. Lee et al.

and Gr oupinV List Le f t (2) = Gr oupingV List Bottom (2) then


The Ar ea satisfies Case 5. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 6
if Gr oupingV List Bottom (1) = Gr opingV List Right (2)
and Gr oupoingV List Bottom (2) = Gr oupingV List Right (1) then
The Ar ea satisfies Case 6. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case

References
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J
Sci Stat Comput 11(2):281–292
Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for con-
strained systems. Automatica 38(1):3–20
Billups SC (1995) Algorithm for complementarity problems and generalized equations. PhD thesis, Uni-
versity of Wisconsin Madison
Burges CJC, Crisp DJ (1999) Uniqueness of the SVM solution. In NIPS’99, pp 223–229
Byrd RH, Nocedal J, Waltz RA (2006) Knitro: an integrated package for nonlinear optimization. In: Di Pillo
G, Roma M (eds) Large-scale nonlinear optimization. Nonconvex optimization and its applications,
vol 83. Springer, US, pp 35–59
Carrizosa E, Morales DR (2013) Supervised classification and mathematical optimization. Comput Oper
Res 40(1):150–165
Carrizosa E, Martn-Barragn B, Morales DR (2014) A nested heuristic for parameter tuning in support vector
machines. Comput Oper Res 43(0):328–334
Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in perfor-
mance evaluation. J Mach Learn Res 11:2079–2107
Columbano S, Fukuda K, Jones CN (2009) An output-sensitive algorithm for multi-parametric LCPs with
sufficient matrices. In CRM proceedings and lecture notes, vol 48
De Luca T, Facchinei F, Kanzow C (1996) A semismooth equation approach to the solution of nonlinear
complementarity problems. Math Program 75:407–439
Demiriz A, Bennett KP, Breneman CM, Embrechts MJ (2001) Support vector machine regression in chemo-
metrics. In: Proceedings of the 33rd symposium on the interface of computing science and statistics
Facchinei F, Soares J (1997) A new merit function for nonlinear complementarity problems and a related
algorithm. SIAM J Optim 7(1):225–247
Facchinei F, Pang J-S (2003) Finite-dimensional variational inequalities and complementarity problems II.
Springer, New York
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim
13(3):783–804
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Program 101:185–204
Floudas CA, Gounaris C (2009) A review of recent advances in global optimization. J Glob Optim 45:3–38
Fung GM, Mangasarian OL (2004) A feature selection newton method for support vector machine classi-
fication. Comput Optim Appl 28(2):185–202
Ghaffari-Hadigheh A, Romanko O, Terlaky T (2010) Bi-parametric convex quadratic optimization. Optim
Methods Softw 25:229–245

123
Author's personal copy
Global resolution of the support vector machine regression... 261

Gumus ZH, Floudas CA (2001) Global optimization of nonlinear bilevel programming problems. J Glob
Optim 20:1–31
IBM ILOG CPLEX Optimizer (2010) http://www-01.ibm.com/software/integration/optimization/
cplex-optimizer/
Jing H, Mitchell JE, Pang J-S, Bennett KP, Kunapuli G (2008) On the global solution of linear programs
with linear complementarity constraints. SIAM J Optim 19:445–471
Kecman V (2005) Support vector machines—an introduction. In: Wang L (ed) Support vector machines:
theory and applications. Studies in fuzziness and soft computing, vol 177. Springer, Berlin, pp 1–48
Keerthi SS, Lin C-J (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural
Comput 15(7):1667–1689
Kunapuli G (2008) A bilevel optimization approach to machine learning. PhD thesis, Rensselaer Polytechnic
Institute
Kunapuli G, Bennett KP, Jing H, Pang J-S (2008) Classification model selection via bilevel programming.
Optim Methods Softw 23(4):475–489
Kunapuli G, Bennett KP, Hu J, Pang J-S (2006) Model selection via bilevel programming. In: Proceedings
of the IEEE international joint conference on neural networks
Lee Y-C, Pang J-S, Mitchell JE (2015) An algorithm for global solution to bi-parametric linear comple-
mentarity constrained linear programs. J Glob Optim 62(2):263–297
Mangasarian OL, Musicant DR (1998) Successive overrelaxation for support vector machines. IEEE Trans
Neural Netw 10:1032–1037
Ng AY (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the fourteenth interna-
tional conference on machine learning. Morgan Kaufmann, Menlo Park, pp 245–253
Schittkowski K (2005) Optimal parameter selection in support vector machines. J Ind Manag Optim
1(4):465–476
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization,
and beyond. MIT Press, Cambridge
Tondel P, Johansen TA, Bemporad A (2003) An algorithm for multi-parametric quadratic programming and
explicit mpc solutions. Automatica 39(3):489–497
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression
estimation and signal processing. In: Mozer M, Jordan MI, Petsche T (eds) Advances in neural infor-
mation processing systems, vol 9. Proceedings of the 1996 neural information processing systems
conference (NIPS 1996). MIT Press, Cambridge, pp 281–287
Yu B (2011) A branch and cut approach to linear programs with linear complementarity constraints. PhD
thesis, Rensselaer Polytechnic Institute

123

You might also like