Professional Documents
Culture Documents
ISSN 2192-4406
Volume 3
Number 3
1 23
Your article is protected by copyright and
all rights are held exclusively by EURO -
The Association of European Operational
Research Societies. This e-offprint is for
personal use only and shall not be self-
archived in electronic repositories. If you wish
to self-archive your article, please use the
accepted manuscript version for posting on
your own website. You may further deposit
the accepted manuscript version in any
repository, provided it is only made publicly
available 12 months after official publication
or later and provided acknowledgement is
given to the original source of publication
and a link is inserted to the published article
on Springer's website. The link must be
accompanied by the following text: "The final
publication is available at link.springer.com”.
1 23
Author's personal copy
EURO J Comput Optim (2015) 3:197–261
DOI 10.1007/s13675-015-0041-z
ORIGINAL PAPER
Received: 1 August 2013 / Accepted: 26 June 2015 / Published online: 15 July 2015
© EURO - The Association of European Operational Research Societies 2015
Abstract Support vector machine regression is a robust data fitting method to mini-
mize the sum of deducted residuals of regression, and thus is less sensitive to changes
of data near the regression hyperplane. Two design parameters, the insensitive tube
size (εe ) and the weight assigned to the regression error trading off the normed support
vector (Ce ), are selected by user to gain better forecasts. The global training and val-
idation parameter selection procedure for the support vector machine regression can
be formulated as a bi-level optimization model, which is equivalently reformulated as
linear program with linear complementarity constraints (LPCC). We propose a rectan-
gle search global optimization algorithm to solve this LPCC. The algorithm exhausts
the invariancy regions on the parameter plane ((Ce , εe )-plane) without explicitly iden-
tifying the edges of the regions. This algorithm is tested on synthetic and real-world
support vector machine regression problems with up to hundreds of data points, and
the efficiency are compared with several approaches. The obtained global optimal
parameter is an important benchmark for every other selection of parameters.
B Yu-Ching Lee
ylee77@illinois.edu
123
Author's personal copy
198 Y.-C. Lee et al.
1 Introduction
The support vector machine (SVM) method was originally developed as a tool of data
classification by Vapnik in 1964 and its use has been extended to regression since 1997
(Vapnik et al. 1997). This method has drawn much attention in the past 20 years because
of the good prediction accuracy it provides on practical applications in data mining
and machine learning. The SVM regression, or SVR, has two design parameters that
significantly affect its performance: (1) the size of the insensitivity zone, and (2) the
regularization parameter that is assigned to provide a trade-off between the absolute
residual and the separation of the data. These design parameters of SVR are commonly
selected by employing the grid-search with the training and validation techniques.
When grid-searching, the box-constrained feasible region on the parameter plane is
partitioned into grids where the intersection points of the grids correspond to the
pairs of parameters. These pairs form a pool of candidates. Provided a pre-determined
partition of the data set, the SVR model with the fixed parameters is trained by one or
several sets of the training data and validated by the validation data. The parameters
that lead to the least prediction error for the validation data are then the best choices
among the others in the pool.
The parameters selected from searching the grids, however, have no guarantees
about the global optimality within the entire feasible region of parameters. A formu-
lation of the bi-level optimization model has been shown to resolve this shortcoming
in the preceding work of Kunapuli (2008) and Yu (2011). The best parameter that can
be found by the training and validation technique is simply the global optimum of a
bi-level optimization model, where the two levels are referred to as the upper level
and lower level in our later discussion.
In the bi-level parameter selection problem of the SVM regression, the lower-level
optimization problem is an SVM regression model that determines a mapping featur-
ing the normal vector (ws ) and the bias (bs ), given the size of the insensitivity zone (εe )
and the regularization parameter (Ce ). The SVM regression model is a strictly convex
quadratic problem, and its Karush–Kuhn–Tucker (KKT) condition is necessary and
sufficient for optimality. The lower-level problem, thus, can be equivalently reformu-
lated by the KKT condition as a linear complementarity problem (LCP). Making use
of this LCP reformulation, the semismooth method (Ferris and Munson 2004), succes-
sive overrelaxation method (Mangasarian and Musicant 1998), and the interior-point
method (Ferris and Munson 2002) are algorithms that have been studied for solving
SVM problems with large numbers of data points, and have yielded some good results.
(Problems with up to 60 million data points and 34 features are solved in Ferris and
Munson (2002, 2004). Problems with up to 10 million data points and 32 features
are solved in Mangasarian and Musicant (1998).) On the other hand, the upper-level
problem is an absolute deviation regression model subject to the lower-level optimiza-
tion problem and the box constraints on its design parameters. There indeed exist
algorithms for solving the general nonlinear bi-level optimization to global optimum,
including the αBB-type method in Gumus and Floudas (2001), the branch-and-bound-
type method in Bard and Moore (1990), and other methods being reviewed in Floudas
and Gounaris (2009). Nevertheless, algorithm efficiency for obtaining a global opti-
123
Author's personal copy
Global resolution of the support vector machine regression... 199
mal solution to the middle- and large-scaled nonlinear bi-level problems remains a
challenge.
This work is among a recent series of research about selecting and validating the
parameters of the SVR, which reformulates the bi-level model as a model of bi-
parametric linear complementarity constrained program. Previous works in this series
include the studies presented in Kunapuli et al. (2006, 2008), Kunapuli (2008), Jing
et al. (2008), Yu (2011), Lee et al. (2014). A multi-fold cross-validated SVR parameter
selection model containing a “feature selection” scheme is considered in Kunapuli et al.
(2006, 2008), Kunapuli (2008). The benchmark for parameter selection in their work
is the performance of the selected SVR model on the “hold-out” set of testing data
points rather than the quality of the solution to the bi-level program. The numerical
results in Kunapuli et al. (2006) have shown that the bi-level programming approach is
resistant to overfitting. (Similar results about reducing overfitting when employing the
cross-validation technique can be seen in Cawley and Talbot (2010), Ng (1997), Arlot
and Celisse (2010).) In Yu (2011), a two-stage branch-and-cut algorithm is proposed
for solving the linear program with complementarity constraints (LPCC) to global
optimum. It is concluded that if the lower bound of the objective value can be pushed
closer to the upper bound in the preprocessing stage, the global solution is identified
efficiently in the second stage in Yu (2011). The optimization algorithms proposed in
Jing et al. (2008), Lee et al. (2014) are designed for the general and the bi-parametric
forms of the LPCC, respectively.
The parameter selection of the Kernel function, which will impose additional one
or more parameters depending on the types of the Kernel function, of the SVM is also
an important issue. None of the above papers including this work have extended the bi-
level optimal parameter selection scheme to the SVR with a Kernel function. We refer
the interested readers to Sathiya Keerthi and Lin (2003), Carrizosa et al. (2014) for the
heuristic methods, Schittkowski (2005) for a mathematical programming approach,
and the approaches surveyed in Carrizosa and Morales (2013) about selecting the
Kernel parameters alone.
We develop algorithms to accommodate the structure of SVM regression parameter
selection. The main algorithm we propose is to search the optimal design parameters
in the feasible region on (Ce , εe )-plane. To do this, the concept of “invariancy region”
for the inner level problem is crucial. An invariancy region is a region in the parameter
space where the basis remains unchanged. Invariancy regions are convex, and they
partition the whole feasible region without overlapping. The searching and partitioning
scheme on (Ce , εe )-plane initiates the search from a fixed point (Ce , εe ), with which a
lower-level SVR problem with fixed parameters is solved, and proceeds by identifying
the invariancy interval along one chosen direction. A queue of the rectangular (Ce , εe )-
areas, which result from the partitioning, is maintained and searched one by one. For
each rectangular area, we either conclude that all the invariancy regions inside the
area have been examined and remove the rectangle from the queue, or we partition the
rectangular areas horizontally and/or vertically, and add the new rectangular subareas
to the queue while removing the original one. The algorithm terminates when all the
rectangular areas in the queue are eliminated. The solution obtained from this algorithm
can be verified to be global optimal. Different from other methods (Ghaffari-Hadigheh
123
Author's personal copy
200 Y.-C. Lee et al.
et al. 2010; Columbano et al. 2009; Tondel et al. 2003) which also perform the search on
the parameter plane, the boundaries of invariancy regions are not explicitly identified
in our algorithm. Although revisiting a previously examined invariancy region is not
avoided, the effort of finding and storing the boundaries of invariancy regions is saved.
Since the algorithm involves exploring the invariancy regions, we expect the solution
time proportional to the number of feasible invariancy regions. This conjecture is
confirmed in the numerical experiments.
We propose a second way to solve the parameter training and validation for SVM
regression as an improved integer program (IP). The linear complementarity con-
strained program can be formulated as such an IP via the big-M technique. The valid
values of big-M for this specific application are derived from finding the upper bound
on the multipliers of the lower-level problem of SVM. Moreover, we propose a pro-
cedure to further tighten the upper bound on the multipliers. The tightened multiplier
upper bounds can reduce the feasible regions for the IP and has enabled us to improve
the running time for solving the IP by CPLEX (IBM ILOG CPLEX Optimizer 2010).
The IP program that employs the improved values of big-M is what we referred to as the
improved IP. The solutions produced by the improved IP and the (Ce , εe )-rectangular
search mentioned in the previous paragraph are both globally optimal. However, the
improved IP loses its the efficiency or even the possibility to solve a problem when
the size of problem is above some threshold. By monitoring the process of branch
and bound, we notice that the gap between the lower and upper bounds reduces very
slowly because the lower bound of the objective is hardly improved, whereas the upper
bound of the objective is usually tight.
The contributions of this paper can be seen from the mathematics, algorithms, and
applications points of view. The first contribution is in defining mathematically the
global optimum for the training and validation SVM regression parameter selection
problem, bridging the area of mathematical programming with machine learning. Such
a global optimal parameter and its corresponding regression residual can serve as a
benchmark for other parameters selected by heuristics, such as grid search. The second
contribution is in the development of the two approaches proposed to solve the problem
to global optimum: the (Ce , εe )-rectangle search algorithm that aims to take advantage
of existing efficient methods of solving the lower-level problem and an improved IP
model that relies on an existing IP solver. The algorithms are tested on instances with
single- and multi-fold training and validation data sets, including those generated by
us and those from the real world. A significant number of numerical experiments are
presented to uncover the strength and limit of each algorithm. We show that the level
of difficulty of the instances inputting to the (Ce , εe )-rectangle search algorithm can be
classified by a four-quadrant diagram. Comparing the convergent time of the instances
between different quadrants is meaningless because they belong to different scales of
difficulty. The running time of the instances in the same quadrant is proportional to the
size of data. The third contribution is in providing a way to evaluate other parameter
selection algorithms. We compare the global solution to the solution produced by
the non-global optimization commercial solver. Although substantially more time is
needed by our algorithm (while the solver can produce a non-global optimal solution
in seconds), the solution quality is guaranteed.
123
Author's personal copy
Global resolution of the support vector machine regression... 201
So far, our methodology can only deal with the 2-parameter instances with small
size of training data sets on one single computer. Future research that integrate the
methodology with computer techniques should have a great chance to expedite the
convergence time and solve problems with larger training data sets. Straightforward
extension of the methodology to the instances with more than 2 parameters, however,
is not possible.
The remaining part of this paper is organized as follows. Section 2 introduces the
SVM classification and regression models and further derives the SVM regression
parameter selection problem as a LPCC. Section 3 introduces the (Ce , εe )-rectangle
search algorithm, including the semismooth method for solving the lower-level prob-
lem of this bi-level parameter selection model, our definition of the invariancy region,
and the recording method of the geometrical allocation of data points. We will explain
how the algorithm searches for the invariancy region on chosen sides of a rectangle,
thus reducing the effort to searching for the invariancy “interval”, and how the algo-
rithm verifies weather an input rectangular region on the (Ce , εe )-plane belongs to
one or two invariancy regions. Section 4 describes the big-M tightening algorithm
that we employed to form the modified IP. Section 5 displays the numerical results
of the (Ce , εe )-rectangle search algorithm and the modified IP. These global optimal
solutions are compared with the local optimal solution obtained by KNITRO (Byrd
et al. 2006). The performance of the algorithms and the difficulty of the instances are
depicted and analyzed. Section 6 summarizes the paper and concludes the findings
from this research.
In Sect. 2.1, we review the parameters used in the soft margin SVM classification, hard
margin SVM classification, and SVM regression. The soft margin SVM classification
involves one parameter Ce , the hard margin SVM classification does not need any
parameters, and the SVM regression involves two parameters (Ce , εe ). In Sect. 2.2,
we introduce the global training and validation technique in selecting the parameters
for the SVM regression specifically. This technique can be formulated as an LPCC.
It can be seen that the multi-fold cross-validation method is a special case of our
formulation. Following this formulation, the meaning of the “optimal parameter” in
our framework is clarified.
The SVM model for classification is the original version of every SVM-related study.
j
The SVM classifier is a hyperplane wsT xd + bs = yd j 1 that classifies the data points
1 The 1st order subscript represents the role of that mathematical expression—subscript d: data; subscript
s: support vector machine regression; subscript e: exogenous parameter to the support vector machine
regression.
123
Author's personal copy
202 Y.-C. Lee et al.
j j
(xd , yd j ) ∀ j ∈ D, where xd ∈ R K , yd j ∈ {+1, −1}, and D denotes the set of training
j
data. The dimension of xd represents the number of features or characteristics that
describe the data point j ∈ D, and the value of the variable yd j indicates the group
to which the data point j belongs. A data point j is said to be misclassified by the
j
hyperplane wsT xd + bs = yd j if the predicted yd j is +1 yet the true yd j is −1, or vice
j
versa. A tube area around the target hyperplane wsT xd + bs = 0 is defined by the two
j j
parallel hyperplanes wsT xd + bs = +1 and wsT xd + bs = −1. In the basic setting of
the SVM classification, the size of the tube (estimated by 2/ws ) is maximized.
In a soft margin SVM classification, the desired hyperplane (ws , bs ) is the optimizer
of the following program:
1
min Ce ξ j + wsT ws
ws ,bs ,ξ 2
j∈D
j
subject to wsT xd + bs ≥ 1 − ξ j , ∀yd j = +1, (1)
j
wsT xd + bs ≤ −1 + ξ j , ∀yd j = −1,
and ξ j ≥ 0.
where Ce is a parameter trading off the two terms in the objective function, and ξ
is the slack. The hyperplane (ws , bs ) minimizes the sum of the distance between the
misclassified observation and the closer bound of the tube, and also maximizes the
j j
margin size, 2/ws , between wsT xd + bs = +1 and wsT xd + bs = −1.
In contrast to the soft margin SVM, the hard margin SVM does not allow any
misclassification. To formulate the hard margin SVM, the model in (1) is revised by
removing the Ce j∈D ξ j term from the objective function and removing the −ξ j and
+ξ j terms from the two constraints, respectively. In Kecman (2005), the hard margin
version is referred to as “linear maximal margin classifier for linearly separable data”,
and the soft margin version is referred to as “linear soft margin classifier for overlapping
classes.”
Extended from the classification, the SVM regression identifies a hyperplane yd j =
j
wsT xd + bs where yd j is a real-valued dependent variable. Besides the Ce parameter,
the other parameter, εe , defines the size of the tube in which the residual is neglected.
Given Ce and εe , the desired hyperplane minimizes the following problem:
⎧ ⎫
⎨
n
1 ⎬
j
min Ce max(|wsT xd + bs − yd j | − εe , 0) + wsT ws . (2)
ws ,bs ⎩ 2 ⎭
j=1
The 21 wsT ws term in the objective of (2) is directly borrowed from its usage in (1).
This term is also called a regularization term because it imposes strong convexity
and forces the optimal ws to be unique. The other term in the objective is the (least)
absolute residual outside the εe -insensitive tube, i.e., the tube area defined by the two
j j
parallel hyperplanes wsT xd + bs = +εe and wsT xd + bs = −εe .
123
Author's personal copy
Global resolution of the support vector machine regression... 203
We consider F folds of training and validation data. That is to say, we divide the train-
j
ing observations (xd , yd j ), j = 1, . . . , n into F groups and validation observations
j
(xd , yd j ), j = n + 1, . . . , n + m into another F groups. The partition of the data set
is assumed to be pre-determined and unchangeable.
f f
Given any (Ce , εe ) pair, a hyperplane (ws , bs ) minimizing the objective function of
SVR is obtained by solving the model (2) for each fold of the training data. Using this
trained hyperplane, the regression residuals of the validation data in each fold are then
calculated. A small cumulative regression error from the validation data is desired,
and this hope might be achieved by choosing a different (Ce , εe ) pair and repeating
the aforementioned process. The procedure of sequentially choosing (Ce , εe ) pair,
solving for the optimal hyperplane on the training data, and computing the regression
errors on validation data, is what we call the training and validation technique. Such
a parameter selection process is nontrivial since there is no guarantee that the Ce or
εe with a large value always produces a smaller regression error than that result from
smaller values of parameters.
Is there really a “best choice” of the parameters? We say that a set of parameters is
“good” if it defines a model that forecasts the future with small errors. Since the future
is not known, it is impossible to find the parameter that will yield a precise prediction.
It should be clearly noted that the optimal parameters being studied in this paper is
in fact the best merely under the framework of training and validation with a fixed
number of features and the fixed partition of the validation and training data sets.
The observations are arbitrarily labeled in each subset. We denote the first and the
last index of the observations in the f th training data set as front f and end f , respec-
tively, and there are n f observations in the f th training data set. Similarly, the first
f
and the last index of the observations in the f th validation data is denoted as front v
f
and endv respectively, and there are m f observations in the f th validation data set.
The training and validation process of the SVR parameter selection is formulated
as follows:
f
F v
end
T
min |wsf xdi + bsf − ydi |
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws2 ,bs2 ,...wsF ,bsF v
subject to 0 ≤ C ≤ Ce ≤ C, 0 ≤ ε ≤ εe ≤ ε,
f f
and ∀ f = 1, . . . , F : (ws , bs ) ∈ the solution set of the following problem
⎧ ⎫
⎨
end f
1 ⎬
T j T
min Ce max(| wsf xd + bsf − yd j | − εe , 0) + wsf wsf .
ws f ,bs ⎩ 2 ⎭
f
f
j=front
(3)
It is known that the KKT conditions of the lower-level problems in (3) is sufficient
for optimality because the lower-level problems are convex. The lower-level problems,
thus, can be replaced by the KKT conditions, which are expressed as F folds of the
123
Author's personal copy
204 Y.-C. Lee et al.
f
linear complementarity problem LCPSVR as follows:
∀ f = 1, . . . , F : ⎧
⎪
f
⎪
end
⎪
⎪ wsf −
j
(β j − α j )xd = 0,
⎪
⎪
⎪
⎪
⎪
⎪ j=front f
⎪
⎪
⎪
⎪
⎪
f f
⎪
end
end
⎪
⎪ 0= αj − β j ⊥ bsf ,
⎪
⎨
f (4)
LCPSVR := j=font f j=front f
⎪
⎪
⎪
⎪ ∀ j = front f . . . end f :
⎪
⎪
⎪
⎪ ⎧
⎪
⎪ ⎪
j f
0 ≤ es j + εe − (xd )T wsf − bs + yd j ⊥ α j ≥ 0,
⎪
⎪ ⎪
⎨
⎪
⎪
⎪
⎪ j f
0 ≤ es j + εe + (xd )T wsf + bs − yd j ⊥ β j ≥ 0,
⎪
⎪ ⎪
⎪
⎩ ⎪
⎩
0 ≤ Ce − α j − β j ⊥ es j ≥ 0.
f
F v
end
min pi
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws ,bs2 ,...,wsF ,bsF ,
2 v
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 , (5)
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n+m 1 +m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n+m 1 +m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,
and constraints in (4).
The optimal parameter is now well defined as the global optimal solution to the
LPCCs. Note that the F-fold cross-validation approach for the parameter selection is
a special case of the formulation (5) when the observations are repeatedly contained
in each subset of training and validation data. See the F-fold cross-validation LPCC
formulation in Kunapuli et al. (2008), Kunapuli (2008).
123
Author's personal copy
Global resolution of the support vector machine regression... 205
While the optimal ws solution is unique to problem (2) due to the strictly convexity
imposed by the 2-normed regularization term (1/2)wsT ws , the optimal bs solution is
not unique in the SVR problem (see Burges and Crisp 1999 for exceptions). This
f
fact implies that the direct use of the values (wsf , bs ) obtained by solving the linear
complementarity program (4) does not sufficiently yield the minimum value of the
upper-level objective
f
F v
end
T
|wsf xdi + bsf − ydi |. (6)
f =1 i=front f
v
f f
The optimal bs ’s, which is an interval denoted by [bmin bmax ], should be obtained by
the following procedure.
f
Given the unique optimal wsf values and any single optimal value of bs , the lower-
level objective value V corresponding to a fixed (Ce , εe ) is computed by
f
f
end
T j 1 T
V f
= Ce max(|wsf xd + bsf − yd j | − εe , 0) + wsf wsf
2
j=front f
for each fold f . Then, solve one maximizing model and one minimizing model for
each fold f = 1, . . . , F:
f f f
bmax /bmin = max / min bs
f
bs ,a j
f
end
1 T
subject to Ce a j ≤ − wsf wsf + V f ,
2
j=front f
f T j
bs − a j ≤ −wsf xd + yd j + εe , ∀ j = front f , . . . , end f ,
f T j
−bs − a j ≤ wsf xd − yd j + εe , ∀ j = front f , . . . , end f ,
a j ≥ 0, ∀ j = front f , . . . , end f
f f f
will give us the optimal intervals of bs : [bmin bmax ].
For an end user, selecting a bs ∈ [bmin bmax ] to be used in prediction is of equal
importance with selecting the parameters to train the SVM. The simplest selection is
the bs value produced by the software or the midpoint of bmin and bmax . Sometimes, one
can select bs ∈ [bmin bmax ] to adjust the number of false positives and false negatives as
mentioned in Scholkopf and Smola (2001). If data are at hand, following the previous
f
setting of the validation data and the optimal wsf , the optimal bs that minimizes the
validation absolute error (6) can be easily obtained by solving a linear mathematical
program:
123
Author's personal copy
206 Y.-C. Lee et al.
f
F v
end
min pi
bs1 ,bs2 ,...,bsF
f =1 i=front f
v
f f f
subject to bmin ≤ bs ≤ bmax , ∀ f = 1, . . . , F,
(xd ) ws + bs − ydi ≤ pi ,
i T 1 1 ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF . . . . , endvF ,
(7)
Note that the bs f obtained by solving (7) will be the same with the global solution
f f f
bs produced by the bi-level model (3) when the intervals [bmin bmax ] and wsf in (7)
are obtained at the global optimal parameters Ce and εe . In other words, The bi-level
f
formulation implicitly produces an optimal bs = Ff=1 bs /F and ws = Ff=1 wsf
simultaneously with the global parameter. The analysis of this (ws , bs ) has not yet
been covered in this work.
We demonstrate a global optimization algorithm that solves the program (5). At every
iteration, we fix the design parameters at different values to solve the lower-level
problem, and with this lower-level solution, an upper bound of the outer-level problem
can be obtained. The algorithm is named rectangle search because the search of the
values for the parameters proceeds along the boundary of rectangular areas to obtain
information of termination or further area partitioning.
For any fixed values of (Ce , εe ), we can solve the lower-level problem (4), an LCP,
by existing methodologies, such as semismooth method. The active and inactive con-
straints of the solved LCP will provide us a set of linear equalities and inequalities. We
call this set of linear equalities and inequalities a piece. Replacing the complementarity
constraints in model (5) by the linear constraints (piece) restricts the feasible Ce and εe
in a smaller but convex region. This region is called an invariancy region on (Ce , εe )-
plane. Because of the “invariancy”, it is sufficient to find the local best (Ce , εe ) pair
and a local minimum of (5) by solving a linear program (restricted linear program)
subject to the piece resulting from the LCP fixed at an arbitrary point within the same
convex region. Then, we search for the next (Ce , εe ) (outside the current invariancy
region) at which the lower-level problem is again fixed and solved. The above proce-
dure repeats continually until the algorithm exhausts all the piecewise-convex feasible
regions2 and achieves global optimality.
2 By exhausting all the invariancy regions, we mean that for every invariancy region, at least one (C , ε )
e e
point within the region is chosen and the associated restricted linear program is solved.
123
Author's personal copy
Global resolution of the support vector machine regression... 207
The specialty of our algorithm lies in the search technique for the next (Ce ,εe ).
The rectangle search scheme we proposed decomposes the entire feasible regions into
small rectangles. Searching along the boundary of these rectangles reduces the effort
of identifying geometric location of the invariancy regions to identifying the geometric
location of invariancy intervals on a given line. Consider the initial [C, C] × [ε, ε]
rectangular area on the 2-dimensional (Ce , εe )-plane. We first fix the values of (Ce , εe )
where the lower-level program is solved at one vertex of the rectangular area. Along
one side of the boundary, we can easily identify the endpoints of the invariancy interval
where this (Ce , εe ) point belongs. Then, we find a new (Ce , εe ) point on the same side
of the boundary but outside the current invariancy interval, and solve another lower-
level problem with this new (Ce , εe ). All the invariancy intervals along the four sides
of the boundary are identified with this repeated procedure.
For each invariancy region, there is a corresponding SVR data points allocation in
the feature space. We call one specific data points allocation a grouping. Recording
groupings is equivalent to recording the invariancy regions that have been found.
Based on the grouping information associated with the invariancy intervals along
the boundary, one can either conclude that all the invariancy regions contained in
this rectangular area have been examined or that further area partitioning is required.
Throughout the algorithm, we maintain a queue of rectangular areas to be examined.
The rectangular areas can be removed from the queue if (1) the four-corner points
of the (Ce , εe )-rectangular area have the same grouping vector, or (2) the (Ce , εe )-
rectangular area is bisected into two regions by a straight line, each belonging to one
invariancy region. These two sufficient conditions are called the 1st and the 2nd stage
of the algorithm respectively.
To summarize, this algorithm contains the following 7 key procedural tasks:
1. Solve the lower-level problem with a fixed (Ce , εe ) by known methodologies, such
as semismooth method (Ferris and Munson 2004).
2. Replace the complementarity constraints by linear constraints (piece), which
restricts the feasible region in a smaller but convex region (invariancy region).
3. Solve for the local best (Ce , εe ) and the optimal minimum within the invariancy
region. This can be done by solving a linear program.
4. Search for the next (Ce , εe ) (outside the current invariancy region) at which the
lower-level problem is fixed and solved. We propose the rectangle search scheme to
search along the boundary of a rectangle and thus the invariancy region is reduced
to invariancy interval.
5. Partition the initial [C, C] × [ε, ε] area into small rectangular regions at chosen
points and maintain a queue of rectangular areas in to be examined.
6. Maintain a list of visited invariancy regions by recording their corresponding data
allocation in space (grouping).
7. Eliminate the rectangular areas from the queue if (1) the four-corner points of
the (Ce , εe )-rectangular area have the same grouping vector, or (2) the (Ce , εe )-
rectangular area is bisected into two regions by a straight line, each belonging to
one invariancy region.
Task 1 is discussed in Sect. 3.1; the linear constraints (piece) and the convex region
(invariancy region) mentioned in task 2 are defined in Sect. 3.2; the allocation of data
123
Author's personal copy
208 Y.-C. Lee et al.
points in the feature space (grouping) mentioned in task 6 and the degeneracy issue
are also discussed in Sect. 3.2; the linear program mentioned in task 3 is introduced
in Sect. 3.3; the search strategy along the boundary of a rectangle mentioned in task 4
can be seen in Sect. 3.4; the partitioning, recording, and eliminating steps mentioned
in tasks 5–7 are shown in Sects. 3.5 and 3.6.
The lower-level problem with a fixed (Ce , εe ) is the SVM regression model for each
fold of training data. We have shown that the lower-level problem is equivalent to a
collection of LCPs. To solve these LCPs, we employ the semismooth method (Luca
et al. 1996), which involves the use of the semismooth Fischer–Burmeister function.
Consider the general complementarity problem as follows:
0 ≤ Ui (a) ⊥ ai ≥ 0, ∀i ∈ I,
0 = Ui (a) ⊥ ai : f r ee, ∀i ∈ E, (8)
where I and E denotes the nonoverlapping sets of indices for inequalities and equalities,
respectively. The Fischer–Burmeister function for the LCP (8) is defined as:
φ(ai , Ui (a)) := ai + Ui (a) − ai2 + Ui (a)2 .
It has been proven in many research papers, such as Facchinei and Soares (1997),
that the merit function (a) has some desirable properties including that (a) is a
semismooth function and that g(a) := 21 (a)22 is continuously differentiable. Most
importantly, for any H ∈ ∂ B (a), where ∂ B (a) represents the B-subdifferential
of (a), we have ∇g(a) = HT (a). These properties hold under the continuous
differentiability of U(a), which is satisfied in the application of SVR. Thus, by solv-
ing the equation g(a) = 0, a solution to the complementarity problem (8) is found.
Theoretical foundations of the semismooth method can be seen in Luca et al. (1996),
Billups (1995), Ferris and Munson (2004). The damped Newton method (Facchinei
and Soares 1997) embedded in our algorithm to solve the lower-level problems of the
SVR is in Appendix A.
123
Author's personal copy
Global resolution of the support vector machine regression... 209
In the context of the SVR, the merit function (a f ) ∈ R3n f +1 is of the form as
shown in (9), provided a fixed pair of parameters (Ce , εe ). Following Theorem 13, the
computation of the B-subdifferential H will require the matrix U (a f ) of the following
form: ⎡ ⎤
⎢ XT X −XT X In f ×n f −1n f ×1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ −XT X XT X In f ×n f 1n f ×1 ⎥
⎢ ⎥
⎢ ⎥
U (a f ) = ⎢
⎢
⎥,
⎥
⎢ ⎥
⎢ −In ×n −In ×n 0n f ×n f 0n f ×1 ⎥
⎢ f f f f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ 1 −11×n f 0 0 ⎦
1×n f 1×n f
T j
where XT X is a n f × n f matrix comprising elements xdi xd for all i =
front f , . . . , end f and j = front f , . . . , end f ; In f ×n f is the identity matrix that belongs
in Rn f ×n f ; 1n f ×1 is a matrix belonging in Rn f ×1 that has all 1 entries; 0n f ×1 is the
matrix belonging in Rn f ×1 that has all 0 entries; and the variables vector a f ∈ R(3n f +1)
is written as
Now suppose {ak } is a sequence generated in the damped Newton method (Algo-
rithm 12) and {ak } → a∗ , where a∗ is the final solution to the system (a) = 0.
It is known that if k is sufficiently large, the search direction dk is always chosen
at the Newton step computed in (31) rather than the steepest decent step as in (33).
Meanwhile, if k is sufficiently large, t k is always chosen at 1 (Luca et al. 1996). The
decent condition in (32) ensures the semismooth method converges globally, i.e., any
initial point, not necessarily close to the solution, can lead to convergence.
The semismooth method is neither the only nor the guaranteed best way to solve
the lower-level problem with a fixed (Ce , εe ). The successive overrelaxation method
(Mangasarian and Musicant 1998) and the interior method (Ferris and Munson 2002)3
applied in the SVM might be substitutions, but the comparison is not within the scope
of this work.
3 According to the numerical experiments provided in work Ferris and Munson (2004), the semismooth
method outperforms the interior method (Ferris and Munson 2002) specifically in solving the large-scale
SVM classification problems.
123
210
⎡ ⎤
⎛ ⎞ ⎛ ⎞2
end f end f
f
123
⎢ ⎜ front f T i f ⎟ 2 i f ⎥
⎢ α ) (βi − αi )xd − bs + yd ⎠ − (αfront f ) + ⎝es + ε e − (xdfront )T (βi − αi )xd − bs + yd ⎠ ⎥
⎢ front f + ⎝esfront f + ε̄e − (xd front f front f front f ⎥
⎢ ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ . ⎥
⎢ . ⎥
⎢ . ⎥
⎢ ⎛ ⎞ ⎥
⎢ ⎛ ⎞2 ⎥
⎢ end f end f ⎥
⎢ ⎥
⎢ ⎜ ⎟ 2 ⎝ ⎠ ⎥
⎢ α + ⎝ e s + ε e − (x endf )T
d (β i − α )x
i d
i −bf + y
s d ⎠ − (α ) + es + ε e − (x endf )T
d (β i − α )x
i d
i −bf + y
s d ⎥
⎢ end f end f end f end f end f end f ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎛ ⎞ ⎛ ⎞2 ⎥
⎢ end f end f ⎥
⎢ f ⎥
⎢ ⎜ front T i f ⎟ 2 ⎝ front f T i f ⎠ ⎥
⎢ β1 + ⎝e + εe + (xd ) (βi − αi )xd + bs − yd + ε e + (xd ) (βi − αi )xd + bs − yd ⎥
⎢ front f front f
⎠ − (βfront f ) + es
front f front f ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ . ⎥
⎢ . ⎥
(a f ) = ⎢
⎢
⎥
⎥
(9)
⎢ . ⎥
⎛ ⎞ ⎛ ⎞2
⎢
⎢ end f end f ⎥
⎥
f f
⎢ ⎜ f ⎟ f ⎥
⎢ β f + ⎝es f + ε e + (xdend )T
end
(βi − αi )xdi + bs − yd f ⎠ − (β f )2 + ⎝es f + εe + (xdend )T
end
(βi − αi )xdi + bs − yd f ⎠ ⎥
⎢ end end end end ⎥
⎢ i=front f i=front f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ es + (Ce − α − β f ) − (es )2 + (Ce − α −β )2 ⎥
⎢ front f front f end front f front f front f ⎥
⎢ ⎥
Author's personal copy
⎢ ⎥
⎢ . ⎥
⎢ . ⎥
⎢ . ⎥
⎢ ⎥
⎢ ⎥
⎢ es f + Ce − α f − β f − (es f )2 + (Ce − α f − β f )2
end end end end ⎥
⎢ end end ⎥
⎢ ⎥
⎢ ⎥
⎢ f f
⎥
⎢ end end ⎥
⎢ ⎥
⎣ αj − βj ⎦
j=front f j=front f
Y.-C. Lee et al.
Author's personal copy
Global resolution of the support vector machine regression... 211
3.2 Piece of the complementarity and data point allocation (grouping) in space
Consider the LCPSVR in (4). We define the binary variables z j , z j and η j for each
f
j = front f , . . . , end f as
j f
1, if es j + εe − (xd )T wsf − bs + yd j = 0,
zj =
0, if α j = 0.
j f
1, if es j + εe + (xd )T wsf + bs − yd j = 0,
z j =
0, if β j = 0,
1, if es j = 0,
ηj =
0, if Ce − α j − β j = 0.
∀
⎧f = 1, . . . , Ff :
⎪
⎪
end
⎪
⎪ f
−
j
(β j − α j )xd = 0,
⎪
⎪ w s
⎪
⎪
⎪
⎪ j=front f
⎪
⎪
⎪
⎪
end f
end f
⎪
⎪
⎪
⎪ αj − β j = 0,
⎪
⎪
⎪
⎪ f f
⎨ j=font j=front
(10)
⎪ ∀ j = front . . . end f :
f
⎪
⎪ ⎧
⎪ ⎪
⎪ 0 ≤ αj ≤ θ1 j · z j,
⎪
⎪ ⎪
⎪
⎪ ⎪ j T f f
⎪
⎪
⎪ ⎪ 0 ≤ es j + εe − (xd ) ws − bs + yd j
⎪
⎨
≤ θ2 j · (1 − z j ),
⎪
⎪
⎪ 0 ≤ βj ≤ θ3 j · z j ,
⎪
⎪ ⎪ · (1 − z j ),
j T f f
⎪
⎪ ⎪
⎪ 0 ≤ esj + εe + (x ) w + bs − yd j ≤ θ4 j
⎪
⎪ ⎪
⎪
d s
⎪ ≤ − α j − βj ≤ θ5 j · ηj,
⎩ ⎪
⎪ ⎩ 0 C e
0 ≤ es j ≤ θ6 j · (1 − η j ).
123
Author's personal copy
212 Y.-C. Lee et al.
Fig. 1 Within the SVR context, a single data point can be labeled by its allocation in space
f f f f f
such that A1 ∪ A2 ∪ A3 ∪ A4 ∪ A5 = {front f , . . . , end f }. Since a point can’t be
allocated on both hyperplanes or on both sides, there are two natural cuts derived:
z j + z j ≤ 1, ∀ j = front f , . . . , end f , f = 1, . . . , F
z j + z j + η j ≥ 1, ∀ j = front f , . . . , end f , f = 1, . . . , F.
A grouping vector has dimension n. There are at most 5n possible grouping vectors
for n training data points, regardless of the choices of Ce and εe . Given a grouping,
4 Lower-hyperplane: y j T f f
d j = (xd ) ws + bs − εe .
5 Upper-hyperplane: y j T f f
d j = (xd ) ws + bs + εe .
123
Author's personal copy
Global resolution of the support vector machine regression... 213
the corresponding piece is expressed as the collection of the linear equalities and
inequalities:
f f
end
j
end
j
ws + f
α j xd − β j xd = 0,
j=front f j=front f
f f (11)
end
end
αj − β j = 0,
j=front f j=front f
j f
(xd )T ws f + bs − yd j − εe ≥ 0, f
∀ j ∈ A1 , (12)
α j = Ce , β j = 0,
j f
yd j − (xd )T ws f − bs − εe ≥ 0, f
∀ j ∈ A2 , (13)
α j = 0, β j = Ce ,
Ce ≥ α j ≥ 0, β j = 0, f
j f ∀ j ∈ A3 , (14)
yd j = (xd )T ws f + bs − εe ,
α j = 0, Ce ≥ β j ≥ 0, f
j f ∀ j ∈ A4 , (15)
yd j = (xd )T ws f + bs + εe ,
⎫
εe − (xd )T ws f − bs + yd j ≥ 0, ⎪
j f
⎬
f
εe − yd j + (xd ) ws + bs ≥ 0, ⎪ ∀ j ∈ A5 .
j T f f (16)
α j = 0, β j = 0. ⎭
Thus, by recording the vector of the grouping, the piece can be monitored and recovered
on the fly. We follow the algorithm below to transform the solution of the lower-level
problem of the LCP with a fixed (Ce , εe ) into a grouping vector.
f
Algorithm 3 Transform the Solutions (α, β, es , wsf , bs ) to a Grouping Vector. (Use
f f
this algorithm when the solution to the LCP SV R is not degenerate and bs is unique)
f
Given solutions α, β, es , wsf and bs , declare Gr oupingV as a vector with a length
f f f f f
of n and let A1 , A2 , A3 , A4 and A5 = ∅.
for f = 1, . . . , F, j = f r ont , . . . , end f
f
j f
if α j > es j + εe − (xd )T wsf − bs + yd j then
if es j > Ce − α j − β j then
f f
Gr oupingV j = 1, and A1 ← A1 ∪ { j}.
else
f f
Gr oupingV j = 3, and A3 ← A3 ∪ { j}.
end if
j f
if es j + εe + (xd )T wsf + bs − yd j > β j then
f f
Gr oupingV j = 5, and A5 ← A5 ∪ { j}.
else
if Ce − α j − β j > es j then
f f
Gr oupingV j = 4, and A4 ← A4 ∪ { j}.
123
Author's personal copy
214 Y.-C. Lee et al.
else
f f
Gr oupingV j = 2, and A2 ← A2 ∪ { j}.
end if
end if
end if
end for
Return Gr oupingV .
f
The solution to the LCPSVR , however, is tricky. The tricky parts are the non-
uniqueness of the bs solution and the degeneracy of the complementarities. One of the
degenerate cases occurs when
j
case 1: {yd j = (xd )T ws + bs − εe , α j = 0, β j = 0}
at some index j . This case implies that the j th data point in the space could either
be on the lower hyperplane or inside tube, and the index j could either be contained
in A3 or A5 . The degeneracy of the following complementarity
j
0 ≤ es j + εe − (xd )T wsf − bsf + yd j ⊥ α j ≥ 0
can result in the solutions of case 1. It is not hard to deduce that there are a total of four
possible cases where a data point is eligible for two geometry locations. They are:
j
case 2: {yd j = (xd )T ws + bs − εe , α j = C j , β j = 0} ⇒ j ∈ A3 or A1 ,
j
case 3: {yd j = (xd )T ws + bs + εe , α j = 0, β j = 0} ⇒ j ∈ A4 or A5 , and
j
case 4: {yd j = (xd )T ws + bs + εe , α j = 0, β j = C j } ⇒ j ∈ A4 or A2 .
j f
For case 2, the complementarities 0 ≤ es j + εe + (xd )T wsf + bs − yd j ⊥ β j ≥ 0 and
0 ≤ Ce −α j −β j ⊥ e j ≥ 0 are degenerate; for case 3, the complementarity 0 ≤ es j +
j f
εe + (xd )T wsf + bs − yd j ⊥ β j ≥ 0 is degenerate; for case 4, the complementarities
j f
0 ≤ es j + εe − (xd )T wsf − bs + yd j ⊥ α j ≥ 0 and 0 ≤ Ce − α j − β j ⊥ e j ≥ 0.
The following algorithm generates a set GS containing all the groupings associated
f f
with a (Ce , εe ) pair when the solution to the LCPSVR is degenerate or the solution bs
f f
is not unique but an interval [bmin bmax ].
f f !
Algorithm 4 Transform the Solutions (α, β, es , wsf , bmin bmax ) to a Set of Group-
f
ing Vectors. (Use this algorithm when the solution to the LCP SV R is degenerate or
f
bs is not unique)
f f f !
Given solutions α, β, es , wsf and bs ∈ bmin bmax , declare V as a vector with a length of n and
entries 0 and initialize a set GS ← {V }.
for f = 1, . . . , F and i = 1, 2
f f f f
Let bs = bmin when i = 1. Let bs = bmax when i = 2. Initialize GSi ← {GS}.
123
Author's personal copy
Global resolution of the support vector machine regression... 215
Below we define the invariancy region in the context of this work and show that
the invariancy region is convex.
Definition 5 An invariancy region IR is a region on the parameter space, i.e., the
(Ce , εe )-plane, such that the grouping vector induced by every (Ce , εe ) ∈ IR is the
same.
f
Theorem 6 Consider the following process (1)–(3). (1) Solve the LCPSVR by
Algorithm 12 with the variables (Ce , εe ) fixed at (C̄e , ε̄e ). (2) If the solution is non-
degenerate, transform the solution to a grouping vector G by Algorithm 3. If the
solution is degenerate, let G be any one members in the set GS obtained by Algorithm
4. (3) Use the grouping G to form the index sets Ai ∀i = 1, . . . , 5 and write down the
piece P as expressed in (11)–(16). Let invariancy region IR be a set of (Ce , εe )-pairs
such that the corresponding grouping vectors equal to G. Then IR is a convex set.
Proof Without loss of generality, let F = 1 and ignore the superscripts f in the
notation of variables. Let (C̄e(1) , ε̄e(1) ) ∈ IR and (C̄e(2) , ε̄e(2) ) ∈ IR. Let the solution
to LCPSVR with (Ce , εe ) fixed at (C̄e(1) , ε̄e(1) ) and (C̄e(2) , ε̄e(2) ) be {α j (1) , β j (1) , e j (1) ,
ws(1) , bs(1) } and {α j (2) , β j (2) , e j (2) , ws(2) , bs(2) } respectively. Assume that they give
the same grouping, i.e., Ai (1) = Ai (2) , ∀i = 1, . . . , 5, and that the correspondent
pieces P are given as (11)–(16). For any arbitrary λ ∈ (0, 1), consider (Ce(3) , εe(3) ) =
(λC̄e(1) +(1−λ)C̄e(2) , λε̄e(1) +(1−λ)ε̄e(2) ). Since (C̄e(1) , ε̄e(1) ) and (C̄e(2) , ε̄e(2) ) produce
123
Author's personal copy
216 Y.-C. Lee et al.
the same groupings, we claim that the solution {α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } equals
to {λα j (1) + (1 − λ)α j (2) , λβ j (1) + (1 − λ)β j (2) , λe j (1) + (1 − λ)e j (2) , λws(1) + (1 −
λ)ws(2) , λbs(1) + (1 − λ)bs(2) } because the following system can be satisfied by it.
n
j
ws(3) − (β j (3) − α j (3) )xd = 0,
j=1
n
n
α j (3) − β j (3) = 0,
j=1 j=1
∀ j = 1, . . . , n :
⎧ j T
⎨ 0 ≤ es j (3) + [λε̄e(1) + (1 − λ)ε̄e(2) ] − (xd ) ws(3) − bs(3) + yd j ⊥ α j (3) ≥ 0,
⎪
j
0 ≤ es j (3) + [λε̄e(1) + (1 − λ)ε̄e(2) ] + (xd )T ws(3) + bs(3) − yd j ⊥ β j (3) ≥ 0,
⎪
⎩
0 ≤ [λC̄e(1) + (1 − λ)C̄e(2) ] − α j (3) − β j (3) ⊥ es j (3) ≥ 0.
The grouping for {α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } is again the same. So { Ce(3) , εe(3) ,
α j (3) , β j (3) , e j (3) , ws(3) , bs(3) } is feasible to P, and IR is convex.
The following property directly results from Theorem 6, which is one of the suffi-
cient conditions we use in the algorithm to claim that all the invariancy regions inside
a rectangular area has been found.
Corollary 7 Given a rectangle6 on the (Ce , εe )-plane, if its four-corner points pro-
duce the same vector of grouping, the whole rectangle all produce the same vector of
the grouping.
In this section, we introduce three types of the restricted programs: restricted lin-
ear program, reduced restricted linear program, and restricted quadratic constrained
program. The restricted linear program and the reduced restricted linear program are
called “restricted” because the feasible set of these two programs is restricted within
the invariancy region of some fixed values of (Ce , εe ). On the other hand, the restricted
quadratic constrained program is even more restricted because its feasible region is
confined by a single (Ce , εe )-pair.
A restricted linear program RLP of the LPCC (5) is obtained by replacing the
f f
LCPSVR with one of its pieces. Since the input of RLP are index sets Ai for all
#F
i = 1, . . . , 5 and f = 1, . . . , F, there are f =1 5n f many RLP defined as follows:
f
RLP(Ai |i = 1, . . . , 5, f = 1, . . . , F) :
123
Author's personal copy
Global resolution of the support vector machine regression... 217
f
F v
end
min pi
f f
Ce ,εe ,ws ,bs , pi , f =1 i=front f
α j ,β j v
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
and ∀
⎧f = 1, . . . , F :
⎪ (xi )T wsf + bsf − ydi ≤ pi , ∀i = front vf , . . . , endvf ,
⎪
⎪
⎪ d
⎪
⎪ f
−(xdi )T wsf − bs + ydi ≤ pi , ∀i = front v , . . . , endv ,
f f
⎪
⎪
⎪
⎪ f f
⎪
⎪
end
end
⎪
⎪ ws + f
α j xd −
j j
β j xd = 0,
⎪
⎪
⎪
⎪
⎪
⎪ j=front f j=front f
⎪ end f
⎪
⎪
⎪
end f
⎪
⎪ α − β j = 0,
⎪
⎪ j
⎪
⎪
⎪
⎪ j=front f j=front f
⎪
⎪ (17)
⎪
⎪ (x
j T
) w f + b f − y − ε ≥ 0,
⎪
⎪ d s s d j e f
∀ j ∈ A1 ,
⎪
⎪ α j = Ce , β j = 0,
⎪
⎨
j f
⎪ yd j − (xd )T ws f − bs − εe ≥ 0, f
⎪
⎪ ∀ j ∈ A2 ,
⎪
⎪ α = 0, β = C ,
⎪
⎪
j j e
⎪
⎪
⎪
⎪ Ce ≥ α j ≥ 0, β j = 0,
⎪
⎪ f
∀ j ∈ A3 ,
⎪
⎪ = (x
j T
) f +bf −ε ,
⎪
⎪ y d j d w s s e
⎪
⎪
⎪
⎪
⎪
⎪ α j = 0, Ce ≥ β j ≥ 0,
⎪
⎪
f
∀ j ∈ A4 ,
⎪
⎪ y = (x
j T
) w f +bf +ε ,
⎪
⎪ d j d s s e
⎪
⎪ ⎫
⎪
⎪ ε − (x )T w f − b f + y ≥ 0, ⎪
⎪
⎪
j
⎬
⎪
⎪
e d s s d j
⎪
⎪ ε − + (x
j T
) f + b f ≥ 0, f
∀ j ∈ A5 .
⎪
⎪ e y dj d w s s ⎪
⎩ α j = 0, β j = 0. ⎭
If we look closely at the model (17), α j∈A f and β j∈A f can be replaced by a single
1 2
f f f f f f
variable Ce . The variables α j , ∀ j ∈ A2 , A4 , A5 , and β j , ∀ j ∈ A1 , A3 , A5 , can be
eliminated. Thus, we obtain a reduced restricted linear program RRLP as follows:
123
Author's personal copy
218 Y.-C. Lee et al.
f
RRLP(Ai |i = 1, . . . , 5, f = 1, . . . , F) :
f
F v
end
min pi
f f
Ce ,εe ,ws ,bs , pi , f =1 i=front f
α f ,β f v
j∈A3 j∈A4
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
and ∀
⎧f = 1, . . . , F :f
⎪ (xdi )T wsf + bs − ydi ≤ pi , ∀i = front vf , . . . , endvf ,
⎪
⎪
⎪
⎪
⎪ −(xdi )T wsf −
f
bs + ydi ≤ pi , ∀i =
f
front v , . . . , end
f
v,
⎪
⎪
⎪
⎪ ws f + Ce
j
xd − Ce
j
xd +
j
α j xd −
j
β j xd = 0,
⎪
⎪
⎪
⎪
⎪
⎪ j∈A1
f f f f
⎪
⎪ j∈A2 j∈A3 j∈A4
⎪
⎪ f f
|A1 |Ce − |A2 |Ce + αj − β j = 0,
⎪
⎪
⎪
⎪
⎪
⎪
f
j∈A3
f
j∈A4
⎪
⎪ $
⎨ j T f
∀ j ∈ A1 ,
f
(xd ) ws + bs − yd j − εe ≥ 0,
f
$
⎪
⎪ j f
∀ j ∈ A2 ,
f
⎪
⎪ yd j − (xd )T ws f − bs − εe ≥ 0,
⎪
⎪
⎪
⎪ Ce ≥ α j ≥ 0, β j = 0,
⎪
⎪ ∀ j ∈ A3 ,
f
⎪
⎪ y = (x
j T
) w f +bf −ε ,
⎪
⎪ dj d s s e
⎪
⎪ α j = 0, Ce ≥ β j ≥ 0,
⎪
⎪ f
⎪
⎪ j T f +bf +ε , ∀ j ∈ A4 ,
⎪
⎪ yd = (x ) w s s e %
⎪
⎪
j d
⎪ j f
⎪ εe − (xd )T ws f − bs + yd j ≥ 0,
⎪
⎪ ∀ j ∈ A5 .
f
⎩ j f
εe − yd j + (xd )T ws f + bs ≥ 0,
f f
where |Ai | denotes the cardinality of the set Ai .
Except for the two linear restricted programs, a fixed (Ce , εe ) pair allows us to
formulate a restricted convex quadratically constrained program RQCP as follows:
123
Author's personal copy
Global resolution of the support vector machine regression... 219
RQCP(Cefix , εefix ) :
f
F v
end
min pi
f f
ws ,bs , pi , f =1 f
α j ,β j i=frontv
subject to ∀ f = 1, . . . , F :
⎧ i T f f f f
⎪ (xd ) ws + bs − ydi ≤ pi , ∀i = front v , . . . , endv ,
⎪
⎪
⎪
⎪
⎪
f
−(xdi )T wsf − bs + ydi ≤ pi , ∀i = front v , . . . , endv ,
f f
⎪
⎪
⎪
⎪
end f
end f
⎪
⎪
⎪
⎪ w f
+ α x
j
−
j
β j xd = 0,
⎪
⎪
s j d
⎪
⎪
⎪
⎪
j=front f j=front f
⎪
⎪ f f
⎪
⎪
end
end
⎪
⎪ αj − β j = 0, (18)
⎪
⎪
⎪
⎨⎛ j=front f f
f
⎞j=front ⎛ f
⎞
⎪
end
end
⎪⎝
⎪ esfj ⎠ Cefix + ⎝ (α j + β j )⎠ εefix
⎪
⎪
⎪
⎪
⎪
⎪ j=front ⎛ f f
j=front ⎞ ⎛ ⎞
⎪
⎪
⎪
⎪
end f
end f
⎪
⎪ +⎝ (β j − α j )(xd )⎠ T ⎝
j
(βi − αi )(xdi )⎠
⎪
⎪
⎪
⎪
⎪
⎪ j=front f i=front f
⎪
⎪
⎪
⎪
end f
⎪
⎪
⎪
⎪ + (α j − β j )y j = 0,
⎩
j=front f
Identifying the invariancy interval on a line is not as complicated as the work (See the
methods in Ghaffari-Hadigheh et al. 2010, Bemporad et al. 2002) of identifying the
& ε)
invariancy region of a point (Ce , εe ). The line passing through a point (Ce , εe ) = (C,&
is either a vertical line expressed as
&
Ce = C,
123
Author's personal copy
220 Y.-C. Lee et al.
and
εmax \εmin = max \ min εe
subject to &
εe = L a Ce + L b , (or Ce = C) (20)
and constraints in (17).
&1 ,&
The solution to (19) and (20) is a line segment [(C &2 ,&
ε1 ) , (C ε2 )] that belongs in
one of the following four cases:
(i) On Ce = C & (a vertical line): C
&1 = C,
&& ε1 = εmax , C&2 = C,& and &ε2 = εmin .
(ii) On εe = L a Ce + L b and L a = 0, L b = & &1 = Cmax ,&
ε (a horizontal line): C ε1 = &ε,
&2 , = Cmin , and &
C ε2 = &ε.
&1 = Cmax , &
(iii) On εe = L a Ce + L b and L a is positive: C &2 = Cmin , and
ε1 = εmax , C
ε2 = εmin .
&
(iv) On εe = L a Ce + L b and L a is negative: C&1 = Cmax ,& &2 = Cmin , and
ε1 = εmin , C
ε2 = ε max .
&
We propose a procedure to find all the groupings and invariancy intervals along the
boundaries of a given rectangle {C̄, C, ε̄, ε}. When searching along a boundary line,
either εe or Ce is fixed at the corresponding value. The procedure starts with solving for
the grouping (or the set of groupings if having degeneracy) at a vertex of the boundary
and finding the invariancy interval [Cmin , Cmax ] or [εmin , εmax ]. When there are more
than one groupings, we first compute the invariancy intervals for each of the possi-
ble groupings. The interval with the farthest endpoint from the starting point is then
chosen to represent the invariancy interval of the starting point, and its corresponding
groping is chosen to represent the grouping of the starting point. After obtaining the
endpoint of the interval, to continue the search for a new grouping vector, we select
a new starting point that is outside and deviates a very small amount from the end-
point of the current interval. The deviation needs to be small enough to make sure that
no groupings are missed and that the invariancy interval containing the new starting
point is adjacent to the current interval. During the procedure, the number of invari-
ancy intervals on each side of the boundary are recorded by count T op, count Le f t,
count Bottom, and count Right. At the end of the procedure, a set of grouping vectors
GroupingVFound and the least objective value LeastU pper Bound are obtained.
It should be noted that in the situation of having degeneracy at a point, all the group-
ings found during the process should be recorded in the set GroupingVFound, but
only the invariancy interval that are chosen to represent a point should be counted in
count T op, count Le f t, count Bottom, and count Right.
123
Author's personal copy
Global resolution of the support vector machine regression... 221
We have noticed from the experiments that the invariancy intervals obtained at
Step 1a, 2a, 3a, and 4a are sometimes problematic/unappropriate maybe because of
arithmetic errors. An interval I B is said to be appropriately located adjacent to the
previous interval I A if:
B
Cmax ≤ Cmin
A B
and Cmin < Cmax
B
. (21)
overlapping case is usually due to arithmetic errors and it can cause an endless loop
123
Author's personal copy
222 Y.-C. Lee et al.
Fig. 2 Appropriate invariancy interval I B = [Cmin B , C B ] subsequent and adjacent to the interval I A on
max
A − deviation, ε̄) if searching on the line ε = ε̄ at step 1e
a line with fixed εe . newStar ting is set at (Cmin e
of Algorithm 10 (or at steps 2e, 3e, and 4e if searching on other line functions of the boundary)
123
Author's personal copy
Global resolution of the support vector machine regression... 223
3. Solve for a new interval: Solve (19) subject to εe = ε̄ and the piece out of
Gr oupingV top and obtain the invariancy interval [Cmin , Cmax ]. counter + 1. Go
to 1.
Figure 4 provides one example of using Algorithm 8 with the add-in to identify
the invariancy intervals along the four boundaries: boundary on the top (εe = ε̄ = 1
in this example), left-hand-side boundary (Ce = C = 1), bottom boundary (εe =
ε = 0.1), and right-hand-side boundary (Ce = C̄ = 10). In Fig. 4, all appropriate
and problematic intervals are sequentially shown. The parameter deviation is set
at 0.0001 and per tur bation is set at 0.001. This specific instance (35_35_5_2) has
twofolds, and each fold has 35 training and 35 validation data points with 5 features.
Among the intervals obtained, intervals #12, #20, #25, #26, #27, #28, #30, #32, #33,
#35, #36, #38, #39, and #41 are problematic intervals at which the add-in is applied
to enforce the forward searching. The problematic intervals are not counted in the
Fig. 4 An example of searching along the boundaries of a rectangle and identifying the invariancy intervals
using the Algorithm 8 with the add-in
123
Author's personal copy
224 Y.-C. Lee et al.
number of invariancy intervals at each side of boundary. Therefore, at the end of the
algorithm, we obtained count T op, count Le f t, count Bottom, and count Right at 2,
13, 2 and 11, respectively.
The (Ce , εe )-rectangle search algorithm explores possible groupings and the corre-
sponding objective values of the restricted linear programs for each area [C, C]×[ε, ε]
in the queue Ar eaToBeSear ched. We only examine, remove, and partition into areas
which are rectangular. The areas are viewed on the base of a rectangle mainly for the
convenience of partitioning and the ease of searching invariancy intervals as described
in Sect. 3.4. The search of groupings and invariancy intervals starts at the four vertices
and along the boundaries of the rectangles. We do not search into the interior of rectan-
gles but infer the interior conditions based on the information of invariancy intervals
gathered along the boundary. If we can not conclude that all groupings in one rectangle
are realized, we partition the rectangle into small rectangles at the midpoints8 of the
invariancy intervals. The process proceeds to sequentially search the rectangular areas
decomposed from the initial box.
In the 1st stage of algorithm, we try to explore as many groupings as possible
and eliminate the areas containing only one grouping. By Corollary 3.2, a rectangu-
lar area is guaranteed to have only one grouping when the same grouping vector
is obtained at its four vertices. A rectangular area with this property requires no
more partitioning and is eliminated (in step 2d of Algorithm 9) from the list/queue
of Ar eaToBeSear ched. The total area of the eliminated rectangles is recorded at
Ar ea Reali zed I nT he1st Stage.
When the number of groupings is not greater than 2 on any side of the boundary
of one rectangle, the 2nd stage of the algorithm is activated for that rectangle. The
condition that initializes the 2nd stage is stated in step 4 of Algorithm 9. For a rectangle
passed to the 2nd stage, there are two possible results. One, the rectangle is eliminated
from Ar eaToBeSear ched permanently because no other invariancy regions are in
the interior (explained in Sect. 3.6). Two, the rectangle is partitioned into smaller
rectangles. By updating the queue Ar eaToBeSear ched, these new rectangles are
passed back to the 1st stage. At the end of the algorithm (consisting of the 1st and
2nd stage), a least upper bound (LeastU pper Bound) for the parameter training and
validation SVR model is obtained.
8 Partitioning at the endpoints is also a theoretically valid strategy. We choose to partition at the midpoints
rather than the endpoints to avoid the loss of information due to arithmetic imprecision at the dividing line.
The drawback of it is that most of the rectangles are not eliminated in the 1st stage but have to be passed to
the 2nd stage.
123
Author's personal copy
Global resolution of the support vector machine regression... 225
then
Solve the corresponding RLP and let the objective value be VU B .
Let LeastU pper Bound ← VU B if VU B < LeastU pper Bound.
Ar ea Reali zed I nT he1st Stage = Ar ea Reali zed I nT he1st Stage+(C −C)×
(ε − ε).
Go to Step 1 after eliminating Ar ea curr ent from Ar easToBeSear ched.
else
Continue to Step 3.
end if
Step 3. Find groupings along the four boundaries of Ar ea curr ent .
if C̄ − C < insensitive then
Apply Algorithm 8 but skip steps 1, 3, and 4.
Go to Step 1 after eliminating Ar ea curr ent from the list of Ar easToBeSear ched.
else if ε̄ − ε < insensitive then
123
Author's personal copy
226 Y.-C. Lee et al.
(count H ori zetental Br eak Point + 1) × (count V er tical Br eak Point +1)
We do not explicitly find the edges of invariancy regions in the algorithm. Instead,
the allocation of the invariancy intervals along the boundaries of each rectangular area
is monitored. Maintaining the search area as a rectangle is especially convenient in
partitioning and eliminating, but the drawback is the possibility of revisiting the same
invariancy region many times. One can compare this method with the partitioning
technique in Bemporad et al. (2002) and the graph identifying technique in Ghaffari-
Hadigheh et al. (2010).
123
Author's personal copy
Global resolution of the support vector machine regression... 227
3.6 The 2nd stage: identify the non-vertical and non-horizontal boundary of
invariancy region
Examining vertices, which is the criteria in the 1st stage, is quick and convenient
in removing some areas from Ar easToBeSear ched. However, the convexity of the
invariancy region implies that a boundary of the region is not a curve but not necessarily
a vertical or a horizontal line. The 1st stage alone is not enough to clear the queue
Ar easToBeSear ched. The 2nd stage of the algorithm is proposed to resolve this
issue. Note for any rectangular area, it is viewed as one of the following three status.
1: belonging in one invariancy region, 2: belonging in two invariancy regions, and 3:
belonging in more than two invariancy regions. The 1st stage of the algorithm tackled
the area of status 1, and the 2nd stage of the algorithm aims to tackle the area of status
2. If a rectangle is of status 3, it is decomposed again.
Given that the number of groupings is no more than 2 on any side of the boundary,
we want to see if there is a single straight line splitting the input rectangular area into
two invariancy regions, thus concluding the realization of the area. There are a total
of six possible cases as shown in Fig. 5. In the figure, the node denotes where a task
of finding the grouping vector and solving the restricted linear program is done. The
arrow denotes a task of solving for the invariancy interval given the grouping vector.
In Case 1, the dividing line passes through the top and left-hand-side boundaries; in
Case 2, the dividing line passes through the top and bottom boundaries; in Case 3,
the dividing line passes through the top and right-hand-side boundaries; in Case 4,
the dividing line passes through the left-hand-side and right-hand-side boundaries; in
Case 5, the dividing line passes through the left-hand-side and bottom boundaries; in
Case 6, the dividing line passes through the bottom and right-hand-side boundaries.
To verify whether a rectangular area fits one of the six cases or not, we first need
to check the number of invariancy intervals at each side of the boundary, denoted as
count T op, count Le f t, count Bottom, and count Right in the 2nd stage of the algo-
rithm. We then compare the vectors of the groupings obtained. The following algorithm
describes the details of verifying the six cases and the way to handle exceptions.
The entries in the lists are in the order of obtaining. We denote the first entry of the
lists by “(1)” adjacent to the name of lists, and the second entry of lists, if it exists, by
“(2)” adjacent to the name of lists.
123
Author's personal copy
228 Y.-C. Lee et al.
Fig. 5 The six cases in which the rectangular area is bisected by a straight line into two invariancy regions
A1 and A2 on the (Ce , εe )-plane
Step 2: Find one of the following cases that fits the reality.
switch s do
case 1
if Gr oupingV List T op (1) = Gr oupingV List Le f t (2)
and Gr oupingV List T op (2) = Gr oupingV List Le f t (1) then
The Ar ea satisfies Case 1. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 2 − 6
See Appendix C.
123
Author's personal copy
Global resolution of the support vector machine regression... 229
end case
case 7
Go to Step 3, Exception 1.
end case
end switch
Step 3.
Exception 1: more than two groupings.
Partition at the middle of the rectangular area.
Update Ar easToBeSear ched of Algorithm 9 by removing Ar ea from and adding
the four
resulting small areas to it.
Exception 2: unknown situation.
Let the randomly picked interior point be (C r , εr ). Partition the Ar ea into the
following
four
' areas (
C̄ − insensitive , C r + insensitive , ε̄ − insensitive , εr + insensitive ,
' (
C̄ − insensitive , C r + insensitive , εr − insensitive , ε + insensitive .
' r (
C − insensitive , C + insensitive , ε̄ − insensitive , εr + insensitive ,
' r (
C − insensitive , C + insensitive , εr − insensitive , ε + insensitive .
Update Ar easToBeSear ched of Algorithm 9 by removing Ar ea from and adding
the four
resulting small areas to it.
Fig. 6 Example for the Exception 1 in the Step 3 of the 2nd stage. There are three invariancy regions, A1,
A2 and A3
123
Author's personal copy
230 Y.-C. Lee et al.
but the areas thrown to Exception 2 in the worst case can only be partitioned and
cropped until they are too thin to have any unrevealed groupings.
An overview of running the whole algorithm on the (Ce , εe )-plane is shown in Fig. 7.
f
The node denotes a (Ce , εe ) point at which the LCPSVR is solved; the number marked
next to a node, if exists, denotes the objective value of the associated restricted linear
program; the arrow in the 2nd figure denotes the direction of search for the invariancy
interval, associated with the starting (Ce , εe ) point of that arrow. The 1st and 2nd stages
can both be seen in this illustration, but not all details in Algorithm 9 and Algorithm
10 are stated in the description of the overview.
123
Author's personal copy
Global resolution of the support vector machine regression... 231
In this section, we propose a technique for finding and tightening the valid big numbers
θ1 j , θ2 j , θ3 j , θ4 j , θ5 j , and θ6 j that are used in the formulation of (10). The direct
effect of tightening the big numbers θi j , i = 1, . . . , 6 is that the feasible region
defined by (10) will shrink. We use these tightened big numbers to form a binary-
integer program that can be solved using any IP commercial solvers. This integer
program with the tightened big numbers is an alternative to solve the bi-level program
(3).
The valid values of θ1 j , θ3 j and θ5 j are not related with es j and are all up-bounded
by C̄. The following optimization programs with the choice of the objective functions
α j , β j , and Ce − α j − β j ) solve for the valid θ1 j , θ3 j , and θ5 j respectively.
max α j \β j \Ce − α j − β j
Ce ,εe ,ws1 ,bs1 ,
ws2 ,bs2 ,...,wsF ,bsF ,
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 +1, . . . , n +m 1 +m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n +m 1 +1, . . . , n +m 1 + m 2 ,
.. (22)
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,
f
F v
end
obj L B ≤ pi ≤ objU B,
f =1 i=front f
v
f f
end
end
αj − β j = 0, ∀ f = 1, . . . , F,
j=front f j=front f
⎛ f
⎞ ⎛ f
⎞
end
end
and ⎝ es j ⎠ Ce + ⎝ (α j + β j )⎠ εe
j=front f j=front f
⎛ f
⎞ ⎛ f
⎞
end
end
+⎝ (β j − α j )(xd )⎠ T ⎝ (βi − αi )(xdi )⎠
j
(23)
j=front f i=front f
f f
end
end
+ (−α j + β j )bs + (α j − β j )y j = 0, ∀ f = 1, . . . , F.
j=front f j=front f
The last F equalities of the above model are the sums of zeros of F folds, which
is obtained from summing up the products of the two sides to the ⊥ sign. As a
f j end f
result, we get a quadratic term ( end
j=front f
(β j − α j )(xd ))T ( i=front f (βi − αi )(xd ))
i
123
Author's personal copy
232 Y.-C. Lee et al.
f
in the equality. Meanwhile, the end j=front f
(−α j + β j )bs term in (23) cancels out
end f
because j=front f
(α j − β j ) = 0, ∀ f = 1, . . . , F. The two nonconvex terms
end f f
( j=front f es j )Ce and ( endj=front f
(α j +β j ))εe that remain in Eq. (23) are to be approx-
imated by the Tyler expansion type of linear expressions. The quadratic relaxation of
(23) is as follows:
) *
Ce
xh + = 0,
ε
⎡ e f
⎤
end
⎢ es j ⎥
⎢ ⎥
⎢ f ⎥
⎢
zh + ⎢
f j=front ⎥ = 0, ∀ f = 1, . . . , F,
f ⎥
⎢
end
⎥
⎣ (α j + β j ) ⎦
j=front f
0 ≤ xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf , ∀f
= 1, . . . , F,
0 ≤ xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf , ∀f
= 1, . . . , F, (24)
xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf ≤ 0, ∀f
= 1, . . . , F,
xh ◦ zhf + zh f ◦ xh − xh ◦ zh f + vf ≤ 0, ∀f
= 1, . . . , F,
⎛ f
⎛ ⎞ f
⎞
end
end
v f (1) + v f (2) + ⎝ (β j − α j )(xd )⎠ T ⎝ (βi − αi )(xdi )⎠
j
j=front f i=front f
f
end
+ (α j − β j )y j ≤ 0, ∀ f = 1, . . . , F,
j=front f
where xh ∈ R2 , zhf ∈ R2 , vf ∈ R2 , and v f (1) and v f (2) are the first and second
entries of vf respectively. xh , xh , z h f , and z h f are the upper and lower bounds of xh
and zhf .
We set xh , xh , z h f , and z h f at the following values:
xh = [−C; −ε],
xh = [−C̄; −ε̄],
zh f = [0; 0], and
⎡ + , ⎤
f
end
f
⎢ − upper bound of es j ⎥
⎢ j=front f ⎥
zh f
=⎢
⎢
+ ,⎥.
⎥
⎣ f
end
f f ⎦
− upper bound of (α j + βj )
j=front f
123
Author's personal copy
Global resolution of the support vector machine regression... 233
Note the second entry of zh f is the direct result of solving the model (22) subject to
(24). It is the main idea of our big-numbers tightening procedure that a refinement of
the bounds zh f is made whenever the objective values of (22) are improved.
On the other hand, the valid values of θ2 j , θ4 j , and θ6 j are all related to es j . By
definition,
j
es j ≥ (xd )T ws f + bsf − yd j − εe ,
j
es j ≥ −(xd )T ws f − bsf + yd j − εe ,
es j ≥ 0.
This implies that θ6 j is the larger objective value of the following two optimization
problems:
j
max (xd )T ws f + bsf − yd j − εe
variables in (22),
xh ,zhf ,vf (25)
subject to constraints in (22) and (24),
and
j
max −(xd )T ws f − bsf + yd j − εe
variables in (22),
xh ,zhf ,vf (26)
subject to constraints in (22) and (24).
Let the larger objective values among (25) and (26) be es j , then a valid θ2 j can be
obtained by solving
j
max es j + εe − (xd )T ws f − bsf + yd j
variables in (22),
xh ,zhf ,vf (27)
subject to constraints in (22) and (24).
j
max es j + εe + (xd )T ws f + bsf − yd j
variables in (22),
xh ,zhf ,vf (28)
subject to constraints in (22) and (24).
123
Author's personal copy
234 Y.-C. Lee et al.
f
F v
end
min pi
Ce ,εe ,ws1 ,bs1 ,
f =1 i=front f
ws ,bs2 ,...,wsF ,bsF ,
2 v
pi ,α j ,β j ,es j
subject to 0 ≤ C ≤ Ce ≤ C,
0 ≤ ε ≤ εe ≤ ε,
(xdi )T ws1 + bs1 − ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
−(xdi )T ws1 − bs1 + ydi ≤ pi , ∀i = n + 1, . . . , n + m 1 ,
(xdi )T ws2 + bs2 − ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 + m 2 ,
−(xdi )T ws2 − bs2 + ydi ≤ pi , ∀i = n + m 1 + 1, . . . , n + m 1 +m 2 ,
..
.
(xdi )T wsF + bsF − ydi ≤ pi , ∀i = front vF , . . . , endvF ,
−(xdi )T wsF − bsF + ydi ≤ pi , ∀i = front vF , . . . , endvF ,
∀
⎧f = 1, . . . , F :
⎪
⎪
end f
⎪
⎪ j
⎪
⎪ w f
− (β j − α j )xd = 0,
⎪
⎪
s
⎪
⎪ j=front f
⎪
⎪
⎪
⎪
end f
end f
⎪
⎪
⎪
⎪ αj − β j = 0,
⎪
⎪
⎪
⎪ f f
⎪
⎪
j=front j=front
⎨
⎪
⎪ ∀ j = front f , . . . , end f :
⎪
⎪ ⎧
⎪
⎪ ⎪ 0 ≤ αj ≤ θ1 j · z j,
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪ 0 ≤ e + ε − (x
j T f
) w − b
f
s + yd j ≤ θ2 j · (1 − z j ),
⎪
⎪ ⎪
⎨
sj e d s
⎪
⎪ 0 ≤ βj ≤ θ3 j · z j ,
⎪
⎪
⎪
⎪ ⎪ 0 ≤ es j + εe + (xd ) ws + bs − yd j
j T f f
≤ θ4 j · (1 − z j ),
⎪
⎪ ⎪
⎪
⎪
⎪ ⎪
⎪ 0 ≤ Ce − α j − β j ≤ θ5 j · ηj,
⎪
⎪ ⎪
⎩
⎩ 0 ≤ es j ≤ θ6 j · (1 − η j ).
(29)
We suggest using the following big-values tightening procedure.
Algorithm 11 Obtaining the Tightened Big Numbers.
Step 0: Initialization.
Set obj L B and objU B at the exogenous valid lower and upper bound of
F endvf
f =1 f pi respectively.
i= f r ontv
Step 1: Obtaining initial big numbers by solving the linear program.
for j = f r ont f , . . . , end f , f = 1, . . . , F
1a: Solve (22) for all choices of the objective functions without constraints
(24).
Let the optimal objective values of α j , β j , and Ce − α j − β j be θ1 j , θ3 j ,
and θ5 j respectively.
1b: Solve (25) and (26) without constraints (24). Let the optimal objective
(1) (2) (1) (2)
values be es j and es j respectively. Then es j = max(es j , es j ).
123
Author's personal copy
Global resolution of the support vector machine regression... 235
θ6 j = es j .
1c: Solve (27) and (28) without constraints (24). Let the optimal objective
values be θ2 j and θ4 j respectively.
end for
Step 2: Solving for the improved lower bound.
f
Solve for max Ff=1 endv f pi subject to the constraints in (22) and (24), where
i= f r ontv
f end f
zh f = [− end j= f r ont f θ6 j ; − j= f r ont f (θ1 j + θ3 j )]. Let the objective be obj L B .
c
The use of the tightened big numbers significantly improves the running time for
the instances that are originally solvable and also allows more instances to be solved.
However, the level of tightening which can be reached by this procedure is limited due
to the effect of aggregation. The approximation to the aggregated complementarities
becomes looser as the number of complementarity constraints increase. When the size
of training data is above some threshold, the tightened values θ1 j , θ2 j , θ3 j , θ4 j , θ5 j ,
and θ6 j resulting from Algorithm 11 are not small enough, so the IP (29) cannot be
solved by the solver. In this situation, we lack good big numbers and a good lower
bound of the upper-level objective function.
Now suppose a global solution set is known: (α ∗ j , β ∗ j , es∗j , εe ∗ , Ce ∗ , ws∗ f , bs∗ f )
and let
123
Author's personal copy
236 Y.-C. Lee et al.
We say that the integer program defined by the big numbers θ1 j = θ1∗ j , θ2 j = θ2∗ j ,
θ3 j = θ3∗ j , θ4 j = θ4∗ j , θ5 j = θ5∗ j and θ6 j = θ6∗ j contains at least one of the global
optimal solutions. Although some portions of the feasible regions are cut off, the values
in (30) provide a benchmark for the tightness of the big numbers θ .
Table 1 gives examples of the averaged tightened values θ1 , θ2 , θ3 , θ4 , θ5 , and θ6
that are obtained from Algorithm 11. The instances in the examples have 30 features
and Ce ∈ [1, 10], εe ∈ [0.1, 0.5]. Compared with the benchmark values of θ1 ∗ , θ3 ∗ ,
and θ5 ∗ , these big numbers are still very large even after tightening, especially when
the numbers of training data points exceed 30.
5 Numerical experiments
In this section we provide the numerical experiments for solving the SVM regression
parameters selection. The following themes are covered:
• Data sources: synthetic data where (xd , yd ) are random values in [0, 10] and real-
world data where xd are the indicators and yd represents the diseases.
• Number of folds for the training data: onefold, twofolds, threefolds, or fivefolds.
• Number of folds for the validation data: the same as the number of folds for the
training data, or 1 single fold.
• Number of features: 5 features or 30 features. (25 features for the real-world data.)
• Size of training data set: 5, 10, 15,…, 95, 100; 105, 120, 135, or 150. (40, 60, or
100 for the real-world data.)
• Algorithms for the global optimal solution: the (Ce , εe )-rectangle search algorithm,
or the improved integer programs solved by CPLEX.
• Algorithms for the local optimal solution: an intermediate solution obtained from
the (Ce , εe )-rectangle search algorithm, or a solution produced by KNITRO, which
is an sequential quadratic programming (SQP) nonlinear programming (NLP)
solver that can be used for mathematical programs with equilibrium constraints
(MPEC).
• Parameters employed for the testing data set: globally optimized parameter and
grid-searched parameter.
The experiments are run on a machine with Intel i7-2600k CPU, 16 GB memory, and
OS windows 7, except those that use KNITRO on NEOS. Both the (Ce , εe )-rectangle
search algorithm described in Sect. 3 and the improved inter program described in
Sect. 4 are implemented in C++.
In Tables 2, 3 and 4, the results of the parameter selection problem for synthetic data
with onefold, twofolds and threefolds of training data are shown, respectively. Here,
123
Table 1 Comparison of the averaged values θi , i = 1, . . . , 6, obtained by Algorithm 11 and the optimal values θi∗ , i = 1, . . . , 6
123
237
238
Table 2 Result of the training and validating SVR on the synthetic data with onefold of training data and onefold of validation data that were solved by the (Ce , εe )-rectangle
search algorithm
123
Instance profile Optimal solution Algorithm statistics at termination/interuption
123
239
Table 3 Result of the training and validating SVR on the synthetic data with twofolds of training data and twofolds of validation data that were solved by the (Ce , εe )-rectangle
240
search algorithm
123
Data # of comp. Global? Least upper Ce εe Convergence # of groupings # of objective # of rectangles
bound time (s) values
123
241
the number of folds for the validation data is the same as that for the training data.
The first column is a self-explanatory name for each instance. For example, instance
5_6_30_1 means that there are 5 data points in each fold of the validation data, 6 data
points in each fold of the training data, 30 features for each data point, and a total of
onefold for the training data. The second column is the number of complementarity
constraints. The 3rd to 10th columns record the solution obtained at convergence (if
the algorithm converges) or at termination (if the algorithm is interrupted due to an
error or after a long time waiting). The possible entries in the 3rd column are Yes, fail1 ,
No2 , or No3 . “Yes” denotes the case where the global optimum is obtained, and the Ce
and εe values in the 5th and 6th columns are the optimal solution. “fail1 ” denotes the
case where errors occur either in the 1st or 2nd stage of the algorithm so the running
process is forced to stop. “No2 ” denotes the case where the instance cannot be solved
within a limit of time. In this case, the Ce and εe recorded in the 5th and 6th columns are
solutions for which the least valid upper bound is obtained. The time limit imposed
on these instances is 8000 s for each stage. Note that we impose the time limit on
some instances only when we know the instance is unlikely to be solved in a decent
running time, or we have confronted a long waiting time on a simpler or equal size of
instance. “No3 ” is similar to “No2 ” except that the running process was interrupted
before converging at about 8000 s. The 7th column records the time in seconds to
obtain the global optimum. If the algorithm doesn’t converge, we mark it by “n/a”.
The 8th column records the number of groupings found during the search. The 9th
column records the number of different objective values ever obtained from solving
RLP or RQCP. If the instance is solved to global optimum, the values in the 8th and
9th column are the ultimate realizations of possible groupings and objective values,
respectively; otherwise, they only show the intermediate understandings. The number
of objective values of RLP or RQCP is at most the number of possible groupings.
The values in the 9th column can be less than those in the 8th column because it is
possible for two different groupings to lead to the same objective. The 10th column
records the number of rectangular areas being processed in the entire algorithm.
For the instances with 5 features shown in Table 2, the (Ce , εe )-rectangle search
algorithm can solve the onefold-5-features problems with up to 85 training and 85
validation data points to global optimum. The unsuccessful runs for instances with
more than 80 data points are due to the failure in solving the restricted linear program
RLP and the restricted quadratic constrained program RQCP9 at the vertices of some
rectangular areas. This is indeed a problematic issue for Algorithms 9 and 10 when
there is any unsolved RLP (and RQCP) during the process of searching and elim-
inating. The constraints set (17) of RLP is used in models (19) and (20) obtaining
the invariancy intervals. Without successfully solving RLP and RQCP, an invariancy
interval cannot be identified, followed by an inefficient partitioning at random points.
In other words, an accurate lower-level solution that can lead to the feasible RLP is
critical for controlling the search step of this algorithm.
Shown in Tables 3 and 4, the (Ce , εe )-rectangle search algorithm solves the twofold-
and threefold-5-features problems to global optimum without troubles. From the 7th
123
Author's personal copy
Global resolution of the support vector machine regression... 243
Fig. 8 The four quadrants of difficulty that are separated by the number of training data points each fold
and the features of a data point of an instance
and 8th columns of the tables, we can see that the time spent to get convergence is about
linear to the number of groupings being discovered. The number of feasible groupings
that an instance can carry is determined by the two factors: number of training data
points and number of features.
For instances with 30 features, as shown in Tables 2, 3 and 4, the search algorithm
can only solve up to 30 data points for each fold with onefold, twofolds, or threefolds
of training data, except a single instance 40_40_30_1. The intermediate number of
groupings for these unsolved instances is already very large compared with those
instances of the same size of training data but fewer features.
We notice that the relationship between number of features and the number of
data points affect the difficulty of solving an instance by the (Ce , εe )-rectangle search
algorithm. If the number of training data points is less than the number of features, the
problem is easier. This observation can be related to the geometric analysis where we
aim to identify a k−1-dimensional hyperplane in a k-dimensional space to fit n training
data points to a least square linear regression. Consider the fact that k properly located
data points determine a hyperplane in k-dimension. If n < k, there are more than one
hyperplane containing the n data points on it. If n = k, given non-collinearity and
other conditions on the relationship of points, there is only one hyperplane containing
the n data points on it. If n > k, a hyperplane is determined under the rule of least
square, yet there will be at least n − k points outside the hyperplane. Therefore, when
n < k, there is more freedom in determining a hyperplane with no residuals.
To analyze the difficulty level of an instance for the (Ce , εe )-rectangle search algo-
rithm, we propose to separate the instances into four quadrants- I , I I , I I I , and I V -
to categorize the instances into four levels of difficulty: the easiest, easy, har d, the
har dest, as shown in Fig. 8. The separating horizontal and vertical axes represent
the number of training data points and number of features, respectively, and the origin
denotes the instance with equal number of the training data points and features. Within
each quadrant, the running time of the instances is locally proportional to the number
of training data points and folds. From Fig. 9, for instances belonging to the same
123
Author's personal copy
244 Y.-C. Lee et al.
200 0.2
0 0
5 10 15 20 25 1 2 3
# of training data points (<30) each fold # of folds
20
10
1 2 3 10152025303540455055606570758085
# of folds # of training data points (>5) each fold
difficulty level, the running time is about linear to number of training data points and
number of folds.
The global optimal results shown in Tables 2, 3, and 4 contain the successful runs
for the easy, the easiest, and har d instances, but the har dest instances remain
unsolved using the (Ce , εe )-rectangle search algorithm.
Besides the synthetic data, we run parameter selection for the real-world data of
chemoinformatics, which has been used in the work of Yu (2011), Kunapuli (2008),
Demiriz et al. (2001). The original usage of these data sets is for building the quan-
titative structure–activity relationship (QSAR) models, but we borrow the data only
to test the algorithm without discussing the meaning of the real-world application. A
profile of these data sets is shown in Table 5.
We divide the training data into fivefolds and keep the validation data as onefold. For
example, the 100 training data points for data set aquasol is divided into fivefolds,
that is, each fold of training data contains 20 data points and the single fold of validation
data contains all 97 validation data points. We follow the same method to create the
folds of data. In addition, compared to the parameter selection studies in Yu (2011),
123
Author's personal copy
Global resolution of the support vector machine regression... 245
Kunapuli (2008) on the same data set, our experiments have some modifications. In
Kunapuli (2008), a lower and upper bound is imposed on the normal vector ws of the
hyperplane in SVM regression problems, thus making them slightly different from
problem (3). The approaches employed in Kunapuli (2008) don’t aim at confirming
the global optimality. In Yu (2011), a branch and cut-based algorithm was proposed
for the global optimality, but all experiments with real-world data were terminated in
7200 s at a local optimal solution. In both Yu (2011) and Kunapuli (2008) the outer-
level objective function takes the average of the regression error, while we take the
sum. This small change doesn’t affect the optimal values of Ce and εe when the number
of training data points in each fold is equal, which is our case, but it affects the outer
objective values being presented to readers. The analysis in these two previous works
focuses on performance, which is viewed by the mean average deviation (MAD) and
mean squared error (MSE) of the trained parameters on a testing data set. However,
the performance of the parameters on a testing data set is really not guaranteed. (See
the debates about the meaning of a “best parameter” in Sect. 2.2 and two small-scale
performance comparisons in Sect. 5.5.) Instead, we focus on the global optimality
itself. Under these modifications, we are the first to obtain a certificate of global
optimality for the problem sets cancer, BBB and CCK, while the global optimality
for the set aquasol remains unrealized. The results of solving real-world instances
using the (Ce , εe )-rectangle search are shown in Table 6.
From Table 6, the runs on data sets cancer, BBB and CCK are successful both
in global optimality and in convergent time. For the data set aquasol, we can see
that the algorithm cannot converge after 1,052,390 s, and the number of groupings
identified for “aquasol_1” is already really large. We then acknowledge that the set
aquasol is a challenge to our rectangle search algorithm. Runs on the remaining
instances “aquasol_2–aquasol_10” are forced to stop at around 8000 s. We
reasonably believe that the number of groupings identified at 8000 s is much smaller
than the number of all possible groupings.
Numerical results show that the (Ce , εe )-rectangle search algorithm is less sensitive
to the increase of the number of folds than to the number of features. For the data sets
cancer, BBB and CCK, dividing the training data set into fivefolds actually makes
the number of training data points each fold drop far below the number of features.
The difficulty of these instances is categorized in the I I quadrant-easy. Although
123
246
123
Table 6 Result of the training and validating SVR on the real-world chemoinformatics data which were divided into fivefolds of training data and onefold of validation data
and solved by the (Ce , εe )-rectangle search algorithm
Name Data # of comp. Global? Avg. least Avg. converge. Avg. # of Avg. # of Avg. # of
upper bound time (s) groupings objective values rectangles
f
we need to solve more LCPSVR s, it seems to be beneficial for the (Ce , εe )-rectangle
search to divide the data set into many folds such that the problem of each fold is less
difficult. Associated with the smaller size of training data in each fold is the risk of
losing representativeness of future data.
In this subsection, we compare the global and local optimal solutions with the con-
vergent efficiency provided by various approaches. Among these algorithms, the
(Ce , εe )-rectangle search algorithm and the improved integer program with tightened
big numbers θ solve the instances to the global optimum at convergence. We imple-
ment Algorithm 11 in C++ and use CPLEX 12.2 to solve the program (29). The initial
obj L B in the Step 1 can be set at 0, or at the least sum of residuals of fitting the
validation data to the absolute regression hyperplane. The initial objU B is set at the
least upper bound obtained in the 1st stage of the (Ce , εe )-rectangle search algorithm.
Besides the two global optimal algorithms, the instances were run on NEOS machine
to use KNITRO, the mathematic program with equilibrium constraints solver. In this
way, we obtained a quick local solution to compare with others.
The results obtained by the three methods on the synthetic instances with onefold,
twofolds, and threefolds training and validation data sets are shown in Tables 7, 8, and
9 respectively. In the tables, the results of the (Ce , εe )-rectangle search algorithm are
excerpted from Table 2, 3, and 4. Next to the objective values obtained by KNITRO
in the tables, some are marked by a double asterisk (**.) This means that KNITRO
returned a solution but it is claimed an infeasible point.
From Tables 7, 8, and 9, the global optimal objective values obtained from the
(Ce , εe )-rectangle search algorithm and the improved integer program basically match
with each other except for small discrepancies in instances 15_15_30_1, 20_20_30 _1,
25_25_30_1, 15_15_30_2, and 15_15_30_3. We think these discrepancies are the
result of different precisions on the right-hand-side feasibility. The local optimal solu-
tions provided by KNITRO are quite good considering the running time, yet we can
also see that the global optimal solution to the application of training and validation
SVR is always better than the local optimal solution provided by KNITRO.
A series of figures that compare the running time of the (Ce , εe )-rectangle search
algorithm and the improved integer program is shown in Fig. 10. In general, the
(Ce , εe )-rectangle search algorithm can solve many more instances to the global
optimum than the improved integer program. The instance 35_35_30_1 is the only
instance that was unsolved by the (Ce , εe )-rectangle search algorithm but solved as
the improved integer program. However, for instances with fewer training data points,
the convergent speed of an improved integer program outperforms that of the (Ce , εe )-
rectangle search algorithm, including the onefold-5-feature instances with up to 55
data points, onefold-30-feature instances with up to 35 data points, twofold-5-feature
instances with up to 25 data points, twofold-30-feature instances with 10 and 15 data
points, threefold-5-feature instances with up to 15 data points, and threefold-30-feature
instances with up to 20 data points. As the number of training data points increases, the
required processing time for the improved integer program rises suddenly at a critical
123
248
Table 7 Comparisons between three methods for training and validating the SVR with onefold of training data and onefold of validation data
123
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value
123
249
250
Table 8 Comparisons between three methods for training and validating the SVR with twofolds of training data and twofolds of validation data
123
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value
Global optimal Local optimal Global optimal Ce εe Time (s) Local optimal Ce εe Time (s)
objective value objective value objective value objective value
123
251
Author's personal copy
252 Y.-C. Lee et al.
1.5 1.5
1 1
0.5 0.5
0 0
5 10152025303540455055606570758085 5 10 15 20 25 30 35 40
# of training data points each fold # of training data points each fold
1.5 0.6
1 0.4
0.5 0.2
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30
# of training data points each fold # of training data points each fold
·106 5 features 3 folds ·106 30 features 3 folds
1 1
Rectangle search Rectangle search
convergence time (sec)
Improved IP Improved IP
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30
# of training data points each fold # of training data points each fold
Fig. 10 Comparisons between the (Ce , εe )-rectangle search algorithm and the improved integer program
on convergent capability and convergent speed
point, after which the larger instances cannot be solved. This sudden rise confirms the
aggregation effect mentioned in Sect. 4.
Similarly, we employed the three approaches on real-world chemoinformatics data
sets. The solution provided by KNITRO on the real-world data is shown in Table 10.
Our conclusion about the comparison between KNITRO and the (Ce , εe )-rectangle
123
Author's personal copy
Global resolution of the support vector machine regression... 253
Table 10 Training and validating SVR solved by KNITRO on the real-world chemoinformatics data
divided into fivefolds training data and onefold validation data
Table 11 Training and validating SVR as an integer program solved by CPLEX on the real-world chemoin-
formatics data divided into fivefolds training data and onefold validation data
search algorithm remains the same: the running time of KNITRO is desirable, but
the solution quality of the (Ce , εe )-rectangle search algorithm is absolutely better
than that produced by KNITRO. Results from the improved integer programming on
the real-world data are limited because we only solved the set cancer and a few
instances in CCK. We didn’t actually wait long enough (e.g., weeks or months) to get
the optimal solutions of the remaining instances in CCK because it takes 25 days to
get a convergence on “CCK_3.” Table 11 shows that the processing time needed for
the improved integer program for this set cancer is slightly less than but very close
to the time needed for the (Ce , εe )-rectangle search algorithm. Knowing that the set
CCK and BBB can be solved by the (Ce , εe )-rectangle search algorithm, we conclude
that the improved integer program is less effective on instances with a large number
of folds.
In all the numerical experiments provided in this work, we found that most of the
optimal (Ce , εe ) lies on the boundaries of the initial rectangular area [C, C̄] × [ε, ε̄].
For example, for those instances solved to global optimum shown in Table 2, either the
optimal Ce ∈ {1, 10} or the optimal εe ∈ {0.1, 0.5}. Even though the optimal (Ce , εe )
for the instance 10_10 5_1 was recorded at an interior point, we have noted that a point
on the boundaries is also optimal. However, the situation of optimizing the parameter
123
Author's personal copy
254 Y.-C. Lee et al.
5.5 Performance of the globally optimized parameters on the testing data set
Attempting to compare the globally optimized parameters with the local grid-searched
parameters, two small-scale experiments on the testing data sets (which should have
a similar pattern with the training and validation data sets) are designed as follows.
For each one out of eight trials of the first experiment, 1090 data points (xd , yd ) are
generated (using a similar strategy as in Yu (2011)). xd contains 5 features that are
uniformly distributed between −2 and 2, and yd = wsT xd + . Let ws be uniformly
distributed between 0 and 1 and be the Gaussian noise. 45 data points are randomly
chosen as training data points and another 45 data points are chosen as validation
data points, each dividing into threefolds. The remaining 1000 data points become the
testing data.
We first solve the (threefold) parameter training and validation program (3) to
obtain the global optimal parameters (Ce ,εe ). Then, employing the pair of the globally
optimized parameters to find a SVM regression model (ws∗ , bs∗ ) that fits the 90 (pooled
training and validation) data points.10 Finally, we compute the averaged absolute
residual |yd − ws∗T xd − b∗s |, or MAD, for the 1000 testing data points.
To obtain the locally grid-searched parameters, the region [Ce C¯e ] × [εe ε¯e ] =
[0.1 10] × [0.001 0.1] is evenly partitioned into grids with 16 coordinates.11 Among
these 16 pairs of (Ce , εe ), the pair that minimizes the absolute validation error is
chosen. For both the globally optimized parameters and the grid-searched parameters,
the same partition of the training and validation data is used. As a result, 4 out of 8
trials have a reduced MAD when using the globally optimized parameters, compared
with the grid-searched parameters. The averaged MAD reduction is 0.56 % (maximum
MAD reduction: 2.23 %; minimum MAD reduction: −0.44 %).
10 Because the b of SVM regression is not unique, we select a b minimizing the absolute residual of the
s s
45 validation data. See Sect. 2.3 for the selection of bs .
11 A finer grid might give us a local grid-searched parameter that coincides the global optimal parameter
since finding the global optimum is relatively easy compared to varying it. Note that the grids commonly
used in the SVM parameter selection are the logarithmic grids (Fung and Mangasarian 2004) of base 2 or
10, i.e., 2i or 10 j where i and j ranges from some negative integer to some positive integer.
123
Author's personal copy
Global resolution of the support vector machine regression... 255
The second experiment is done for CCK real-world data. The testing set, which
contains 660 data points, is obtained from pooling all the training and validation data
of CCK_1-CCK_10 due to lack of extra data points. Therefore, each trial of the training
and validation uses exactly one portion of the testing data set. The numbers of trials that
has a reduced MAD is 5, out of 10. In average, there is a 0.09 % reduction (maximum
reduction: 0.14 %; minimum reduction: −1.10 %) for the MAD of the testing data if
using the globally optimized parameters instead of the grid-searched parameters.
Due to long solution time for finding a global optimum, the size of the data sets
remains small in the above two experiments. The global optimal parameters reduces the
MAD in average. Nevertheless, the reduction of the averaged MAD is small (without
a proper benchmark), and the numbers of instances with a reduced MAD is merely
half of the total numbers of instances. We shall not conclude that the global optimal
parameters really outperform the grid-searched parameters in the prediction accuracy
before a greater size of data set can be tackled.
6 Conclusion
This paper presents the research on selecting the optimal parameters for the support
vector machine regression within the framework of training and validation. The para-
meter selection problem for the support vector machine regression is formulated as a
linear program with complementarity constraints, and the main challenge turns into
the verification of the global optimal solution to the linear program with complemen-
tarity constraints. The development of our (Ce , εe )-rectangle search algorithm that
searches on the parameter plane primarily relies on the definitions of the grouping and
the corresponding invariancy region. Then, the conditions sufficient to conclude that
the groupings and the invariancy regions in a rectangular area are fully realized are
essential. The two stages of the (Ce , εe )-rectangle search algorithm are distinguished
by the two different sufficient conditions: (1) The four vertices of a rectangular area
have the same grouping and are in the same invariancy region (the 1st stage), and (2)
The rectangular area is split up into two invariancy regions (the 2nd stage). A potential
direction for improving the (Ce , εe )-rectangle search algorithm is thus in discovering
other sufficient conditions.
A total of 140 instances including 80 synthetic instances and 60 real-world instances
were run in the numerical experiments. The synthetic instances were generated without
imposing natural structure between the indicators and the dependent variables, while
the real-world instances are statistical data of chemoinformatics that has been studied
in Yu (2011), Kunapuli (2008), Demiriz et al. (2001). A total of 56 of the synthetic
instances and 43 of the real-world instances were solved to global optimum, while tight
valid upper bounds were provided for the remaining instances. The results allow us to
further categorize difficulty level of the instances under the (Ce , εe )-rectangle search
algorithm by a 4-quadrant diagram with axes the number of features and the number of
training data in each fold. The effect of increasing the number of folds remains within
each quadrant because the effect of number of folds is of a smaller scale compared
with other factors (number of features and number of training data each fold). This
categorization helps the users to understand the performance of the (Ce , εe )-rectangle
123
Author's personal copy
256 Y.-C. Lee et al.
search algorithm and to be aware of its limitations. The unsolved instances from our
experiments all belong to the quadrant that is of the hardest difficulty level.
The long processing time could be a concern, but obtaining the global optimal
parameters is more critical than runtime for the purpose of long term benefit brought
from the global optimum in some applications. Moreover, the global optimum result
obtained in this paper can serve as a global optimum certificate of every other local
optimal solution to a parameter training and validation process for the support vector
machine regression.
Acknowledgments Thanks for the thoroughgoing comments and suggestions from reviewers that allow
us to significantly improve the completeness and the presentation of this work. Lee and Pang were supported
in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0151. Mitchell
was supported in part by the Air Force Office of Scientific Research under Grant Number FA9550-11-1-0260.
Pang was supported by the National Science Foundation under Grant Number CMMI-1333902. Mitchell
was supported by the National Science Foundation under Grant Number CMMI-1334327. Lee’s new affil-
iation after August 1, 2015 will be Department of Industrial Engineering and Engineering Management,
National Tsing Hua University, Hsinchu, Taiwan.
Hk dk = −(ak ). (31)
123
Author's personal copy
Global resolution of the support vector machine regression... 257
where Da ∈ Rn×n and Db ∈ Rn×n are diagonal matrices with entries defined as
follows:
1. For all i ∈ I: If (ai , Ui (a)) = 0, then
ai
(Da )ii = 1 − ,
ai , Ui (a)
Ui (a)
(Db )ii = 1 − ; (34)
ai , Ui (a)
otherwise
2. For all i ∈ E:
(Da )ii = 0,
(Db )ii = 1.
If (ai , Ui (a)) = 0, (a) is differentiable at a and the formula (34) computes the
exact Jacobian. On the other hand, if (ai , Ui (a)) = 0 occurs at the i th comple-
mentarity, κ and γ appeared in (35) are computed as suggested in Facchinei and Pang
(2003):
vi (U v)i
κ= , and γ = , (36)
vi2 + (U v)i2 vi2 + (U v)i2
where v ∈ Rn is a vector of user’s choice whose ith element is nonzero.
To compute κ and γ in (36), we can choose v = 1, and let
⎡ ⎤
⎢ 0n f ×1 ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ 2n f ×1 ⎥
⎢ ⎥
⎢ ⎥
hv := U (a )v = ⎢
f
⎢
⎥,
⎥
⎢ ⎥
⎢ −2n ×1 ⎥
⎢ f ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ 0 ⎦
123
Author's personal copy
258 Y.-C. Lee et al.
1 (hv)i
κ= , and γ = .
1 + (hv)i2 1 + (hv)i2
123
Author's personal copy
Global resolution of the support vector machine regression... 259
4a: For every pieces corresponding to a member of GroupingVSet right ! , solve (20)
subject to Ce = C̄ to obtain invariancy intervals. Let εmin , εmax be the largest
intervals among others. Do Add-In when repeating. count Right + 1.
f
4b: Solve LCP SV R at (C̄, εmin ) and obtain the grouping vectors set.
4c: Replace GroupingVSet right by the set of the grouping vectors obtained in 4b. If
any members of GroupingVSet right are not in the set GroupingVFound, add
them to the later set.
4d: If the objective value of RLP is smaller than LeastU pper Bound, update
LeastU pper Bound.
4e: If εmin is greater than ε, let newStar ting = (C̄, εmin − deviation).
f
4f: Solve LCP SV R at newStar ting and obtain the grouping vectors set. Do as in
4c–4d.
4g: Repeat 4a–4d until εmin = ε.
case 2
if Gr oupingV List T op (1) = Gr oupingV List Bottom (1)
and Gr oupingV List T op (2) = Gr oupingV List Bottom (2) then
The Ar ea satisfies Case 2. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 3
if Gr opuoingV List T op (1) = Gr oupoingV List Right (1)
and Gr oupingV List T op (2) = Gr oupingV List Right (2) then
The Ar ea satisfies Case 3. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 4
if Gr oupingV List Le f t (1) = Gr oupingV List Right (1)
and Gr oupingV List Le f t (2) = Gr oupingV List Right (2) then
The Ar ea satisfies Case 4. Terminate the 2nd stage.
else
Go to Step 3, Exception 2.
end if
end case
case 5
if Gr oupingV List Le f t (1) = Gr oupingV List Bottom (1)
123
Author's personal copy
260 Y.-C. Lee et al.
References
Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4:40–79
Bard JF, Moore JT (1990) A branch and bound algorithm for the bilevel programming problem. SIAM J
Sci Stat Comput 11(2):281–292
Bemporad A, Morari M, Dua V, Pistikopoulos EN (2002) The explicit linear quadratic regulator for con-
strained systems. Automatica 38(1):3–20
Billups SC (1995) Algorithm for complementarity problems and generalized equations. PhD thesis, Uni-
versity of Wisconsin Madison
Burges CJC, Crisp DJ (1999) Uniqueness of the SVM solution. In NIPS’99, pp 223–229
Byrd RH, Nocedal J, Waltz RA (2006) Knitro: an integrated package for nonlinear optimization. In: Di Pillo
G, Roma M (eds) Large-scale nonlinear optimization. Nonconvex optimization and its applications,
vol 83. Springer, US, pp 35–59
Carrizosa E, Morales DR (2013) Supervised classification and mathematical optimization. Comput Oper
Res 40(1):150–165
Carrizosa E, Martn-Barragn B, Morales DR (2014) A nested heuristic for parameter tuning in support vector
machines. Comput Oper Res 43(0):328–334
Cawley GC, Talbot NLC (2010) On over-fitting in model selection and subsequent selection bias in perfor-
mance evaluation. J Mach Learn Res 11:2079–2107
Columbano S, Fukuda K, Jones CN (2009) An output-sensitive algorithm for multi-parametric LCPs with
sufficient matrices. In CRM proceedings and lecture notes, vol 48
De Luca T, Facchinei F, Kanzow C (1996) A semismooth equation approach to the solution of nonlinear
complementarity problems. Math Program 75:407–439
Demiriz A, Bennett KP, Breneman CM, Embrechts MJ (2001) Support vector machine regression in chemo-
metrics. In: Proceedings of the 33rd symposium on the interface of computing science and statistics
Facchinei F, Soares J (1997) A new merit function for nonlinear complementarity problems and a related
algorithm. SIAM J Optim 7(1):225–247
Facchinei F, Pang J-S (2003) Finite-dimensional variational inequalities and complementarity problems II.
Springer, New York
Ferris MC, Munson TS (2002) Interior-point methods for massive support vector machines. SIAM J Optim
13(3):783–804
Ferris MC, Munson TS (2004) Semismooth support vector machines. Math Program 101:185–204
Floudas CA, Gounaris C (2009) A review of recent advances in global optimization. J Glob Optim 45:3–38
Fung GM, Mangasarian OL (2004) A feature selection newton method for support vector machine classi-
fication. Comput Optim Appl 28(2):185–202
Ghaffari-Hadigheh A, Romanko O, Terlaky T (2010) Bi-parametric convex quadratic optimization. Optim
Methods Softw 25:229–245
123
Author's personal copy
Global resolution of the support vector machine regression... 261
Gumus ZH, Floudas CA (2001) Global optimization of nonlinear bilevel programming problems. J Glob
Optim 20:1–31
IBM ILOG CPLEX Optimizer (2010) http://www-01.ibm.com/software/integration/optimization/
cplex-optimizer/
Jing H, Mitchell JE, Pang J-S, Bennett KP, Kunapuli G (2008) On the global solution of linear programs
with linear complementarity constraints. SIAM J Optim 19:445–471
Kecman V (2005) Support vector machines—an introduction. In: Wang L (ed) Support vector machines:
theory and applications. Studies in fuzziness and soft computing, vol 177. Springer, Berlin, pp 1–48
Keerthi SS, Lin C-J (2003) Asymptotic behaviors of support vector machines with gaussian kernel. Neural
Comput 15(7):1667–1689
Kunapuli G (2008) A bilevel optimization approach to machine learning. PhD thesis, Rensselaer Polytechnic
Institute
Kunapuli G, Bennett KP, Jing H, Pang J-S (2008) Classification model selection via bilevel programming.
Optim Methods Softw 23(4):475–489
Kunapuli G, Bennett KP, Hu J, Pang J-S (2006) Model selection via bilevel programming. In: Proceedings
of the IEEE international joint conference on neural networks
Lee Y-C, Pang J-S, Mitchell JE (2015) An algorithm for global solution to bi-parametric linear comple-
mentarity constrained linear programs. J Glob Optim 62(2):263–297
Mangasarian OL, Musicant DR (1998) Successive overrelaxation for support vector machines. IEEE Trans
Neural Netw 10:1032–1037
Ng AY (1997) Preventing “overfitting” of cross-validation data. In: Proceedings of the fourteenth interna-
tional conference on machine learning. Morgan Kaufmann, Menlo Park, pp 245–253
Schittkowski K (2005) Optimal parameter selection in support vector machines. J Ind Manag Optim
1(4):465–476
Scholkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization,
and beyond. MIT Press, Cambridge
Tondel P, Johansen TA, Bemporad A (2003) An algorithm for multi-parametric quadratic programming and
explicit mpc solutions. Automatica 39(3):489–497
Vapnik V, Golowich SE, Smola AJ (1997) Support vector method for function approximation, regression
estimation and signal processing. In: Mozer M, Jordan MI, Petsche T (eds) Advances in neural infor-
mation processing systems, vol 9. Proceedings of the 1996 neural information processing systems
conference (NIPS 1996). MIT Press, Cambridge, pp 281–287
Yu B (2011) A branch and cut approach to linear programs with linear complementarity constraints. PhD
thesis, Rensselaer Polytechnic Institute
123