Sequential Support Vector Classifiers and Regressi PDF

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/2318644
Sequential Support Vector Classiﬁers and Regression
Article · February 1999

Source: CiteSeer
CITATIONS READS
60 718
2 authors:
Sethu Vijayakumar Si Wu
The University of Edinburgh Peking University
307 PUBLICATIONS 5,426 CITATIONS 125 PUBLICATIONS 2,254 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Robotic Manipulation View project
AnyScale Applications View project
All content following this page was uploaded by Si Wu on 24 October 2014.
The user has requested enhancement of the downloaded file.

Sequential Support Vector Classiers and Regression
Sethu Vijayakumar and Si Wu
RIKEN Brain Science Institute, The Institute for Physical and Chemical Research,
Hirosawa 2-1, Wako-shi, Saitama, Japan 351-0198
fsethu,phwusig@brain.riken.go.jp
1 Abstract dense the information in the training data and pro-

vide a sparse representation using non-linear decision
Support Vector Machines(SVMs) map the input surfaces of relatively low VC dimension. Moreover,
training data into a high dimensional feature space when extended for the regression case, the SVM solu-
and nds a maximal margin hyperplane separating tion provides a tool for some level of automatic model
the data in that feature space. Extensions of this selection- for example, the SVM regression with RBF
approach account for non-separable or noisy training kernels leads to automatic selection of the optimal
data (soft classiers) as well as support vector based number of RBFs and their centers [13].
regression. The optimal hyperplane is usually found The SVM solution relies on maximizing the mar-
by solving a quadratic programming problem which gin between the separating hyperplanes and the data.
is usually quite complex, time consuming and prone This solution is achieved by reducing the problem to a
to numerical instabilities. In this work, we introduce quadratic programming problem and using optimiza-
a sequential gradient ascent based algorithm for fast tion routines from standard numerical packages. In
and simple implementation of the SVM for classi- spite of the successes of SVMs in real world appli-
cation with soft classiers. The fundamental idea is cations like handwritten character recognition prob-
similar to applying the Adatron algorithm to SVM as lems [3], it's uptake in practice has been limited be-
developed independently in the Kernel-Adatron [7], cause quadratic optimization through packages are
although the details are dierent in many respects. computationally intensive, suspect to stability prob-
We modify the formulation of the bias and consider lems and it's implementation, especially for a large
a modied dual optimization problem. This formu- training data set, is non-trivial. Recently, some al-
lation has made it possible to extend the framework gorithms have dealt with chunking the data sets and
for solving the SVM regression in an online setting. solving smaller quadratic problems to deal with mem-
This paper looks at theoretical justications of the al- ory storage problems resulting from large training
gorithm, which is shown to converge robustly to the sets[10]. However, it still uses the quadratic program-
optimal solution very fast in terms of number of itera- ming solvers for the basic optimization. The idea of
tions, is orders of magnitude faster than conventional using a non-linear kernel on top of the Adatron al-
SVM solutions and is extremely simple to implement gorithm to obtain the Kernel-Adatron algorithm for
even for large sized problems. Experimental evalua- solving the support vector classication problem has
tions on benchmark classication problems of sonar been proposed in [7] in parallel to our work. However,
data and USPS and MNIST databases substantiate in the initial implementations[7], they do not consider
the speed and robustness of the learning procedure. bias nor soft margin classiers. In later modications
[8], the authors talk of an implementation of bias and
soft margin which is conceptually dierent from our
2 Introduction method. Based on the same idea of applying the Ada-
tron method to SVM, in this work, by exploiting an
Support Vector Machines(SVMs) were introduced for alternate formulation for the bias term in the classi-
solving pattern recognition problems by Vapnik et er, we derive a sequential gradient ascent type learn-
al.[6, 4] as a technique that minimized the struc- ing algorithm which can maximize the margins for the
tural risk - that is, the probability of misclassifying quadratic programming problem in high-dimensional
novel patterns for a xed but unknown probability feature space and provide very fast convergence in
distribution of the input data - as opposed to tradi- terms of number of iterations to the optimal solution
tional methods which minimize the empirical risk[15]. within particular analytically derived bounds of the
This induction principle is equivalent to minimizing learning rate. We look at theoretical implications of
an upper bound on the generalization error. What the modication of the cost function and derive con-
makes the SVMs so attractive is the ability to conditions that are necessary to obtain optimal class sep-
1
aration. Experimental results on benchmark datasets which is equivalent to
have shown that the algorithm provides generalization
abilities equal to the standard SVM implementations yi (xi w + b) ? 1 0 for i = 1; : : : ; l: (3)
while being orders of magnitude faster. The algorithm P
where a b i ai bi represents the dot product. All
is capable of handling noisy data and outliers through points for which eq.(1) hold lie on the hyperplane
implementation of the soft classier with adaptable H1 : xi w + b = 1 with normal w and perpendic-
trade-o parameter. Furthermore, the algorithm has ular distance from origin dH1 = (1kw?kb) . Similarly,
been extended to provide a fast implementation of those points satisfying eq.(2) lie on the hyperplane
the Support Vector regression, a unied approach to
classication and regression which very few other al- H2 : xi w + b = ?1 with distance from the origin
gorithms have addressed. It should be noted in rela- dH2 = (?k1w?kb) . The distance between the two canoni-
tion to the work in Friess et al. [7] that although the cal hyperplanes, H1 and H2 is a quantity we refer to
the fundamental idea is the same and nal algorithm as the margin and is given by :
looks quite similar, there are conceptual dierences
in the way the bias and soft margin is handled. A Margin = jd ? d j = 2 :
H1 H2
kwk (4)
dierence in the update procedures (clipping) can be
shown to lead to an extended convergence plateau in There are (in general) many hyperplanes satisfying
[7] resulting in qualitatively similar results but aect- the condition given in eq.(3) for a data set that is
ing the speed of solution and the number of resulting separable. To choose one with the best generaliza-
SVMs. tion capabilities, we try to maximize the margin be-
Section 3 reviews the theoretical framework of the tween the canonical hyperplanes [5]. This is justied
SVMs and looks at implementation of non-linear de- from the computational theory perspective since max-
cision surfaces. Section 4 provides a new formulation imizing margins correspond to minimizing the V-C
for the bias and describes the sequential algorithm for dimension of the resulting classier [15] and from the
classication along with rigorous justications for it's interpretation of attractor type networks in physics,
convergence. We also perform a theoretically rigorous achieves maximal stability[2].
analysis on the margin maximization which claries Therefore, the optimization problem to be solved
the additional conditions that need to be satised to becomes
obtain a balanced classier. Section 5 compares the
learning performance of the proposed method against Minimize J1 [w] = 12 kwk2 (5)
other techniques for standard benchmark problems. subject to yi (xi w + b) ? 1 0; i = 1; : : : ; l:(6)
Section 6 describes an extension of the algorithm for
implementing SVM regression while Section 7 sum-
marizes the salient features of the proposed method 3.2 The linearly non separable case
and looks at future research directions. In case the data are not linearly separable, as is true
in many cases, we introduce slack variables i 0 to
take care of misclassications such that the following
3 The classication problem constraint is satised :
and Support Vectors yi (xi w + b) ? 1 + i 0 for i = 1; : : : ; l: (7)
We start by introducing the theoretical underpinnings Here, we choose a tradeo between maximizing mar-
of the Support Vector learning. Consider the problem gins and reducing the number of misclassications by
of classifying data fxi ; yi gli=1 , where xi are the N using a user dened parameter C such that the opti-
dimensional pattern vectors and yi 2 f?1; 1g are the mization problem looks like :
target values.
Minimize J2 [w; i ] = ( 12 kwk2 + C
Xl ) (8)
i
i=1
3.1 The linearly separable case subject to yi (xi w + b) ? 1 + i 0;
(9)
We rst consider the case when the data is linearly i 0; i = 1; : : : ; l: (10)
separable. The classication problem can be reformu- We can write this constrained optimization as an
lated as one of nding a hyperplane f (w; b) = xi w +b unconstrained quadratic convex programming prob-
which separate the positive and negative examples: lem by introducing Lagrangian multipliers as
xi w + b 1 for yi = 1; (1) Xl
Minimize L(w; b; i ; h; r) = 12 kwk2 + C i
xi w + b 1 for yi = ?1: (2) i=1
?
Xl h [y (x w + b) ? 1 + ] ? Xl r (11)
dot product of two feature vectors in terms of the ker-
i i i i i i nel, provided some (non-trivial) conditions are satis-
i=1 i=1 ed :
where h = (h1 ; : : : ; hl ) and r = (r1 ; : : : ; rl ) are the ((xi )(xj )) = K (xi ; xj )
X (x ) (x ): (17)
k i k j
non-negative Lagrange multipliers. k
3.3 The dual problem and sparsity of Some common examples of the kernels are the
Gaussian RBF kernels, given as
solution
The optimization problem of eq.(11) can be converted K (x; x0 ) e?kx?x k2 =22
0
(18)
into a dual problem which is much more easier to
handle[5] by looking at the saddle points of eq.(11). and polynomial kernel of degree d, written as
Maximize LD
Xl
(h) = ( h ? 1 h Dh) (12)
K (x; x0 ) (x x0 + 1)d : (19)
i
i=1
2 The use of kernels instead of dot products in the op-
subject to
X y h = 0;
l
(13)
timization provides an automatic implementation of
representing hyperplanes in the feature space rather
i i
i=1 than the input space. For further details on SVMs
0 hi C; i = 1; : : : ; l: (14) and use of kernels, the reader is referred to the tuto-
rial by Burges[5].
where Dij = yi yj xi xj .
Solutions to this dual problem is traditionally found
by using standard quadratic programming packages. 4 Sequential learning of sup-
Once the optimal multipliers hi are found, the classi-
er is simply port vectors for classication
f (x) = sign(
X h y x x + b ): (15)
The main aim of this work is to implement the support
i i i 0 vectors in a fast, simple, sequential scheme. To this
i eect, we make a slight modication in the problem
Usually, based on the Kuhn-Tucker optimality condi- formulation.
tion, the number of non-zero Lagrange multipliers hi
are much smaller than the number of data and hence,
the kind of expansion for the classier given by eq.(15)
4.1 An alternate formulation for bias
is very sparse. The data points corresponding to these In the previous sections, the bias has been represented
non-zero Lagrangians are referred to as support vec- explicitly using an additional variable b. From now
tors. The bias term b0 can be found using any support on, we will use an additional dimension for the pat-
vector xsv as tern vectors such that x0 = (x1 ; : : : ; xN ; ) and in-
b0 = ysv ?
X h y x x : (16)
corporate the bias term in the weight vector, w0 =
(w1 ; : : : ; wN ; b=), where is a scalar constant. The
i i i sv
i2SV choice of the magnitude of the augmenting factor is
discussed in Section 4.1.1.
Based on this modication, we write an analogous
3.4 Extension to non-linear decision optimization problem.
surfaces
Obviously, the power of linear decision surfaces are Mininimize L0(w0 ; i ) = ( 1 kw0 k2 + C
Xl )(20)
i
very limited. SVMs provide a convenient way of ex- 2 i=1
tending the analysis from the input space to a non- subject to yi (x0i w0 ) ? 1 + i 0; (21)
linear feature space by using a high-dimensional map-
ping (x). Finding a linear separating hyperplane in i 0; i = 1; : : : ; l: (22)
this feature space is equivalent to nding a non-linear
decision boundary in the input space. If we look carefully, the margin we are trying to
It should be noted that the input vectors appear maximize is slightly dierent from the original prob-
only in the form of dot products (xi ) (xj ) in the lem. Instead of maximizing kw2 k or minimizing kwk2 ,
dual problem and it's solution. By using the repro- we are minimizing kw0 k2 = kwk2 + b2 =2 . We now
ducing kernels of a Hilbert space, we can rewrite this look at the implications of this approximation.
Therefore, eqs.(23), (24) and (25) yield an explicit
relationship for the modied margin:
d~2 = 1=(kwk2 + b2 ): (26)
So, maximizing the modied margin will, in general,
+ m (0,1) only give an approximate solution of the original op-
- d timization problem. In case the input dimension is
~ high, the contribution of the bias term in the margin
1=(kwk2 + b2) is negligible and the approximation will
d
be very good as conrmed in our simulation experi-
φ ments.
(0,0)
From eq.(24), we see that if the angle (refer Fig.
1) is smaller, the approximation tends to be better.
This can be ensured by augmenting the input vec-
tor by a `larger' number instead of `1'. To prove this
point and also show how good the approximation is,
Figure 1: Visualization of the modied margin in we carried out simulations on an articial two dimen-
higher dimensional space sional linearly separable data set. One data cluster
is randomly generated in the range [?1; 1] [?1; 1]
with uniform distribution. The other one is obtained
in the same way in the range [1:5; 3:5] [1:5; 3:5]. It is
Table 1: Eect of augmenting factor on approxima- easy to see that the normal of the optimal classifying
tion quality hyperplane w can be obtained in the following way:
Augmenting Factor() (degrees) Select a pair of data points, say x1 and x2 , from dif-
0.1 31.823032 ferent clusters such that the distance between them
0.5 16.999935 is the minimum among all such pairs. w can then
1.0 6.831879 be expressed as w = (x2 ? x1 )=kx2 ? x1 k. Let the
2.0 0.977416 solution achieved through SVMseq be w. The angle
5.0 0.275545 between w and w , which indicates how good the
approximation is, can be written as

= arccos wkw
w :
k (27)
4.1.1 Eect of modied bias formulation on
margin maximization Table 1 compares the eect of the augmenting factor
Here, we look at the implications of the maximizing () on the approximation quality. We see that the
a modied margin (refer Section 4.1) which involves approximation improved vastly with the increase of
using an input vector augmented with a scalar con- the magnitude of the augmenting factor. The simula-
stant . For the sake of analysis, let us assume that tions used a xed set of 50 data points for the exper-
the augmenting factor = 1. Fig. 1 gives a schematic iments. In practice, however, there is a tradeo re-
diagram of a linearly separable classication problem sulting from increasing the augmenting factor. Large
in a higher dimensional space after the input vector augmenting factor will normally lead to slower conver-
has been augmented. The margin being maximized gence speeds and instability in the learning process.
in the original problem with explicit bias, based on An appropriate solution is to choose an augmenting
eq.(4), is factor having the same magnitude as the input vector.
d2 = 1=kwk2: (23) However, while using the non-linear kernels, the eect
of the augmenting factor is not directly interpretable
The modied margin d~ is the closest distance from the as above and depends on the exact kernels we use.
data points to a hyperplane in the higher dimensional We have noticed that setting the augmenting term to
space passing through the origin. Based on simple zero (equivalent to neglecting the bias term) in high
geometry, it can be seen from Fig. 1 that dimensional kernels gives satisfactory results on real
world data.
d~2 = d2 cos2 : (24)
The angle satises the following relationship:
4.1.2 Obtaining a balanced classier
In the case when the data is linearly separable, a bal-
cos2 = 1=(1 + m2 ) where m = b=kwk (25) anced classier is dened as the hyperplane (f (w; b) =
xw+b = 0) that lies in the middle of the two clusters, method. In practice, this is easy to satisfy by just
or more precisely, the bias term b is given as: scaling the input data to smaller values, an automatic
result of which leads to larger magnitudes of w.
b = ? 12 (x+i w + x?i w); (28) Working along the lines of Section 3.3, and solv-
ing for the hyperplane in the high dimensional fea-
where x+i and x?i are any two support vectors from ture space, we can convert the modied optimization
the two clusters, respectively. Certainly, a balanced problem of eq.(20) into a dual problem:
classier is superior to an unbalanced one. For the X
Maximize L0D (h) = ( hi ? 12 h D0 h) (35)
original problem, a balanced classier is automatically
obtained. However, while maximizing the modied i
margin, it is no longer the case. A constraint on the subject to 0 hi C; i = 1; : : : ; l: (36)
magnitude of w needs to be satised. (37)
Suppose there is an unbalanced classier f (w0 ; b0 ).
Without loss of generality, we assume it is closer to where Dij0 = yi yj K (xi ; xj ) + 2 yi yj . Here, in the
the negative cluster, that is, non-linear case, we augment the input vector with an
extra dimension in the feature space, i.e., (x0 ) =
x+i w0 + b0 = c; c > 1 ((x) )T . The resultant non-linear classier is com-
x?i w0 + b0 = ?1: puted as
It is easy to check by using the transformation w = f (x) = sign(
X h y K (x ; x) + h y ): 2
(38)
i i i i i
2
1+c
w0 and b = 1+2 c b0 + 11+?cc that we can get a bal- i2SV
anced classier f (w ; b) parallel to the unbalanced
one : For the sake of clarity, we will stop using the `dash' no-
tation and from now on assume that vectors x and ma-
x+i w + b = 1; trix D refers to the modied input vector augmented
x?i w + b = ?1: with the scalar and modied matrix, respectively,
as described in the beginning of this section.
So, we can construct a family of parallel classiers
parameterized by c, for c 1 in the following way: 4.2 SVMseq for classication
w(c) = 1 +2 c w; (29) In this section, we provide the pseudocode for the
sequential learning algorithm. The algorithm, on
b(c) = 1 +2 c b ? 1 ?2 c ; (30) convergence, provides the optimal set of Lagrange
multipliers hi for the classication problem in high-
such that dimensional feature space which corresponds to the
maximum of the cost function (35).
xi w(c) + b(c) = c; c > 1
+
(31)
x?i w(c) + b(c) = ?1: (32) Sequential Algorithm for Classication
Here, f (w(1); b(1)) gives the balanced classier, which 1. Initialize hi = 0. Compute matrix [D]ij =
corresponds to c = 1 and f (w ; b ). For c > 1, an un- yi yj (K (xi ; xj ) + ) for i; j = 1; : : : ; l.
2
balanced classier closer to the negative cluster is ob- 2. For each pattern, i=1 to l, compute
2.1 Ei = Plj hj Dij .
tained. We will, for the sake of simplicity, assume the
augmenting factor to be 1. Then, SVMseq minimizes
kwk2 + b2 . So, when =1
2.2 hi = minfmax[ (1 ? Ei); ?hi ]; C ? hi g:

L(c) = kw(c)k2 + b(c)2 (33) 2.3 hi = hi + hi .
takes the minimum value at c = 1, SVMseq will pro- 3. If the training has converged, then stop else goto
vide a balanced classier. Calculating the derivative step 2.
of L(c) with respect to c, we have
dL(c) = 1 + c [kw k2 ? 1 ]+ 1 + c [b + c ]2 : Here, refers to the learning rate, theoretical
dc 2 (1 + c)2 2 1+c bounds for which will be derived in the following anal-
(34) ysis. The kernel K (x; x0 ) can be any function which
Note that since c 1, kw k2 > 41 ensures dLdc(c) > 0, satises the Mercer's condition[6]
for any c. So, kw k2 > 14 is the sucient condi- ZZ
tion that a balanced classier will be obtained in our K (x; y)f (x)f (y)dxdy 0; ?_ f 2 L2; (39)
examples of which are the Gaussian and polynomial Case 1: hti = (1 ? Eit )
kernels given in Section 3.4. We can check for the
convergence of the algorithm either by monitoring the L0D = (1 ? Eit )2 ( ? 21 Dii0 2 )
updates of each Lagrange multiplier hi or by looking
at the rate of change of the cost function, L0D , which
can be written as
(1 ? Eit )2 ( ? 21 a 2 ) 0
L0D = hti (1 ? Eit ? 12 Dii hti ) (40) Case 2: hti = ?hti

This case arises only when ?hti (1 ? Eit ). Re-
member hti 0, which implies Eit 1 in this
as shown in Section 4.3.2. case. Substituting hi in eq.(41), we get
t
4.3 Rigorous proofs on convergence L0D = ?hti (1 ? Eit + 21 Dii hti ) (42)
Now we prove the learning algorithm can indeed nd Notice that
the optimal solution of the dual problem given by
eq.(35). The proof involves showing that the con- 1 ? Eit + 12 Dii hti 1 ? Eit + 21 Dii (Eit ? 1)
straint of eq.(36) is always satised and that the al-
gorithm will converge to the global maximum of the (1 ? Eit )(1 ? 21 Dii ) 0 (43)
cost function given by eq.(35).
So, from eqs.(42) and (43) L0D 0.
4.3.1 The constraint (36) is always satised Case 3: hti = C ? hti
during the learning process This case arises only when C ? hti (1 ? Eit ).
For all the possible updates of hi , we have Remember hti C , which implies Eit 1 in this
case. Therefore,
1. If hti = C ? hti ; then, hti+1 = hti + hti = C .
L0D = (C ? hti )[1 ? Eit ? 12 Dii (C ? hti )]
2. If hti = maxf (1 ? Eit ); ?hti g;
then,hti+1 = hti + hti hti ? hti = 0.
Since, in this case C ? hti maxf (1 ? Eit ); ?hti g;
(C ? hti )[1 ? Eit ? 12 Dii (1 ? Eit )]
hence, hti+1 hti + C ? hti = C (C ? hti )(1 ? Eit )(1 ? 12 Dii ) 0:
So, for any initial values of h0i , h1i will satisfy the
constraint (36) after just one step learning and will So for all the three possible values of hi , L0D (h) will
maintain this property forever. monotonically increase and stop when a local maxi-
mum is reached. In the following we will show that
4.3.2 The learning process will monotonically the local maximum is also the global one.
increase L0D (h) and stop when a local
maximum is reached 4.3.3 Every local maximum is also the global
one
Dene a maxfig Dii . We prove that for the learning P P P
Since ij hi Dij hj = ( i hi yi xi ) ( i hi yi xi ) 0 for
rate bounded by 0 < < 2=a, L0D (h) will monotoni-
cally increase during the learning process. any h, D is a semidenite matrix, and L0D (h) is a con-
vex function. For the linear constraints that arises in
L0 = L0D (hti + hti ) ? L0D (hti ) this optimization problem, we show that each local
D
X X
= hti ? 12 htj Dji hti ? 21 Dij htj hti
maximum is equivalent to the global one. Suppose h1
is a local maximum point, while h2 is the global maxi-
j j mum which has a larger value, i.e., L0D (h2 ) > L0D (h1 ).
We can construct a straight line connecting h1 and
? 12 Dii (hti )2 : h2 , which is parameterized as (1 ? )h1 + h2 , for
P
Since Dij = Dji and j Dij htj = Eit , we have
0 1. Since the constraints are linear, all the
data points on the line are feasible solutions. On
the other hand, since L0D (h) is a convex function,
L0D = hti (1 ? Eit ? 12 Dii hti ): (41)
L0D (h()) (1 ? )L(h1 ) + L0D (h2 ). For > 0,
L0D (h()) > L0D (h1 ), which contradicts the assump-
tion that h1 is a local maximum. So h1 must also be
Now, considering the three possible updates of hti : the global maximum solution.
0.8 0.2
generalization error sigma = 0.3
0.15
training error
0.7
0.1
0.05
Error (in percentage of data)
0.6
0
0 10 20 30 40 50 60 70 80 90 100
0.2
0.5 SONAR classification benchmark
sigma = 0.6
0.15
0.4 0.1
0.05
0.3
0
0 50 100 150 200 250 300
Generalization error
0.2
0.2
sigma = 0.7
0.15
0.1 0.1
0.05
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 50 100 150 200 250 300
σ : Gaussian kernel parameter No. of training epochs
(a) (b)
Figure 2: Sonar data classication using SVMseq: (a) Training and generalization error obtained using various
Gaussian kernel parameter (b) Generalization error vs the number of training epochs plotted for three dierent
kernel parameters
Sections 4.3.1, 4.3.2 and 4.3.3 combined completes

the proof of convergence. The analysis also gives us Table 2: Sonar classication: Gorman and Se-
a theoretical upper bound on the learning rate to be jnowski's result with standard NNs using BP algo[9]
used : #hidden 0 2 6 12 24
2= max D :
fig ii
(44) %gen. 73.1 85.7 89.3 90.4 89.2
Table 3: Sonar classication: Gen. error and compu-
5 Experimental results tational cost for standard SVM and SVMseq
ALGO Epoch %gen.(#Errors) FLOPS
We now proceed to look at some experimental clas- SVM - 93.3% (7) 6:8 109
sication results using the SVMseq algorithm. The SVMseq 10 93.3% (7) 383520
aim of this section is to demonstrate that the SVM- SVMseq 50 95.2% (5) 1917600
seq algorithm is capable of achieving the high lev- SVMseq 130 95.2% (5) 4985760
els of generalization capability boasted by the SVM
kind of classiers and it does so with much lesser has 60 input dimensions and the aspect-angle depen-
amounts of computation besides being robust to pa- dent data set has been split into a training set and
rameter choices in a broad range. We use benchmark testing set of 104 patterns each as recommended by
problems reported in [7] as well some others. Results Gorman and Sejnowski[9]. We use a Gaussian RBF
for the the case without bias are similar but are im- kernel (refer Section 3.4) with various values of the
proved by introducing bias and soft margin as done in variance parameter in our non-linear implementa-
our modied formulation. Moreover, in our method tion.
the xing of learning rates and open parameters have Tables 2 and 3 compares the generalization perfor-
a rm theoretical basis as an o-shoot of the rigorous mance of our method with the BP trained NNs of
theoretical analysis. Gorman and Sejnowski [9] (upto 24 hidden units and
Here, we give examples of using RBF and polyno- 300 epochs through training data) and the standard
mial type non-linear decision boundaries. The learn- SVM method on the lines of the analysis in Friess
ing rate is xed to 1:9= maxfig Dii based on eq.(44) et al.[7]. The values shown are obtained with pa-
and the convergence of the algorithm is determined by rameters = 0:6 and C = 50. The generalization
looking at a trace of the cost function which is easily performance achieved by SVMseq is superior to the
computed using eq.(40). NN based method (Table 2) while it is similar to the
standard SVM. However, the SVMseq is orders faster
5.1 Sonar classication benchmark than the standard SVM as is re ected through the
FLOPS comparison in MATLAB (Table 3). The last
We consider a benchmark problem to classify the row (epochs=130) is when the algorithm detected con-
sonar data set of 208 patterns representing metal vergence.
cylinders(mines) and rocks[9]. Each of the patterns We also plot the evolution of the generalization and
0.1 0.1
Error(% of data)
Error(% of data)
0.08 0.08
USPS database classification MNIST database classification
0.06 0.06
training error
training error
0.04 0.04 generalization error
generalization error
0.02 0.02
0 0
0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35
Number of training epochs Number of training epochs
1 1
Cost function
Cost function
0.8 0.8
0.6 0.6
normalized cost function normalized cost function
0.4 0.4
0.2 0.2
0 0
0 50 100 150 200 250 300 350 400 450 500 5 10 15 20 25 30 35
Number of training epochs Number of training epochs
(a) (b)
Figure 3: Handwritten character recognition : Training, generalization error and the cost function trace with
number of training epochs for (a) USPS 16 16 pixel database (b) MNIST 28 28 pixel database.
Table 4: USPS dataset two-class classication comparisons: number of test errors (out of 2007 test patterns)
Digits 0 1 2 3 4 5 6 7 8 9
classical RBF 20 16 43 38 46 31 15 18 37 26
Optimally tuned SVM 16 12 25 24 32 24 14 16 26 16
SVMseq (poly.deg.=3) 19 14 31 29 33 30 17 16 34 26
SVMseq (poly.deg.=4) 14 13 30 28 29 27 14 16 28 19
SVMseq (RBF:=3.5) 11 7 25 24 30 25 17 14 24 14
training error for various values of the kernel param- et al.[13]) and classical RBF networks. We see that
eter in Fig.2(a). It is noted that generalization error inspite of absolutely no tuning of parameters, SVM-
is at in a fairly broad region of the parameter space. seq achieves much better results than classical RBF
Fig.2(b) shows that the algorithm has very fast con- and comparable results to that of best tuned standard
vergence speed(in terms of training epochs) if we are SVMs.
in the region with an approximately correct kernel Fig.3(a) shows a typical learning performance, in
parameter. this case for the classication of digit `0', using SVM-
seq. The training error drops to zero very fast(in a
5.2 USPS and MNIST handwritten few epochs) and the test error also reaches a low of
around .7% which corresponds to an average of about
character classication 14 test errors. The plot of the cost function L0D (h),
The next classication task we considered used the which is an indicator of algorithm convergence, is also
USPS database of 9298 handwritten digits (7291 for shown to reach the maximum very quickly.
training, 2007 for testing) collected from mail en-
velopes in Bualo[11]. Each digit is a 16 16 vector
with entries between -1 and +1. Preprocessing con- 6 SV regression and it's sequen-
sists of smoothing with a Gaussian kernel of width
= 0:75. We used polynomial kernels of third and
tial implementation
fourth degree and RBF kernels to generate the non- In this section we will proceed to show how the idea
linear decision surface. can be extended to incorporate SVM regression. The
Here, we consider the task of two-class binary clas- basic idea of the SV algorithm for regression estima-
sication, i.e. separating one digit from the rest of tion is to compute a linear function in some high di-
the digits. Table 4 shows comparisons of the general- mensional feature space (which possesses a dot prod-
ization performance of the SVMseq using polynomial uct) and thereby, compute a non-linear function in the
and RBF kernels against the optimally tuned SVM space of the input data. Working in the spirit of the
with RBF kernels (performance reported in Scholkopf SRM principle, we minimize empirical risk and maxi-
mize margins by solving the following constrained op- 3. If the training has converged, then stop else goto
timization problem[16]: step 2.
Xl X
Min. J (; ; w) = 12 kwk2 + C ( i + i ) (45) Similar to the classication case, any non-linear
i=1 i=1l
kernel satisfying the Mercer's condition can be used.
We have rigorously proved the convergence of the al-
subject to yi ? w xi + i ; i = 1; : : : ; l (46) gorithm for the regression case, a result which will
w xi ? yi + i ; i = 1; : : : ; l (47) be part of another publication with detailed experi-
i ; i 0; i = 1 : : : ; l: (48) mental evaluations of benchmark regression problems.
Here, the user dened error insensitivity parameter
Here, x refers to the input vector augmented with the controls the balance between the sparseness of the so-
augmenting factor to handle the bias term, i ; i are lution and the closeness to the training data. Recent
the slack variables and is a user dened insensitivity works on nding the asymptotical optimal choice of
bound for the error function[16]. The corresponding -Loss [14] and soft -tube [12] give theoretical foun-
modied dual problem works out to be: dations for the optimal choice of the -loss function.
Xl ( ? )( ? )x x 7 Discussion and concluding re-
Max. JD (; ) = ? 12 i i j j i j
Xl i;j
Xl =1
marks
? ( + ) + y ( ? ) (49)
i i i i i
By modifying the formulation of the bias term, we
i=1 i=1
subject to 0 i ; i C; i = 1; : : : ; l; (50) were able to come up with a sequential update based
learning scheme which solves the SVM classication
where i ; i are non-negative Lagrange multipliers. and regression optimization problems with added ad-
The solution to this optimization problem is tradition- vantages of conceptual simplicity and huge computa-
ally obtained using quadratic programming packages. tional gains.
The optimal approximation surface using the modi- By considering a modied dual problem, the
ed formulation, after extending the SVM to nonlin-
ear case, is given as
Pglobal
i h i yi
constraint in the optimization problem,
= 0, which is dicult to be satised in a
sequential scheme is circumvented by absorbing
f (x) =
Xl ( ? )(K (x ; x) + ): 2
(51)
it into the maximization functional.
i i i
i=1 Unlike many other gradient ascent type algo-
rithms, the SVMseq is guaranteed to converge
As in the classication case, only some of the coe- (essentially due to the nature of the optimiza-
cients (i ? i ) are non-zero, and the corresponding tion problem) very fast in terms of the training
data points are called support vectors. epochs. Moreover, we have derived a theoretical
upper bound for the learning rate, hence, ensur-
6.1 SVMseq algorithm for regression ing that there are very few open parameters to
be handled.
We extend the sequential learning scheme devised
for the classication problem to derive a sequential The algorithm has been shown to be computa-
update scheme which can obtain the solution to the tionally more ecient than conventional SVM
optimization problem of eq.(49) for SVM regression. routines through comparisons of the computa-
tional complexity in MATLAB routines. More-
Sequential Algorithm for Regression over, in the proposed learning procedure, non-
support vector Lagrangian values go exactly to
1. Initialize i = 0; i = 0. Compute [R]ij = zero as opposed to numerical round-o errors
K (xi ; xj ) + for i; j = 1; : : : ; l.
2
that are characteristic of quadratic optimization
routine packages.
2. For each training point, i=1 to l, compute
2.1 Ei = yi ? Plj (i ? i )Rij .
The algorithm also deals with noisy data by im-
=1
plementing soft classier type decision surfaces
2.2 i = minfmax[ (Ei ? ); ?i ]; C ? i g: which can take care of outliers. Most impor-
i = minfmax[ (?Ei ? ); ?i ]; C ? i g: tantly, the algorithm has been extended to han-
dle the SVM for regression under the same gen-
2.3 i = i + i : eral framework, something which very few other
i = i + i : speed up techniques have considered.
The SVM paradigm in the regression domain pro- [5] C.J.C. Burges. A tutorial on support vector ma-
vide an automatic complexity and model order chines for pattern recognition. Data Mining and
selection for function approximation - a char- Knowledge Discovery, 1998. (In press.).
acteristic that should favor these techniques to
standard RBF, Neural Network or other para- [6] C. Cortes and V. Vapnik. Support vector net-
metric approaches. works. Machine Learning, 20:273{297, 1995.
[7] T.T. Friess, N. Cristianini, and C. Campbell.
In addition, this work looks at implications of re- The kernel adatron algorithm: A fast and simple
formulation of the bias by using the augmenting fac- learning procedure for support vector machines.
tor from a geometric point of view and interprets the In Proceedings, 15th Intl. Conference on Machine
eects seen in simulation results theoretically. The Learning. Morgan-Kaufman, 1998.
choice of the augmenting factor is also discussed.
Moreover, the conditions for obtaining a balanced [8] T.T. Friess and R.F. Harrison. Support vector
classier, which is simple but conceptually important, neural networks: The kernel adatron with bias
is also derived. and soft margin. Technical Report 752, The Univ
The proposed algorithm is essentially a gradient as- of Sheeld, Dept. of ACSE, 1998.
cent method in the modied dual problem by using [9] R.P. Gorman and T.J. Sejnowski. Finding the
extra clipping terms in the updates to satisfy the lin- size of a neural network. Neural Networks, 1:75{
ear constraints. It is well known that the gradient as- 89, 1988.
cent/descent may be trapped in a long narrow valley
if the Hessian matrix (in our context, the matrices D [10] T. Joachims. Making large scale SVM learn-
or R in the dual problem) has wide spread eigen val- ing practical. In B. Scholkopf, C. Burges, and
ues. In our future work, we will work on implement- A. Smola, editors, Advances in Kernel Methods -
ing more advanced gradient descent methods, like the Support Vector Learning. MIT Press, 1998.
natural gradient descent of Amari[1], to overcome this [11] Y. LeCun, B. Boser, J.S. Denker, D. Hender-
prospective problem. son, R.E. Howard, W. Hubbard, and L.J. Jackel.
Backpropagation applied to handwritten zip code
recognition. Neural Computation, 1:541{551,
Acknowledgements 1989.
We would like to thank Prof. Shunichi Amari and Dr. [12] B. Scholkopf, P. Bartlett, A.J. Smola, and
Noboru Murata for constructive suggestions. The au- R. Williamson. Support vector regression
thors acknowledges the support of the RIKEN Brain with automatic accuracy control. In Proceed-
Science Institute for funding this project. ings, Intl. Conf. on Articial Neural Networks
(ICANN'98), volume 1, pages 111{116. Springer,
1998.
References [13] B. Scholkopf, K. Sung, C. Burges, F. Girosi,
P. Niyogi, T. Poggio, and V. Vapnik. Compar-
[1] S. Amari. Natural gradient works eciently in ing Support Vector Machines with Gaussian ker-
learning. Neural Computation, 10(2):251{276, nels to radial basis function classiers. A.I.Memo
1998. 1599, M.I.T. AI Labs, 1996.
[2] J.K. Anlauf and M. Biehl. The Adatron: an [14] A.J. Smola, N. Murata, B. Scholkopf, and K.-
adaptive perceptron algorithm. Europhysics Let- R. Muller. Asymptotically optimal choice of -
ters, 10(7):687{692, 1989. loss for support vector machines. In Proceed-
ings, Intl. Conf. on Articial Neural Networks
[3] P. Barlett and J. Shawe-Taylor. Generaliza- (ICANN'98), volume 1, pages 105{110. Springer,
tion performance of support vector machines 1998.
and other pattern classiers. In B. Scholkopf, [15] V. Vapnik. Statistical Learning Theory. John
C. Burges, and A. Smola, editors, Advances in Wiley and Sons,Inc.,New York, 1997.
Kernel Methods - Support Vector Learning. MIT
Press, 1998. [16] V. Vapnik, S.E. Golowich, and A. Smola. Sup-
port vector method for function approximation,
[4] B. Boser, I. Guyon, and V. Vapnik. A train- regression estimation and signal processing. In
ing algorithm for optimal margin classiers. In Advances in Neural Information Processing Sys-
Proceedings, Fifth Annual Workshop on Compu- tems, volume 9, pages 281{287, 1997.
tational Learning Theory. ACM Press, 1992.
View publication stats

Sequential Support Vector Classifiers and Regressi PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequential Support Vector Classifiers and Regressi PDF

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Sequential Support Vector Classiﬁers and Regression

Article · February 1999

SEE PROFILE SEE PROFILE

Robotic Manipulation View project

AnyScale Applications View project

All content following this page was uploaded by Si Wu on 24 October 2014.

The user has requested enhancement of the downloaded file.

1 Abstract dense the information in the training data and pro-

2.2 hi = minfmax[ (1 ? Ei); ?hi ]; C ? hi g:

L0D = hti (1 ? Eit ? 12 Dii hti ) (40) Case 2: hti = ?hti

Sections 4.3.1, 4.3.2 and 4.3.3 combined completes

View publication stats

You might also like

Sequential Support Vector Classifiers and Regressi PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sequential Support Vector Classifiers and Regressi PDF

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Sequential Support Vector Classiﬁers and Regression

Article · February 1999

SEE PROFILE SEE PROFILE

Robotic Manipulation View project

AnyScale Applications View project

All content following this page was uploaded by Si Wu on 24 October 2014.

The user has requested enhancement of the downloaded file.

1 Abstract dense the information in the training data and pro-

2.2 hi = minfmax[ (1 ? Ei); ?hi ]; C ? hi g:

L0D = hti (1 ? Eit ? 12 Dii hti ) (40) Case 2: hti = ?hti

Sections 4.3.1, 4.3.2 and 4.3.3 combined completes

View publication stats

You might also like

2.2 hi = minfmax[ (1 ? Ei); ?hi ]; C ? hi g:

L0D = hti (1 ? Eit ? 12 Dii hti ) (40) Case 2: hti = ?hti