You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/2318644

Sequential Support Vector Classifiers and Regression

Article · February 1999


Source: CiteSeer

CITATIONS READS

60 718

2 authors:

Sethu Vijayakumar Si Wu
The University of Edinburgh Peking University
307 PUBLICATIONS   5,426 CITATIONS    125 PUBLICATIONS   2,254 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Robotic Manipulation View project

AnyScale Applications View project

All content following this page was uploaded by Si Wu on 24 October 2014.

The user has requested enhancement of the downloaded file.


Sequential Support Vector Classi ers and Regression
Sethu Vijayakumar and Si Wu
RIKEN Brain Science Institute, The Institute for Physical and Chemical Research,
Hirosawa 2-1, Wako-shi, Saitama, Japan 351-0198
fsethu,phwusig@brain.riken.go.jp

1 Abstract dense the information in the training data and pro-


vide a sparse representation using non-linear decision
Support Vector Machines(SVMs) map the input surfaces of relatively low VC dimension. Moreover,
training data into a high dimensional feature space when extended for the regression case, the SVM solu-
and nds a maximal margin hyperplane separating tion provides a tool for some level of automatic model
the data in that feature space. Extensions of this selection- for example, the SVM regression with RBF
approach account for non-separable or noisy training kernels leads to automatic selection of the optimal
data (soft classi ers) as well as support vector based number of RBFs and their centers [13].
regression. The optimal hyperplane is usually found The SVM solution relies on maximizing the mar-
by solving a quadratic programming problem which gin between the separating hyperplanes and the data.
is usually quite complex, time consuming and prone This solution is achieved by reducing the problem to a
to numerical instabilities. In this work, we introduce quadratic programming problem and using optimiza-
a sequential gradient ascent based algorithm for fast tion routines from standard numerical packages. In
and simple implementation of the SVM for classi - spite of the successes of SVMs in real world appli-
cation with soft classi ers. The fundamental idea is cations like handwritten character recognition prob-
similar to applying the Adatron algorithm to SVM as lems [3], it's uptake in practice has been limited be-
developed independently in the Kernel-Adatron [7], cause quadratic optimization through packages are
although the details are di erent in many respects. computationally intensive, suspect to stability prob-
We modify the formulation of the bias and consider lems and it's implementation, especially for a large
a modi ed dual optimization problem. This formu- training data set, is non-trivial. Recently, some al-
lation has made it possible to extend the framework gorithms have dealt with chunking the data sets and
for solving the SVM regression in an online setting. solving smaller quadratic problems to deal with mem-
This paper looks at theoretical justi cations of the al- ory storage problems resulting from large training
gorithm, which is shown to converge robustly to the sets[10]. However, it still uses the quadratic program-
optimal solution very fast in terms of number of itera- ming solvers for the basic optimization. The idea of
tions, is orders of magnitude faster than conventional using a non-linear kernel on top of the Adatron al-
SVM solutions and is extremely simple to implement gorithm to obtain the Kernel-Adatron algorithm for
even for large sized problems. Experimental evalua- solving the support vector classi cation problem has
tions on benchmark classi cation problems of sonar been proposed in [7] in parallel to our work. However,
data and USPS and MNIST databases substantiate in the initial implementations[7], they do not consider
the speed and robustness of the learning procedure. bias nor soft margin classi ers. In later modi cations
[8], the authors talk of an implementation of bias and
soft margin which is conceptually di erent from our
2 Introduction method. Based on the same idea of applying the Ada-
tron method to SVM, in this work, by exploiting an
Support Vector Machines(SVMs) were introduced for alternate formulation for the bias term in the classi-
solving pattern recognition problems by Vapnik et er, we derive a sequential gradient ascent type learn-
al.[6, 4] as a technique that minimized the struc- ing algorithm which can maximize the margins for the
tural risk - that is, the probability of misclassifying quadratic programming problem in high-dimensional
novel patterns for a xed but unknown probability feature space and provide very fast convergence in
distribution of the input data - as opposed to tradi- terms of number of iterations to the optimal solution
tional methods which minimize the empirical risk[15]. within particular analytically derived bounds of the
This induction principle is equivalent to minimizing learning rate. We look at theoretical implications of
an upper bound on the generalization error. What the modi cation of the cost function and derive con-
makes the SVMs so attractive is the ability to con- ditions that are necessary to obtain optimal class sep-

1
aration. Experimental results on benchmark datasets which is equivalent to
have shown that the algorithm provides generalization
abilities equal to the standard SVM implementations yi (xi  w + b) ? 1  0 for i = 1; : : : ; l: (3)
while being orders of magnitude faster. The algorithm P
where a  b  i ai bi represents the dot product. All
is capable of handling noisy data and outliers through points for which eq.(1) hold lie on the hyperplane
implementation of the soft classi er with adaptable H1 : xi  w + b = 1 with normal w and perpendic-
trade-o parameter. Furthermore, the algorithm has ular distance from origin dH1 = (1kw?kb) . Similarly,
been extended to provide a fast implementation of those points satisfying eq.(2) lie on the hyperplane
the Support Vector regression, a uni ed approach to
classi cation and regression which very few other al- H2 : xi  w + b = ?1 with distance from the origin
gorithms have addressed. It should be noted in rela- dH2 = (?k1w?kb) . The distance between the two canoni-
tion to the work in Friess et al. [7] that although the cal hyperplanes, H1 and H2 is a quantity we refer to
the fundamental idea is the same and nal algorithm as the margin and is given by :
looks quite similar, there are conceptual di erences
in the way the bias and soft margin is handled. A Margin = jd ? d j = 2 :
H1 H2
kwk (4)
di erence in the update procedures (clipping) can be
shown to lead to an extended convergence plateau in There are (in general) many hyperplanes satisfying
[7] resulting in qualitatively similar results but a ect- the condition given in eq.(3) for a data set that is
ing the speed of solution and the number of resulting separable. To choose one with the best generaliza-
SVMs. tion capabilities, we try to maximize the margin be-
Section 3 reviews the theoretical framework of the tween the canonical hyperplanes [5]. This is justi ed
SVMs and looks at implementation of non-linear de- from the computational theory perspective since max-
cision surfaces. Section 4 provides a new formulation imizing margins correspond to minimizing the V-C
for the bias and describes the sequential algorithm for dimension of the resulting classi er [15] and from the
classi cation along with rigorous justi cations for it's interpretation of attractor type networks in physics,
convergence. We also perform a theoretically rigorous achieves maximal stability[2].
analysis on the margin maximization which clari es Therefore, the optimization problem to be solved
the additional conditions that need to be satis ed to becomes
obtain a balanced classi er. Section 5 compares the
learning performance of the proposed method against Minimize J1 [w] = 12 kwk2 (5)
other techniques for standard benchmark problems. subject to yi (xi  w + b) ? 1  0; i = 1; : : : ; l:(6)
Section 6 describes an extension of the algorithm for
implementing SVM regression while Section 7 sum-
marizes the salient features of the proposed method 3.2 The linearly non separable case
and looks at future research directions. In case the data are not linearly separable, as is true
in many cases, we introduce slack variables i  0 to
take care of misclassi cations such that the following
3 The classi cation problem constraint is satis ed :
and Support Vectors yi (xi  w + b) ? 1 + i  0 for i = 1; : : : ; l: (7)
We start by introducing the theoretical underpinnings Here, we choose a tradeo between maximizing mar-
of the Support Vector learning. Consider the problem gins and reducing the number of misclassi cations by
of classifying data fxi ; yi gli=1 , where xi are the N using a user de ned parameter C such that the opti-
dimensional pattern vectors and yi 2 f?1; 1g are the mization problem looks like :
target values.
Minimize J2 [w; i ] = ( 12 kwk2 + C
Xl  ) (8)
i
i=1
3.1 The linearly separable case subject to yi (xi  w + b) ? 1 + i  0;
(9)
We rst consider the case when the data is linearly i  0; i = 1; : : : ; l: (10)
separable. The classi cation problem can be reformu- We can write this constrained optimization as an
lated as one of nding a hyperplane f (w; b) = xi w +b unconstrained quadratic convex programming prob-
which separate the positive and negative examples: lem by introducing Lagrangian multipliers as
xi  w + b  1 for yi = 1; (1) Xl
Minimize L(w; b; i ; h; r) = 12 kwk2 + C i
xi  w + b  1 for yi = ?1: (2) i=1
?
Xl h [y (x  w + b) ? 1 +  ] ? Xl r  (11)
dot product of two feature vectors in terms of the ker-
i i i i i i nel, provided some (non-trivial) conditions are satis-
i=1 i=1 ed :
where h = (h1 ; : : : ; hl ) and r = (r1 ; : : : ; rl ) are the ((xi )(xj )) = K (xi ; xj ) 
X  (x ) (x ): (17)
k i k j
non-negative Lagrange multipliers. k

3.3 The dual problem and sparsity of Some common examples of the kernels are the
Gaussian RBF kernels, given as
solution
The optimization problem of eq.(11) can be converted K (x; x0 )  e?kx?x k2 =22
0
(18)
into a dual problem which is much more easier to
handle[5] by looking at the saddle points of eq.(11). and polynomial kernel of degree d, written as

Maximize LD
Xl
(h) = ( h ? 1 h  Dh) (12)
K (x; x0 )  (x  x0 + 1)d : (19)
i
i=1
2 The use of kernels instead of dot products in the op-
subject to
X y h = 0;
l
(13)
timization provides an automatic implementation of
representing hyperplanes in the feature space rather
i i
i=1 than the input space. For further details on SVMs
0  hi  C; i = 1; : : : ; l: (14) and use of kernels, the reader is referred to the tuto-
rial by Burges[5].
where Dij = yi yj xi  xj .
Solutions to this dual problem is traditionally found
by using standard quadratic programming packages. 4 Sequential learning of sup-
Once the optimal multipliers hi are found, the classi-
er is simply port vectors for classi cation
f (x) = sign(
X h y x  x + b ): (15)
The main aim of this work is to implement the support
i i i 0 vectors in a fast, simple, sequential scheme. To this
i e ect, we make a slight modi cation in the problem
Usually, based on the Kuhn-Tucker optimality condi- formulation.
tion, the number of non-zero Lagrange multipliers hi
are much smaller than the number of data and hence,
the kind of expansion for the classi er given by eq.(15)
4.1 An alternate formulation for bias
is very sparse. The data points corresponding to these In the previous sections, the bias has been represented
non-zero Lagrangians are referred to as support vec- explicitly using an additional variable b. From now
tors. The bias term b0 can be found using any support on, we will use an additional dimension for the pat-
vector xsv as tern vectors such that x0 = (x1 ; : : : ; xN ; ) and in-
b0 = ysv ?
X h y x x : (16)
corporate the bias term in the weight vector, w0 =
(w1 ; : : : ; wN ; b=), where  is a scalar constant. The
i i i sv
i2SV choice of the magnitude of the augmenting factor  is
discussed in Section 4.1.1.
Based on this modi cation, we write an analogous
3.4 Extension to non-linear decision optimization problem.
surfaces
Obviously, the power of linear decision surfaces are Mininimize L0(w0 ; i ) = ( 1 kw0 k2 + C
Xl  )(20)
i
very limited. SVMs provide a convenient way of ex- 2 i=1
tending the analysis from the input space to a non- subject to yi (x0i  w0 ) ? 1 + i  0; (21)
linear feature space by using a high-dimensional map-
ping (x). Finding a linear separating hyperplane in i  0; i = 1; : : : ; l: (22)
this feature space is equivalent to nding a non-linear
decision boundary in the input space. If we look carefully, the margin we are trying to
It should be noted that the input vectors appear maximize is slightly di erent from the original prob-
only in the form of dot products (xi )  (xj ) in the lem. Instead of maximizing kw2 k or minimizing kwk2 ,
dual problem and it's solution. By using the repro- we are minimizing kw0 k2 = kwk2 + b2 =2 . We now
ducing kernels of a Hilbert space, we can rewrite this look at the implications of this approximation.
Therefore, eqs.(23), (24) and (25) yield an explicit
relationship for the modi ed margin:
d~2 = 1=(kwk2 + b2 ): (26)
So, maximizing the modi ed margin will, in general,
+ m (0,1) only give an approximate solution of the original op-
- d timization problem. In case the input dimension is
~ high, the contribution of the bias term in the margin
1=(kwk2 + b2) is negligible and the approximation will
d
be very good as con rmed in our simulation experi-
φ ments.
(0,0)
From eq.(24), we see that if the angle  (refer Fig.
1) is smaller, the approximation tends to be better.
This can be ensured by augmenting the input vec-
tor by a `larger' number instead of `1'. To prove this
point and also show how good the approximation is,
Figure 1: Visualization of the modi ed margin in we carried out simulations on an arti cial two dimen-
higher dimensional space sional linearly separable data set. One data cluster
is randomly generated in the range [?1; 1]  [?1; 1]
with uniform distribution. The other one is obtained
in the same way in the range [1:5; 3:5]  [1:5; 3:5]. It is
Table 1: E ect of augmenting factor on approxima- easy to see that the normal of the optimal classifying
tion quality hyperplane w can be obtained in the following way:
Augmenting Factor() (degrees) Select a pair of data points, say x1 and x2 , from dif-
0.1 31.823032 ferent clusters such that the distance between them
0.5 16.999935 is the minimum among all such pairs. w can then
1.0 6.831879 be expressed as w = (x2 ? x1 )=kx2 ? x1 k. Let the
2.0 0.977416 solution achieved through SVMseq be w. The angle
5.0 0.275545  between w and w , which indicates how good the
approximation is, can be written as

 = arccos wkw
w :
k (27)
4.1.1 E ect of modi ed bias formulation on
margin maximization Table 1 compares the e ect of the augmenting factor
Here, we look at the implications of the maximizing () on the approximation quality. We see that the
a modi ed margin (refer Section 4.1) which involves approximation improved vastly with the increase of
using an input vector augmented with a scalar con- the magnitude of the augmenting factor. The simula-
stant . For the sake of analysis, let us assume that tions used a xed set of 50 data points for the exper-
the augmenting factor  = 1. Fig. 1 gives a schematic iments. In practice, however, there is a tradeo re-
diagram of a linearly separable classi cation problem sulting from increasing the augmenting factor. Large
in a higher dimensional space after the input vector augmenting factor will normally lead to slower conver-
has been augmented. The margin being maximized gence speeds and instability in the learning process.
in the original problem with explicit bias, based on An appropriate solution is to choose an augmenting
eq.(4), is factor having the same magnitude as the input vector.
d2 = 1=kwk2: (23) However, while using the non-linear kernels, the e ect
of the augmenting factor is not directly interpretable
The modi ed margin d~ is the closest distance from the as above and depends on the exact kernels we use.
data points to a hyperplane in the higher dimensional We have noticed that setting the augmenting term to
space passing through the origin. Based on simple zero (equivalent to neglecting the bias term) in high
geometry, it can be seen from Fig. 1 that dimensional kernels gives satisfactory results on real
world data.
d~2 = d2  cos2 : (24)
The angle  satis es the following relationship:
4.1.2 Obtaining a balanced classi er
In the case when the data is linearly separable, a bal-
cos2  = 1=(1 + m2 ) where m = b=kwk (25) anced classi er is de ned as the hyperplane (f (w; b) =
xw+b = 0) that lies in the middle of the two clusters, method. In practice, this is easy to satisfy by just
or more precisely, the bias term b is given as: scaling the input data to smaller values, an automatic
result of which leads to larger magnitudes of w.
b = ? 12 (x+i  w + x?i  w); (28) Working along the lines of Section 3.3, and solv-
ing for the hyperplane in the high dimensional fea-
where x+i and x?i are any two support vectors from ture space, we can convert the modi ed optimization
the two clusters, respectively. Certainly, a balanced problem of eq.(20) into a dual problem:
classi er is superior to an unbalanced one. For the X
Maximize L0D (h) = ( hi ? 12 h  D0 h) (35)
original problem, a balanced classi er is automatically
obtained. However, while maximizing the modi ed i
margin, it is no longer the case. A constraint on the subject to 0  hi  C; i = 1; : : : ; l: (36)
magnitude of w needs to be satis ed. (37)
Suppose there is an unbalanced classi er f (w0 ; b0 ).
Without loss of generality, we assume it is closer to where Dij0 = yi yj K (xi ; xj ) + 2 yi yj . Here, in the
the negative cluster, that is, non-linear case, we augment the input vector with an
extra dimension in the feature space, i.e., (x0 ) =
x+i  w0 + b0 = c; c > 1 ((x) )T . The resultant non-linear classi er is com-
x?i  w0 + b0 = ?1: puted as
It is easy to check by using the transformation w = f (x) = sign(
X h y K (x ; x) + h y  ): 2
(38)
i i i i i
2
1+c
w0 and b = 1+2 c b0 + 11+?cc that we can get a bal- i2SV
anced classi er f (w ; b) parallel to the unbalanced
one : For the sake of clarity, we will stop using the `dash' no-
tation and from now on assume that vectors x and ma-
x+i  w + b = 1; trix D refers to the modi ed input vector augmented
x?i  w + b = ?1: with the scalar  and modi ed matrix, respectively,
as described in the beginning of this section.
So, we can construct a family of parallel classi ers
parameterized by c, for c  1 in the following way: 4.2 SVMseq for classi cation
w(c) = 1 +2 c w; (29) In this section, we provide the pseudocode for the
sequential learning algorithm. The algorithm, on
b(c) = 1 +2 c b ? 1 ?2 c ; (30) convergence, provides the optimal set of Lagrange
multipliers hi for the classi cation problem in high-
such that dimensional feature space which corresponds to the
maximum of the cost function (35).
xi  w(c) + b(c) = c; c > 1
+
(31)
x?i  w(c) + b(c) = ?1: (32) Sequential Algorithm for Classi cation
Here, f (w(1); b(1)) gives the balanced classi er, which 1. Initialize hi = 0. Compute matrix [D]ij =
corresponds to c = 1 and f (w ; b ). For c > 1, an un- yi yj (K (xi ; xj ) +  ) for i; j = 1; : : : ; l.
2

balanced classi er closer to the negative cluster is ob- 2. For each pattern, i=1 to l, compute
2.1 Ei = Plj hj Dij .
tained. We will, for the sake of simplicity, assume the
augmenting factor to be 1. Then, SVMseq minimizes
kwk2 + b2 . So, when =1

2.2 hi = minfmax[ (1 ? Ei); ?hi ]; C ? hi g:


L(c) = kw(c)k2 + b(c)2 (33) 2.3 hi = hi + hi .
takes the minimum value at c = 1, SVMseq will pro- 3. If the training has converged, then stop else goto
vide a balanced classi er. Calculating the derivative step 2.
of L(c) with respect to c, we have
dL(c) = 1 + c [kw k2 ? 1 ]+ 1 + c [b + c ]2 : Here, refers to the learning rate, theoretical
dc 2 (1 + c)2 2 1+c bounds for which will be derived in the following anal-
(34) ysis. The kernel K (x; x0 ) can be any function which
Note that since c  1, kw k2 > 41 ensures dLdc(c) > 0, satis es the Mercer's condition[6]
for any c. So, kw k2 > 14 is the sucient condi- ZZ
tion that a balanced classi er will be obtained in our K (x; y)f (x)f (y)dxdy  0; ?_ f 2 L2; (39)
examples of which are the Gaussian and polynomial Case 1: hti = (1 ? Eit )
kernels given in Section 3.4. We can check for the
convergence of the algorithm either by monitoring the L0D = (1 ? Eit )2 ( ? 21 Dii0 2 )
updates of each Lagrange multiplier hi or by looking
at the rate of change of the cost function, L0D , which
can be written as
 (1 ? Eit )2 ( ? 21 a 2 )  0

L0D = hti (1 ? Eit ? 12 Dii hti ) (40) Case 2: hti = ?hti


This case arises only when ?hti  (1 ? Eit ). Re-
member hti  0, which implies Eit  1 in this
as shown in Section 4.3.2. case. Substituting hi in eq.(41), we get
t

4.3 Rigorous proofs on convergence L0D = ?hti (1 ? Eit + 21 Dii hti ) (42)
Now we prove the learning algorithm can indeed nd Notice that
the optimal solution of the dual problem given by
eq.(35). The proof involves showing that the con- 1 ? Eit + 12 Dii hti  1 ? Eit + 21 Dii (Eit ? 1)
straint of eq.(36) is always satis ed and that the al-
gorithm will converge to the global maximum of the  (1 ? Eit )(1 ? 21 Dii )  0 (43)
cost function given by eq.(35).
So, from eqs.(42) and (43) L0D  0.
4.3.1 The constraint (36) is always satis ed Case 3: hti = C ? hti
during the learning process This case arises only when C ? hti  (1 ? Eit ).
For all the possible updates of hi , we have Remember hti  C , which implies Eit  1 in this
case. Therefore,
1. If hti = C ? hti ; then, hti+1 = hti + hti = C .
L0D = (C ? hti )[1 ? Eit ? 12 Dii (C ? hti )]
2. If hti = maxf (1 ? Eit ); ?hti g;
then,hti+1 = hti + hti  hti ? hti = 0.
Since, in this case C ? hti  maxf (1 ? Eit ); ?hti g;
 (C ? hti )[1 ? Eit ? 12 Dii (1 ? Eit )]
hence, hti+1  hti + C ? hti = C  (C ? hti )(1 ? Eit )(1 ? 12 Dii )  0:
So, for any initial values of h0i , h1i will satisfy the
constraint (36) after just one step learning and will So for all the three possible values of hi , L0D (h) will
maintain this property forever. monotonically increase and stop when a local maxi-
mum is reached. In the following we will show that
4.3.2 The learning process will monotonically the local maximum is also the global one.
increase L0D (h) and stop when a local
maximum is reached 4.3.3 Every local maximum is also the global
one
De ne a  maxfig Dii . We prove that for the learning P P P
Since ij hi Dij hj = ( i hi yi xi )  ( i hi yi xi )  0 for
rate bounded by 0 < < 2=a, L0D (h) will monotoni-
cally increase during the learning process. any h, D is a semide nite matrix, and L0D (h) is a con-
vex function. For the linear constraints that arises in
L0 = L0D (hti + hti ) ? L0D (hti ) this optimization problem, we show that each local
D
X X
= hti ? 12 htj Dji hti ? 21 Dij htj hti
maximum is equivalent to the global one. Suppose h1
is a local maximum point, while h2 is the global maxi-
j j mum which has a larger value, i.e., L0D (h2 ) > L0D (h1 ).
We can construct a straight line connecting h1 and
? 12 Dii (hti )2 : h2 , which is parameterized as (1 ? )h1 + h2 , for
P
Since Dij = Dji and j Dij htj = Eit , we have
0    1. Since the constraints are linear, all the
data points on the line are feasible solutions. On
the other hand, since L0D (h) is a convex function,
L0D = hti (1 ? Eit ? 12 Dii hti ): (41)
L0D (h())  (1 ? )L(h1 ) + L0D (h2 ). For  > 0,
L0D (h()) > L0D (h1 ), which contradicts the assump-
tion that h1 is a local maximum. So h1 must also be
Now, considering the three possible updates of hti : the global maximum solution.
0.8 0.2
generalization error sigma = 0.3
0.15
training error
0.7
0.1

0.05
Error (in percentage of data)

0.6
0
0 10 20 30 40 50 60 70 80 90 100

0.2
0.5 SONAR classification benchmark
sigma = 0.6
0.15

0.4 0.1

0.05
0.3
0
0 50 100 150 200 250 300

Generalization error
0.2
0.2
sigma = 0.7
0.15

0.1 0.1

0.05

0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 50 100 150 200 250 300
σ : Gaussian kernel parameter No. of training epochs

(a) (b)
Figure 2: Sonar data classi cation using SVMseq: (a) Training and generalization error obtained using various
Gaussian kernel parameter (b) Generalization error vs the number of training epochs plotted for three di erent
kernel parameters

Sections 4.3.1, 4.3.2 and 4.3.3 combined completes


the proof of convergence. The analysis also gives us Table 2: Sonar classi cation: Gorman and Se-
a theoretical upper bound on the learning rate to be jnowski's result with standard NNs using BP algo[9]
used : #hidden 0 2 6 12 24
 2= max D :
fig ii
(44) %gen. 73.1 85.7 89.3 90.4 89.2
Table 3: Sonar classi cation: Gen. error and compu-
5 Experimental results tational cost for standard SVM and SVMseq
ALGO Epoch %gen.(#Errors) FLOPS
We now proceed to look at some experimental clas- SVM - 93.3% (7) 6:8  109
si cation results using the SVMseq algorithm. The SVMseq 10 93.3% (7) 383520
aim of this section is to demonstrate that the SVM- SVMseq 50 95.2% (5) 1917600
seq algorithm is capable of achieving the high lev- SVMseq 130 95.2% (5) 4985760
els of generalization capability boasted by the SVM
kind of classi ers and it does so with much lesser has 60 input dimensions and the aspect-angle depen-
amounts of computation besides being robust to pa- dent data set has been split into a training set and
rameter choices in a broad range. We use benchmark testing set of 104 patterns each as recommended by
problems reported in [7] as well some others. Results Gorman and Sejnowski[9]. We use a Gaussian RBF
for the the case without bias are similar but are im- kernel (refer Section 3.4) with various values of the
proved by introducing bias and soft margin as done in variance parameter  in our non-linear implementa-
our modi ed formulation. Moreover, in our method tion.
the xing of learning rates and open parameters have Tables 2 and 3 compares the generalization perfor-
a rm theoretical basis as an o -shoot of the rigorous mance of our method with the BP trained NNs of
theoretical analysis. Gorman and Sejnowski [9] (upto 24 hidden units and
Here, we give examples of using RBF and polyno- 300 epochs through training data) and the standard
mial type non-linear decision boundaries. The learn- SVM method on the lines of the analysis in Friess
ing rate is xed to 1:9= maxfig Dii based on eq.(44) et al.[7]. The values shown are obtained with pa-
and the convergence of the algorithm is determined by rameters  = 0:6 and C = 50. The generalization
looking at a trace of the cost function which is easily performance achieved by SVMseq is superior to the
computed using eq.(40). NN based method (Table 2) while it is similar to the
standard SVM. However, the SVMseq is orders faster
5.1 Sonar classi cation benchmark than the standard SVM as is re ected through the
FLOPS comparison in MATLAB (Table 3). The last
We consider a benchmark problem to classify the row (epochs=130) is when the algorithm detected con-
sonar data set of 208 patterns representing metal vergence.
cylinders(mines) and rocks[9]. Each of the patterns We also plot the evolution of the generalization and
0.1 0.1
Error(% of data)

Error(% of data)
0.08 0.08
USPS database classification MNIST database classification
0.06 0.06
training error
training error
0.04 0.04 generalization error
generalization error

0.02 0.02

0 0
0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35
Number of training epochs Number of training epochs

1 1
Cost function

Cost function
0.8 0.8

0.6 0.6
normalized cost function normalized cost function
0.4 0.4

0.2 0.2

0 0
0 50 100 150 200 250 300 350 400 450 500 5 10 15 20 25 30 35
Number of training epochs Number of training epochs

(a) (b)
Figure 3: Handwritten character recognition : Training, generalization error and the cost function trace with
number of training epochs for (a) USPS 16  16 pixel database (b) MNIST 28  28 pixel database.

Table 4: USPS dataset two-class classi cation comparisons: number of test errors (out of 2007 test patterns)
Digits 0 1 2 3 4 5 6 7 8 9
classical RBF 20 16 43 38 46 31 15 18 37 26
Optimally tuned SVM 16 12 25 24 32 24 14 16 26 16
SVMseq (poly.deg.=3) 19 14 31 29 33 30 17 16 34 26
SVMseq (poly.deg.=4) 14 13 30 28 29 27 14 16 28 19
SVMseq (RBF:=3.5) 11 7 25 24 30 25 17 14 24 14

training error for various values of the kernel param- et al.[13]) and classical RBF networks. We see that
eter in Fig.2(a). It is noted that generalization error inspite of absolutely no tuning of parameters, SVM-
is at in a fairly broad region of the parameter space. seq achieves much better results than classical RBF
Fig.2(b) shows that the algorithm has very fast con- and comparable results to that of best tuned standard
vergence speed(in terms of training epochs) if we are SVMs.
in the region with an approximately correct kernel Fig.3(a) shows a typical learning performance, in
parameter. this case for the classi cation of digit `0', using SVM-
seq. The training error drops to zero very fast(in a
5.2 USPS and MNIST handwritten few epochs) and the test error also reaches a low of
around .7% which corresponds to an average of about
character classi cation 14 test errors. The plot of the cost function L0D (h),
The next classi cation task we considered used the which is an indicator of algorithm convergence, is also
USPS database of 9298 handwritten digits (7291 for shown to reach the maximum very quickly.
training, 2007 for testing) collected from mail en-
velopes in Bu alo[11]. Each digit is a 16  16 vector
with entries between -1 and +1. Preprocessing con- 6 SV regression and it's sequen-
sists of smoothing with a Gaussian kernel of width
 = 0:75. We used polynomial kernels of third and
tial implementation
fourth degree and RBF kernels to generate the non- In this section we will proceed to show how the idea
linear decision surface. can be extended to incorporate SVM regression. The
Here, we consider the task of two-class binary clas- basic idea of the SV algorithm for regression estima-
si cation, i.e. separating one digit from the rest of tion is to compute a linear function in some high di-
the digits. Table 4 shows comparisons of the general- mensional feature space (which possesses a dot prod-
ization performance of the SVMseq using polynomial uct) and thereby, compute a non-linear function in the
and RBF kernels against the optimally tuned SVM space of the input data. Working in the spirit of the
with RBF kernels (performance reported in Scholkopf SRM principle, we minimize empirical risk and maxi-
mize margins by solving the following constrained op- 3. If the training has converged, then stop else goto
timization problem[16]: step 2.
Xl X
Min. J (;   ; w) = 12 kwk2 + C ( i + i ) (45) Similar to the classi cation case, any non-linear
i=1 i=1l
kernel satisfying the Mercer's condition can be used.
We have rigorously proved the convergence of the al-
subject to yi ? w  xi   + i ; i = 1; : : : ; l (46) gorithm for the regression case, a result which will
w  xi ? yi   + i ; i = 1; : : : ; l (47) be part of another publication with detailed experi-
i ; i  0; i = 1 : : : ; l: (48) mental evaluations of benchmark regression problems.
Here, the user de ned error insensitivity parameter 
Here, x refers to the input vector augmented with the controls the balance between the sparseness of the so-
augmenting factor  to handle the bias term, i ; i are lution and the closeness to the training data. Recent
the slack variables and  is a user de ned insensitivity works on nding the asymptotical optimal choice of
bound for the error function[16]. The corresponding -Loss [14] and soft -tube [12] give theoretical foun-
modi ed dual problem works out to be: dations for the optimal choice of the -loss function.
Xl (  ? )(  ? )x  x 7 Discussion and concluding re-
Max. JD ( ;  ) = ? 12 i i j j i j

Xl i;j
Xl =1
marks
?  ( +  ) + y (  ? ) (49)
i i i i i
By modifying the formulation of the bias term, we
i=1 i=1
subject to 0  i ; i  C; i = 1; : : : ; l; (50) were able to come up with a sequential update based
learning scheme which solves the SVM classi cation
where i ; i are non-negative Lagrange multipliers. and regression optimization problems with added ad-
The solution to this optimization problem is tradition- vantages of conceptual simplicity and huge computa-
ally obtained using quadratic programming packages. tional gains.
The optimal approximation surface using the modi-  By considering a modi ed dual problem, the
ed formulation, after extending the SVM to nonlin-
ear case, is given as
Pglobal
i h i yi
constraint in the optimization problem,
= 0, which is dicult to be satis ed in a
sequential scheme is circumvented by absorbing
f (x) =
Xl (  ? )(K (x ; x) +  ): 2
(51)
it into the maximization functional.
i i i
i=1  Unlike many other gradient ascent type algo-
rithms, the SVMseq is guaranteed to converge
As in the classi cation case, only some of the coe- (essentially due to the nature of the optimiza-
cients ( i ? i ) are non-zero, and the corresponding tion problem) very fast in terms of the training
data points are called support vectors. epochs. Moreover, we have derived a theoretical
upper bound for the learning rate, hence, ensur-
6.1 SVMseq algorithm for regression ing that there are very few open parameters to
be handled.
We extend the sequential learning scheme devised
for the classi cation problem to derive a sequential  The algorithm has been shown to be computa-
update scheme which can obtain the solution to the tionally more ecient than conventional SVM
optimization problem of eq.(49) for SVM regression. routines through comparisons of the computa-
tional complexity in MATLAB routines. More-
Sequential Algorithm for Regression over, in the proposed learning procedure, non-
support vector Lagrangian values go exactly to
1. Initialize i = 0; i = 0. Compute [R]ij = zero as opposed to numerical round-o errors
K (xi ; xj ) +  for i; j = 1; : : : ; l.
2
that are characteristic of quadratic optimization
routine packages.
2. For each training point, i=1 to l, compute
2.1 Ei = yi ? Plj ( i ? i )Rij .
 The algorithm also deals with noisy data by im-
=1
plementing soft classi er type decision surfaces
2.2  i = minfmax[ (Ei ? ); ? i ]; C ? i g: which can take care of outliers. Most impor-
 i = minfmax[ (?Ei ? ); ? i ]; C ? i g: tantly, the algorithm has been extended to han-
dle the SVM for regression under the same gen-
2.3 i = i +  i : eral framework, something which very few other
i = i +  i : speed up techniques have considered.
 The SVM paradigm in the regression domain pro- [5] C.J.C. Burges. A tutorial on support vector ma-
vide an automatic complexity and model order chines for pattern recognition. Data Mining and
selection for function approximation - a char- Knowledge Discovery, 1998. (In press.).
acteristic that should favor these techniques to
standard RBF, Neural Network or other para- [6] C. Cortes and V. Vapnik. Support vector net-
metric approaches. works. Machine Learning, 20:273{297, 1995.
[7] T.T. Friess, N. Cristianini, and C. Campbell.
In addition, this work looks at implications of re- The kernel adatron algorithm: A fast and simple
formulation of the bias by using the augmenting fac- learning procedure for support vector machines.
tor from a geometric point of view and interprets the In Proceedings, 15th Intl. Conference on Machine
e ects seen in simulation results theoretically. The Learning. Morgan-Kaufman, 1998.
choice of the augmenting factor is also discussed.
Moreover, the conditions for obtaining a balanced [8] T.T. Friess and R.F. Harrison. Support vector
classi er, which is simple but conceptually important, neural networks: The kernel adatron with bias
is also derived. and soft margin. Technical Report 752, The Univ
The proposed algorithm is essentially a gradient as- of Sheeld, Dept. of ACSE, 1998.
cent method in the modi ed dual problem by using [9] R.P. Gorman and T.J. Sejnowski. Finding the
extra clipping terms in the updates to satisfy the lin- size of a neural network. Neural Networks, 1:75{
ear constraints. It is well known that the gradient as- 89, 1988.
cent/descent may be trapped in a long narrow valley
if the Hessian matrix (in our context, the matrices D [10] T. Joachims. Making large scale SVM learn-
or R in the dual problem) has wide spread eigen val- ing practical. In B. Scholkopf, C. Burges, and
ues. In our future work, we will work on implement- A. Smola, editors, Advances in Kernel Methods -
ing more advanced gradient descent methods, like the Support Vector Learning. MIT Press, 1998.
natural gradient descent of Amari[1], to overcome this [11] Y. LeCun, B. Boser, J.S. Denker, D. Hender-
prospective problem. son, R.E. Howard, W. Hubbard, and L.J. Jackel.
Backpropagation applied to handwritten zip code
recognition. Neural Computation, 1:541{551,
Acknowledgements 1989.
We would like to thank Prof. Shunichi Amari and Dr. [12] B. Scholkopf, P. Bartlett, A.J. Smola, and
Noboru Murata for constructive suggestions. The au- R. Williamson. Support vector regression
thors acknowledges the support of the RIKEN Brain with automatic accuracy control. In Proceed-
Science Institute for funding this project. ings, Intl. Conf. on Arti cial Neural Networks
(ICANN'98), volume 1, pages 111{116. Springer,
1998.
References [13] B. Scholkopf, K. Sung, C. Burges, F. Girosi,
P. Niyogi, T. Poggio, and V. Vapnik. Compar-
[1] S. Amari. Natural gradient works eciently in ing Support Vector Machines with Gaussian ker-
learning. Neural Computation, 10(2):251{276, nels to radial basis function classi ers. A.I.Memo
1998. 1599, M.I.T. AI Labs, 1996.
[2] J.K. Anlauf and M. Biehl. The Adatron: an [14] A.J. Smola, N. Murata, B. Scholkopf, and K.-
adaptive perceptron algorithm. Europhysics Let- R. Muller. Asymptotically optimal choice of -
ters, 10(7):687{692, 1989. loss for support vector machines. In Proceed-
ings, Intl. Conf. on Arti cial Neural Networks
[3] P. Barlett and J. Shawe-Taylor. Generaliza- (ICANN'98), volume 1, pages 105{110. Springer,
tion performance of support vector machines 1998.
and other pattern classi ers. In B. Scholkopf, [15] V. Vapnik. Statistical Learning Theory. John
C. Burges, and A. Smola, editors, Advances in Wiley and Sons,Inc.,New York, 1997.
Kernel Methods - Support Vector Learning. MIT
Press, 1998. [16] V. Vapnik, S.E. Golowich, and A. Smola. Sup-
port vector method for function approximation,
[4] B. Boser, I. Guyon, and V. Vapnik. A train- regression estimation and signal processing. In
ing algorithm for optimal margin classi ers. In Advances in Neural Information Processing Sys-
Proceedings, Fifth Annual Workshop on Compu- tems, volume 9, pages 281{287, 1997.
tational Learning Theory. ACM Press, 1992.

View publication stats

You might also like