Professional Documents
Culture Documents
net/publication/2318644
CITATIONS READS
60 718
2 authors:
Sethu Vijayakumar Si Wu
The University of Edinburgh Peking University
307 PUBLICATIONS 5,426 CITATIONS 125 PUBLICATIONS 2,254 CITATIONS
Some of the authors of this publication are also working on these related projects:
1
aration. Experimental results on benchmark datasets which is equivalent to
have shown that the algorithm provides generalization
abilities equal to the standard SVM implementations yi (xi w + b) ? 1 0 for i = 1; : : : ; l: (3)
while being orders of magnitude faster. The algorithm P
where a b i ai bi represents the dot product. All
is capable of handling noisy data and outliers through points for which eq.(1) hold lie on the hyperplane
implementation of the soft classier with adaptable H1 : xi w + b = 1 with normal w and perpendic-
trade-o parameter. Furthermore, the algorithm has ular distance from origin dH1 = (1kw?kb) . Similarly,
been extended to provide a fast implementation of those points satisfying eq.(2) lie on the hyperplane
the Support Vector regression, a unied approach to
classication and regression which very few other al- H2 : xi w + b = ?1 with distance from the origin
gorithms have addressed. It should be noted in rela- dH2 = (?k1w?kb) . The distance between the two canoni-
tion to the work in Friess et al. [7] that although the cal hyperplanes, H1 and H2 is a quantity we refer to
the fundamental idea is the same and nal algorithm as the margin and is given by :
looks quite similar, there are conceptual dierences
in the way the bias and soft margin is handled. A Margin = jd ? d j = 2 :
H1 H2
kwk (4)
dierence in the update procedures (clipping) can be
shown to lead to an extended convergence plateau in There are (in general) many hyperplanes satisfying
[7] resulting in qualitatively similar results but aect- the condition given in eq.(3) for a data set that is
ing the speed of solution and the number of resulting separable. To choose one with the best generaliza-
SVMs. tion capabilities, we try to maximize the margin be-
Section 3 reviews the theoretical framework of the tween the canonical hyperplanes [5]. This is justied
SVMs and looks at implementation of non-linear de- from the computational theory perspective since max-
cision surfaces. Section 4 provides a new formulation imizing margins correspond to minimizing the V-C
for the bias and describes the sequential algorithm for dimension of the resulting classier [15] and from the
classication along with rigorous justications for it's interpretation of attractor type networks in physics,
convergence. We also perform a theoretically rigorous achieves maximal stability[2].
analysis on the margin maximization which claries Therefore, the optimization problem to be solved
the additional conditions that need to be satised to becomes
obtain a balanced classier. Section 5 compares the
learning performance of the proposed method against Minimize J1 [w] = 12 kwk2 (5)
other techniques for standard benchmark problems. subject to yi (xi w + b) ? 1 0; i = 1; : : : ; l:(6)
Section 6 describes an extension of the algorithm for
implementing SVM regression while Section 7 sum-
marizes the salient features of the proposed method 3.2 The linearly non separable case
and looks at future research directions. In case the data are not linearly separable, as is true
in many cases, we introduce slack variables i 0 to
take care of misclassications such that the following
3 The classication problem constraint is satised :
and Support Vectors yi (xi w + b) ? 1 + i 0 for i = 1; : : : ; l: (7)
We start by introducing the theoretical underpinnings Here, we choose a tradeo between maximizing mar-
of the Support Vector learning. Consider the problem gins and reducing the number of misclassications by
of classifying data fxi ; yi gli=1 , where xi are the N using a user dened parameter C such that the opti-
dimensional pattern vectors and yi 2 f?1; 1g are the mization problem looks like :
target values.
Minimize J2 [w; i ] = ( 12 kwk2 + C
Xl ) (8)
i
i=1
3.1 The linearly separable case subject to yi (xi w + b) ? 1 + i 0;
(9)
We rst consider the case when the data is linearly i 0; i = 1; : : : ; l: (10)
separable. The classication problem can be reformu- We can write this constrained optimization as an
lated as one of nding a hyperplane f (w; b) = xi w +b unconstrained quadratic convex programming prob-
which separate the positive and negative examples: lem by introducing Lagrangian multipliers as
xi w + b 1 for yi = 1; (1) Xl
Minimize L(w; b; i ; h; r) = 12 kwk2 + C i
xi w + b 1 for yi = ?1: (2) i=1
?
Xl h [y (x w + b) ? 1 + ] ? Xl r (11)
dot product of two feature vectors in terms of the ker-
i i i i i i nel, provided some (non-trivial) conditions are satis-
i=1 i=1 ed :
where h = (h1 ; : : : ; hl ) and r = (r1 ; : : : ; rl ) are the ((xi )(xj )) = K (xi ; xj )
X (x ) (x ): (17)
k i k j
non-negative Lagrange multipliers. k
3.3 The dual problem and sparsity of Some common examples of the kernels are the
Gaussian RBF kernels, given as
solution
The optimization problem of eq.(11) can be converted K (x; x0 ) e?kx?x k2 =22
0
(18)
into a dual problem which is much more easier to
handle[5] by looking at the saddle points of eq.(11). and polynomial kernel of degree d, written as
Maximize LD
Xl
(h) = ( h ? 1 h Dh) (12)
K (x; x0 ) (x x0 + 1)d : (19)
i
i=1
2 The use of kernels instead of dot products in the op-
subject to
X y h = 0;
l
(13)
timization provides an automatic implementation of
representing hyperplanes in the feature space rather
i i
i=1 than the input space. For further details on SVMs
0 hi C; i = 1; : : : ; l: (14) and use of kernels, the reader is referred to the tuto-
rial by Burges[5].
where Dij = yi yj xi xj .
Solutions to this dual problem is traditionally found
by using standard quadratic programming packages. 4 Sequential learning of sup-
Once the optimal multipliers hi are found, the classi-
er is simply port vectors for classication
f (x) = sign(
X h y x x + b ): (15)
The main aim of this work is to implement the support
i i i 0 vectors in a fast, simple, sequential scheme. To this
i eect, we make a slight modication in the problem
Usually, based on the Kuhn-Tucker optimality condi- formulation.
tion, the number of non-zero Lagrange multipliers hi
are much smaller than the number of data and hence,
the kind of expansion for the classier given by eq.(15)
4.1 An alternate formulation for bias
is very sparse. The data points corresponding to these In the previous sections, the bias has been represented
non-zero Lagrangians are referred to as support vec- explicitly using an additional variable b. From now
tors. The bias term b0 can be found using any support on, we will use an additional dimension for the pat-
vector xsv as tern vectors such that x0 = (x1 ; : : : ; xN ; ) and in-
b0 = ysv ?
X h y x x : (16)
corporate the bias term in the weight vector, w0 =
(w1 ; : : : ; wN ; b=), where is a scalar constant. The
i i i sv
i2SV choice of the magnitude of the augmenting factor is
discussed in Section 4.1.1.
Based on this modication, we write an analogous
3.4 Extension to non-linear decision optimization problem.
surfaces
Obviously, the power of linear decision surfaces are Mininimize L0(w0 ; i ) = ( 1 kw0 k2 + C
Xl )(20)
i
very limited. SVMs provide a convenient way of ex- 2 i=1
tending the analysis from the input space to a non- subject to yi (x0i w0 ) ? 1 + i 0; (21)
linear feature space by using a high-dimensional map-
ping (x). Finding a linear separating hyperplane in i 0; i = 1; : : : ; l: (22)
this feature space is equivalent to nding a non-linear
decision boundary in the input space. If we look carefully, the margin we are trying to
It should be noted that the input vectors appear maximize is slightly dierent from the original prob-
only in the form of dot products (xi ) (xj ) in the lem. Instead of maximizing kw2 k or minimizing kwk2 ,
dual problem and it's solution. By using the repro- we are minimizing kw0 k2 = kwk2 + b2 =2 . We now
ducing kernels of a Hilbert space, we can rewrite this look at the implications of this approximation.
Therefore, eqs.(23), (24) and (25) yield an explicit
relationship for the modied margin:
d~2 = 1=(kwk2 + b2 ): (26)
So, maximizing the modied margin will, in general,
+ m (0,1) only give an approximate solution of the original op-
- d timization problem. In case the input dimension is
~ high, the contribution of the bias term in the margin
1=(kwk2 + b2) is negligible and the approximation will
d
be very good as conrmed in our simulation experi-
φ ments.
(0,0)
From eq.(24), we see that if the angle (refer Fig.
1) is smaller, the approximation tends to be better.
This can be ensured by augmenting the input vec-
tor by a `larger' number instead of `1'. To prove this
point and also show how good the approximation is,
Figure 1: Visualization of the modied margin in we carried out simulations on an articial two dimen-
higher dimensional space sional linearly separable data set. One data cluster
is randomly generated in the range [?1; 1] [?1; 1]
with uniform distribution. The other one is obtained
in the same way in the range [1:5; 3:5] [1:5; 3:5]. It is
Table 1: Eect of augmenting factor on approxima- easy to see that the normal of the optimal classifying
tion quality hyperplane w can be obtained in the following way:
Augmenting Factor() (degrees) Select a pair of data points, say x1 and x2 , from dif-
0.1 31.823032 ferent clusters such that the distance between them
0.5 16.999935 is the minimum among all such pairs. w can then
1.0 6.831879 be expressed as w = (x2 ? x1 )=kx2 ? x1 k. Let the
2.0 0.977416 solution achieved through SVMseq be w. The angle
5.0 0.275545 between w and w , which indicates how good the
approximation is, can be written as
= arccos wkw
w :
k (27)
4.1.1 Eect of modied bias formulation on
margin maximization Table 1 compares the eect of the augmenting factor
Here, we look at the implications of the maximizing () on the approximation quality. We see that the
a modied margin (refer Section 4.1) which involves approximation improved vastly with the increase of
using an input vector augmented with a scalar con- the magnitude of the augmenting factor. The simula-
stant . For the sake of analysis, let us assume that tions used a xed set of 50 data points for the exper-
the augmenting factor = 1. Fig. 1 gives a schematic iments. In practice, however, there is a tradeo re-
diagram of a linearly separable classication problem sulting from increasing the augmenting factor. Large
in a higher dimensional space after the input vector augmenting factor will normally lead to slower conver-
has been augmented. The margin being maximized gence speeds and instability in the learning process.
in the original problem with explicit bias, based on An appropriate solution is to choose an augmenting
eq.(4), is factor having the same magnitude as the input vector.
d2 = 1=kwk2: (23) However, while using the non-linear kernels, the eect
of the augmenting factor is not directly interpretable
The modied margin d~ is the closest distance from the as above and depends on the exact kernels we use.
data points to a hyperplane in the higher dimensional We have noticed that setting the augmenting term to
space passing through the origin. Based on simple zero (equivalent to neglecting the bias term) in high
geometry, it can be seen from Fig. 1 that dimensional kernels gives satisfactory results on real
world data.
d~2 = d2 cos2 : (24)
The angle satises the following relationship:
4.1.2 Obtaining a balanced classier
In the case when the data is linearly separable, a bal-
cos2 = 1=(1 + m2 ) where m = b=kwk (25) anced classier is dened as the hyperplane (f (w; b) =
xw+b = 0) that lies in the middle of the two clusters, method. In practice, this is easy to satisfy by just
or more precisely, the bias term b is given as: scaling the input data to smaller values, an automatic
result of which leads to larger magnitudes of w.
b = ? 12 (x+i w + x?i w); (28) Working along the lines of Section 3.3, and solv-
ing for the hyperplane in the high dimensional fea-
where x+i and x?i are any two support vectors from ture space, we can convert the modied optimization
the two clusters, respectively. Certainly, a balanced problem of eq.(20) into a dual problem:
classier is superior to an unbalanced one. For the X
Maximize L0D (h) = ( hi ? 12 h D0 h) (35)
original problem, a balanced classier is automatically
obtained. However, while maximizing the modied i
margin, it is no longer the case. A constraint on the subject to 0 hi C; i = 1; : : : ; l: (36)
magnitude of w needs to be satised. (37)
Suppose there is an unbalanced classier f (w0 ; b0 ).
Without loss of generality, we assume it is closer to where Dij0 = yi yj K (xi ; xj ) + 2 yi yj . Here, in the
the negative cluster, that is, non-linear case, we augment the input vector with an
extra dimension in the feature space, i.e., (x0 ) =
x+i w0 + b0 = c; c > 1 ((x) )T . The resultant non-linear classier is com-
x?i w0 + b0 = ?1: puted as
It is easy to check by using the transformation w = f (x) = sign(
X h y K (x ; x) + h y ): 2
(38)
i i i i i
2
1+c
w0 and b = 1+2 c b0 + 11+?cc that we can get a bal- i2SV
anced classier f (w ; b) parallel to the unbalanced
one : For the sake of clarity, we will stop using the `dash' no-
tation and from now on assume that vectors x and ma-
x+i w + b = 1; trix D refers to the modied input vector augmented
x?i w + b = ?1: with the scalar and modied matrix, respectively,
as described in the beginning of this section.
So, we can construct a family of parallel classiers
parameterized by c, for c 1 in the following way: 4.2 SVMseq for classication
w(c) = 1 +2 c w; (29) In this section, we provide the pseudocode for the
sequential learning algorithm. The algorithm, on
b(c) = 1 +2 c b ? 1 ?2 c ; (30) convergence, provides the optimal set of Lagrange
multipliers hi for the classication problem in high-
such that dimensional feature space which corresponds to the
maximum of the cost function (35).
xi w(c) + b(c) = c; c > 1
+
(31)
x?i w(c) + b(c) = ?1: (32) Sequential Algorithm for Classication
Here, f (w(1); b(1)) gives the balanced classier, which 1. Initialize hi = 0. Compute matrix [D]ij =
corresponds to c = 1 and f (w ; b ). For c > 1, an un- yi yj (K (xi ; xj ) + ) for i; j = 1; : : : ; l.
2
balanced classier closer to the negative cluster is ob- 2. For each pattern, i=1 to l, compute
2.1 Ei = Plj hj Dij .
tained. We will, for the sake of simplicity, assume the
augmenting factor to be 1. Then, SVMseq minimizes
kwk2 + b2 . So, when =1
4.3 Rigorous proofs on convergence L0D = ?hti (1 ? Eit + 21 Dii hti ) (42)
Now we prove the learning algorithm can indeed nd Notice that
the optimal solution of the dual problem given by
eq.(35). The proof involves showing that the con- 1 ? Eit + 12 Dii hti 1 ? Eit + 21 Dii
(Eit ? 1)
straint of eq.(36) is always satised and that the al-
gorithm will converge to the global maximum of the (1 ? Eit )(1 ? 21 Dii
) 0 (43)
cost function given by eq.(35).
So, from eqs.(42) and (43) L0D 0.
4.3.1 The constraint (36) is always satised Case 3: hti = C ? hti
during the learning process This case arises only when C ? hti
(1 ? Eit ).
For all the possible updates of hi , we have Remember hti C , which implies Eit 1 in this
case. Therefore,
1. If hti = C ? hti ; then, hti+1 = hti + hti = C .
L0D = (C ? hti )[1 ? Eit ? 12 Dii (C ? hti )]
2. If hti = maxf
(1 ? Eit ); ?hti g;
then,hti+1 = hti + hti hti ? hti = 0.
Since, in this case C ? hti maxf
(1 ? Eit ); ?hti g;
(C ? hti )[1 ? Eit ? 12 Dii
(1 ? Eit )]
hence, hti+1 hti + C ? hti = C (C ? hti )(1 ? Eit )(1 ? 12 Dii
) 0:
So, for any initial values of h0i , h1i will satisfy the
constraint (36) after just one step learning and will So for all the three possible values of hi , L0D (h) will
maintain this property forever. monotonically increase and stop when a local maxi-
mum is reached. In the following we will show that
4.3.2 The learning process will monotonically the local maximum is also the global one.
increase L0D (h) and stop when a local
maximum is reached 4.3.3 Every local maximum is also the global
one
Dene a maxfig Dii . We prove that for the learning P P P
Since ij hi Dij hj = ( i hi yi xi ) ( i hi yi xi ) 0 for
rate bounded by 0 <
< 2=a, L0D (h) will monotoni-
cally increase during the learning process. any h, D is a semidenite matrix, and L0D (h) is a con-
vex function. For the linear constraints that arises in
L0 = L0D (hti + hti ) ? L0D (hti ) this optimization problem, we show that each local
D
X X
= hti ? 12 htj Dji hti ? 21 Dij htj hti
maximum is equivalent to the global one. Suppose h1
is a local maximum point, while h2 is the global maxi-
j j mum which has a larger value, i.e., L0D (h2 ) > L0D (h1 ).
We can construct a straight line connecting h1 and
? 12 Dii (hti )2 : h2 , which is parameterized as (1 ? )h1 + h2 , for
P
Since Dij = Dji and j Dij htj = Eit , we have
0 1. Since the constraints are linear, all the
data points on the line are feasible solutions. On
the other hand, since L0D (h) is a convex function,
L0D = hti (1 ? Eit ? 12 Dii hti ): (41)
L0D (h()) (1 ? )L(h1 ) + L0D (h2 ). For > 0,
L0D (h()) > L0D (h1 ), which contradicts the assump-
tion that h1 is a local maximum. So h1 must also be
Now, considering the three possible updates of hti : the global maximum solution.
0.8 0.2
generalization error sigma = 0.3
0.15
training error
0.7
0.1
0.05
Error (in percentage of data)
0.6
0
0 10 20 30 40 50 60 70 80 90 100
0.2
0.5 SONAR classification benchmark
sigma = 0.6
0.15
0.4 0.1
0.05
0.3
0
0 50 100 150 200 250 300
Generalization error
0.2
0.2
sigma = 0.7
0.15
0.1 0.1
0.05
0 0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 50 100 150 200 250 300
σ : Gaussian kernel parameter No. of training epochs
(a) (b)
Figure 2: Sonar data classication using SVMseq: (a) Training and generalization error obtained using various
Gaussian kernel parameter (b) Generalization error vs the number of training epochs plotted for three dierent
kernel parameters
Error(% of data)
0.08 0.08
USPS database classification MNIST database classification
0.06 0.06
training error
training error
0.04 0.04 generalization error
generalization error
0.02 0.02
0 0
0 10 20 30 40 50 60 70 80 90 100 5 10 15 20 25 30 35
Number of training epochs Number of training epochs
1 1
Cost function
Cost function
0.8 0.8
0.6 0.6
normalized cost function normalized cost function
0.4 0.4
0.2 0.2
0 0
0 50 100 150 200 250 300 350 400 450 500 5 10 15 20 25 30 35
Number of training epochs Number of training epochs
(a) (b)
Figure 3: Handwritten character recognition : Training, generalization error and the cost function trace with
number of training epochs for (a) USPS 16 16 pixel database (b) MNIST 28 28 pixel database.
Table 4: USPS dataset two-class classication comparisons: number of test errors (out of 2007 test patterns)
Digits 0 1 2 3 4 5 6 7 8 9
classical RBF 20 16 43 38 46 31 15 18 37 26
Optimally tuned SVM 16 12 25 24 32 24 14 16 26 16
SVMseq (poly.deg.=3) 19 14 31 29 33 30 17 16 34 26
SVMseq (poly.deg.=4) 14 13 30 28 29 27 14 16 28 19
SVMseq (RBF:=3.5) 11 7 25 24 30 25 17 14 24 14
training error for various values of the kernel param- et al.[13]) and classical RBF networks. We see that
eter in Fig.2(a). It is noted that generalization error inspite of absolutely no tuning of parameters, SVM-
is
at in a fairly broad region of the parameter space. seq achieves much better results than classical RBF
Fig.2(b) shows that the algorithm has very fast con- and comparable results to that of best tuned standard
vergence speed(in terms of training epochs) if we are SVMs.
in the region with an approximately correct kernel Fig.3(a) shows a typical learning performance, in
parameter. this case for the classication of digit `0', using SVM-
seq. The training error drops to zero very fast(in a
5.2 USPS and MNIST handwritten few epochs) and the test error also reaches a low of
around .7% which corresponds to an average of about
character classication 14 test errors. The plot of the cost function L0D (h),
The next classication task we considered used the which is an indicator of algorithm convergence, is also
USPS database of 9298 handwritten digits (7291 for shown to reach the maximum very quickly.
training, 2007 for testing) collected from mail en-
velopes in Bualo[11]. Each digit is a 16 16 vector
with entries between -1 and +1. Preprocessing con- 6 SV regression and it's sequen-
sists of smoothing with a Gaussian kernel of width
= 0:75. We used polynomial kernels of third and
tial implementation
fourth degree and RBF kernels to generate the non- In this section we will proceed to show how the idea
linear decision surface. can be extended to incorporate SVM regression. The
Here, we consider the task of two-class binary clas- basic idea of the SV algorithm for regression estima-
sication, i.e. separating one digit from the rest of tion is to compute a linear function in some high di-
the digits. Table 4 shows comparisons of the general- mensional feature space (which possesses a dot prod-
ization performance of the SVMseq using polynomial uct) and thereby, compute a non-linear function in the
and RBF kernels against the optimally tuned SVM space of the input data. Working in the spirit of the
with RBF kernels (performance reported in Scholkopf SRM principle, we minimize empirical risk and maxi-
mize margins by solving the following constrained op- 3. If the training has converged, then stop else goto
timization problem[16]: step 2.
Xl X
Min. J (; ; w) = 12 kwk2 + C ( i + i ) (45) Similar to the classication case, any non-linear
i=1 i=1l
kernel satisfying the Mercer's condition can be used.
We have rigorously proved the convergence of the al-
subject to yi ? w xi + i ; i = 1; : : : ; l (46) gorithm for the regression case, a result which will
w xi ? yi + i ; i = 1; : : : ; l (47) be part of another publication with detailed experi-
i ; i 0; i = 1 : : : ; l: (48) mental evaluations of benchmark regression problems.
Here, the user dened error insensitivity parameter
Here, x refers to the input vector augmented with the controls the balance between the sparseness of the so-
augmenting factor to handle the bias term, i ; i are lution and the closeness to the training data. Recent
the slack variables and is a user dened insensitivity works on nding the asymptotical optimal choice of
bound for the error function[16]. The corresponding -Loss [14] and soft -tube [12] give theoretical foun-
modied dual problem works out to be: dations for the optimal choice of the -loss function.
Xl ( ? )( ? )x x 7 Discussion and concluding re-
Max. JD (; ) = ? 12 i i j j i j
Xl i;j
Xl =1
marks
? ( + ) + y ( ? ) (49)
i i i i i
By modifying the formulation of the bias term, we
i=1 i=1
subject to 0 i ; i C; i = 1; : : : ; l; (50) were able to come up with a sequential update based
learning scheme which solves the SVM classication
where i ; i are non-negative Lagrange multipliers. and regression optimization problems with added ad-
The solution to this optimization problem is tradition- vantages of conceptual simplicity and huge computa-
ally obtained using quadratic programming packages. tional gains.
The optimal approximation surface using the modi- By considering a modied dual problem, the
ed formulation, after extending the SVM to nonlin-
ear case, is given as
Pglobal
i h i yi
constraint in the optimization problem,
= 0, which is dicult to be satised in a
sequential scheme is circumvented by absorbing
f (x) =
Xl ( ? )(K (x ; x) + ): 2
(51)
it into the maximization functional.
i i i
i=1 Unlike many other gradient ascent type algo-
rithms, the SVMseq is guaranteed to converge
As in the classication case, only some of the coe- (essentially due to the nature of the optimiza-
cients (i ? i ) are non-zero, and the corresponding tion problem) very fast in terms of the training
data points are called support vectors. epochs. Moreover, we have derived a theoretical
upper bound for the learning rate, hence, ensur-
6.1 SVMseq algorithm for regression ing that there are very few open parameters to
be handled.
We extend the sequential learning scheme devised
for the classication problem to derive a sequential The algorithm has been shown to be computa-
update scheme which can obtain the solution to the tionally more ecient than conventional SVM
optimization problem of eq.(49) for SVM regression. routines through comparisons of the computa-
tional complexity in MATLAB routines. More-
Sequential Algorithm for Regression over, in the proposed learning procedure, non-
support vector Lagrangian values go exactly to
1. Initialize i = 0; i = 0. Compute [R]ij = zero as opposed to numerical round-o errors
K (xi ; xj ) + for i; j = 1; : : : ; l.
2
that are characteristic of quadratic optimization
routine packages.
2. For each training point, i=1 to l, compute
2.1 Ei = yi ? Plj (i ? i )Rij .
The algorithm also deals with noisy data by im-
=1
plementing soft classier type decision surfaces
2.2 i = minfmax[
(Ei ? ); ?i ]; C ? i g: which can take care of outliers. Most impor-
i = minfmax[
(?Ei ? ); ?i ]; C ? i g: tantly, the algorithm has been extended to han-
dle the SVM for regression under the same gen-
2.3 i = i + i : eral framework, something which very few other
i = i + i : speed up techniques have considered.
The SVM paradigm in the regression domain pro- [5] C.J.C. Burges. A tutorial on support vector ma-
vide an automatic complexity and model order chines for pattern recognition. Data Mining and
selection for function approximation - a char- Knowledge Discovery, 1998. (In press.).
acteristic that should favor these techniques to
standard RBF, Neural Network or other para- [6] C. Cortes and V. Vapnik. Support vector net-
metric approaches. works. Machine Learning, 20:273{297, 1995.
[7] T.T. Friess, N. Cristianini, and C. Campbell.
In addition, this work looks at implications of re- The kernel adatron algorithm: A fast and simple
formulation of the bias by using the augmenting fac- learning procedure for support vector machines.
tor from a geometric point of view and interprets the In Proceedings, 15th Intl. Conference on Machine
eects seen in simulation results theoretically. The Learning. Morgan-Kaufman, 1998.
choice of the augmenting factor is also discussed.
Moreover, the conditions for obtaining a balanced [8] T.T. Friess and R.F. Harrison. Support vector
classier, which is simple but conceptually important, neural networks: The kernel adatron with bias
is also derived. and soft margin. Technical Report 752, The Univ
The proposed algorithm is essentially a gradient as- of Sheeld, Dept. of ACSE, 1998.
cent method in the modied dual problem by using [9] R.P. Gorman and T.J. Sejnowski. Finding the
extra clipping terms in the updates to satisfy the lin- size of a neural network. Neural Networks, 1:75{
ear constraints. It is well known that the gradient as- 89, 1988.
cent/descent may be trapped in a long narrow valley
if the Hessian matrix (in our context, the matrices D [10] T. Joachims. Making large scale SVM learn-
or R in the dual problem) has wide spread eigen val- ing practical. In B. Scholkopf, C. Burges, and
ues. In our future work, we will work on implement- A. Smola, editors, Advances in Kernel Methods -
ing more advanced gradient descent methods, like the Support Vector Learning. MIT Press, 1998.
natural gradient descent of Amari[1], to overcome this [11] Y. LeCun, B. Boser, J.S. Denker, D. Hender-
prospective problem. son, R.E. Howard, W. Hubbard, and L.J. Jackel.
Backpropagation applied to handwritten zip code
recognition. Neural Computation, 1:541{551,
Acknowledgements 1989.
We would like to thank Prof. Shunichi Amari and Dr. [12] B. Scholkopf, P. Bartlett, A.J. Smola, and
Noboru Murata for constructive suggestions. The au- R. Williamson. Support vector regression
thors acknowledges the support of the RIKEN Brain with automatic accuracy control. In Proceed-
Science Institute for funding this project. ings, Intl. Conf. on Articial Neural Networks
(ICANN'98), volume 1, pages 111{116. Springer,
1998.
References [13] B. Scholkopf, K. Sung, C. Burges, F. Girosi,
P. Niyogi, T. Poggio, and V. Vapnik. Compar-
[1] S. Amari. Natural gradient works eciently in ing Support Vector Machines with Gaussian ker-
learning. Neural Computation, 10(2):251{276, nels to radial basis function classiers. A.I.Memo
1998. 1599, M.I.T. AI Labs, 1996.
[2] J.K. Anlauf and M. Biehl. The Adatron: an [14] A.J. Smola, N. Murata, B. Scholkopf, and K.-
adaptive perceptron algorithm. Europhysics Let- R. Muller. Asymptotically optimal choice of -
ters, 10(7):687{692, 1989. loss for support vector machines. In Proceed-
ings, Intl. Conf. on Articial Neural Networks
[3] P. Barlett and J. Shawe-Taylor. Generaliza- (ICANN'98), volume 1, pages 105{110. Springer,
tion performance of support vector machines 1998.
and other pattern classiers. In B. Scholkopf, [15] V. Vapnik. Statistical Learning Theory. John
C. Burges, and A. Smola, editors, Advances in Wiley and Sons,Inc.,New York, 1997.
Kernel Methods - Support Vector Learning. MIT
Press, 1998. [16] V. Vapnik, S.E. Golowich, and A. Smola. Sup-
port vector method for function approximation,
[4] B. Boser, I. Guyon, and V. Vapnik. A train- regression estimation and signal processing. In
ing algorithm for optimal margin classiers. In Advances in Neural Information Processing Sys-
Proceedings, Fifth Annual Workshop on Compu- tems, volume 9, pages 281{287, 1997.
tational Learning Theory. ACM Press, 1992.