Support Vector Machine (SVM)

Support Vector Machine (SVM)
p. 205
First papers on SVM

A Training Algorithm for Optimal Margin Classifiers,
Bernhard E. Boser and Isabelle M. Guyon and Vladimir
Vapnik, Proceedings of the fifth annual workshop on
Computational learning theory (COLT), 1992
http://www.clopinet.com/isabelle/Papers/colt92.ps.Z
Support-Vector Networks, Corinna Cortes and Vladimir
Vapnik, Machine Learning, 20(3):273-297, 1995
http://homepage.mac.com/corinnacortes/papers/SVM.ps
Extracting Support Data for a Given Task, Bernhard
Scholkopf and Christopher J.C. Burges and Vladimir Vapnik,
First International Conference on Knowledge Discovery &
Data Mining, 1995
http://research.microsoft.com/cburges/papers/kdd95.ps.gz
p. 206
Literature (Papers, Bookchapter, T.Report)

Statistical Learning and Kernel Method, Bernhard
Scholkopf, Microsoft Research Technical Report, 2000,
ftp://ftp.research.microsoft.com/pub/tr/tr-2000-23.pdf
An Introduction to Kernel-based Learning Algorithms,
K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B.
Scholkopf, IEEE Neural Networks, 12(2):181-201, 2001,
http://ieeexplore.ieee.org/iel5/72/19749/00914517.pdf
Support Vector Machines, S. Mika, C. Schafer, P. Laskov,
D. Tax and K.-R. Muller, Bookchapter : Handbook of
Computational Statistics (Concepts and Methods),
http://www.xplore-stat.de/ebooks/scripts/csa/html/
Support Vector Machines for Classification and
Regression, S.R. Gunn, Technical Report,
http://www.ecs.soton.ac.uk/srg/publications/pdf/SVM.pdf
p. 207
Literature (Paper, Books)

A Tutorial on Support Vector Machines for Pattern
Recognition, Christopher J.C. Burges, Kluwer Academic
Publishers, Boston
http://research.microsoft.com/cburges/papers/SVMTutorial.pdf
Learning with Kernels: Support Vector Machines,
Regularization, Optimization, and Beyond, Bernhard
Scholkopf and Alexander J. Smola, MIT Press, 2001
An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods, Nello Cristianini
and John Shawe-Taylor, Cambridge University Press, 2000
Kernel Methods for Pattern Analysis, John Shawe-Taylor
and Nello Cristianini, Cambridge University Press, 2004
The Nature of Statistical Learning Theory, Vladimir N.
Vapnik, Springer-Verlag, sec. edition, 1999
p. 208
Literature (WWW and Source Code)
http://www.kernel-machines.org/
http://www.support-vector.net/
http://www.learning-with-kernels.org/
http://www.pascal-network.org/
http://www.r-project.org/ (packages e1071,kernlab)
p. 209
Scaling Freedom (Problem)

{x | w x + b > 0}
{x | w x + b < 0}
w
{x | w x + b = 0}
Multiplying x and b by the same constant gives the same

hyperplane, represented in terms of different parameters
[(w x) + b] = 0 (w x + b) = 0 (w x) + b = 0.
Example: w := (5, 4)T , b := 2;

5x 4y + 2 = 0 y = 45 x + 12 , := 3;
15x 12y + 6 = 0 y = 54 x + 12
p. 210
Canonical Hyperplane
A canonical hyperplane with respect to given m examples
(x1 , y1 ), . . . , (xm , ym ) X {1} is defined as a function
f (x) = (w x) + b
where w is normalized such that

min |f (xi )| = 1
i=1,...,m
i.e. scaling w and b such that the points closest to the

hyperplane satisfy |(w xi ) + b| = 1
p. 211
Plus/Minus and Zero Hyperplane

{x | w x + b = 1}
2
kwk
1
kwk
x2
x1
1
kwk
w
{x | w x + b = 1}
{x | w x + b = 0}
p. 212
Optimal Separating Hyperplane

Let x3 be any point on the
minus hyperplane and let
x1 be the closest point to
x3 .
{x|w x + b = 1}
2
kwk
1
kwk
x2
x1
1
kwk
x3
{x|w x + b = 1}
{x|w x + b = 0}
w x3 + b +kwk2 = 1 =
| {z }
w x1 + b = +1
w x3 + b = 1
x1 = x3 + w
w (x3 + w) +b = 1
| {z }
x1
2
kwk2 .
We are interested in
margin, i.e. kx1 x3 k = kwk = kwk =
2
kwk2 kwk
2
kwk
p. 213
Formulation as an Optimization Problem

Hyperplane with maximum margin (the smaller the norm of
the weight vector w, the larger the margin):
minimize
subject to
1
kwk2
2
yi ((w xi ) + b) 1,
i = 1, . . . , m
Introduce Lagrange multipliers i 0 and a Lagrangian

1
L(w, b, ) = kwk2
2
m
X
i=1
i (yi ((xi w) + b) 1)
At the extrema, we have
L(w, b, ) = 0,
b
L(w, b, ) = 0
w
p. 214
Optimization and Kuhn-Tucker Theorem

leads to solution
m
X
i=1
i yi = 0,
w=
m
X
i y i xi .
i=1
The extreme point solution obtained has an important

property that results from optimization known as the
Kuhn-Tucker theorem. The theorem says:
Lagrange multiplier can be non-zero only if the corresponding
inequality constraint is an equality at the solution.
This implies that only the training examples xi that lie on the
plus and minus hyperplane have their corresponding i
non-zero.
p. 215
Relevant/Irrelevant Support Vector

More formally, the Karush-Kuhn-Tucker complementarity
conditions say:
i [yi ((xi w) + b) 1] = 0,
i = 1, . . . , m
the Support Vectors lie on the margin. That means for all
training points
[yi ((xi w) + b)] > 1
[yi ((xi w) + b)] = 1
i = 0
(on margin)
xi irrelevant
xi Support Vector
p. 216
Relevant/Irrelevant Support Vector

{x|w x + b = 1}
2
kwk
k > 0
1
kwk
i > 0
m = 0
n = 0
j > 0
l > 0
1
kwk
w
{x|w x + b = 1}
{x|w x + b = 0}
p. 217
Dual Form
The dual form has many advantages
Formulate optimization problem without w (mapping w
in high-dimensional spaces).
Formulate optimization problem by means of , yi and
dot product xi xj .
Quadratic Programming Solver.
maximize
W () =
m
X
i=1
subject to
i 0,
m
1 X
i
i j yi yj (xi xj )
2
i,j=1
i = 1, . . . , m and
m
X
i y i = 0
i=1
p. 218
Hyperplane Decision Function

The solution is determined by the examples on the margin.
Thus
f (x) = sgn ((x w) + b)
= sgn
m
X
i=1
yi i (x xi ) + b
where
i [yi ((xi w) + b) 1] = 0,
and
w=
m
X
i = 1, . . . , m
i y i xi
i=1
p. 219
Non-separable Case
Case where the constraints yi (w xi + b) 1 cannot be

satisfied, i.e. i
p. 220
Relax Constraints (Soft Margin)

Modify the constraints to
yi (w xi + b) 1 i , with i 0
and add
C
m
X
i=1
i.e. distance of error points to their correct place in the

objective function
minimize
1
kwk2 + C
2
m
X
i=1
Same dual, with additional constraints 0 i C

p. 221
Non-separable Case (Part 2)

x2
x1
This data set is not properly separable with lines (also when
using many slack variables)
p. 222
Separate in Higher-Dim. Space

Map data in higher-dimensional space and separate it there
with a hyperplane
x1
z3
x2
z1
z2
: R2 R3
2
(x1 , x2 ) 7 (z1 , z2 , z3 ) := (x1 , 2x1 x2 , x22 )
p. 223
Feature Space
Apply the mapping
: RN F
x 7 (x)
to the data x1 , x2 , . . . , xm X and construct separating

hyperplane in F instead of X . The samples are preprocessed
as ((x1 ), y1 ), . . . , ((xm ), ym ) F {1}.
Obtained decision function:
f (x) = sgn
= sgn
m
X
i=1
m
X
i=1
yi i ((x) (xi )) + b
yi i k(x, xi ) + b
!
p. 224
Kernels
A kernel is a function k , such that for all x, y X
k(x, y) = ((x) (y)),
where is a mapping from X to an dot product feature

space F .
The m m matrix K with elements Kij = k(xi , xj ) is called
kernel matrix or Gram matrix. The kernel matrix is symmetric
and positive semi-definite, i.e. for all ai R, i = 1, . . . , m, we
m
have
X
ai aj Kij 0
i,j=1
Positive semi-definite kernels are exactly those giving rise to a

positive semi-definite kernel matrix K for all m and all sets
{x1 , x2 , . . . , xm } X .
p. 225
The Kernel Trick Example

Example : compute 2nd order products of two pixels, i.e.
2
x = (x1 , x2 ) and (x) = (x1 , 2x1 x2 , x22 )
((x) (z)) =
=
2x1 x2 , x22 )(z12 ,

((x1 , x2 )(z1 , z2 )T )2
T 2
(x21 ,
2z1 z2 , z22 )T
= (x z )
= : k(x, z)
p. 226
Kernel without knowing

Recall: mapping : RN F . SVM depends on the data
through dot products in F , i.e. functions of the form
(xi ) (xj )
With k such that k(xi , xj ) = (xi ) (xj ), it is not

necessary to even know what (x) is.

kuvk2
Example: k(u, v) = exp
, in this example F is
infinite dimensional.
p. 227
Feature Space (Optimization Problem)

Quadratic optimization problem (soft margin) with kernel:
maximize
W () =
m
X
i=1
subject to
0 i C,
m
X
1
i
i j yi yj k(xi , xj )
2
i,j=1
i = 1, . . . , m and
m
X
i y i = 0
i=1
p. 228
(Standard) Kernels
Linear k0 (u, v) = (u v)
Polynomial k1 (u, v) = ((u v) + )d

2
ku vk
Gaussian k2 (u, v) = exp
Sigmoidal k3 (u, v) = tanh((u v) + )
p. 229
SVM Results for Gaussian Kernel
= 0.5, C = 50
= 0.5, C = 1
p. 230
SVM Results for Gaussian Kernel (cont.)
= 0.02, C = 50
= 10, C = 50
p. 231
Link to Statistical Learning Theory

Pattern Classification:
Learn f : X {1} from input-output training data
We assume that data is generated from some unknown (but
fixed) probability distribution P (x, y)
3.0
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0
10
10
p. 232
(Empirical) Risk
Learn f from training set (x1 , y1 ), . . . , (xm , ym ) X {1},
such that the expected missclassification error on a test set,
also drawn from P (x, y),
Z
1
R[f ] =
|f (x) y|dP (x, y) Expected Risk
2
is minimal.
Problem: we cannot minimize the expected risk, because P is
unknown. Minimize instead the average risk over training set,
i.e.
m
X
1
1
|f (xi ) yi |
Remp [f ] =
m
2
Empirical Risk
i=1
p. 233
Problems with Empirical Risk

Minimizing the empirical risk (training error), does not imply
a small test error. To see this, note that for each function f
and any test set (x1 , y 1 ), . . . , (xm , y m ) X {1}, satisfying
{x1 , . . . , xm } {x1 , . . . , xm } = {}, there exists another
function f such that f (xi ) = f (xi ) for all i = 1, . . . , m, but
f (xi ) 6= f (xi ) (on all test samples) for all i = 1, . . . , m
p. 234
A Bound for Pattern Classification

For any f F and m > h, with a probability of at least 1 ,
s
h(ln 2m
h + 1) ln(/4)
R[f ] Remp [f ] +
m
| {z }
high
|
{z
}
minimize
minimize
error
Expected Risk
Complexity
Training Error
low
small
Complexity of Function Set
high
p. 235
Measure Complexity of Function Set

How to measure the complexity of a given function set ?
The Vapnik-Chervonenkis (short VC) dimension is defined as
the maximum number of points that can be labeled in all
possible ways.
In R2 we can shatter three non-collinear points.
But we can never shatter four points in R2 .

Hence the VC dimension is h = 3.
xxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
xxxxxxxxxxxxx
x
x
p. 236
VC Dimension
Separating hyperplanes in RN have VC dimension N + 1.
Hence: separating hyperplanes in high-dimensional
feature spaces have extremely large VC dimension, and
may not generalize well.
But, margin hyperplanes can still have a small VC
dimension.
p. 237
VC Dimension of Margin Hyperplanes

Consider hyperplanes (w x) = 0 where w is normalized such
they are in a canonical form w.r.t. a set of points
X = {x1 , . . . , xr }, i.e.,
min |(w xi )| = 1.
i=1,...,r
The set of decision functions fw (x) = sgn(x w) defined on

X and satisfying the constraint kwk has VC dimension
satisfying
h min(R2 2 + 1, N + 1)
Here, R is the smallest sphere around the origin containing
X .
p. 238
Smallest Sphere and VC Dimension

2
2
2
1
Hyperplanes with a large margin ( 21 ) induce only a small

number of possibilities to separate the data, i.e. the VC
dimension is small (left figure). In contrast, smaller margins
( 22 ) induce more separating hyperplanes, i.e. the VC
dimension is large (right figure).
p. 239
SVM (RBF Kernel) and Bayes Decision

SVM decision region
3.0
2.5
2.5
2.0
2.0
1.5
1.5
1.0
1.0
0.5
0.5
0.0
0.0
0
10
10
p. 240
Implementation (Matrix notation)

Recall:
maximize
W () =
m
X
i=1
subject to
i 0,
m
X
1
i
i j yi yj (xi xj )
2
i,j=1
i = 1, . . . , m and
m
X
i y i = 0
i=1
This can be expressed in a matrix notation as

1 T
min H + cT
2
p. 241
Implementation (Matrix notation) cont.

where
H = ZZ T ,
with constraints
cT = (1, . . . , 1)
T Y = 0, i 0, i = 1, . . . , m
where
Z=
y 1 x1
y 2 x2
..
.
y m xm
Y =
y1
y2
..
.
ym
p. 242
Implementation (Maple)
CalcAlphaVector := proc(LPM::Matrix, Kernel, C)
local H,i,j,k,rows,a,b,c,bl,bu,Aeq,
beq,Y,QPSolution;
rows := RowDimension(LPM);
H := Matrix(1..rows,1..rows);
a := Vector(2);
b := Vector(2);
beq := Vector([0]);
Y := Vector(rows);
for k from 1 to rows do
Y[k] := LPM[k,3];
end do;
p. 243
Implementation (Maple) cont.

for i from 1 to rows do
for j from 1 to rows do
a[1] := LPM[i,1];
a[2] := LPM[i,2];
b[1] := LPM[j,1];
b[2] := LPM[j,2];
H[i,j] := Y[i] * Y[j] * Kernel(a,b);
end do;
end do;
p. 244
Implementation (Maple) cont.

c
bl
bu
Aeq
:=
:=
:=
:=
Vector(1..rows, fill = -1);

Vector(1..rows, fill = 0);
Vector(1..rows, fill = C);
convert(Transpose(Y),Matrix);
QPSolution := QPSolve([c,H],
[NoUserValue,NoUserValue,Aeq,beq],
[bl,bu]);
return QPSolution[2];
end proc:
p. 245
Using SVM in R (e1071)

Example, how to use SVM classification in R:
library(e1071);
# load famous Iris Fisher data set
data(iris);
attach(iris);
# number of rows
n <- nrow(iris);
# number of folds
folds <- 10
samps <- sample(rep(1:folds, length=n),
n, replace=FALSE)
p. 246
Using SVM in R (e1071) cont.

# Using the first fold:
train <- iris[samps!=1,] # fit the model
test <- iris[samps==1,] # predict
# features 1 -> 4 from training set
x_train <- train[,1:4];
# class labels
y_train <- train[,5];
# determine the hyperplane
model <- svm(x_train,y_train,type="C-classification",
cost=10,kernel="radial",
probability = TRUE, scale = FALSE);
p. 247
Using SVM in R (e1071) cont.

# features 1 -> 4 from test set (no class labels)
x_test <- test[,1:4];
# predict class labels for x_test
pred <- predict(model,x_test,decision.values = TRUE);
# get true class labels
y_true <- test[,5];
# check accuracy:
table(pred, y_true);
# determine test error
sum(pred != y_true)/length(y_true);
# do that for all folds, i.e. train <- iris[samps!=2,]
# test <- iris[samps==2,] , ........
p. 248
Summary (SVM)
Optimal Separating Hyperplane

Formulation as an Optimization Problem (Primal,Dual)
Only Support Vectors are Relevant (Sparseness)

Mapping in High-dimensional Spaces
Kernel Trick
Replacement for Neural Networks?
p. 249

Support Vector Machine (SVM)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Support Vector Machine (SVM)

Uploaded by

Copyright:

Available Formats

Support Vector Machine (SVM)

First papers on SVM

Literature (Papers, Bookchapter, T.Report)

Literature (Paper, Books)

Literature (WWW and Source Code)

http://www.r-project.org/ (packages e1071,kernlab)

Scaling Freedom (Problem)

Multiplying x and b by the same constant gives the same

Example: w := (5, 4)T , b := 2;

where w is normalized such that

i.e. scaling w and b such that the points closest to the

Plus/Minus and Zero Hyperplane

Optimal Separating Hyperplane

margin, i.e. kx1 x3 k = kwk = kwk =

Formulation as an Optimization Problem

Introduce Lagrange multipliers i 0 and a Lagrangian

At the extrema, we have

Optimization and Kuhn-Tucker Theorem

The extreme point solution obtained has an important

Relevant/Irrelevant Support Vector

Relevant/Irrelevant Support Vector

Quadratic Programming Solver.

Hyperplane Decision Function

Case where the constraints yi (w xi + b) 1 cannot be

Relax Constraints (Soft Margin)

i.e. distance of error points to their correct place in the

Same dual, with additional constraints 0 i C

Non-separable Case (Part 2)

Separate in Higher-Dim. Space

to the data x1 , x2 , . . . , xm X and construct separating

where is a mapping from X to an dot product feature

Positive semi-definite kernels are exactly those giving rise to a

The Kernel Trick Example

2x1 x2 , x22 )(z12 ,

Kernel without knowing

With k such that k(xi , xj ) = (xi ) (xj ), it is not

Feature Space (Optimization Problem)

Polynomial k1 (u, v) = ((u v) + )d

Sigmoidal k3 (u, v) = tanh((u v) + )

SVM Results for Gaussian Kernel

SVM Results for Gaussian Kernel (cont.)

Link to Statistical Learning Theory

Problems with Empirical Risk

A Bound for Pattern Classification

Complexity of Function Set

Measure Complexity of Function Set

In R2 we can shatter three non-collinear points.

But we can never shatter four points in R2 .

VC Dimension of Margin Hyperplanes

The set of decision functions fw (x) = sgn(x w) defined on

Smallest Sphere and VC Dimension

Hyperplanes with a large margin ( 21 ) induce only a small

SVM (RBF Kernel) and Bayes Decision

Implementation (Matrix notation)

This can be expressed in a matrix notation as

Implementation (Matrix notation) cont.

Implementation (Maple) cont.

Implementation (Maple) cont.

Vector(1..rows, fill = -1);

Using SVM in R (e1071)

Using SVM in R (e1071) cont.

Using SVM in R (e1071) cont.

Optimal Separating Hyperplane