Introduction To Svms

Introduction to SVMs
SVMs
• Geometric
– Maximizing Margin
• Kernel Methods
– Making nonlinear decision boundaries linear
– Efficiently!
• Capacity
– Structural Risk Minimization
SVM History
• SVM is a classifier derived from statistical learning
theory by Vapnik and Chervonenkis
• SVM was first introduced by Boser, Guyon and
Vapnik in COLT-92
• SVM became famous when, using pixel maps as
input, it gave accuracy comparable to NNs with
hand-designed features in a handwriting
recognition task
• SVM is closely related to:
– Kernel machines (a generalization of SVMs), large
margin classifiers, reproducing kernel Hilbert space,
Gaussian process, Boosting
α
Linear Classifiers
x f yest
f(x,w,b) = sign(w. x - b)
denotes +1
denotes -1
How would you

classify this data?
α
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
α
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
α
Linear Classifiers
x f yest
denotes +1
denotes -1
How would you

classify this data?
α
Linear Classifiers
x f yest
denotes +1
denotes -1
Any of these
would be fine..
..but which is
best?
α
Classifier Margin
x f yest
denotes +1
denotes -1 Define the margin
of a linear
classifier as the
width that the
boundary could be
increased by
before hitting a
datapoint.
α
Maximum Margin
x f yest
denotes +1
denotes -1 The maximum
margin linear
classifier is the
linear classifier
with the maximum
margin.
This is the
simplest kind of
SVM (Called an
LSVM)
Linear SVM
α
Maximum Margin
x f yest
denotes +1
margin linear
classifier is the
linear classifier
Support Vectors with the, um,
are those
datapoints that maximum margin.
the margin
This is the
pushes up
against simplest kind of
SVM (Called an
LSVM)
Linear SVM
Why Maximum Margin?
1. Intuitively this feels safest.
f(x,w,b)
2. If we’ve made = sign(w.
a small error inxthe
- b)
denotes +1
location of the boundary (it’s been
jolted in its perpendicular direction)
this gives us leastmargin
chance linear
of causing a
misclassification. classifier is the
linear
3. There’s some theory classifier
(using VC
Support Vectors dimension) that iswith
related
the,to um,
(but not
are those the same as) the proposition that this
datapoints that maximum margin.
is a good thing.
the margin
This
4. Empirically it works is very
very the well.
pushes up
against simplest kind of
SVM (Called an
LSVM)
A “Good” Separator
O O
X X
X O
X O O
X X O
X O
X O
Noise in the Observations
O O
X X
X O
X O O
X X O
X O
X O
Ruling Out Some Separators
O O
X X
X O
X O O
X X O
X O
X O
Lots of Noise
O O
X X
X O
X O O
X X O
X O
X O
Maximizing the Margin
O O
X X
X O
X O O
X X O
X O
X O
Specifying a line and margin
1 ” Plus-Plane
= +
la ss Classifier Boundary
i ct C one
Pr ed z Minus-Plane
“
- 1”
=
l a ss
i ct C one
Pr ed z
“
• How do we represent this mathematically?

• …in m input dimensions?
Specifying a line and margin
1 ” Plus-Plane
= +
la ss Classifier Boundary
i ct C one
r e d z Minus-Plane
“ P
- 1”
=
l a ss
=1
+b i ct C one
wx
b=
0
Pr ed z
+
w x b = -1 “
+
wx
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Classify +1 if w . x + b >= 1
as..
-1 if w . x + b <= -1
Universe if -1 < w . x + b < 1
explodes
Computing the margin width
1 ”
= + M = Margin Width
la ss
i ct C one
r e d z
“ P
= - 1” How do we compute
l a ss M in terms of w
=1
+b i ct C one
wx
b=
0
Pr ed z
and b?
+
w x b = -1 “
+
wx
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus
Plane. Why?
1 ”
la ss
i ct C one
r e d z
“ P
= - 1” How do we compute
l a ss M in terms of w
=1
+b i ct C one
wx
b=
0
Pr ed z
and b?
+
w x b = -1 “
+
wx
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
Claim: The vector w is perpendicular to the Plus
Plane. Why? Let u and v be two vectors on the
Plus Plane. What is w . ( u – v ) ?
And so of course the vector w is also

perpendicular to the Minus Plane
1 ”
ss x+
la
i ct C one
r e d z
“ P
=- -
1” How do we compute
s sx
+b
=1
ct
a
Cl e M in terms of w
wx 0 e di zo n
+ b=
w x b = -1 “Pr and b?
+
wx
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• The vector w is perpendicular to the Plus Plane
Any location in
A :: not
R not m
m
• Let x- be any point on the minus plane necessarily a

datapoint
• Let x+ be the closest plus-plane-point to x-.
1 ”
ss x+
la
i ct C one
r e d z
“ P
=- -
1” How do we compute
s sx
+b
=1
ct
a
Cl e M in terms of w
wx 0 e di zo n
+ b=
w x b = -1 “Pr and b?
+
wx
• Plus-plane = { x : w . x + b = +1 }
• Minus-plane = { x : w . x + b = -1 }
• Let x- be any point on the minus plane
• Claim: x+ = x- + λ w for some value of λ. Why?
1 ”
ss x+
la
i ct C one The line from x- to x+ is
r e d z
“ P
- 1” How do we
perpendicularcompute
to the
s s x=-
planes.
b =1
ct C la
e M in terms of w
di zo n
+
wx 0 e
+ b=
w x b = -1 “Pr So to getbfrom
and ? x- to x+
+
wx travel some distance in
• Plus-plane = { x : w . x + b =direction +1 } w.
• Minus-plane = { x : w . x + b = -1 }
• Let x- be any point on the minus plane
• Claim: x+ = x- + λ w for some value of λ. Why?
1 ”
ss x+
la
i ct C one
r e d z
“ P
1”
sx =- -
a s
+b
=1
ct Cl e
wx
b=
0
r e di zo n
+
w x b = -1 “P
+
wx
What we know:
• w . x+ + b = +1
• w . x- + b = -1
• x+ = x- + λ w
• |x+ - x- | = M
It’s now easy to get M in terms of w and b
1 ”
ss x+
la
i ct C one
r e d z
“ P
1”
sx =- -
a s
+b
=1
ct Cl e
wx
b=
0
r e di zo n w . (x - + λ w) + b = 1
+
w x b = -1 “P
+
wx
=>
What we know: w . x + b + λ w .w = 1 -
• w . x+ + b = +1 =>
• w . x- + b = -1 -1 + λ w .w = 1
• x+ = x- + λ w => 2
λ=
• |x+ - x- | = M w.w
It’s now easy to get M in terms of w and b
1 ” 2
= + M = Margin Width =
ss x+ w.w
la
i ct C one
r e d z
“ P
1”
sx =- -
a s
+b
=1
ct Cl e
wx
b=
0
r e di zo n M = |x+ - x- | =| λ w |=
+
w x b = -1 “P
+
wx
What we know: = λ | w | = λ w.w

• w . x+ + b = +1
2 w.w 2
• w . x- + b = -1 = =
w.w w.w
• x+ = x- + λ w
• |x+ - x- | = M
2
• λ =
w.w
Learning the Maximum Margin Classifier
1 ” 2
= + M = Margin Width =
ss x+ w.w
la
i ct C one
r e d z
“ P
1”
sx =- -
a s
+b
=1
ct Cl e
wx
b=
0
r e di zo n
+
w x b = -1 “P
+
wx
Given a guess of w and b we can

• Compute whether all data points in the correct half-planes
• Compute the width of the margin
So now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all the
datapoints. How?
Gradient descent? Simulated Annealing? Matrix Inversion?
EM? Newton’s Method?
Don’t worry…
it’s good for
you…
• Linear Programming
find w
argmax c⋅w
subject to
w⋅ai ≤ bi, for i = 1, …, m
wj ≥ 0 for j = 1, …, n
There are fast algorithms for solving linear programs including the
simplex algorithm and Karmarkar’s algorithm
Learning via Quadratic Programming
• QP is a well-studied class of optimization

algorithms to maximize a quadratic function
of some real-valued variables subject to
linear constraints.
Quadratic Programming
T
u Ru
Find arg max c + d u +
T Quadratic criterion
u 2
Subject to a11u1 + a12u 2 + ... + a1mum ≤ b1

a21u1 + a22u 2 + ... + a2 mu m ≤ b2 n additional linear
inequality
: constraints
an1u1 + an 2u 2 + ... + anmu m ≤ bn
And subject to
e additional linear
a( n +1)1u1 + a( n +1) 2u2 + ... + a( n +1) mu m = b( n +1)
constraints
equality
a( n + 2)1u1 + a( n + 2 ) 2u2 + ... + a( n + 2 ) mu m = b( n + 2 )
:
a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Quadratic Programming
T
u Ru
Find arg max c + d u +
T Quadratic criterion
u 2
Subject to a11u1 + a12u 2 + ... + a1mum ≤ bn1g

s f o r findi
a21u1 +re ae22 + rit+ hm
x i u
st a lg o... a q u
u a d r≤atb ic2 n additional linear
Th e
2
n s t r a i ned2 m m
i e ntly
u c h c o e f fi c inequality
s
m u: c h more i ent
op t i m a g r a d constraints
l i a b l y than
an1u1 + aann2dur2e+ ...as+ceannm t. u m ≤ bn
id dl y … yo u
And subject to ery f
e additional linear
v
a( n +1)1u1 (+Buat( nth+1e)y2u2 +n...’t +waan(tn +to1) mwurimte = b( n +1)
a r e
ly do
constraints
equality
p r o b a b r s elf)
a u +a yo u
uon+e ... + a u =b
( n + 2 )1 1 ( n+ 2) 2 2 ( n+ 2) m m ( n+2)
:
a( n + e )1u1 + a( n + e ) 2u2 + ... + a( n + e ) mu m = b( n + e )
Learning the Maximum Margin Classifier
1 ” M=
= + 2 Given guess of w , b we
la ss
ic t C one w.w can
Pr ed z
“ • Compute whether all
- 1”
=
l a ss data points are in the
=1
+b i ct C one
wx =0 ed z
+ b
wx b=-1 “ Pr correct half-planes
+
wx
• Compute the margin
width
What should our How many
Assume constraints
R datapoints,
quadratic optimization will we
each (xhave? R
k,yk) where yk =
criterion be? What
+/- 1should they be?
Minimize w.w w . xk + b >= 1 if yk = 1
w . xk + b <= -1 if yk = -1
This is going to be a problem!
Uh-oh!
What should we do?
denotes +1 Idea 1:
denotes -1
Find minimum w.w, while
minimizing number of
training set errors.
Problem: Two things to
minimize makes for an
ill-defined optimization
Uh-oh!
What should we do?
denotes +1 Idea 1.1:
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
There’s a serious practical

problem that’s about to make
us reject this approach. Can
you guess what it is?
Uh-oh!
What should we do?
denotes -1
Minimize
w.w + C (#train errors)
Tradeoff parameter
Can’t be expressed as a Quadratic
Programming problem.
Solving it may be too slow.
There’s a serious practical
(Also, doesn’t distinguish between y
problem that’s o… an to make
about
S
disastrous errors and near misses) other
us reject this approach.ideas? Can
you guess what it is?
Uh-oh!
What should we do?
denotes -1
Minimize
w.w + C (distance of error
points to their
correct place)
Learning Maximum Margin with Noise
M=
2 Given guess of w , b we
w.w can
• Compute sum of
b =1 distances of points to
x+
w =0
+ b
wx b=-1
their correct zones
+
wx
width
What should our Assume R datapoints,
quadratic optimization each (xk,yk) where yk =
criterion be? +/- 1
How many constraints
will we have?
What should they be?
Large-margin Decision Boundary
• The decision boundary should be as far away from the data of both
classes as possible
– We should maximize the margin, m
– Distance between the origin and the line wtx=k is k/||w||
Class 2
m
Class 1
05/18/12 39
Finding the Decision Boundary
• Let {x1, ..., xn} be our data set and let yi ∈ {1,-1} be the
class label of xi
• The decision boundary should classify all points correctly ⇒
• The decision boundary can be found by solving the

following constrained optimization problem
• This is a constrained optimization problem. Solving it

requires some new tools
– Feel free to ignore the following several slides; what is important is
the constrained optimization problem above
05/18/12 40
Back to the Original Problem
• The Lagrangian is
– Note that ||w||2 = wTw

• Setting the gradient of w.r.t. w and b to zero, we
have
05/18/12 41
• The Karush-Kuhn-Tucker conditions,
05/18/12 42
The Dual Problem
• If we substitute to , we have
• Note that
• This is a function of αi only
05/18/12 43
The Dual Problem
• The new objective function is in terms of αi only
• It is known as the dual problem: if we know w, we know all αi; if we
know all αi, we know w
• The original problem is known as the primal problem
• The objective function of the dual problem needs to be maximized!
• The dual problem is therefore:
Properties of αi when we introduce The result when we differentiate the

the Lagrange multipliers original Lagrangian w.r.t. b 44
The Dual Problem
• This is a quadratic programming (QP) problem

– A global maximum of αi can always be found
• w can be recovered by
05/18/12 45
Characteristics of the Solution
• Many of the αi are zero
– w is a linear combination of a small number of data points
– This “sparse” representation can be viewed as data compression
as in the construction of knn classifier
• xi with non-zero αi are called support vectors (SV)
– The decision boundary is determined only by the SV
– Let tj (j=1, ..., s) be the indices of the s support vectors. We can
write
• For testing with a new data z
– Compute and classify
z as class 1 if the sum is positive, and class 2 otherwise
– Note: w need not be formed explicitly
05/18/12 46
A Geometrical Interpretation
Class 2
α10=0
α8=0.6
α7=0
α2=0
α5=0
α1=0.8
α4=0
α6=1.4
α9=0
α3=0
Class 1
05/18/12 47
Non-linearly Separable Problems
• We allow “error” ξi in classification; it is based on the output of the
discriminant function wTx+b
• ξi approximates the number of misclassified samples
Class 2
Class 1
05/18/12 48
M=
ε11 2 Given guess of w , b we
ε2 w.w can
• Compute sum of
b=
1 distances of points to
wx
+
=0 ε7
+ b
wx b=-1
their correct zones
+
wx
width
What should our How many R
Assume constraints will
datapoints,
quadratic optimization weeach have? R k) where yk =
(xk,y
criterion be? R What+/-should
1 they be?
1
Minimize w.w + C ∑ εk w . x + b >= 1-ε if y = 1
2 k =1
k k k
w . xk + b <= -1+εk if yk = -1
Learning Maximum Marginm with Noise
= # input
M= dimensions
2 Given guess of w , b we
ε11
ε2 w.w can
• Compute
Our original (noiseless data) QPsum m+1
had of
=1
variables: w1, w2,distances
… wm, and ofb. points to
b
wx
+
=0 ε7
+ b
wx b=-1 Our new (noisy their
data) QP correct
has m+1+R zones
+
wx
… wm, b, εthe
variables: w1, •w2,Compute k , ε1margin
,… εR
width
Assume constraints
datapoints, will
R= # records
(xk,y
1 they be?
1
2 k =1
k k k
w . xk + b <= -1+εk if yk = -1
M=
ε2 w.w can
• Compute sum of
b=
wx
+
=0 ε7
+ b
wx b=-1
their correct zones
+
wx
width
Assume constraints
datapoints,will
(xk,y
1 they be?
1
2 k =1
k k k
There’s a bug in this QP. Canwyou

. xkspot <= -1+εk if yk = -1
+ b it?
M=
ε2 w.w can
• Compute sum of
b=
wx
+
=0 ε7
+ b
wx b=-1
their correct zones
+
wx
Howwidth
many constraints will
What should our we have?
Assume R 2R
datapoints,
quadratic optimization Whateach (xk,ythey
should k) where
be? yk =
criterion be? +/- 1
1 R w . x + b >= 1-εk if yk = 1
Minimize w.w + C ∑ εk
k
2 k =1 w . xk + b <= -1+εk if yk = -1
εk >= 0 for all k
M=
ε2 w.w can
• Compute sum of
b=
wx
+
=0 ε7
+ b
wx b=-1
their correct zones
+
wx
Howwidth
many constraints will
What should our we have?
Assume R 2R
datapoints,
quadratic optimization Whateach (xk,ythey
should k) where
be? yk =
criterion be? +/- 1
1 R w . x + b >= 1-εk if yk = 1
Minimize w.w + C ∑ εk
k
2 k =1 w . xk + b <= -1+εk if yk = -1
εk >= 0 for all k
An Equivalent Dual QP
Minimize
R
w . xk + b >= 1-εk if yk = 1
1
w.w + C ∑ εk w . xk + b <= -1+εk if yk = -1
2 k =1
εk >= 0 ， for all k
R
1 R R
Maximize ∑ αk − ∑∑ αk αl Qkl where Qkl = yk yl ( x k .x l )
k =1 2 k =1 l =1
R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
k =1
k yk = 0
R
1 R R
k =1 2 k =1 l =1
R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
k =1
k yk = 0
Then define:
R
w = ∑αk y k x k Then classify with:
k =1 f(x,w,b) = sign(w. x - b)
b = y K (1 −ε K ) −x K .w K
where K = arg max αk
k
Example XOR problem revisited:
Let the nonlinear mapping be :
φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T

And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T
Therefore the feature space is in 6D with input data in 2D
x1 = (-1,-1), d1= - 1
x2 = (-1, 1), d2 = 1
x3 = ( 1,-1), d3 = 1
x4 = (-1,-1), d4= -1
Q(a)= Σ ai – ½ Σ Σ ai aj di dj φ(xi) Tφ(xj)
=a1 +a2 +a3 +a4 – ½(9 a1 a1 - 2a1 a2 -2 a1 a3 +2a1 a4
+9a2 a2 + 2a2 a3 -2a2 a4 +9a3 a3 -2a3 a4 +9 a4 a4 )
To minimize Q, we only need to calculate
∂Q ( a )
∂a i = 0, i = 1,...,4
(due to optimality conditions) which gives
1 = 9 a1 - a2 - a3 + a4
1 = -a1 + 9 a2 + a3 - a4
1 = -a1 + a2 + 9 a3 - a4
1 = a1 - a2 - a3 + 9 a4
The solution of which gives the optimal values:
a0,1 =a0,2 =a0,3 =a0,4 =1/8
w0 = Σ a0,i di φ(xi)
= 1/8[φ(x1)- φ(x2)- φ(x3)+ φ(x4)]
  1   1   1   1   0 
          
  1   1   1   1   0 
1   2   − 2   − 2   2   − 1 
= − + − −  = 2
8  1   1   1   1   0 
 − 2 − 2  2      
2  0 
   
 
 
 
 
   
  − 2   2   − 2   2   0 
Where the first element of w0 gives the bias b

From earlier we have that the optimal hyperplane is defined by:
w0T φ(x) = 0
That is:
 1 
 2

 x1 
 2x x 
 0 0 0 
w0T φ(x) =  0 0 − 1  = − x1 x2 = 0
1 2
2
 2  x2 
 2x 
 1 
 2x 
 2 
which is the optimal decision boundary for the XOR problem.

Furthermore we note that the solution is unique since the optimal
decision boundary is unique
Output for polynomial
RBF
Harder 1-dimensional dataset
Remember how
permitting non-
linear basis
functions made
linear regression
so much nicer?
Let’s permit them
here too
x=0 z k = ( xk , xk2 )
For a non-linearly separable problem we have to first map data onto
feature space so that they are linear separable
xi φ(xi)
Given the training data sample {(xi,yi), i=1, …,N}, find the
optimum values of the weight vector w and bias b
w = Σ a0,i yi φ(xi)
where a0,i are the optimal Lagrange multipliers determined by
maximizing the following objective function
N
1 N N
Q ( a ) = ∑ai − ∑∑ai a j d i d j φ ( x i )φ( x j )
T
i =1 2 i =1 j =1
subject to the constraints

Σ ai yi =0 ; ai >0
SVM building procedure:
• Pick a nonlinear mapping φ

• Solve for the optimal weight vector
However: how do we pick the function φ?
• In practical applications, if it is not totally

impossible to find φ, it is very hard
• In the previous example, the function φ is quite
complex: How would we find it?
Answer: the Kernel Trick

Notice that in the dual problem the image of input vectors
only involved as an inner product meaning that the
optimization can be performed in the (lower dimensional)
input space and that the inner product can be replaced by
an inner-product kernel
N N
Q ( a ) = ∑ai − 1 ∑ai a j d i d jϕj ( x )T ϕj ( x i )
i =1
2 i , j =1
N N
= ∑ai − 1 ∑ai a j d i d j K ( xi , x j )
i =1
2 i , j =1
How do we relate the output of the SVM to the kernel K?
Look at the equation of the boundary in the feature space and use the
optimality conditions derived from the Lagrangian formulations
Hyperplane is defined by
m1
∑w ϕ ( x ) + b = 0
j =1
j j
or
m1
∑w ϕ ( x ) = 0;
j =0
j j where ϕ0 ( x ) = 1
writing : ϕ( x ) = [ϕ0 ( x ), ϕ1 ( x ),..., ϕm1 ( x )]

w ϕ( x ) = 0
T
we get :
N
from optimality conditions : w = ∑ai d i ϕ( x i )
i =1
N
∑a d ϕ ( x i )ϕ( x ) = 0
T
Thus : i i
i =1
N
and so boundary is : ∑ai d i K ( x, xi ) = 0
i =1
N
and Output = w ϕ( x ) = ∑ai d i K ( x, xi )
T
i =1
m1
where : K ( x , xi ) = ∑ ϕj ( x )ϕj ( x i )
j =0
In the XOR problem, we chose to use the kernel function:
K(x, xi) = (x T xi+1)2
= 1+ x12 xi12 + 2 x1x2 xi1xi2 + x22 xi22 + 2x1xi1 ,+ 2x2xi2
Which implied the form of our nonlinear functions:

φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T
And: φ(xi)=(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T
However, we did not need to calculate φ at all and could simply

have used the kernel to calculate:
Q(a) = Σ ai – ½ Σ Σ ai aj di dj Κ(xi, xj)
Maximized and solved for ai and derived the hyperplane via:

N
∑a d K ( x, x ) = 0
i =1
i i i
We therefore only need a suitable choice of kernel function cf:
Mercer’s Theorem:
Let K(x,y) be a continuous symmetric kernel that defined in the

closed interval [a,b]. The kernel K can be expanded in the form
Κ (x,y) = φ(x) T φ(y)
provided it is positive definite. Some of the usual choices for K are:
Polynomial SVM (x T xi+1)p p specified by user
RBF SVM exp(-1/(2σ2) || x – xi||2) σ specified by user
MLP SVM tanh(s0 x T xi + s1)

R
1 R R
k =1 2 k =1 l =1
R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
k =1
k yk = 0
Then define: Datapoints with αk > 0

R will be the support
w = ∑αk y k x k Then classify with:
vectors
k =1 f(x,w,b) = sign(w. x - b)
..so this sum only needs
b = y K (1 −ε K ) −x K .w K to be over the
support vectors.
k


1 

Constant Term Quadratic
 2 x1 


2 x2 
 Linear Terms
Basis
 : 

 2 x m


Functions
 x12

 2  Pure Number of terms (assuming m input
 x 
2
Quadratic dimensions) = (m+2)-choose-2
 : 
  Terms
x 2 = (m+2)(m+1)/2
 m 
Φ(x) = 
2 x1 x2  = (as near as makes no difference) m2/2
 
 2 x1 x3 
 : 
 
 2 x1 xm  You may be wondering what those
  Quadratic
 2 x x
2 3 
Cross-Terms
2 ’s are doing.
 : 
  •You should be happy that they do no
 2 x x
1 m  harm
 : 
  •You’ll find out why they’re there
 2 xm −1 xm  soon.
QP with basis functions
R
1 R R
Maximize ∑
k =1
αk − ∑∑ αk αl Qkl where Qkl = yk yl (Φ(x k ).Φ(x l ))
2 k =1 l =1
R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
k =1
k yk = 0
Then define:
w= ∑αk yk Φ(x k ) Then classify with:

k s.t. α k >0 f(x,w,b) = sign(w. φ (x) - b)
b = y K (1 −ε K ) −x K .w K
k
QP with basis functions
R
1 R R
Maximize ∑
k =1
2 k =1 l =1
We must do R2/2 dot products to
get this matrix ready.
R
dot product ∑
Subject to these 0 ≤ αk ≤ CEach∀k α y =0
constraints: requiresk m k/22
k =1
additions and multiplications
The whole thing costs R2 m2 /4.
Then define:
Yeeks!
w= ∑αk yk Φ(x k ) …or does it?

Then classify with:
b = y K (1 −ε K ) −x K .w K
k
 1   1  1
   
 2a1   2b1 
 +
2 a2   2b2 
    m
 :   :  ∑ 2a b i i
 2 a   2 b  i =1
 m   m 
 a12
  b1 2
 +
 2   2 
 a   b  m
∑ i bi
2 2 2 2
    a
: :
 2
  2
 i =1
 a   b 
Quadratic
m m
Φ(a) • Φ(b) =  •
2a1a2   2b1b2 
   
 2a1a3   2b1b3  +


:  
 
: 

Dot Products
 2a1am   2b1bm 
    m m
 2 a a
2 3   2 b b
2 3  ∑ ∑ 2a a b b i j i j
 :   :  i =1 j =i +1
   
 2 a a
1 m   2b b
1 m 
 :   : 
   
 2am −1am   2bm −1bm 
Just out of casual, innocent, interest,
let’s look at another function of a and
Quadratic b:
(a.b + 1) 2
Dot Products = (a.b) 2 + 2a.b + 1
2
 m
 m
=  ∑ ai bi  + 2∑ ai bi + 1
 i =1  i =1
Φ(a) • Φ(b) = m m m
m m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1
1 + 2∑ ai bi + ∑ a b + ∑
2 2
i i ∑ 2a a b bi j i j
i =1 j =1 i =1
i =1 i =1 i =1 j =i +1
m m m m
= ∑ (ai bi ) + 2∑
2
∑a b a b i i j j + 2∑ ai bi + 1
i =1 i =1 j =i +1 i =1
Just out of casual, innocent, interest,
let’s look at another function of a and
Quadratic b:
(a.b + 1) 2
Dot Products = (a.b) 2 + 2a.b + 1
2
 m
 m
=  ∑ ai bi  + 2∑ ai bi + 1
 i =1  i =1
Φ(a) • Φ(b) = m m m
m m m m = ∑∑ ai bi a j b j + 2∑ ai bi + 1
1 + 2∑ ai bi + ∑ a b + ∑
2 2
i i ∑ 2a a b bi j i j
i =1 j =1 i =1
i =1 i =1 i =1 j =i +1
m m m m
= ∑ (ai bi ) + 2∑
2
∑a b a b i i j j + 2∑ ai bi + 1
i =1 i =1 j =i +1 i =1
They’re the same!

And this is only O(m) to
compute!
QP with Quadratic basis functions
R
1 R R
Maximize ∑
k =1
2 k =1 l =1
We must do R2/2 dot products to
get this matrix ready.
R
dot product ∑
Subject to these 0 ≤ αk ≤ CEach∀k k =0
αk yrequires
now only
constraints: k =1
m additions and multiplications
Then define:
w= ∑αk yk Φ(x k ) Then classify with:

b = y K (1 −ε K ) −x K .w K
k
Higher Order Polynomials
Polynomial φ (x) Cost to build Cost if 100 φ (a).φ (b) Cost to Cost if
Qkl matrix inputs build Qkl 100
traditionally matrix inputs
efficiently
Quadratic All m2/2 m2 R2 /4 2,500 R2 (a.b+1)2 m R2 / 2 50 R2
terms up to
degree 2
Cubic All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2

terms up to
degree 3
Quartic All m4/24 m4 R2 /48 1,960,000 (a.b+1)4 m R2 / 2 50 R2

terms up to R2
degree 4
QP with Quintic basis functions
R R R
Maximize ∑ α + ∑∑ α α Q
k =1
k
k =1 l =1
k l kl
where Qkl = yk yl (Φ(x k ).Φ(x l ))
R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
k =1
k yk = 0
Then define:
w= ∑α
k s.t. α k >0
k y k Φ( x k )
b = y K (1 −ε K ) −x K .w K
Then classify with:
k f(x,w,b) = sign(w. φ (x) - b)
We must do RR2/2 dot products
R R to get this
matrix ready.
In 100-d, each
k
k =1dot product
k =1 now
k l kl
l =1 needs 103
operations instead of 75 million
R
But there are still worrying things lurking away.
Subject to these
What are they?
constraints:
0 ≤ α ≤ C ∀k
k ∑α
•The fear of overfittingkwith
y =0
k k
=1 this enormous
number of terms
Then define: •The evaluation phase (doing a set of
predictions on a test set) will be very
∑α
expensive (why?)
w= k y k Φ( x k )
k s.t. α k >0
b = y K (1 −ε K ) −x K .w K
Then classify with:
R R to get this
matrix ready.
In 100-d, each
k
k =1dot product
k =1 now
k l kl
l =1 needs 103
The use of Maximum Margin
operations instead of 75 million magically makes this not a
problem R
Subject to these
What are they?
constraints:
0 ≤ α ≤ C ∀k
k ∑α
y =0
k k
=1 this enormous
number of terms
∑α
expensive (why?)
w= k y k Φ( x k )
k s.t. α k >0 Because each w. φ (x) (see below)
needs 75 million operations. What
b = y K (1 −ε K ) −x K .w K can be done?
Then classify with:
R R to get this
matrix ready.
In 100-d, each
k
k =1dot product
k =1 now
k l kl
l =1 needs 103
problem R
Subject to these
What are they?
constraints:
0 ≤ α ≤ C ∀k
k ∑α
k y =0
k
=1 this enormous
number of terms
∑α
expensive (why?)
w= k y k Φ( x k )
b ⋅=
w x) =
Φ(y
K
∑εαk y)k Φ
(1 − −(x
x k ) .⋅w
k s.t. α k >0K
Φ( x)
K K
can be done?
Then classify with:
where K∑
= ⋅ + 5
= arg
α y
k k ( x
max
k x 1
α
k s.t. α k >0
)
k
k
Only Sm operations (S=#support vectors)
f(x,w,b) = sign(w. φ (x) - b)
R R to get this
matrix ready.
In 100-d, each
k
k =1dot product
k =1 now
k l kl
l =1 needs 103
problem R
Subject to these
What are they?
constraints:
0 ≤ α ≤ C ∀k
k ∑α
k y =0
k
=1 this enormous
number of terms
∑α
expensive (why?)
w= k y k Φ( x k )
b ⋅=
w x) =
Φ(y
K
∑εαk y)k Φ
(1 − −(x
x k ) .⋅w
k s.t. α k >0K
Φ( x)
K K
can be done?
Then classify with:
where K∑
= ⋅ + 5
= arg
α y
k k ( x
max
k x 1
α
k s.t. α k >0
)
k
k
R
1 R R
Maximize ∑
k =1
αk − ∑∑ αk αl Qkl where
2 k =1 l =1
WhyQ kl = y k yl (Φ( x k ).Φ( x l ))
SVMs don’t overfit as much as
you’d think:
No matter what the basis function,
there are really R
only up to R
Subject to these
constraints:
0 ≤ αk ≤ C ∀k ∑α
parameters: α1, α2 .. αk
most are set tokzero
y =0
R, and
k
usually
=1 by the Maximum
Margin.
Then define: Asking for small w.w is like “weight

decay” in Neural Nets and like Ridge
Regression parameters in Linear
w= ∑α
k s.t. α k >0
k y Φ( x )
k k regression and like the use of Priors
in Bayesian Regression---all designed
to smooth the function and reduce
b = y (1 −
w ⋅Φ( x) = ∑ε ) −x .w
αk y k Φ( x k ) ⋅Φ( x) overfitting.
K k s.t. α k >0K K K
Then classify with:
=
where K ∑ =α
k s.t. α >0
y
arg
k k ( x
max
k ⋅ x +1)
αk
5

k
k
SVM Kernel Functions
• K(a,b)=(a . b +1)d is an example of an SVM

Kernel Function
• Beyond polynomials there are other very
high dimensional basis functions that can be
made practical by finding the right Kernel
Function
– Radial-Basis-style Kernel Function:
 (a − b)
2

K (a, b) = exp −  σ, κ and δ are magic
 2σ 2
 parameters that must
– Neural-net-style Kernel Function: be chosen by a model
K (a, b) = tanh(κ a.b − δ ) selection method
such as CV or VCSRM
SVM Implementations
• Sequential Minimal Optimization, SMO,

efficient implementation of SVMs, Platt
– in Weka
• SVMlight
– http://svmlight.joachims.org/
References
• Tutorial on VC-dimension and Support

Vector Machines:
C. Burges. A tutorial on support vector machines for
pattern recognition. Data Mining and Knowledge
Discovery, 2(2):955-974, 1998.
http://citeseer.nj.nec.com/burges98tutorial.html
• The VC/SRM/SVM Bible:
Statistical Learning Theory by Vladimir Vapnik, Wiley-
Interscience; 1998

Introduction To Svms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Svms

Uploaded by

Copyright:

Available Formats

Introduction to SVMs

How would you

How would you

How would you

How would you

• How do we represent this mathematically?

And so of course the vector w is also

• Let x- be any point on the minus plane necessarily a

What we know: = λ | w | = λ w.w

Given a guess of w and b we can

• QP is a well-studied class of optimization

Subject to a11u1 + a12u 2 + ... + a1mum ≤ b1

Subject to a11u1 + a12u 2 + ... + a1mum ≤ bn1g

There’s a serious practical

• The decision boundary can be found by solving the

• This is a constrained optimization problem. Solving it

– Note that ||w||2 = wTw

Properties of αi when we introduce The result when we differentiate the

• This is a quadratic programming (QP) problem

There’s a bug in this QP. Canwyou

Let the nonlinear mapping be :

φ(x) = (1,x12, 21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T

Therefore the feature space is in 6D with input data in 2D

Where the first element of w0 gives the bias b

which is the optimal decision boundary for the XOR problem.

subject to the constraints

• Pick a nonlinear mapping φ

However: how do we pick the function φ?

• In practical applications, if it is not totally

Answer: the Kernel Trick

How do we relate the output of the SVM to the kernel K?

writing : ϕ( x ) = [ϕ0 ( x ), ϕ1 ( x ),..., ϕm1 ( x )]

Which implied the form of our nonlinear functions:

However, we did not need to calculate φ at all and could simply

Maximized and solved for ai and derived the hyperplane via:

Let K(x,y) be a continuous symmetric kernel that defined in the

Κ (x,y) = φ(x) T φ(y)

provided it is positive definite. Some of the usual choices for K are:

Polynomial SVM (x T xi+1)p p specified by user

RBF SVM exp(-1/(2σ2) || x – xi||2) σ specified by user

MLP SVM tanh(s0 x T xi + s1)

Then define: Datapoints with αk > 0

w= ∑αk yk Φ(x k ) Then classify with:

w= ∑αk yk Φ(x k ) …or does it?

They’re the same!

w= ∑αk yk Φ(x k ) Then classify with:

Cubic All m3/6 m3 R2 /12 83,000 R2 (a.b+1)3 m R2 / 2 50 R2

Quartic All m4/24 m4 R2 /48 1,960,000 (a.b+1)4 m R2 / 2 50 R2

Then define: Asking for small w.w is like “weight

f(x,w,b) = sign(w. φ (x) - b)

• K(a,b)=(a . b +1)d is an example of an SVM

• Sequential Minimal Optimization, SMO,

• Tutorial on VC-dimension and Support

You might also like