Lecture6 - Support Vector Machines & Kernels

Support Vector Machines & Kernels
Lecture 6
David Sontag
New York University
Slides adapted from Luke Zettlemoyer, Carlos Guestrin,

and Vibhav Gogate
SVMs in the dual
Primal: Solve for w, b:
Dual:
dot product
The dual is also a quadratic program, and can be efficiently solved

to optimality
Support vectors
•  Complementary slackness conditions:
•  Support vectors: points xj such that

(includes all j such that , but also additional points
where )
•  Note: the SVM dual solution may not be unique!

Dual SVM interpretation: Sparsity
+1
-1
=
=
=
w.x + b
w.x + b
w.x + b
Final solution tends to
be sparse
• αj=0 for most j
• don’t need to store these

points to compute w or make
predictions
Non-support Vectors:
• αj=0
• moving them will not Support Vectors:
change w •  αj≥0
Classification rule using dual solution
Using dual solution
dot product of feature vectors of

new example with support vectors
SVM with kernels
•  Never compute features explicitly!!!

–  Compute dot products in closed form Predict with:
•  O(n2) time in size of dataset to

compute objective
–  much work on speeding up

∂w x . . .x
(1)   j j j
φ(x) =   ex x j�
(1)
 (1)e∂Lx(3) 
� x �. . .�= w − � αj yj xj
u1 .∂w .. v1  j
=  �. .. .  = u1 v1 + u2 v2 = u.v
Efficient dot-product
φ(u).φ(v)
∂L
∂L = w
=
− u2� (1)of
�
α �y
w − euα1j yj xj 
x j
v2�polynomials
j xj
 �
∂w
Polynomials of degree exactly d
v1
∂w 2 =
φ(u).φ(v)  jju 2
.  = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 �� . . .

uu u 1u 2
v1  v 1 v 2
 2 2
φ(u).φ(v)
φ(u).φ(v) =
=  11
φ(u).φ(v) = u u u. v1�. 
u 2 1
.  == uu v v2+ =
+u2uu
v221vv= + u.v
u.v
1= 2u1 v1 u2 v2 + u

∂Lu22 21 u1 uv222  v
1
2 v1
1 11
1 2
v v 
φ(u).φ(v)
 =
= u


w
2 −  α
.
 2

vj2y j1x 2
j  = u21 v12 + 2u1 v1 u2 v2 +

d=2 u∂w
1
2 2
uv21u12 v2 v1
 u1 u2   v1uv22  j 2
v2 v )2
φ(u).φ(v) =  .
�u2 u1 �� = (u v 2
+ 2
u 2 2
v2 v1 �1 11 1 2 21 1 2 2
= u v + 2u v u v + u 2 v2
u
u2 1 v 2v1 = (u v + u v ) 2
φ(u).φ(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
φ(u).φ(v) = (u.v)d
For any d (we will skip proof): φ(u).φ(v) = (u.v) d
= (u.v)=d (u.v)d
φ(u).φ(v)φ(u).φ(v)
•  Cool! Taking a dot product and exponentiating gives same
results as mapping into high dimensional space and then taking
dot product
Quadratic kernel
[Tommi Jaakkola]
Quadratic kernel
Feature mapping given by:
[Cynthia Rudin]
Common kernels
•  Polynomials of degree exactly d
•  Polynomials of degree up to d
•  Gaussian kernels
Euclidean distance,
squared
•  And many others: very active area of research!

(e.g., structured kernels that use dynamic programming
to evaluate)
Gaussian kernel
[Cynthia Rudin] [mblondel.org]

Kernel algebra
Q: How would you prove that the “Gaussian kernel” is a valid kernel?
A: Expand the Euclidean norm as follows:
To see that this is a kernel, use the

Taylor series expansion of the
Then, apply (e) from above exponential, together with repeated
application of (a), (b), and (c):
The feature mapping is
infinite dimensional!
[Justin Domke]
Overfitting?
•  Huge feature space with kernels: should we worry about

overfitting?
–  SVM objective seeks a solution with large margin
•  Theory says that large margin leads to good generalization
(we will see this in a couple of lectures)
–  But everything overfits sometimes!!!
–  Can control by:
•  Setting C
•  Choosing a better Kernel
•  Varying parameters of the Kernel (width of Gaussian, etc.)
Software
•  SVMlight: one of the most widely used SVM packages. Fast

optimization, can handle very large datasets, C++ code.
•  LIBSVM
•  Both of these handle multi-class, weighted SVM for
unbalanced data, etc.
•  There are several new approaches to solving the SVM

objective that can be much faster:
–  Stochastic subgradient method (discussed in a few lectures)
–  Distributed computation (also to be discussed)
•  See http://mloss.org, “machine learning open source software”

Machine learning methodology:
Cross Validation
Choosing among several hypotheses
•  Suppose you are considering between several

different hypotheses, e.g.
regression
•  For the SVM, we get one linear classifier for each

choice of the regularization parameter C
•  How do you choose between them?

General strategy
Split the data up into three parts:
Assumes that the available data is randomly allocated to these

three, e.g. 60/20/20.
Typical approach
• Learn a model from the training set
(e.g., ﬁx a C and learn the SVM)
•  Es=mate your future performance with

the valida=on data
This the model you learned.
More data is better
With more data you can learn better
Blue: Observed data
Red: Predicted curve
True: Green true distribution
Compare the predicted curves

Cross Validation
Recycle the data!
Use (almost) all of this for training:

LOOCV (Leave-one-out Cross Validation)
Lets say we have N data points
Your single test data point k indices the data points, i.e. k=1...N
Let (xk,yk) be the kth example
Temporarily remove (xk,yk) from the

dataset
Train on the remaining N-1 data points
Test your error on (xk,yk)
Do this for each k=1..N and report the

average error
Once the best parameters (e.g., choice of C for the SVM) are found, re-train
using all of the training data
LOOCV (Leave-one-out Cross
Validation)
There are N data points.
Repeat learning N times.
Notice the test data

(shown in red) is changing
each time
K-fold cross validation
validate train
Test Train on (k - 1) splits
k-fold
In 3 fold cross valida=on, there are 3 runs.

the error is averaged over all runs

Lecture6 - Support Vector Machines & Kernels

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture6 - Support Vector Machines & Kernels

Uploaded by

Copyright:

Available Formats

Support Vector Machines & Kernels

Slides adapted from Luke Zettlemoyer, Carlos Guestrin,

Primal: Solve for w, b:

The dual is also a quadratic program, and can be efficiently solved

• Complementary slackness conditions:

• Support vectors: points xj such that

• Note: the SVM dual solution may not be unique!

• don’t need to store these

Using dual solution

dot product of feature vectors of

• Never compute features explicitly!!!

• O(n2) time in size of dataset to

Feature mapping given by:

• And many others: very active area of research!

[Cynthia Rudin] [mblondel.org]

To see that this is a kernel, use the

• Huge feature space with kernels: should we worry about

• SVMlight: one of the most widely used SVM packages. Fast

• There are several new approaches to solving the SVM

– Distributed computation (also to be discussed)

• See http://mloss.org, “machine learning open source software”

• Suppose you are considering between several

• For the SVM, we get one linear classifier for each

• How do you choose between them?

Assumes that the available data is randomly allocated to these

• Es=mate your future performance with

Compare the predicted curves

Recycle the data!

Use (almost) all of this for training:

Let (xk,yk) be the kth example

Temporarily remove (xk,yk) from the

Train on the remaining N-1 data points

Test your error on (xk,yk)

Do this for each k=1..N and report the

Notice the test data

In 3 fold cross valida=on, there are 3 runs.

You might also like

•  Complementary slackness conditions:

•  Support vectors: points xj such that

•  Note: the SVM dual solution may not be unique!

• don’t need to store these

•  Never compute features explicitly!!!

•  O(n2) time in size of dataset to

•  And many others: very active area of research!

•  Huge feature space with kernels: should we worry about

•  SVMlight: one of the most widely used SVM packages. Fast

•  There are several new approaches to solving the SVM

–  Distributed computation (also to be discussed)

•  See http://mloss.org, “machine learning open source software”

•  Suppose you are considering between several

•  For the SVM, we get one linear classifier for each

•  How do you choose between them?

•  Es=mate your future performance with