You are on page 1of 19

Support Vector Machines & Kernels

Lecture 5

David Sontag
New York University

Slides adapted from Luke Zettlemoyer and Carlos Guestrin


Multi-class SVM
As for the SVM, we introduce slack variables and maximize margin:

To predict, we use:

Now can we learn it? 


How to deal with imbalanced data?

•  In many practical applications we may have


imbalanced data sets
•  We may want errors to be equally distributed
between the positive and negative classes
•  A slight modification to the SVM objective
does the trick!

Class-specific weighting of the slack variables


What’s Next!
•  Learn one of the most interesting and
exciting recent advancements in machine
learning
–  The “kernel trick”
–  High dimensional feature spaces at no extra
cost!
•  But first, a detour
–  Constrained optimization!
Constrained optimization

No Constraint x ≥ -1 x≥1

x*=0 x*=0 x*=1

How do we solve with constraints?


 Lagrange Multipliers!!!
Lagrange multipliers – Dual variables
Add Lagrange multiplier
Rewrite
Constraint
Introduce Lagrangian (objective):

We will solve:
Why is this equivalent?
•  min is fighting max!
x<b  (x-b)<0  maxα-α(x-b) = ∞
•  min won’t let this happen!
Add new
x>b, α≥0  (x-b)>0  maxα-α(x-b) = 0, α*=0 constraint
•  min is cool with 0, and L(x, α)=x2 (original objective)

x=b  α can be anything, and L(x, α)=x2 (original objective)

The min on the outside forces max to behave, so constraints will be satisfied.
Dual SVM derivation (1) – the linearly
separable case

Original optimization problem:

Rewrite One Lagrange multiplier


constraints per example
Lagrangian:

Our goal now is to solve:


Dual SVM derivation (2) – the linearly
separable case

(Primal)

Swap min and max

(Dual)

Slater’s condition from convex optimization guarantees that


these two optimization problems are equivalent!
 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

Sums over all training examples scalars dot product


 
x(1)
 ... 
 
Dual SVM derivation
 x (n) 
 (3) – the linearly
 x(1) x(2) 
φ(x) =  separable case
 
(1) (3) 
 x x 
 
 ... 
(Dual)  
 ex(1) 
Can solve for optimal w,. .b. as function of α:
∂L �
∂w
=w− αj yj xj 
j

Substituting these values back in (and simplifying), we obtain:

(Dual)

So, in dual formulation we will solve for α directly!


•  w and b are computed from α (if needed)
Dual SVM derivation (3) – the linearly
separable case
Lagrangian:

αj > 0 for some j implies constraint


is tight. We use this to obtain b:

(1)

(2)

(3)
Dual for the non-separable case – same basic
story (we will skip details)

Primal: Solve for w,b,α:

Dual:

What changed?
•  Added upper bound of C on αi!
•  Intuitive explanation:
•  Without slack, αi  ∞ when constraints are violated (points
misclassified)
•  Upper bound of C limits the αi, so misclassifications are allowed
Wait a minute: why did we learn about the dual
SVM?

•  There are some quadratic programming


algorithms that can solve the dual faster than the
primal
–  At least for small datasets

•  But, more importantly, the “kernel trick”!!!


Reminder: What if the data is not
linearly separable?
Use features of features
of features of features….
 
x(1)
 ... 
 
 x(n) 
 
 x(1) x(2) 
 
φ(x) =  (1) (3) 
 x x 
 
 ... 
 
 ex(1) 
...

Feature space can get really large really quickly!


Higher order polynomials
number of monomial terms

d=4

m – input features
d – degree of polynomial

d=3
grows fast!
d = 6, m = 100
d=2
about 1.6 billion terms
number of input dimensions
Dual formulation only depends on
dot-products, not on w!

Remember the
First, we introduce features: examples x only
 appear in one dot
product
Next, replace the dot product with a Kernel:

Why is this useful???



∂w x . . .x
(1)   j j j
φ(x) =   ex x j�
(1)
 (1)e∂Lx(3) 
� x �. . .�= w − � αj yj xj
u1 .∂w .. v1  j
=  �. .. .  = u1 v1 + u2 v2 = u.v
Efficient dot-product
φ(u).φ(v)
∂L
∂L = w
=
− u2� (1)of

α �y
w − euα1j yj xj 
x j
v2�polynomials
j xj
 �
∂w
Polynomials of degree exactly d
v1
∂w 2 =
φ(u).φ(v)  jju 2
.  = u1 v1 + u2 v2 = u.v
u1 2 v1 v2
d=1 �� �� � � ��. . .

uu u 1u 2
v1  v 1 v 2
 2 2
φ(u).φ(v)
φ(u).φ(v) =
=  11
φ(u).φ(v) = u u u. v1�. 
u 2 1
.  == uu v v2+ =
+u2uu
v221vv= + u.v
u.v
1= 2u1 v1 u2 v2 + u

∂Lu22 21 u1 uv222  v
1
2 v1
1 11
1 2
v v 
φ(u).φ(v)
 =
= u


w
2 −  α
.
 2

vj2y j1x 2
j  = u21 v12 + 2u1 v1 u2 v2 +

d=2 u∂w
1
2 2
uv21u12 v2 v1
 u1 u2   v1uv22  j 2
v2 v )2
φ(u).φ(v) =  .
�u2 u1 �� = (u v 2
+ 2
u 2 2
v2 v1 �1 11 1 2 21 1 2 2
= u v + 2u v u v + u 2 v2

u
u2 1 v 2v1 = (u v + u v ) 2
φ(u).φ(v) = 2 . 2 = 1u11 v1 + 2 2u v = u.v
2 2
u2 v2 = (u.v) 2
φ(u).φ(v) = (u.v)d
For any d (we will skip proof): φ(u).φ(v) = (u.v) d

= (u.v)=d (u.v)d
φ(u).φ(v)φ(u).φ(v)
•  Cool! Taking a dot product and exponentiating gives same
results as mapping into high dimensional space and then taking
dot produce
Finally: the “kernel trick”!

•  Never compute features explicitly!!!


–  Compute dot products in closed form
•  Constant-time high-dimensional dot-
products for many classes of features
•  But, O(n2) time in size of dataset to
compute objective
–  Naïve implements slow
–  much work on speeding up
Common kernels
•  Polynomials of degree exactly d

•  Polynomials of degree up to d

•  Gaussian kernels

•  Sigmoid

•  And many others: very active area of research!