You are on page 1of 42

Foundations of Machine Learning

Support Vector Machines

Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Binary Classification Problem
Training data: sample drawn i.i.d. from set X RN
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X { 1, +1}.

Problem: find hypothesis h :X { 1, +1} in H


(classifier) with small generalization error R(h).
• choice of hypothesis set H : learning guarantees
of previous lecture.
linear classification (hyperplanes) if
dimension N is not too large.

Mehryar Mohri - Foundations of Machine Learning page 2


This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees

Mehryar Mohri - Foundations of Machine Learning page 3


Linear Separation

margin

w·x+b = 0

• classifiers: H = {x sgn(w · x + b) : w RN, b R}.

• geometric margin: ⇢ = min i2[1,m]


|w·xi +b|
kwk .

• which separating hyperplane?


Mehryar Mohri - Foundations of Machine Learning page 4
Optimal Hyperplane: Max. Margin
(Vapnik and Chervonenkis, 1965)

w·x+b = 0

margin
w·x+b = +1
w·x+b = 1

|w · xi + b|
⇢= max min .
w,b : yi (w·xi +b) 0 i2[1,m] kwk

Mehryar Mohri - Foundations of Machine Learning page 5


Maximum Margin
|w · xi + b|
⇢= max min
w,b : yi (w·xi +b) 0 i2[1,m] kwk
|w·xi + b|
= max min (scale-invariance)
w,b : yi (w·xi +b) 0 i2[1,m] kwk
mini2[1,m] |w·xi +b|=1

1
= max
w,b : yi (w·xi +b) 0 kwk
mini2[1,m] |w·xi +b|=1

1
= max . (min. reached)
w,b : yi (w·xi +b) 1 kwk

Mehryar Mohri - Foundations of Machine Learning page 6


Optimization Problem
Constrained optimization:
1
min w 2
w,b 2

subject to yi (w · xi + b) 1, i [1, m].


Properties:
• Convex optimization.
• Unique solution for linearly separable sample.

Mehryar Mohri - Foundations of Machine Learning page 7


Optimal Hyperplane Equations
Lagrangian: for all w, b, i 0,
m
1
L(w, b, ) = w 2
i [yi (w · xi + b) 1].
2 i=1
KKT conditions:
m m
wL =w i yi xi =0 w= i yi xi .
m i=1 m i=1

bL = i yi =0 i yi = 0.
i=1 i=1

i [1, m], i [yi (w · xi + b) 1] = 0.

Mehryar Mohri - Foundations of Machine Learning page 8


Support Vectors
Complementarity conditions:
i [yi (w · xi + b) 1] = 0 = i =0 yi (w · xi + b) = 1.

Support vectors: vectors xi such that


i =0 yi (w · xi + b) = 1.

• Note: support vectors are not unique.

Mehryar Mohri - Foundations of Machine Learning page 9


Moving to The Dual
Plugging in the expression of w in L gives:
m 2 m m m
1 X X X X
L= ↵ i yi x i ↵i ↵j yi yj (xi · xj ) ↵ i yi b + ↵i .
2 i=1 i,j=1 i=1 i=1
| {z } | {z }
1
Pm 0
2 i,j=1 ↵i ↵j yi yj (xi ·xj )

Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1

Mehryar Mohri - Foundations of Machine Learning page 10


Equivalent Dual Opt. Problem
Constrained optimization:
m m
1
max i i j yi yj (xi · xj )
i=1
2 i,j=1
m
subject to: i 0 i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi (xi · x) + b ,
m i=1
with b = yi j yj (xj · xi ) for any SV xi .
j=1

Mehryar Mohri - Foundations of Machine Learning page 11


Leave-One-Out Error
Definition: let hS be the hypothesis output by
learning algorithm L after receiving sample S of
size m. Then, the leave-one-out error of L over S
is: 1
m
Rloo (L) = 1 hS {xi } (xi )=f (xi )
.
m i=1
Property: unbiased estimate of expected error of
hypothesis trained on sample of size m 1 ,
m
1
E m [Rloo (L)] = E[1hS {xi } (xi )=f (xi )
] = E[1hS {x} (x)=f (x)
]
S D m i=1
S S

= Em 1 [ E [1hS (x)=f (x) ]] = E m 1 [R(hS )].


S D x D S D
Mehryar Mohri - Foundations of Machine Learning page 12
Leave-One-Out Analysis
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
NSV (S)
E m [R(hS )] E .
S D S Dm+1 m+1
Proof: Let S Dm+1 be a sample linearly separable
and let x S . If hS {x} misclassifies x , then x must
be a SV for hS . Thus,
NSV (S)
Rloo (opt.-hyp.) .
m+1

Mehryar Mohri - Foundations of Machine Learning page 13


Notes
Bound on expectation of error only, not the
probability of error.
Argument based on sparsity (number of support
vectors). We will see later other arguments in
support of the optimal hyperplanes based on the
concept of margin.

Mehryar Mohri - Foundations of Machine Learning page 14


This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees

Mehryar Mohri - Foundations of Machine Learning page 15


Support Vector Machines
(Cortes and Vapnik, 1995)
Problem: data often not linearly separable in
practice. For any hyperplane, there exists xi such
that
yi [w · xi + b] 1.
Idea: relax constraints using slack variables i 0
yi [w · xi + b] 1 i.

Mehryar Mohri - Foundations of Machine Learning page 16


Soft-Margin Hyperplanes
w·x+b = 0
ξi

ξj
w·x+b = +1

w·x+b = 1

Support vectors: points along the margin or outliers.


Soft margin: = 1/ w .

Mehryar Mohri - Foundations of Machine Learning page 17


Optimization Problem
(Cortes and Vapnik, 1995)
Constrained optimization:
m
1
min w 2
+C i
w,b, 2 i=1
subject to yi (w · xi + b) 1 i i 0, i [1, m].

Properties:
• C 0 trade-off parameter.
• Convex optimization.
• Unique solution.

Mehryar Mohri - Foundations of Machine Learning page 18


Notes
Parameter C : trade-off between maximizing margin
and minimizing training error. How do we
determine C ?
The general problem of determining a hyperplane
minimizing the error on the training set is NP-
complete (as a function of the dimension).
Other convex functions of the slack variables
could be used: this choice and a similar one with
squared slack variables lead to a convenient
formulation and solution.

Mehryar Mohri - Foundations of Machine Learning page 19


SVM - Equivalent Problem
Optimization:
m
1
min w 2
+C 1 yi (w · xi + b) .
w,b 2 i=1
+

Loss functions:
• hinge loss:
L(h(x), y) = (1 yh(x))+ .

• quadratic hinge loss:


L(h(x), y) = (1 yh(x))2+ .

Mehryar Mohri - Foundations of Machine Learning page 20


Hinge Loss
2
‘Quadratic’ hinge loss

Hinge loss ξ 1

0/1 loss function

Mehryar Mohri - Foundations of Machine Learning page 21


SVMs Equations
Lagrangian: for all w, b, i 0, i 0,
m m m
1
L(w, b, , , ) = w 2
+C i i [yi (w · xi + b) 1 + i] i i .
2 i=1 i=1 i=1

KKT conditions:
m m
wL =w i yi xi =0 w= i yi xi .
m i=1 m i=1

bL = i yi =0 i yi = 0.
i=1 i=1
i L=C i i =0 i+ i = C.
i [1, m], i [yi (w · xi + b) 1 + i] = 0

i i = 0.
Mehryar Mohri - Foundations of Machine Learning page 22
Support Vectors
Complementarity conditions:
i [yi (w · xi + b) 1 + i] = 0 = i =0 yi (w · xi + b) = 1 i.

Support vectors: vectors xi such that


i =0 yi (w · xi + b) = 1 i.

• Note: support vectors are not unique.

Mehryar Mohri - Foundations of Machine Learning page 23


Moving to The Dual
Plugging in the expression of w in L gives:
m 2 m m m
1 X X X X
L= ↵ i yi x i ↵i ↵j yi yj (xi · xj ) ↵ i yi b + ↵i .
2 i=1 i,j=1 i=1 i=1
| {z } | {z }
1
Pm 0
2 i,j=1 ↵i ↵j yi yj (xi ·xj )

Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1

The condition i 0 is equivalent to i C.

Mehryar Mohri - Foundations of Machine Learning page 24


Dual Optimization Problem
Constrained optimization:
m m
1
max i i j yi yj (xi · xj )
i=1
2 i,j=1
m
subject to: 0 i C i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi (xi · x) + b ,
m i=1
with b = yi j yj (xj · xi ) for any xi with
j=1 0 < i < C.
Mehryar Mohri - Foundations of Machine Learning page 25
This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees

Mehryar Mohri - Foundations of Machine Learning page 26


High-Dimension
Learning guarantees: for hyperplanes in dimension N
with probability at least 1 ,
s s
em
b
2(N + 1) log N +1 log 1
R(h)  R(h) + + .
m 2m

• bound is uninformative for N m.


• but SVMs have been remarkably successful in
high-dimension.
• can we provide a theoretical justification?
• analysis of underlying scoring function.
Mehryar Mohri - Foundations of Machine Learning page 27
Confidence Margin
Definition: the confidence margin of a real-valued
function h at (x, y) 2 X ⇥ Y is ⇢h (x, y) = yh(x).
• interpreted as the hypothesis’ confidence in
prediction.
• if correctly classified coincides with |h(x)| .
• relationship with geometric margin for linear
functions h : x 7! w · x + b: for x in the sample,
|⇢h (x, y)| ⇢geom kwk.

Mehryar Mohri - Foundations of Machine Learning page 28


Confidence Margin Loss
Definition: for any confidence margin parameter > 0
the ⇢-margin loss function ⇢ is defined by
⇢ (yh(x))

ρ 1 yh(x)
0

For a sample S = (x1 , . . . , xm ) and real-valued


hypothesis h , the empirical margin loss is
m m
1 1
R (h) = (yi h(xi )) 1yi h(xi )< .
m i=1
m i=1

Mehryar Mohri - Foundations of Machine Learning page 29


General Margin Bound
Theorem: Let H be a set of real-valued functions.
Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h H :
2 log 1
R(h) R (h) + Rm H +
2m
2 log 2
R(h) R (h) + RS H + 3 .
2m
Proof: Let H = {z = (x, y) yh(x) : h H}. Consider
the family of functions taking values in [0, 1] :
H={ f: f H}.
Mehryar Mohri - Foundations of Machine Learning page 30
• By the theorem ofg Lecture 3, with probability at
least 1 H , for all ,
m
1 log 1
E[g(z)] g(zi ) + 2Rm (H) + .
m 2m
• Thus,
i=1

log 1
E[ (yh(x))] R (h) + 2Rm H + .
2m
• Since is 1 - Lipschitz, by Talagrand’s lemma,
m
1 1 1
Rm H Rm (H) = E sup i yi h(xi ) = Rm H .
m ,S h H i=1

• Since 1 yh(x)<0 (yh(x)) , this shows the first


statement, and similarly the second one.

Mehryar Mohri - Foundations of Machine Learning page 31


Rademacher Complexity of Linear Hypotheses

Theorem: Let S {x : x R} be a sample of size m


and let H = {x w·x : w }. Then,
R2 2
RS (H) .
m
Proof:
 m
X  m
X
b 1 1
RS (H) = E sup i w · xi = E sup w · i xi
m kwk⇤ i=1 m kwk⇤ i=1
 X  h X

m

m 2 i 1/2
 E i xi  E i xi
m m
 hXi=1 p i=1 r

m i 1/2
⇤ mR 2 R 2 ⇤2
 E kxi k2  = .
m i=1
m m
Mehryar Mohri - Foundations of Machine Learning page 32
Margin Bound - Linear Classifiers
Corollary: Let > 0 and H = {x w·x : w }.
Assume that X {x : x R}. Then, for any > 0 ,
with probability at least 1 , for any h H,

R2 2/ 2 log 2
R(h) R (h) + 2 +3 .
m 2m

Proof: Follows directly general margin bound and


bound on RS H for linear classifiers.

Mehryar Mohri - Foundations of Machine Learning page 33


High-Dimensional Feature Space
Observations:
• generalization bound does not depend on the
dimension but on the margin.
• this suggests seeking a large-margin hyperplane
in a higher-dimensional feature space.
Computational problems:
• taking dot products in a high-dimensional feature
space can be very costly.
• solution based on kernels (next lecture).

Mehryar Mohri - Foundations of Machine Learning page 34


References
• Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.

• Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding


the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002.

• Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,


1982.

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 35 Courant Institute, NYU


Appendix

Mehryar Mohri - Foudations of Machine Learning


Saddle Point
Let (w , b , ) be the saddle point of the Langrangian. Multiplying
both sides of the equation giving b by αi∗ yi and taking the sum leads
to: m m m

i y i b = y
i i
2
i j yi yj (xi · xj ).
i=1 i=1 i,j=1
m m
Using yi2 = 1 , i=1 i yi = 0, and w = i=1 i yi xi yields
m
0= i w 2
.
i=1
Thus, the margin is also given by:
1 1
2
= = .
w 2
2 1

Mehryar Mohri - Foundations of Machine Learning page 37


Talagrand’s Contraction Lemma
(Ledoux and Talagrand, 1991; pp. 112-114)
Theorem: Let : R R be an L -Lipschitz function.
Then, for any hypothesis set H of real-valued
functions,
RS ( H) L RS (H).

Proof: fix sample S = (x1 , . . . , xm ) . By definition,


m
1
RS ( H) = E sup i( h)(xi )
m h H i=1
1
= E E sup um 1 (h) + m( h)(xm ) ,
m 1 ,..., m 1 m h H
m 1
with um 1 (h) = i( h)(xi ).
i=1
Mehryar Mohri - Foundations of Machine Learning page 38
Talagrand’s Contraction Lemma
Now, assuming that the suprema are reached,
there exist h1 , h2 H such that
E sup um 1 (h) + m( h)(xm )
m h H
1 1
= [um 1 (h1 ) +( h1 )(xm )] + [um 1 (h2 ) ( h2 )(xm )]
2 2
1
[um 1 (h1 ) + um 1 (h2 ) + sL(h1 (xm ) h2 (xm ))]
2
1 1
= [um 1 (h1 ) + sLh1 (xm )] + [um 1 (h2 ) sLh2 (xm )]
2 2
E sup um 1 (h) + m Lh(xm ) ,
m h H

where s = sgn(h1 (xm ) h2 (xm )).


Mehryar Mohri - Foundations of Machine Learning page 39
Talagrand’s Contraction Lemma
When the suprema are not reached, the same can
be shown modulo , followed by 0.

Proceeding similarly for other is directly leads to


the result.

Mehryar Mohri - Foundations of Machine Learning page 40


VC Dimension of Canonical Hyperplanes
Theorem: Let S {x : x R}. Then, the VC
dimension d of the set of canonical hyperplanes
{x sgn(w · x) : min |w · x| = 1 w } verifies
x S

d R2 2
.

Proof: Let {x1 , . . . , xd } be a set fully shattered. Then,


for all y { 1, +1} d
, there exists w such

i [1, d], 1 yi (w · xi ).

Mehryar Mohri - Foundations of Machine Learning page 41


• Summing up the inequalities gives
d d d
d w· yi xi w yi xi yi xi .
i=1 i=1 i=1

• Taking the expectation over y


d d
U (uniform) yields
1/2
d E [ yi xi ] E [ yi xi 2
] (Jensen’s ineq.)
y U y U
i=1 i=1
d
1/2
= E[yi yj ](xi · xj )
i,j=1
d
1/2 2 1/2
= (xi · xi ) dR = R d.
i=1

• Thus, d R.

Mehryar Mohri - Foundations of Machine Learning page 42

You might also like