Lecture 4

Foundations of Machine Learning
Support Vector Machines
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Binary Classification Problem
Training data: sample drawn i.i.d. from set X RN
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X { 1, +1}.
Problem: find hypothesis h :X { 1, +1} in H

(classifier) with small generalization error R(h).
• choice of hypothesis set H : learning guarantees
of previous lecture.
linear classification (hyperplanes) if
dimension N is not too large.
Mehryar Mohri - Foundations of Machine Learning page 2

This Lecture
Support Vector Machines - separable case
Support Vector Machines - non-separable case
Margin guarantees

Linear Separation
margin
w·x+b = 0
• classifiers: H = {x sgn(w · x + b) : w RN, b R}.
• geometric margin: ⇢ = min i2[1,m]

|w·xi +b|
kwk .
• which separating hyperplane?

Optimal Hyperplane: Max. Margin
(Vapnik and Chervonenkis, 1965)
w·x+b = 0
margin
w·x+b = +1
w·x+b = 1
|w · xi + b|
⇢= max min .
w,b : yi (w·xi +b) 0 i2[1,m] kwk

Maximum Margin
|w · xi + b|
⇢= max min
|w·xi + b|
= max min (scale-invariance)
mini2[1,m] |w·xi +b|=1
1
= max
w,b : yi (w·xi +b) 0 kwk
mini2[1,m] |w·xi +b|=1
1
= max . (min. reached)
w,b : yi (w·xi +b) 1 kwk

Optimization Problem
Constrained optimization:
1
min w 2
w,b 2
subject to yi (w · xi + b) 1, i [1, m].

Properties:
• Convex optimization.
• Unique solution for linearly separable sample.

Optimal Hyperplane Equations
Lagrangian: for all w, b, i 0,
m
1
L(w, b, ) = w 2
i [yi (w · xi + b) 1].
2 i=1
KKT conditions:
m m
wL =w i yi xi =0 w= i yi xi .
m i=1 m i=1
bL = i yi =0 i yi = 0.
i=1 i=1
i [1, m], i [yi (w · xi + b) 1] = 0.

Support Vectors
Complementarity conditions:
i [yi (w · xi + b) 1] = 0 = i =0 yi (w · xi + b) = 1.
Support vectors: vectors xi such that

i =0 yi (w · xi + b) = 1.
• Note: support vectors are not unique.

Moving to The Dual
Plugging in the expression of w in L gives:
m 2 m m m
1 X X X X
L= ↵ i yi x i ↵i ↵j yi yj (xi · xj ) ↵ i yi b + ↵i .
2 i=1 i,j=1 i=1 i=1
| {z } | {z }
1
Pm 0
2 i,j=1 ↵i ↵j yi yj (xi ·xj )
Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1

Equivalent Dual Opt. Problem
m m
1
max i i j yi yj (xi · xj )
i=1
2 i,j=1
m
subject to: i 0 i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi (xi · x) + b ,
m i=1
with b = yi j yj (xj · xi ) for any SV xi .
j=1

Leave-One-Out Error
Definition: let hS be the hypothesis output by
learning algorithm L after receiving sample S of
size m. Then, the leave-one-out error of L over S
is: 1
m
Rloo (L) = 1 hS {xi } (xi )=f (xi )
.
m i=1
Property: unbiased estimate of expected error of
hypothesis trained on sample of size m 1 ,
m
1
E m [Rloo (L)] = E[1hS {xi } (xi )=f (xi )
] = E[1hS {x} (x)=f (x)
]
S D m i=1
S S
= Em 1 [ E [1hS (x)=f (x) ]] = E m 1 [R(hS )].

S D x D S D
Leave-One-Out Analysis
Theorem: let hS be the optimal hyperplane for a
sample S and let NSV (S) be the number of support
vectors defining hS. Then,
NSV (S)
E m [R(hS )] E .
S D S Dm+1 m+1
Proof: Let S Dm+1 be a sample linearly separable
and let x S . If hS {x} misclassifies x , then x must
be a SV for hS . Thus,
NSV (S)
Rloo (opt.-hyp.) .
m+1

Notes
Bound on expectation of error only, not the
probability of error.
Argument based on sparsity (number of support
vectors). We will see later other arguments in
support of the optimal hyperplanes based on the
concept of margin.

This Lecture
Margin guarantees

Support Vector Machines
(Cortes and Vapnik, 1995)
Problem: data often not linearly separable in
practice. For any hyperplane, there exists xi such
that
yi [w · xi + b] 1.
Idea: relax constraints using slack variables i 0
yi [w · xi + b] 1 i.

Soft-Margin Hyperplanes
w·x+b = 0
ξi
ξj
w·x+b = +1
w·x+b = 1
Support vectors: points along the margin or outliers.

Soft margin: = 1/ w .

Optimization Problem
(Cortes and Vapnik, 1995)
m
1
min w 2
+C i
w,b, 2 i=1
subject to yi (w · xi + b) 1 i i 0, i [1, m].
Properties:
• C 0 trade-off parameter.
• Convex optimization.
• Unique solution.

Notes
Parameter C : trade-off between maximizing margin
and minimizing training error. How do we
determine C ?
The general problem of determining a hyperplane
minimizing the error on the training set is NP-
complete (as a function of the dimension).
Other convex functions of the slack variables
could be used: this choice and a similar one with
squared slack variables lead to a convenient
formulation and solution.

SVM - Equivalent Problem
Optimization:
m
1
min w 2
+C 1 yi (w · xi + b) .
w,b 2 i=1
+
Loss functions:
• hinge loss:
L(h(x), y) = (1 yh(x))+ .
• quadratic hinge loss:

L(h(x), y) = (1 yh(x))2+ .

Hinge Loss
2
‘Quadratic’ hinge loss
Hinge loss ξ 1
0/1 loss function

SVMs Equations
Lagrangian: for all w, b, i 0, i 0,
m m m
1
L(w, b, , , ) = w 2
+C i i [yi (w · xi + b) 1 + i] i i .
2 i=1 i=1 i=1
KKT conditions:
m m
wL =w i yi xi =0 w= i yi xi .
m i=1 m i=1
bL = i yi =0 i yi = 0.
i=1 i=1
i L=C i i =0 i+ i = C.
i [1, m], i [yi (w · xi + b) 1 + i] = 0
i i = 0.
Support Vectors
Complementarity conditions:
i [yi (w · xi + b) 1 + i] = 0 = i =0 yi (w · xi + b) = 1 i.
Support vectors: vectors xi such that

i =0 yi (w · xi + b) = 1 i.
• Note: support vectors are not unique.

Moving to The Dual
Plugging in the expression of w in L gives:
m 2 m m m
1 X X X X
L= ↵ i yi x i ↵i ↵j yi yj (xi · xj ) ↵ i yi b + ↵i .
2 i=1 i,j=1 i=1 i=1
| {z } | {z }
1
Pm 0
2 i,j=1 ↵i ↵j yi yj (xi ·xj )
Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1
The condition i 0 is equivalent to i C.

Dual Optimization Problem
m m
1
max i i j yi yj (xi · xj )
i=1
2 i,j=1
m
subject to: 0 i C i yi = 0, i [1, m].
i=1
Solution:
m
h(x) = sgn i yi (xi · x) + b ,
m i=1
with b = yi j yj (xj · xi ) for any xi with
j=1 0 < i < C.
This Lecture
Margin guarantees

High-Dimension
Learning guarantees: for hyperplanes in dimension N
with probability at least 1 ,
s s
em
b
2(N + 1) log N +1 log 1
R(h)  R(h) + + .
m 2m
• bound is uninformative for N m.

• but SVMs have been remarkably successful in
high-dimension.
• can we provide a theoretical justification?
• analysis of underlying scoring function.
Confidence Margin
Definition: the confidence margin of a real-valued
function h at (x, y) 2 X ⇥ Y is ⇢h (x, y) = yh(x).
• interpreted as the hypothesis’ confidence in
prediction.
• if correctly classified coincides with |h(x)| .
• relationship with geometric margin for linear
functions h : x 7! w · x + b: for x in the sample,
|⇢h (x, y)| ⇢geom kwk.

Confidence Margin Loss
Definition: for any confidence margin parameter > 0
the ⇢-margin loss function ⇢ is defined by
⇢ (yh(x))
ρ 1 yh(x)
0
For a sample S = (x1 , . . . , xm ) and real-valued

hypothesis h , the empirical margin loss is
m m
1 1
R (h) = (yi h(xi )) 1yi h(xi )< .
m i=1
m i=1

General Margin Bound
Theorem: Let H be a set of real-valued functions.
Fix > 0 . For any > 0 , with probability at least 1 ,
the following holds for all h H :
2 log 1
R(h) R (h) + Rm H +
2m
2 log 2
R(h) R (h) + RS H + 3 .
2m
Proof: Let H = {z = (x, y) yh(x) : h H}. Consider
the family of functions taking values in [0, 1] :
H={ f: f H}.
• By the theorem ofg Lecture 3, with probability at
least 1 H , for all ,
m
1 log 1
E[g(z)] g(zi ) + 2Rm (H) + .
m 2m
• Thus,
i=1
log 1
E[ (yh(x))] R (h) + 2Rm H + .
2m
• Since is 1 - Lipschitz, by Talagrand’s lemma,
m
1 1 1
Rm H Rm (H) = E sup i yi h(xi ) = Rm H .
m ,S h H i=1
• Since 1 yh(x)<0 (yh(x)) , this shows the first

statement, and similarly the second one.

Rademacher Complexity of Linear Hypotheses
Theorem: Let S {x : x R} be a sample of size m

and let H = {x w·x : w }. Then,
R2 2
RS (H) .
m
Proof:
 m
X  m
X
b 1 1
RS (H) = E sup i w · xi = E sup w · i xi
m kwk⇤ i=1 m kwk⇤ i=1
 X  h X
⇤
m
⇤
m 2 i 1/2
 E i xi  E i xi
m m
 hXi=1 p i=1 r
⇤
m i 1/2
⇤ mR 2 R 2 ⇤2
 E kxi k2  = .
m i=1
m m
Margin Bound - Linear Classifiers
Corollary: Let > 0 and H = {x w·x : w }.
Assume that X {x : x R}. Then, for any > 0 ,
with probability at least 1 , for any h H,
R2 2/ 2 log 2
R(h) R (h) + 2 +3 .
m 2m
Proof: Follows directly general margin bound and

bound on RS H for linear classifiers.

High-Dimensional Feature Space
Observations:
• generalization bound does not depend on the
dimension but on the margin.
• this suggests seeking a large-margin hyperplane
in a higher-dimensional feature space.
Computational problems:
• taking dot products in a high-dimensional feature
space can be very costly.
• solution based on kernels (next lecture).

References
• Corinna Cortes and Vladimir Vapnik, Support-Vector Networks, Machine Learning, 20,
1995.
• Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding

the generalization error of combined classifiers. The Annals of Statistics, 30(1), 2002.
• Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,

1982.
• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.
• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.
Mehryar Mohri - Foundations of Machine Learning page 35 Courant Institute, NYU

Appendix
Mehryar Mohri - Foudations of Machine Learning

Saddle Point
Let (w , b , ) be the saddle point of the Langrangian. Multiplying
both sides of the equation giving b by αi∗ yi and taking the sum leads
to: m m m
i y i b = y
i i
2
i j yi yj (xi · xj ).
i=1 i=1 i,j=1
m m
Using yi2 = 1 , i=1 i yi = 0, and w = i=1 i yi xi yields
m
0= i w 2
.
i=1
Thus, the margin is also given by:
1 1
2
= = .
w 2
2 1

Talagrand’s Contraction Lemma
(Ledoux and Talagrand, 1991; pp. 112-114)
Theorem: Let : R R be an L -Lipschitz function.
Then, for any hypothesis set H of real-valued
functions,
RS ( H) L RS (H).
Proof: fix sample S = (x1 , . . . , xm ) . By definition,

m
1
RS ( H) = E sup i( h)(xi )
m h H i=1
1
= E E sup um 1 (h) + m( h)(xm ) ,
m 1 ,..., m 1 m h H
m 1
with um 1 (h) = i( h)(xi ).
i=1
Now, assuming that the suprema are reached,
there exist h1 , h2 H such that
E sup um 1 (h) + m( h)(xm )
m h H
1 1
= [um 1 (h1 ) +( h1 )(xm )] + [um 1 (h2 ) ( h2 )(xm )]
2 2
1
[um 1 (h1 ) + um 1 (h2 ) + sL(h1 (xm ) h2 (xm ))]
2
1 1
= [um 1 (h1 ) + sLh1 (xm )] + [um 1 (h2 ) sLh2 (xm )]
2 2
E sup um 1 (h) + m Lh(xm ) ,
m h H
where s = sgn(h1 (xm ) h2 (xm )).

When the suprema are not reached, the same can
be shown modulo , followed by 0.
Proceeding similarly for other is directly leads to

the result.

VC Dimension of Canonical Hyperplanes
Theorem: Let S {x : x R}. Then, the VC
dimension d of the set of canonical hyperplanes
{x sgn(w · x) : min |w · x| = 1 w } verifies
x S
d R2 2
.
Proof: Let {x1 , . . . , xd } be a set fully shattered. Then,

for all y { 1, +1} d
, there exists w such
i [1, d], 1 yi (w · xi ).

• Summing up the inequalities gives
d d d
d w· yi xi w yi xi yi xi .
i=1 i=1 i=1
• Taking the expectation over y

d d
U (uniform) yields
1/2
d E [ yi xi ] E [ yi xi 2
] (Jensen’s ineq.)
y U y U
i=1 i=1
d
1/2
= E[yi yj ](xi · xj )
i,j=1
d
1/2 2 1/2
= (xi · xi ) dR = R d.
i=1
• Thus, d R.

Lecture 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4

Uploaded by

Copyright:

Available Formats

Foundations of Machine Learning

Support Vector Machines

Problem: find hypothesis h :X { 1, +1} in H

Mehryar Mohri - Foundations of Machine Learning page 2

Mehryar Mohri - Foundations of Machine Learning page 3

• classifiers: H = {x sgn(w · x + b) : w RN, b R}.

• geometric margin: ⇢ = min i2[1,m]

• which separating hyperplane?

Mehryar Mohri - Foundations of Machine Learning page 5

Mehryar Mohri - Foundations of Machine Learning page 6

subject to yi (w · xi + b) 1, i [1, m].

Mehryar Mohri - Foundations of Machine Learning page 7

i [1, m], i [yi (w · xi + b) 1] = 0.

Mehryar Mohri - Foundations of Machine Learning page 8

Support vectors: vectors xi such that

• Note: support vectors are not unique.

Mehryar Mohri - Foundations of Machine Learning page 9

Mehryar Mohri - Foundations of Machine Learning page 10

Mehryar Mohri - Foundations of Machine Learning page 11

= Em 1 [ E [1hS (x)=f (x) ]] = E m 1 [R(hS )].

Mehryar Mohri - Foundations of Machine Learning page 13

Mehryar Mohri - Foundations of Machine Learning page 14

Mehryar Mohri - Foundations of Machine Learning page 15

Mehryar Mohri - Foundations of Machine Learning page 16

Support vectors: points along the margin or outliers.

Mehryar Mohri - Foundations of Machine Learning page 17

Mehryar Mohri - Foundations of Machine Learning page 18

Mehryar Mohri - Foundations of Machine Learning page 19

• quadratic hinge loss:

Mehryar Mohri - Foundations of Machine Learning page 20

0/1 loss function

Mehryar Mohri - Foundations of Machine Learning page 21

Support vectors: vectors xi such that

• Note: support vectors are not unique.

Mehryar Mohri - Foundations of Machine Learning page 23

The condition i 0 is equivalent to i C.

Mehryar Mohri - Foundations of Machine Learning page 24

Mehryar Mohri - Foundations of Machine Learning page 26

• bound is uninformative for N m.

Mehryar Mohri - Foundations of Machine Learning page 28

For a sample S = (x1 , . . . , xm ) and real-valued

Mehryar Mohri - Foundations of Machine Learning page 29

• Since 1 yh(x)<0 (yh(x)) , this shows the first

Mehryar Mohri - Foundations of Machine Learning page 31

Theorem: Let S {x : x R} be a sample of size m

Proof: Follows directly general margin bound and

Mehryar Mohri - Foundations of Machine Learning page 33

Mehryar Mohri - Foundations of Machine Learning page 34

• Koltchinskii,Vladimir and Panchenko, Dmitry. Empirical margin distributions and bounding

• Vladimir N.Vapnik. Estimation of Dependences Based on Empirical Data. Springer, Basederlin,

• Vladimir N.Vapnik. The Nature of Statistical Learning Theory. Springer, 1995.

• Vladimir N.Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.

Mehryar Mohri - Foundations of Machine Learning page 35 Courant Institute, NYU

Mehryar Mohri - Foudations of Machine Learning

Mehryar Mohri - Foundations of Machine Learning page 37

Proof: fix sample S = (x1 , . . . , xm ) . By definition,

where s = sgn(h1 (xm ) h2 (xm )).

Proceeding similarly for other is directly leads to

Mehryar Mohri - Foundations of Machine Learning page 40

Proof: Let {x1 , . . . , xd } be a set fully shattered. Then,

Mehryar Mohri - Foundations of Machine Learning page 41

• Taking the expectation over y

Mehryar Mohri - Foundations of Machine Learning page 42