Professional Documents
Culture Documents
Mehryar Mohri
Courant Institute and Google Research
mohri@cims.nyu.edu
Binary Classification Problem
Training data: sample drawn i.i.d. from set X RN
according to some distribution D,
S = ((x1 , y1 ), . . . , (xm , ym )) X { 1, +1}.
margin
w·x+b = 0
w·x+b = 0
margin
w·x+b = +1
w·x+b = 1
|w · xi + b|
⇢= max min .
w,b : yi (w·xi +b) 0 i2[1,m] kwk
1
= max
w,b : yi (w·xi +b) 0 kwk
mini2[1,m] |w·xi +b|=1
1
= max . (min. reached)
w,b : yi (w·xi +b) 1 kwk
bL = i yi =0 i yi = 0.
i=1 i=1
Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1
ξj
w·x+b = +1
w·x+b = 1
Properties:
• C 0 trade-off parameter.
• Convex optimization.
• Unique solution.
Loss functions:
• hinge loss:
L(h(x), y) = (1 yh(x))+ .
Hinge loss ξ 1
KKT conditions:
m m
wL =w i yi xi =0 w= i yi xi .
m i=1 m i=1
bL = i yi =0 i yi = 0.
i=1 i=1
i L=C i i =0 i+ i = C.
i [1, m], i [yi (w · xi + b) 1 + i] = 0
i i = 0.
Mehryar Mohri - Foundations of Machine Learning page 22
Support Vectors
Complementarity conditions:
i [yi (w · xi + b) 1 + i] = 0 = i =0 yi (w · xi + b) = 1 i.
Thus,
m m
1
L= i i j yi yj (xi · xj ).
i=1
2 i,j=1
ρ 1 yh(x)
0
log 1
E[ (yh(x))] R (h) + 2Rm H + .
2m
• Since is 1 - Lipschitz, by Talagrand’s lemma,
m
1 1 1
Rm H Rm (H) = E sup i yi h(xi ) = Rm H .
m ,S h H i=1
R2 2/ 2 log 2
R(h) R (h) + 2 +3 .
m 2m
• Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces. Springer, New York.
i y i b = y
i i
2
i j yi yj (xi · xj ).
i=1 i=1 i,j=1
m m
Using yi2 = 1 , i=1 i yi = 0, and w = i=1 i yi xi yields
m
0= i w 2
.
i=1
Thus, the margin is also given by:
1 1
2
= = .
w 2
2 1
d R2 2
.
i [1, d], 1 yi (w · xi ).
• Thus, d R.