Professional Documents
Culture Documents
Charlie Frogner 1
MIT
2011
1
Slides mostly stolen from Ryan Rifkin (Google).
C. Frogner Support Vector Machines
Plan
3.5
2.5
Hinge Loss
1.5
0.5
−3 −2 −1 0 1 2 3
y * f(x)
1
Note that we don’t have a 2 multiplier on the regularization
term.
Substituting in:
n
X
∗
f (x) = ci K (x, xi ),
i=1
1 Pn
argmin n i=1 ξi + λc T Kc
c∈Rn ,ξ∈Rn
Pn
subject to : ξi ≥ 1 − yi j=1 cj K (xi , xj ) i = 1, . . . , n
ξi ≥ 0 i = 1, . . . , n
1 Pn
argmin n i=1 ξi + λc T Kc
c∈Rn ,b∈R,ξ∈Rn
subject to : ξi ≥ 1 − yi ( nj=1 cj K (xi , xj ) + b) i = 1, . . . , n
P
ξi ≥ 0 i = 1, . . . , n
1
C= .
2λn
Using this definition (after multiplying our objective function by
1
the constant 2λ , the regularization problem becomes
n
X 1
argmin C V (yi , f (xi )) + ||f ||2H .
f ∈H 2
i=1
Pn
argmin C i=1 ξi + 12 c T Kc
c∈Rn ,b∈R,ξ∈Rn
ξi ≥ 0 i = 1, . . . , n
Pn
argmin C i=1 ξi + 12 c T Kc
c∈Rn ,b∈R,ξ∈Rn
ξi ≥ 0 i = 1, . . . , n
n
X 1 T
L(c, ξ, b, α, ζ) = C ξi + c Kc
2
i=1
n
X Xn
− αi (yi { cj K (xi , xj ) + b} − 1 + ξi )
i=1 j=1
n
X
− ζi ξi
i=1
Dual:
argmax inf L(c, ξ, b, α, ζ)
α,ζ≥0 c,ξ,b
Optimality conditions:
(1) ci = αi yi
Pn
(2) i=1 αi yi = 0
(3) αi ∈ [0, C]
Optimality conditions:
(1) ci = αi yi
Pn
(2) i=1 αi yi = 0
(3) αi ∈ [0, C]
Plug in (1):
Pn 1 Pn
argmax L(α) = i=1 αi − 2 i,j=1 αi yi K (xi , xj )αj yj
α≥0
Pn
= i=1 αi − 21 αT (diagY)K(diagY)α
Pn
argmin C i=1 ξi + 12 c T Kc
c∈Rn ,b∈R,ξ∈Rn
ξi ≥ 0 i = 1, . . . , n
Pn
maxn i=1 αi − 12 αT Qα
α∈R
Pn
subject to : i=1 yi αi =0
0 ≤ αi ≤ C i = 1, . . . , n
Basic idea: solve the dual problem to find the optimal α’s,
and use them to find b and c.
The dual problem is easier to solve the primal problem. It
has simple box constraints and a single equality constraint,
and the problem can be decomposed into a sequence of
smaller problems (see appendix).
α tells us:
c and b.
The identities of the misclassified points.
How to analyze? Use the optimality conditions.
Already used: derivative of L w.r.t. (c, ξ, b) is zero at
optimality.
Haven’t used: complementary slackness, primal/dual
constraints.
∂L
= 0 =⇒ ci = αi yi , ∀i
∂c
αi < C =⇒ ζi > 0
=⇒ ξi = 0
Xn
=⇒ yi ( yj αj K (xi , xj ) + b) − 1 = 0
j=1
n
X
=⇒ b = yi − yj αj K (xi , xj )
j=1
Pn
(Remember we defined f (x) = i=1 yi αi K (x, xi ) + b.)
So
yi f (xi ) < 1 ⇒ αi = C.
Conversely, suppose αi = C:
αi = C =⇒ ξi = 1 − yi f (xi )
=⇒ yi f (xi ) ≤ 1
αi = 0 =⇒ yi f (xi ) ≥ 1
0 < αi < C =⇒ yi f (xi ) = 1
αi = C ⇐= yi f (xi ) < 1
αi = 0 ⇐= yi f (xi ) > 1
αi = C =⇒ yi f (xi ) ≤ 1
(a) (b)
Classification function:
Classification function:
Separable means ∃w s.t. all points are beyond the margin, i.e.
yi hw , xi i ≥ 1 , ∀i.
So we solve:
argmin kw k2
w
s.t. yi hw , xi i ≥ 1 , ∀i
(Follows.)
Our plan will be to solve the dual problem to find the α’s, and
use that to find b and our function f . The dual problem is easier
to solve the primal problem. It has simple box constraints and a
single inequality constraint, even better, we will see that the
problem can be decomposed into a sequence of smaller
problems.
The reduced problems are fixed size, and can be solved using
a standard QP code. Convergence proofs are difficult, but this
approach seems to always converge to an optimal solution in
practice.