Lecture5 SVM

Support Vector Machine (SVM)
DSA5103 Lecture 5
Yangjing Zhang
12-Sep-2023
NUS
Today’s content
1. SVM
2. Lagrange duality and KKT
3. Dual of SVM and kernels
4. SVM with soft constraints
lecture5 1/43
SVM
Idea of support vector machine (SVM)
• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )
lecture5 2/43
Idea of support vector machine (SVM)
• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )
• Question: what is the “best”

separating hyperplane?
• SVM answer: the hyperplane
with maximum margin.
• Margin = the distance to the
closet data points.
lecture5 2/43
Maximum margin separating hyperplane
For the separating hyperplane with maximum margin,

distance to points in positive class = distance to points in negative class
lecture5 3/43
Maximum margin separating hyperplane
For the separating hyperplane with maximum margin,

distance to points in positive class = distance to points in negative class
lecture5 3/43
Normal cone of a hyperplane
Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex
Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.
• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.
lecture5 4/43
Normal cone of a hyperplane
Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex
Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.
• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.
This is true since z, x̄ ∈ H ⇒ β T z + β0 = 0, β T x̄ + β0 = 0
⇒ β T (z − x̄) = 0. Obviously, we also have −β ∈ NH (x̄).
lecture5 4/43
Distance of a point to a hyperplane
Compute the distance of a point x to a hyperplane

H = {x ∈ Rp | β T x + β0 = 0}.
1. x̄ = ΠH (x) ⇐⇒ 1 x − x̄ ∈ NH (x̄) ⇐⇒ x − x̄ = λβ for some
λ∈R
2. x̄ ∈ H ⇒ β T x̄ + β0 = 0
⇒ β T (x − λβ) + β0 = 0
β T x + β0
⇒λ=
βT β
β T x + β0
3. x − x̄ = β
βT β
|β T x + β0 |
kx − x̄k =
kβk
|β T x + β0 |
The distance of a point x to a hyperplane H is ; it is
kβk
invariant to scaling of the parameters β, β0
1 Lecture
lecture5 5/43
4, page 24
Maximize margin
|β T xi + β0 |
• Margin γ = γ(β, β0 ) = min
i=1,...,n kβk
• All data points must lie on the correct side:
β T xi + β0 ≥ 0 when yi = 1 β T xi + β0 ≤ 0 when yi = −1
⇐⇒ yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n] = {1, . . . , n}
• Therefore, the optimization problem is
|β T xi + β0 |

max min
β,β0 i=1,...,n kβk
s.t. yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n]
lecture5 6/43
Simplify the optimization problem
1
max
kβk

1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n
lecture5 7/43
1
max
kβk

1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n
• The hyperplane and margin are scale

invariant (β, β0 ) → (cβ, cβ0 ), for any
c 6= 0
• If xk is the closest point to H, i.e.,
k = arg min |β T xi + β0 |, we can
i=1,...,n
scale β, β0 such that |β T xk + β0 | = 1
lecture5 7/43
1
max
β,β0 kβk
min kβk2
β,β0
s.t. T
yi (β xi + β0 ) ≥ 0 ∀i ⇐⇒
s.t. yi (β T xi + β0 ) ≥ 1 ∀i
T
min |β xi + β0 | = 1
i=1,...,n
• “⇒” Note that yi ∈ {−1, 1}
• “⇐” Note that we minimize kβk
lecture5 8/43
SVM
SVM is a quadratic programming (QP) problem — it can be solved by

generic QP solvers
1
min kβk2
β,β0 2
s.t. yi (β T xi + β0 ) ≥ 1 ∀ i ∈ [n]
• Later, we will discuss the Lagrangian duality and derive the dual
problem of the above
• The dual problem will play a key role in allowing us to use kernels
(introduced later)
• The dual problem will also allow us to derive an efficient algorithm
better than generic QP solvers (especially when n p)
lecture5 9/43
Support vectors
Support vectors are some xi having tight constraints
yi (β T xi + β0 ) = 1
• Support vectors must exist

• Number of support vectors sample
size n
• The resulting hyperplane may change
if some support vectors are removed
lecture5 10/43
Lagrange duality and KKT
Lagrangian
Consider a general nonlinear programming problem (NLP), which is

known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
• Define the Lagrangian

m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1
for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .
lecture5 11/43
Lagrangian
Consider a general nonlinear programming problem (NLP), which is

known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
• Define the Lagrangian

m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1
for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .
• Define the Lagrange dual function (always concave)
θ(v, u) = min L(x, v, u)

x
lecture5 11/43
Lagrangian
• In evaluating θ(v, u) for each v, u, we must solve

m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1
∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )
lecture5 12/43
Lagrangian
• In evaluating θ(v, u) for each v, u, we must solve

m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1
∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )
• Suppose x is a feasible point of (P). Then for any v ∈ Rm , u ∈ Rl+ ,

θ(v, u) = min L(x, v, u) ≤ L(x, v, u)
x
m
X l
X
= f (x) + vi gi (x) + uj hj (x)
i=1 j=1
≤ f (x)
lecture5 12/43
Lagrangian dual problem
is a lower bound of
• Dual function θ(v, u) ≤ Primal function f (x)
v ∈ Rm , u ∈ Rl+ x is primal feasible
dual feasible
• We want to search for the largest lower bound — leading to the

Lagrangian dual problem
(D) max θ(v, u)

v,u
s.t. v ∈ Rm , u ∈ Rl+
Here vi ,uj are called dual variables or Lagrange multipliers.
lecture5 13/43
Primal and dual
Definition (Lagrangian dual problem)

For a primal nonlinear programming problem (P)
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
The Lagrangian dual problem (D) is the following nonlinear programming
problem
( m l
)
X X
(D) max θ(v, u) = min f (x) + vi gi (x) + uj hj (x)
v,u x
i=1 j=1
m
s.t. v∈R , u∈ Rl+
• Weak duality: optimal value for (D) ≤ optimal value for (P)
• Under certain assumptions (see page 21),
strong duality: optimal value for (D) = objective value for (P)
lecture5 14/43
Example
Find the dual problem of the convex program

min x21 + x22
s.t. x1 + x2 ≥ 4
lecture5 15/43
Example

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
lecture5 15/43
Example

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )
lecture5 15/43
Example

min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )
The dual function is
θ(u1 ) = min x21 + x22 + u1 (4 − x1 − x2 )
x1 ,x2
= 4u1 + min {x21 − u1 x1 } + min {x22 − u1 x2 }

x1 x2
u21 u1 u1
= 4u1 − (Attained at x1 = , x2 = )
2 2 2
The dual problem is
u2
max 4u1 − 1
2
s.t. u1 ≥ 0
lecture5 15/43
Example: LP
Consider the linear programming (LP) problem in standard form

min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.
lecture5 16/43
Example: LP
Consider the linear programming (LP) problem in standard form

min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.
Solution. Let v ∈ Rm and u ∈ Rn+ . The dual function is
θ(v) = min {cT x + v T (b − Ax) − uT x} = v T b + min {xT (c − AT v − u)}
x x
(
T T
v b, if c − A v − u = 0
=
−∞, otherwise
max bT v
v,u max bT v
Dual problem: s.t. A v + u = c i.e.,
T v
s.t. AT v ≤ c
u≥0
lecture5 16/43
Example: LP
Consider the LP in standard inequality form

min cT x
x
s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
lecture5 17/43
Example: LP
Consider the LP in standard inequality form

min cT x
x
s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
Solution. Let u ∈ Rm
+ . The dual function is
θ(u) = minn {cT x + uT (Ax − b)} = −uT b + minn {xT (c + AT u)}

x∈R x∈R
(
T T
−u b, if c + A u = 0
=
−∞, otherwise
max −bT u
u
Dual problem: s.t. AT u + c = 0
u≥0 lecture5 17/43
Example: Lasso
Consider the problem

1
min kzk2 + λkβk1
β,z 2
s.t. z + Xβ = Y
where X ∈ Rn×p , Y ∈ Rn , λ > 0. Find the dual function and dual

problem.
Solution.
1. Let y ∈ Rn . The Lagrangian is

1
L(β, z, y) = kzk2 + λkβk1 + hy, Y − z − Xβi
2
1
= kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
2
lecture5 18/43
Example: Lasso
2. The dual function θ(y) = min L(β, z, y)

β,z
n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
2 Lecture 4, page 34
lecture5 19/43
Example: Lasso
2. The dual function θ(y) = min L(β, z, y)

β,z
n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
n1 o 1
• Set ∇z = z − y = 0 ⇒ z = y. min kzk2 − hy, zi = − kyk2
z 2 2
n o n XT y o
• min λkβk1 − hX T y, βi = −λ max h , βi − kβk1 =
β β λ | {z }
h(β)
T 2 T
X y X y
−λh∗ = −δB1 , B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ λ
2 Lecture 4, page 34
lecture5 19/43
Example: Lasso
2. The dual function

T
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2
3. The dual problem
lecture5 20/43
Example: Lasso
2. The dual function

T
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2
3. The dual problem
max θ(y)
y
T
X y 1
= − min δB1 + kyk2 − hy, Y i
y λ 2
1
= − min kyk2 − hy, Y i s.t. XT y ∞ ≤ λ
y 2
lecture5 20/43
KKT
Assumptions
1. f, hj : Rp → R differentiable and convex

2. gi : Rp → R affine (gi (x) = aTi x + bi )
3. Slater’s condition holds, i.e., there exists x̂ such that
gi (x̂) = 0, ∀ i hj (x̂) < 0, ∀ j
Under the above assumptions, strong duality holds, and there exist a
solution x∗ to (P) and a solution (u∗ , v ∗ ) to (D) satisfying the
Karush-Kuhn-Tucker (KKT) conditions:
m l
∂ X X
L(x∗ , u∗ , v ∗ ) = ∇f (x∗ ) + vi∗ ∇gi (x∗ ) + u∗j ∇hj (x∗ ) = 0
∂x i=1 j=1
gi (x∗ ) = 0, hj (x∗ ) ≤ 0, u∗j ≥ 0, u∗j hj (x∗ ) = 0, ∀ i ∈ [m], j ∈ [l]
lecture5 21/43
KKT
• We say (x∗ , u∗ , v ∗ ) (or simply x∗ ) is a KKT point or a KKT solution

if (x∗ , u∗ , v ∗ ) satisfies the KKT conditions
• Under the above assumptions, (x∗ , u∗ , v ∗ ) is a KKT solution ⇐⇒
x∗ is an optimal solution to (P) and (u∗ , v ∗ ) is an optimal solution
to (D)
• We call
u∗j hj (x∗ ) = 0, ∀ j ∈ [l]
complementary slackness condition. It implies
u∗j = 0 if hj (x∗ ) < 0, hj (x∗ ) = 0 if u∗j > 0
 • If the constraint hj (x∗ ) ≤ 0 is slack

∗
 hj (x ) ≤ 0
 (hj (x∗ ) < 0), then the constraint u∗j ≥ 0 is
∗
uj ≥ 0 active (u∗j = 0)
 u∗ h (x∗ ) = 0

j j • If the constraint u∗j ≥ 0 is slack (u∗j > 0),
then the constraint hj (x∗ ) ≤ 0 is active
(hj (x∗ ) = 0)
lecture5 22/43
Dual of SVM
Dual of SVM
Derive the dual of the following SVM problem

1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
1. For α ∈ Rn+ , the Lagrangian is

n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1
lecture5 23/43
Dual of SVM
Derive the dual of the following SVM problem

1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
1. For α ∈ Rn+ , the Lagrangian is

n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1
2. The dual function is
θ(α) = min L(β, β0 , α)

β,β0
n n n
1 X X X
= min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
lecture5 23/43
Dual of SVM
We need to solve the optimization problem

n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1
lecture5 24/43
Dual of SVM
We need to solve the optimization problem

n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0, we
∂β i=1
∂β0 i=1
obtain that
 n n n
 X 1 X X
 αi − αi αj yi yj xTi xj if αi yi = 0
θ(α) = i=1
2 i,j=1 i=1

−∞ otherwise

3. The dual problem is

n n
X 1 X
max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0, αi ≥ 0, i ∈ [n]
i=1 lecture5 24/43
KKT of SVM
Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Verify the assumptions (in Page 21) for strong duality and the existence
of KKT points: (Slater’s condition) there exist β̂, β̂0 such that
yi (β̂ T xi + β̂0 ) > 1, i ∈ [n]. It requires that the two classes are strictly
separable.
Figure 2: Left: strictly separable. Right: separable but not strictly separable
lecture5 25/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
1. If we obtain a dual solution α∗ (via solving the SVM dual problem),

then we can construct a primal solution (β ∗ , β0∗ ) from KKT
conditions
Xn
β∗ = αi∗ yi xi
i=1
n
X
β0∗ = yk − αi∗ yi hxi , xk i, for some k satisfying αk∗ > 0
i=1
lecture5 26/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
2. If αi∗ > 0, then xi is a support vector, by complementary slackness

condition:
αi∗ > 0 ⇒ yi ((β ∗ )T xi + β0∗ ) = 1
3. |{i | αi∗ > 0}| ≤ the number of support vectors n
Dual solution is sparse (many αi∗ = 0)
lecture5 27/43
KKT of SVM
KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]
4. Decision boundary
n
X
0 = (β ∗ )T x + β0∗ = αi∗ yi hxi , xi + β0∗
i=1
X
= αi∗ yi hxi , xi + β0∗
i, α∗
i >0
For a new test point x, the prediction only depends on hxi , xi where
αi∗ > 0, namely, the inner products between x and support vectors.
lecture5 28/43
Primal vs Dual
Primal Dual
1 n n
min kβk2
X 1 X
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Classifier Classifier
n
!
f (x) = sign(β T x + β0 ) X
f (x) = sign αi yi hxi , xi + β0
i=1
Many αi ’s are zero (sparse solutions)

• Optimize p + 1 variables for primal, n variables for dual
• When n p, it might be more efficient to solve the dual
• Dual problem only involves hxi , xj i — allowing the use of kernels
lecture5 29/43
Feature mapping
• Recall feature expansion, for example,

 
xi1
" #  x 
xi1  i2 
feature  2 
i-th feature vector xi = →  xi1 
xi2 expansion  2 
 xi2 
xi1 xi2
• Let φ denote the feature mapping, which maps from original
features to new features  
z1
" #!  z 
z1  2 
For example, φ =  z12 
 
z2
 z22 
 
z1 z2
• Instead of using the original feature vectors xi , i ∈ [n], we may
apply SVM using new features φ(xi ), i ∈ [n]
• New feature space can be very high dimensional lecture5 30/43
Kernel
Primal Dual
1 n n
min kβk2
X 1 X
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1
αi ≥ 0, i ∈ [n]
Using dual:
• For feature expansion, simply replace hxi , xj i with hφ(xi ), φ(xj )i
• Given a feature mapping φ, we define the corresponding kernel
K(a, b) = hφ(a), φ(b)i, a, b ∈ Rp
• Usually computing K(a, b) may be very cheap, even though
computing φ(a), φ(b) (high dimensional vectors) may be expensive
• The dual of SVM only requires the computation of kernels
K(xi , xj ). Explicitly calculating φ(xi ) is not necessary lecture5 31/43
Example: (homogeneous) polynomial kernel
For a, b ∈ Rp , consider
K(a, b) = (aT b)2
It can be written as p
! p 
p
X X X
K(a, b) = ai bi  aj bj =
 a i a j bi bj
i=1 j=1 i,j=1
p
X
= (ai aj )(bi bj )
i,j=1
Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → Rp is given by  

a1 a1

a1
  a1 a2 
 
a2  
  a a 

φ(a) = φ  ..  =  1 3 
 .   .. 
 . 
ap
ap ap
Computing φ(a): O(p2 ) operations; computing K(a, b): O(p) operations
lecture5 32/43
Example: (inhomogeneous) polynomial kernel
Given c ≥ 0. For a, b ∈ Rp , consider
K(a, b) = (aT b + c)2

p p
X X √ √
= (ai aj )(bi bj ) + ( 2cai )( 2cbi ) + c2
i,j=1 i=1
Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → R(p +p+1) is given by
√ √ √ T
φ(a) = a1 a1 , a1 a2 , a1 a3 , . . . , ap ap , 2ca1 , 2ca2 , . . . , 2cap , c
| {z }| {z }
second order terms first order terms
Parameter c controls the relative weighting between first order and
second order terms.
lecture5 33/43
Common kernels
• Polynomials of degree d
K(a, b) = (aT b)d
• Polynomials up to degree d
K(a, b) = (aT b + 1)d
• Gaussian kernel — polynomials of all orders3
ka − bk2

K(a, b) = exp − , σ>0
2σ 2
3 ex
P∞ xn
= n=0 n!
lecture5 34/43
Kernel
• SVM can be applied in high dimensional feature spaces, without

explicitly applying the feature mapping
• The two classes might be separable in high dimensional space, but
not separable in the original feature space
• Kernels can be used efficiently in the dual problem of SVM because
the dual only involves inner products
lecture5 35/43
SVM with soft constraints
When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
lecture5 36/43
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
lecture5 37/43
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
SVM with soft constraints solves
n
1 X
min kβk2 +C max{1 − yi (β T xi + β0 ), 0}
β,β0 2
| {z } i=1
ridge regularization
| {z }
hinge-loss function
lecture5 37/43
Logistic regression
Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,

T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =
 log 1 + eβ T xi +β0 ,
 if yi = 0
4 Lecture 3, Page 19
lecture5 38/43
Logistic regression
Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,

T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =
 log 1 + eβ T xi +β0 ,
 if yi = 0
Change label yi = 0 → yi = −1,

T

logistic-loss = log 1 + e−yi (β xi +β0 ) , yi ∈ {−1, 1}
Logistic regression with ridge regularization

n
X T
min log 1 + e−yi (β xi +β0 ) + λkβk2
β,β0
i=1
4 Lecture 3, Page 19
lecture5 38/43
SVM vs. logistic regression
SVM with soft constraints Logistic regression with

n n
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1
Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0
lecture5 39/43
SVM vs. logistic regression
SVM with soft constraints Logistic regression with

n n
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1
Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z 0
4
hinge-loss
3
logistic loss is a “smoothed
2
version” of hinge loss
logistic-loss
0
-3 -2 -1 0 1 2 3 lecture5 39/43
SVM with soft constraints: dual

n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. 1 − εi − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
−εi ≤ 0, i ∈ [n]
Find the dual problem.
1. For α ∈ Rn+ , r ∈ Rn+ , the Lagrangian L(β, β0 , ε, α, r) =

n n n
1 X X X
kβk2 + C εi + αi (1 − εi − yi (β T xi + β0 )) − ri εi
2 i=1 i=1 i=1
n n n n
1 2
X
T
X X X
= kβk − αi yi xi β − αi yi β0 + (C − αi − ri )εi + αi
2 i=1 i=1 i=1 i=1
lecture5 40/43
2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =

β,β0 ,ε
n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1
n n
∂ X ∂ X
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi
lecture5 41/43
2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =

β,β0 ,ε
n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1
n n
∂ X ∂ X
∂β i=1
∂β0 i=1
∂
L = C − αi − ri = 0,
∂εi
we obtain that θ(α, r) =
 n n n
X 1 X X
αi − αi αj yi yj xTi xj if αi yi = 0 and αi + ri = C


i=1
2 i,j=1 i=1

−∞ otherwise

lecture5 41/43
3. The dual problem max {θ(α, r) | α ∈ Rn+ , r ∈ Rn+ }

α,r
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α,r
i=1
2 i,j=1
n
X
s.t. αi yi = 0
i=1
αi + ri = C, i ∈ [n]
α ∈ Rn+ , r ∈ Rn+
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0
i=1
0 ≤ αi ≤ C, i ∈ [n]
lecture5 42/43
In practice
You are encouraged to learn two popular open source machine learning
libraries:
LIBLINEAR https://www.csie.ntu.edu.tw/~cjlin/liblinear/
LIBSVM https://www.csie.ntu.edu.tw/~cjlin/libsvm/
lecture5 43/43

Lecture5 SVM

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture5 SVM

Uploaded by

Copyright:

Available Formats

Support Vector Machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic

• Question: what is the “best”

For the separating hyperplane with maximum margin,

For the separating hyperplane with maximum margin,

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}

Compute the distance of a point x to a hyperplane

• Therefore, the optimization problem is

• The hyperplane and margin are scale

• “⇒” Note that yi ∈ {−1, 1}

• “⇐” Note that we minimize kβk

SVM is a quadratic programming (QP) problem — it can be solved by

Support vectors are some xi having tight constraints

• Support vectors must exist

Consider a general nonlinear programming problem (NLP), which is

• Define the Lagrangian

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

Consider a general nonlinear programming problem (NLP), which is

• Define the Lagrangian

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

• Define the Lagrange dual function (always concave)

θ(v, u) = min L(x, v, u)

• In evaluating θ(v, u) for each v, u, we must solve

• In evaluating θ(v, u) for each v, u, we must solve

• Suppose x is a feasible point of (P). Then for any v ∈ Rm , u ∈ Rl+ ,

• We want to search for the largest lower bound — leading to the

(D) max θ(v, u)

Here vi ,uj are called dual variables or Lagrange multipliers.

Definition (Lagrangian dual problem)

Find the dual problem of the convex program

Find the dual problem of the convex program

Find the dual problem of the convex program

Find the dual problem of the convex program

= 4u1 + min {x21 − u1 x1 } + min {x22 − u1 x2 }

Consider the linear programming (LP) problem in standard form

Consider the linear programming (LP) problem in standard form

Consider the LP in standard inequality form

Consider the LP in standard inequality form

θ(u) = minn {cT x + uT (Ax − b)} = −uT b + minn {xT (c + AT u)}

Consider the problem

where X ∈ Rn×p , Y ∈ Rn , λ > 0. Find the dual function and dual

1. Let y ∈ Rn . The Lagrangian is

2. The dual function θ(y) = min L(β, z, y)

2. The dual function θ(y) = min L(β, z, y)

2. The dual function

3. The dual problem

2. The dual function

3. The dual problem

1. f, hj : Rp → R differentiable and convex

gi (x̂) = 0, ∀ i hj (x̂) < 0, ∀ j

gi (x∗ ) = 0, hj (x∗ ) ≤ 0, u∗j ≥ 0, u∗j hj (x∗ ) = 0, ∀ i ∈ [m], j ∈ [l]

• We say (x∗ , u∗ , v ∗ ) (or simply x∗ ) is a KKT point or a KKT solution

 • If the constraint hj (x∗ ) ≤ 0 is slack

Derive the dual of the following SVM problem

1. For α ∈ Rn+ , the Lagrangian is

Derive the dual of the following SVM problem

1. For α ∈ Rn+ , the Lagrangian is

2. The dual function is