You are on page 1of 67

Support Vector Machine (SVM)

DSA5103 Lecture 5

Yangjing Zhang
12-Sep-2023
NUS
Today’s content

1. SVM
2. Lagrange duality and KKT
3. Dual of SVM and kernels
4. SVM with soft constraints

lecture5 1/43
SVM
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic


regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

lecture5 2/43
Idea of support vector machine (SVM)

• Data: xi ∈ Rp , yi ∈ {−1, 1} (instead of {0, 1} in logistic


regression), i = 1, . . . , n
• The two classes are assumed to be linearly separable
• Aim: Learn a linear classifier: f (x) = sign(β T x + β0 )

• Question: what is the “best”


separating hyperplane?
• SVM answer: the hyperplane
with maximum margin.
• Margin = the distance to the
closet data points.

lecture5 2/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,


distance to points in positive class = distance to points in negative class

lecture5 3/43
Maximum margin separating hyperplane

For the separating hyperplane with maximum margin,


distance to points in positive class = distance to points in negative class

lecture5 3/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}


. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.

lecture5 4/43
Normal cone of a hyperplane

Hyperplane H = Hβ,β0 = {x ∈ Rp | β T x + β0 = 0}
• a linear decision boundary
• (p − 1)-dimensional subspace, closed, convex

Figure 1: Left: p = 2, H is a line. Right: p = 3, H is a plane.

• For any x̄ ∈ H, normal cone NH (x̄) = {λβ | λ ∈ R}


. The normal cone of H must be 1-dimensional. We can show that
β ∈ NH (x̄), i.e., hβ, z − x̄i ≤ 0 ∀z ∈ H.
This is true since z, x̄ ∈ H ⇒ β T z + β0 = 0, β T x̄ + β0 = 0
⇒ β T (z − x̄) = 0. Obviously, we also have −β ∈ NH (x̄).
lecture5 4/43
Distance of a point to a hyperplane

Compute the distance of a point x to a hyperplane


H = {x ∈ Rp | β T x + β0 = 0}.
1. x̄ = ΠH (x) ⇐⇒ 1 x − x̄ ∈ NH (x̄) ⇐⇒ x − x̄ = λβ for some
λ∈R
2. x̄ ∈ H ⇒ β T x̄ + β0 = 0
⇒ β T (x − λβ) + β0 = 0
β T x + β0
⇒λ=
βT β
β T x + β0
3. x − x̄ = β
βT β
|β T x + β0 |
kx − x̄k =
kβk
|β T x + β0 |
The distance of a point x to a hyperplane H is ; it is
kβk
invariant to scaling of the parameters β, β0
1 Lecture
lecture5 5/43
4, page 24
Maximize margin

|β T xi + β0 |
• Margin γ = γ(β, β0 ) = min
i=1,...,n kβk
• All data points must lie on the correct side:
β T xi + β0 ≥ 0 when yi = 1 β T xi + β0 ≤ 0 when yi = −1

⇐⇒ yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n] = {1, . . . , n}

• Therefore, the optimization problem is

|β T xi + β0 |
 
max min
β,β0 i=1,...,n kβk

s.t. yi (β T xi + β0 ) ≥ 0, ∀ i ∈ [n]

lecture5 6/43
Simplify the optimization problem

1
max
kβk
 
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

lecture5 7/43
Simplify the optimization problem

1
max
kβk
 
1 β,β0
T
max min |β xi + β0 |
β,β0 kβk i=1,...,n
⇐⇒ s.t. yi (β T xi + β0 ) ≥ 0 ∀i
T
s.t. yi (β xi + β0 ) ≥ 0 ∀i
min |β T xi + β0 | = 1
i=1,...,n

• The hyperplane and margin are scale


invariant (β, β0 ) → (cβ, cβ0 ), for any
c 6= 0
• If xk is the closest point to H, i.e.,
k = arg min |β T xi + β0 |, we can
i=1,...,n
scale β, β0 such that |β T xk + β0 | = 1

lecture5 7/43
Simplify the optimization problem

1
max
β,β0 kβk
min kβk2
β,β0
s.t. T
yi (β xi + β0 ) ≥ 0 ∀i ⇐⇒
s.t. yi (β T xi + β0 ) ≥ 1 ∀i
T
min |β xi + β0 | = 1
i=1,...,n

• “⇒” Note that yi ∈ {−1, 1}

• “⇐” Note that we minimize kβk

lecture5 8/43
SVM

SVM is a quadratic programming (QP) problem — it can be solved by


generic QP solvers
1
min kβk2
β,β0 2
s.t. yi (β T xi + β0 ) ≥ 1 ∀ i ∈ [n]

• Later, we will discuss the Lagrangian duality and derive the dual
problem of the above
• The dual problem will play a key role in allowing us to use kernels
(introduced later)
• The dual problem will also allow us to derive an efficient algorithm
better than generic QP solvers (especially when n  p)

lecture5 9/43
Support vectors

Support vectors are some xi having tight constraints

yi (β T xi + β0 ) = 1

• Support vectors must exist


• Number of support vectors  sample
size n
• The resulting hyperplane may change
if some support vectors are removed

lecture5 10/43
Lagrange duality and KKT
Lagrangian

Consider a general nonlinear programming problem (NLP), which is


known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian


m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

lecture5 11/43
Lagrangian

Consider a general nonlinear programming problem (NLP), which is


known as a primal problem
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]

• Define the Lagrangian


m
X l
X
L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
i=1 j=1

for v = [v1 ; . . . ; vm ] ∈ Rm , u = [u1 ; . . . ; ul ] ∈ Rl+ .

• Define the Lagrange dual function (always concave)

θ(v, u) = min L(x, v, u)


x

lecture5 11/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve


m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )

lecture5 12/43
Lagrangian

• In evaluating θ(v, u) for each v, u, we must solve


m
X l
X
min L(x, v, u) = f (x) + vi gi (x) + uj hj (x)
x
i=1 j=1

∂L
We may set = 0 if f , gi , hj are convex and differentiable
∂x
• The dual function θ is concave even when (P) is not convex — verify
that θ(λv + (1 − λ)v 0 , λu + (1 − λ)u0 ) ≥ λθ(v, u) + (1 − λ)θ(v 0 , u0 )

• Suppose x is a feasible point of (P). Then for any v ∈ Rm , u ∈ Rl+ ,


θ(v, u) = min L(x, v, u) ≤ L(x, v, u)
x
m
X l
X
= f (x) + vi gi (x) + uj hj (x)
i=1 j=1

≤ f (x)

lecture5 12/43
Lagrangian dual problem

is a lower bound of
• Dual function θ(v, u) ≤ Primal function f (x)
v ∈ Rm , u ∈ Rl+ x is primal feasible
dual feasible

• We want to search for the largest lower bound — leading to the


Lagrangian dual problem

(D) max θ(v, u)


v,u

s.t. v ∈ Rm , u ∈ Rl+

Here vi ,uj are called dual variables or Lagrange multipliers.

lecture5 13/43
Primal and dual

Definition (Lagrangian dual problem)


For a primal nonlinear programming problem (P)
(P) min f (x)
x∈Rp
s.t. gi (x) = 0, i ∈ [m]
hj (x) ≤ 0, j ∈ [l]
The Lagrangian dual problem (D) is the following nonlinear programming
problem
( m l
)
X X
(D) max θ(v, u) = min f (x) + vi gi (x) + uj hj (x)
v,u x
i=1 j=1
m
s.t. v∈R , u∈ Rl+
• Weak duality: optimal value for (D) ≤ optimal value for (P)
• Under certain assumptions (see page 21),
strong duality: optimal value for (D) = objective value for (P)
lecture5 14/43
Example

Find the dual problem of the convex program


min x21 + x22
s.t. x1 + x2 ≥ 4

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )

lecture5 15/43
Example

Find the dual problem of the convex program


min x21 + x22 f (x) = x21 + x22
s.t. x1 + x2 ≥ 4 h1 (x) = 4 − x1 − x2 ≤ 0 ← u1 ≥ 0
Solution. For u1 ≥ 0, the Lagrangian is
L(x1 , x2 , u1 ) = f (x) + u1 h1 (x) = x21 + x22 + u1 (4 − x1 − x2 )
The dual function is
θ(u1 ) = min x21 + x22 + u1 (4 − x1 − x2 )
x1 ,x2

= 4u1 + min {x21 − u1 x1 } + min {x22 − u1 x2 }


x1 x2

u21 u1 u1
= 4u1 − (Attained at x1 = , x2 = )
2 2 2
The dual problem is
u2
max 4u1 − 1
2
s.t. u1 ≥ 0
lecture5 15/43
Example: LP

Consider the linear programming (LP) problem in standard form


min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.

lecture5 16/43
Example: LP

Consider the linear programming (LP) problem in standard form


min cT x
x
s.t. Ax = b
x≥0
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and x ≥ 0 means xi ≥ 0, i ∈ [n].
Find the dual function and dual problem.
Solution. Let v ∈ Rm and u ∈ Rn+ . The dual function is
θ(v) = min {cT x + v T (b − Ax) − uT x} = v T b + min {xT (c − AT v − u)}
x x
(
T T
v b, if c − A v − u = 0
=
−∞, otherwise

max bT v
v,u max bT v
Dual problem: s.t. A v + u = c i.e.,
T v
s.t. AT v ≤ c
u≥0
lecture5 16/43
Example: LP

Consider the LP in standard inequality form


min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.

lecture5 17/43
Example: LP

Consider the LP in standard inequality form


min cT x
x

s.t. Ax ≤ b
where A ∈ Rm×n , b ∈ Rm , c ∈ Rn , and the inequality in the constraint
Ax ≤ b is interpreted component-wise. Find the dual function and dual
problem.
Solution. Let u ∈ Rm
+ . The dual function is

θ(u) = minn {cT x + uT (Ax − b)} = −uT b + minn {xT (c + AT u)}


x∈R x∈R
(
T T
−u b, if c + A u = 0
=
−∞, otherwise

max −bT u
u
Dual problem: s.t. AT u + c = 0
u≥0 lecture5 17/43
Example: Lasso

Consider the problem


1
min kzk2 + λkβk1
β,z 2
s.t. z + Xβ = Y

where X ∈ Rn×p , Y ∈ Rn , λ > 0. Find the dual function and dual


problem.
Solution.

1. Let y ∈ Rn . The Lagrangian is


1
L(β, z, y) = kzk2 + λkβk1 + hy, Y − z − Xβi
2
1
= kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
2

lecture5 18/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)


β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function θ(y) = min L(β, z, y)


β,z

n1 o
= min kzk2 − hy, zi + λkβk1 − hX T y, βi + hy, Y i
β,z 2
n o n1 o
= min λkβk1 − hX T y, βi + min kzk2 − hy, zi + hy, Y i
β z 2
n1 o 1
• Set ∇z = z − y = 0 ⇒ z = y. min kzk2 − hy, zi = − kyk2
z 2 2
n o n XT y o
• min λkβk1 − hX T y, βi = −λ max h , βi − kβk1 =
β β λ | {z }
h(β)
 T  2  T 
X y X y
−λh∗ = −δB1 , B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ λ

2 Lecture 4, page 34
lecture5 19/43
Example: Lasso

2. The dual function


 T 
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

lecture5 20/43
Example: Lasso

2. The dual function


 T 
X y 1
θ(y) = −δB1 − kyk2 +hy, Y i, B1 = {β ∈ Rp | kβk∞ ≤ 1}
λ 2

3. The dual problem

max θ(y)
y
 T 
X y 1
= − min δB1 + kyk2 − hy, Y i
y λ 2
1
= − min kyk2 − hy, Y i s.t. XT y ∞ ≤ λ
y 2

lecture5 20/43
KKT

Assumptions

1. f, hj : Rp → R differentiable and convex


2. gi : Rp → R affine (gi (x) = aTi x + bi )
3. Slater’s condition holds, i.e., there exists x̂ such that

gi (x̂) = 0, ∀ i hj (x̂) < 0, ∀ j

Under the above assumptions, strong duality holds, and there exist a
solution x∗ to (P) and a solution (u∗ , v ∗ ) to (D) satisfying the
Karush-Kuhn-Tucker (KKT) conditions:
m l
∂ X X
L(x∗ , u∗ , v ∗ ) = ∇f (x∗ ) + vi∗ ∇gi (x∗ ) + u∗j ∇hj (x∗ ) = 0
∂x i=1 j=1

gi (x∗ ) = 0, hj (x∗ ) ≤ 0, u∗j ≥ 0, u∗j hj (x∗ ) = 0, ∀ i ∈ [m], j ∈ [l]

lecture5 21/43
KKT

• We say (x∗ , u∗ , v ∗ ) (or simply x∗ ) is a KKT point or a KKT solution


if (x∗ , u∗ , v ∗ ) satisfies the KKT conditions
• Under the above assumptions, (x∗ , u∗ , v ∗ ) is a KKT solution ⇐⇒
x∗ is an optimal solution to (P) and (u∗ , v ∗ ) is an optimal solution
to (D)
• We call
u∗j hj (x∗ ) = 0, ∀ j ∈ [l]
complementary slackness condition. It implies
u∗j = 0 if hj (x∗ ) < 0, hj (x∗ ) = 0 if u∗j > 0

 • If the constraint hj (x∗ ) ≤ 0 is slack



 hj (x ) ≤ 0
 (hj (x∗ ) < 0), then the constraint u∗j ≥ 0 is

uj ≥ 0 active (u∗j = 0)
 u∗ h (x∗ ) = 0

j j • If the constraint u∗j ≥ 0 is slack (u∗j > 0),
then the constraint hj (x∗ ) ≤ 0 is active
(hj (x∗ ) = 0)
lecture5 22/43
Dual of SVM
Dual of SVM

Derive the dual of the following SVM problem


1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is


n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

lecture5 23/43
Dual of SVM

Derive the dual of the following SVM problem


1
min kβk2
β,β0 2
s.t. 1 − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]

1. For α ∈ Rn+ , the Lagrangian is


n
1 X
L(β, β0 , α) = kβk2 + αi (1 − yi (β T xi + β0 ))
2 i=1

2. The dual function is

θ(α) = min L(β, β0 , α)


β,β0
n n n
1 X X X
= min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1

lecture5 23/43
Dual of SVM

We need to solve the optimization problem


n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

lecture5 24/43
Dual of SVM

We need to solve the optimization problem


n n n
1 X X X
min kβk2 − αi yi xTi β − αi yi β0 + αi
β,β0 2 i=1 i=1 i=1
n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0, we
∂β i=1
∂β0 i=1
obtain that
 n n n
 X 1 X X
 αi − αi αj yi yj xTi xj if αi yi = 0
θ(α) = i=1
2 i,j=1 i=1

−∞ otherwise

3. The dual problem is


n n
X 1 X
max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0, αi ≥ 0, i ∈ [n]
i=1 lecture5 24/43
KKT of SVM

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Verify the assumptions (in Page 21) for strong duality and the existence
of KKT points: (Slater’s condition) there exist β̂, β̂0 such that
yi (β̂ T xi + β̂0 ) > 1, i ∈ [n]. It requires that the two classes are strictly
separable.

Figure 2: Left: strictly separable. Right: separable but not strictly separable
lecture5 25/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

1. If we obtain a dual solution α∗ (via solving the SVM dual problem),


then we can construct a primal solution (β ∗ , β0∗ ) from KKT
conditions
Xn
β∗ = αi∗ yi xi
i=1
n
X
β0∗ = yk − αi∗ yi hxi , xk i, for some k satisfying αk∗ > 0
i=1

lecture5 26/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

2. If αi∗ > 0, then xi is a support vector, by complementary slackness


condition:
αi∗ > 0 ⇒ yi ((β ∗ )T xi + β0∗ ) = 1
3. |{i | αi∗ > 0}| ≤ the number of support vectors  n
Dual solution is sparse (many αi∗ = 0)

lecture5 27/43
KKT of SVM

KKT conditions:
n
X n
X
αi∗ yi xi = β ∗ , αi∗ yi = 0
i=1 i=1
∗ T
yi ((β ) xi + β0∗ ) ≥ 1, i ∈ [n]
αi∗ ≥ 0, i ∈ [n]
αi∗ (1 − yi ((β ∗ )T xi + β0∗ )) = 0, i ∈ [n]

4. Decision boundary
n
X
0 = (β ∗ )T x + β0∗ = αi∗ yi hxi , xi + β0∗
i=1
X
= αi∗ yi hxi , xi + β0∗
i, α∗
i >0

For a new test point x, the prediction only depends on hxi , xi where
αi∗ > 0, namely, the inner products between x and support vectors.
lecture5 28/43
Primal vs Dual

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Classifier Classifier
n
!
f (x) = sign(β T x + β0 ) X
f (x) = sign αi yi hxi , xi + β0
i=1

Many αi ’s are zero (sparse solutions)


• Optimize p + 1 variables for primal, n variables for dual
• When n  p, it might be more efficient to solve the dual
• Dual problem only involves hxi , xj i — allowing the use of kernels
lecture5 29/43
Feature mapping

• Recall feature expansion, for example,


 
xi1
" #  x 
xi1  i2 
feature  2 
i-th feature vector xi = →  xi1 
xi2 expansion  2 
 xi2 
xi1 xi2
• Let φ denote the feature mapping, which maps from original
features to new features  
z1
" #!  z 
z1  2 
For example, φ =  z12 
 
z2
 z22 
 

z1 z2
• Instead of using the original feature vectors xi , i ∈ [n], we may
apply SVM using new features φ(xi ), i ∈ [n]
• New feature space can be very high dimensional lecture5 30/43
Kernel

Primal Dual
1 n n
min kβk2
X 1 X
β,β0 2 max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
s.t. yi (β T xi + β0 ) ≥ 1, i ∈ [n]
Xn
s.t. αi yi = 0
i=1

αi ≥ 0, i ∈ [n]
Using dual:
• For feature expansion, simply replace hxi , xj i with hφ(xi ), φ(xj )i
• Given a feature mapping φ, we define the corresponding kernel
K(a, b) = hφ(a), φ(b)i, a, b ∈ Rp
• Usually computing K(a, b) may be very cheap, even though
computing φ(a), φ(b) (high dimensional vectors) may be expensive
• The dual of SVM only requires the computation of kernels
K(xi , xj ). Explicitly calculating φ(xi ) is not necessary lecture5 31/43
Example: (homogeneous) polynomial kernel

For a, b ∈ Rp , consider
K(a, b) = (aT b)2
It can be written as p
! p 
p
X X X
K(a, b) = ai bi  aj bj =
 a i a j bi bj
i=1 j=1 i,j=1
p
X
= (ai aj )(bi bj )
i,j=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → Rp is given by  

a1 a1

a1
  a1 a2 
 
a2  
  a a 

φ(a) = φ  ..  =  1 3 
 .   .. 
 . 
ap
ap ap
Computing φ(a): O(p2 ) operations; computing K(a, b): O(p) operations
lecture5 32/43
Example: (inhomogeneous) polynomial kernel

Given c ≥ 0. For a, b ∈ Rp , consider

K(a, b) = (aT b + c)2


p p
X X √ √
= (ai aj )(bi bj ) + ( 2cai )( 2cbi ) + c2
i,j=1 i=1

Thus, we see that K(a, b) = hφ(a), φ(b)i, where the feature mapping
2
φ : Rp → R(p +p+1) is given by
 √ √ √ T
φ(a) = a1 a1 , a1 a2 , a1 a3 , . . . , ap ap , 2ca1 , 2ca2 , . . . , 2cap , c
| {z }| {z }
second order terms first order terms
Parameter c controls the relative weighting between first order and
second order terms.

lecture5 33/43
Common kernels

• Polynomials of degree d

K(a, b) = (aT b)d

• Polynomials up to degree d

K(a, b) = (aT b + 1)d

• Gaussian kernel — polynomials of all orders3

ka − bk2
 
K(a, b) = exp − , σ>0
2σ 2

3 ex
P∞ xn
= n=0 n!

lecture5 34/43
Kernel

• SVM can be applied in high dimensional feature spaces, without


explicitly applying the feature mapping
• The two classes might be separable in high dimensional space, but
not separable in the original feature space
• Kernels can be used efficiently in the dual problem of SVM because
the dual only involves inner products

lecture5 35/43
SVM with soft constraints
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]

lecture5 36/43
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}

lecture5 37/43
SVM with soft constraints

When the two classes are not separable, no feasible separating hyperplane
exists. We allow the constraints to be violated slightly (C > 0 is given)
n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. yi (β T xi + β0 ) ≥ 1 − εi ∀ i ∈ [n]
εi ≥ 0, i ∈ [n]
(
1 − yi (β T xi + β0 ), if yi (β T xi + β0 ) < 1
εi =
0, if yi (β T xi + β0 ) ≥ 1
= max{1 − yi (β T xi + β0 ), 0}
SVM with soft constraints solves
n
1 X
min kβk2 +C max{1 − yi (β T xi + β0 ), 0}
β,β0 2
| {z } i=1
ridge regularization
| {z }
hinge-loss function
lecture5 37/43
Logistic regression

Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
  
T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =  
 log 1 + eβ T xi +β0 ,
 if yi = 0

4 Lecture 3, Page 19
lecture5 38/43
Logistic regression

Recall that4
n
X T
logistic-loss = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
Equivalently,
  
T
 log 1 + e−(β xi +β0 ) , if yi = 1

logistic-loss =  
 log 1 + eβ T xi +β0 ,
 if yi = 0

Change label yi = 0 → yi = −1,


 T

logistic-loss = log 1 + e−yi (β xi +β0 ) , yi ∈ {−1, 1}

Logistic regression with ridge regularization


n  
X T
min log 1 + e−yi (β xi +β0 ) + λkβk2
β,β0
i=1
4 Lecture 3, Page 19
lecture5 38/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with


ridge regularization
n n  
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z  0

lecture5 39/43
SVM vs. logistic regression

SVM with soft constraints Logistic regression with


ridge regularization
n n  
X T 1 2
X −yi (β T xi +β0 ) 2
min C max{1 − yi (β xi + β0 ), 0} + kβk min log 1+e +λkβk
β,β0 2 β,β0
i=1 i=1

Hinge-loss Logistic-loss
hinge-loss = max{1 − z, 0} logistic-loss = log(1 + e−z )
z = yi (β T xi + β0 ) z = yi (β T xi + β0 )
hope z ≥ 1 hope z  0
4

hinge-loss
3
logistic loss is a “smoothed
2
version” of hinge loss
logistic-loss

0
-3 -2 -1 0 1 2 3 lecture5 39/43
SVM with soft constraints: dual

SVM with soft constraints


n
1 X
min kβk2 + C εi
β,β0 ,ε 2 i=1
s.t. 1 − εi − yi (β T xi + β0 ) ≤ 0 ∀ i ∈ [n]
−εi ≤ 0, i ∈ [n]

Find the dual problem.

1. For α ∈ Rn+ , r ∈ Rn+ , the Lagrangian L(β, β0 , ε, α, r) =


n n n
1 X X X
kβk2 + C εi + αi (1 − εi − yi (β T xi + β0 )) − ri εi
2 i=1 i=1 i=1
n n n n
1 2
X
T
X X X
= kβk − αi yi xi β − αi yi β0 + (C − αi − ri )εi + αi
2 i=1 i=1 i=1 i=1

lecture5 40/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =


β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

L = C − αi − ri = 0,
∂εi

lecture5 41/43
SVM with soft constraints: dual

2. The dual function is θ(α, r) = min L(β, β0 , ε, α, r) =


β,β0 ,ε

n n n n
1 X X X X
min kβk2 − αi yi xTi β − αi yi β0 + (C − αi − ri )εi + αi
β,β0 ,ε 2 i=1 i=1 i=1 i=1

n n
∂ X ∂ X
Setting L=β− αi yi xi = 0, L=− αi yi = 0,
∂β i=1
∂β0 i=1

L = C − αi − ri = 0,
∂εi
we obtain that θ(α, r) =
 n n n
X 1 X X
αi − αi αj yi yj xTi xj if αi yi = 0 and αi + ri = C


i=1
2 i,j=1 i=1

−∞ otherwise

lecture5 41/43
SVM with soft constraints: dual

3. The dual problem max {θ(α, r) | α ∈ Rn+ , r ∈ Rn+ }


α,r

n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α,r
i=1
2 i,j=1
n
X
s.t. αi yi = 0
i=1

αi + ri = C, i ∈ [n]
α ∈ Rn+ , r ∈ Rn+
n n
X 1 X
⇐⇒ max αi − αi αj yi yj hxi , xj i
α
i=1
2 i,j=1
Xn
s.t. αi yi = 0
i=1

0 ≤ αi ≤ C, i ∈ [n]
lecture5 42/43
In practice

You are encouraged to learn two popular open source machine learning
libraries:
LIBLINEAR https://www.csie.ntu.edu.tw/~cjlin/liblinear/
LIBSVM https://www.csie.ntu.edu.tw/~cjlin/libsvm/

lecture5 43/43

You might also like