SVM Dual Kernel

Support Vector Machines
- Dual formulation and Kernel

Trick
Aarti Singh
Machine Learning 10-315

Oct 28, 2020
Constrained Optimization – Dual Problem
Primal problem:
b +ve
Moving the constraint to objective function

Lagrangian:
Dual problem:
2
Connection between Primal and Dual
Primal problem: p* = Dual problem: d* =
Ø Weak duality: The dual solution d* lower bounds the primal

solution p* i.e. d* ≤ p*
Duality gap = p*-d*
Ø Strong duality: d* = p* holds often for many problems of

interest e.g. if the primal is a feasible convex objective with linear
constraints (Slater’s condition)
3
Connection between Primal and Dual
What does strong duality say about ↵⇤⇤ (the ↵ that achieved optimal value of
What does ⇤strong duality say about ↵ ⇤ (the ↵ that achieved optimal value of
dual)
Whatanddoesx strong
(the xduality
that achieves
say aboutoptimal value
↵ (the ↵ of primal
that problem)?
achieved optimal value of
dual) and x⇤⇤ (the x that achieves optimal value of primal problem)?
dual) and x (the x that achieves optimal value of primal problem)?
Whenever strong duality holds, the following conditions (known as KKT con-
Whenever strong duality holds, the following conditions (known as KKT con-
ditions)
Whenever are strong ↵⇤⇤ andholds,
true forduality x⇤⇤ : the following conditions (known as KKT con-
ditions) are true for ↵ ⇤ and x ⇤:
ditions) are true for ↵ and x :
• 1. 5L(x⇤⇤ , ↵⇤⇤ ) = 0 i.e. Gradient of Lagrangian at x⇤⇤ and ↵⇤⇤ is zero.
• 1. 5L(x ⇤, ↵ ⇤) = 0 i.e. Gradient of Lagrangian at x ⇤ and ↵ ⇤ is zero.
• 1. 5L(x
⇤
, ↵ ) ⇤= 0 i.e. Gradient of Lagrangian at x and ↵ is zero.
• 2. x⇤ b i.e. x⇤ is primal feasible
• 2. x ⇤ b i.e. x ⇤ is primal feasible
• 2. x⇤ b i.e. x⇤ is primal feasible
• 3. ↵⇤ 0 i.e. ↵⇤ is dual feasible
• 3. ↵ ⇤ 0 i.e. ↵ ⇤ is dual feasible
• 3. ↵⇤ ⇤ 0 i.e. ↵ is dual feasible
• 4. ↵⇤ (x⇤ b) = 0 (called as complementary slackness)
• 4. ↵ ⇤(x ⇤ b) = 0 (called as complementary slackness)
• 4. ↵ (x b) = 0 (called as complementary slackness)
We use the first one to relate x⇤⇤ and ↵⇤⇤ . We use the last one (complimentary
We use the first one to ⇤relate x ⇤and ↵ ⇤. We use the last one ⇤ (complimentary
slackness)
We usetotheargue
firstthat ↵⇤relate
one to = 0 ifxconstraint is inactive
and ↵ . We use the and
last ↵one>(complimentary
0 if constraint
slackness) to argue that ↵ ⇤= 0 if constraint is inactive and ↵⇤⇤> 0 if constraint
isslackness)
active andto tight.
argue that ↵ = 0 if constraint is inactive and ↵ > 0 if constraint
is active and tight.
is active and tight.
4
Solving the dual
5
Solving the dual
Find the dual: Optimization over x is unconstrained.
Solve: Now need to maximize L(x*,α) over α ≥ 0

Solve unconstrained problem to get α’ and then take max(α’,0)
) ↵0 = 2b
a = 0 constraint is inactive, α > 0 constraint is active (tight) 6

Dual SVM – linearly separable case
n training points, d features (x1, …, xn) where xi is a d-dimensional
vector
• Primal problem:
w – weights on features (d-dim problem)
• Dual problem (derivation):
a – weights on training pts (n-dim problem)

7
• Dual problem (derivation):
If we can solve for

as (dual problem),
then we have a
solution for w,b
(primal problem)
8
• Dual problem:
9
Dual problem is also QP

Solution gives ajs
What about b?
10
Dual SVM: Sparsity of dual solution
aj = 0
aj > 0 aj = 0
=0
Only few ajs can be
w .x + b
non-zero : where
aj > 0
constraint is active and
tight
aj > 0
aj = 0
(w.xj + b)yj = 1
Support vectors –
training points j whose
ajs are non-zero 11

Solution gives ajs
Use any one of support vectors with

ak>0 to compute b since constraint is
tight (w.xk + b)yk = 1 12
Dual SVM – non-separable case
• Primal problem:
,{ξj}
Lagrange
• Dual problem: Multipliers
,{ξj} L(w, b, ⇠, ↵, µ)
HW3!
13
Dual SVM – non-separable case
comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠

Solution gives ajs
14
So why solve the dual SVM?
• There are some quadratic programming algorithms
that can solve the dual faster than the primal,
(specially in high dimensions d>>n)
• But, more importantly, the “kernel trick”!!!
15
Separable using higher-order features
x2 q
x1 r = √x12+x22
x12
x1
16
x1
What if data is not linearly separable?
Use features of features
of features of features….
Φ(x) = (x12, x22, x1x2, …., exp(x1))
Feature space becomes really large very quickly!

17
Higher Order Polynomials
m – input features d – degree of polynomial
grows fast!
d = 6, m = 100
about 1.6 billion terms
18
Dual formulation only depends on
dot-products, not on w!
Φ(x) – High-dimensional feature space, but never need it explicitly as long

as we can compute the dot product fast using some Kernel K
19
Dot Product of Polynomials
d=1
d=2
d 20
Finally: The Kernel Trick!
• Never represent features explicitly

– Compute dot products in closed
form
• Constant-time high-dimensional dot-

products for many classes of features
21
Common Kernels
• Polynomials of degree d
• Polynomials of degree up to d
• Gaussian/Radial kernels (polynomials of all orders – recall

series expansion of exp)
• Sigmoid
22
Mercer Kernels
What functions are valid kernels that correspond to feature
vectors j(x)?
Answer: Mercer kernels K

• K is continuous
• K is symmetric
• K is positive semi-definite, i.e. xTKx ≥ 0 for all x
Ensures optimization is concave maximization
23
Overfitting
• Huge feature space with kernels, what about
overfitting???
– Maximizing margin leads to sparse set of support
vectors
– Some interesting theory says that SVMs search for
simple hypothesis with large margin
– Often robust to overfitting
24
What about classification time?
• For a new input x, if we need to represent F(x), we are in trouble!
• Recall classifier: sign(w.F(x)+b)
• Using kernels we are cool!
25
SVMs with Kernels
• Choose a set of features and kernel function
• Solve dual problem to obtain support vectors ai
• At classification time, compute:
Classify as
26
SVMs with Kernels
• Iris dataset, 2 vs 13, Linear Kernel
27
SVMs with Kernels
• Iris dataset, 1 vs 23, Polynomial Kernel degree 2
28
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
29
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
30
SVMs with Kernels
• Chessboard dataset, Gaussian RBF kernel
31
SVMs with Kernels
• Chessboard dataset, Polynomial kernel
32
USPS Handwritten digits
33
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
34
SVM : Hinge loss
Logistic Regression : Log loss ( -ve log conditional likelihood)
Log loss Hinge loss
0-1 loss
-1 0 1
35
SVMs Logistic
Regression
High dimensional Yes! Yes!

features with
kernels
Solution sparse Often yes! Almost always no!
Semantics of “Margin” Real probabilities

output
36
Kernels in Logistic Regression
• Define weights in terms of features:
• Derive simple gradient descent rule on ai 37

SVMs Logistic
Regression

features with
kernels

output
38
SVMs Logistic
Regression

features with
kernels

output
39
SVMs Logistic
Regression

features with
kernels

output
40
What you need to know
• Maximizing margin
• Derivation of SVM formulation
• Slack variables and hinge loss
• Relationship between SVMs and logistic regression
– 0/1 loss
– Hinge loss
– Log loss
• Tackling multiple class
– One against All
– Multiclass SVMs
• Dual SVM formulation
– Easier to solve when dimension high d > n
– Kernel Trick 41

SVM Dual Kernel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SVM Dual Kernel

Uploaded by

Copyright:

Available Formats

Support Vector Machines

- Dual formulation and Kernel

Machine Learning 10-315

Moving the constraint to objective function

Ø Weak duality: The dual solution d* lower bounds the primal

Duality gap = p*-d*

Ø Strong duality: d* = p* holds often for many problems of

Find the dual: Optimization over x is unconstrained.

Solve: Now need to maximize L(x*,α) over α ≥ 0

a = 0 constraint is inactive, α > 0 constraint is active (tight) 6

w – weights on features (d-dim problem)

• Dual problem (derivation):

a – weights on training pts (n-dim problem)

If we can solve for

Dual problem is also QP

Dual problem is also QP

Use any one of support vectors with

Dual problem is also QP

• But, more importantly, the “kernel trick”!!!

Φ(x) = (x12, x22, x1x2, …., exp(x1))

Feature space becomes really large very quickly!

Φ(x) – High-dimensional feature space, but never need it explicitly as long

• Never represent features explicitly

• Constant-time high-dimensional dot-

• Gaussian/Radial kernels (polynomials of all orders – recall

Answer: Mercer kernels K

Ensures optimization is concave maximization

• Using kernels we are cool!

Logistic Regression : Log loss ( -ve log conditional likelihood)

Log loss Hinge loss

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

• Define weights in terms of features:

• Derive simple gradient descent rule on ai 37

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

High dimensional Yes! Yes!

Semantics of “Margin” Real probabilities

You might also like

Duality gap = p-d