Professional Documents
Culture Documents
Primal problem:
b +ve
Dual problem:
2
Connection between Primal and Dual
Primal problem: p* = Dual problem: d* =
3
Connection between Primal and Dual
What does strong duality say about ↵⇤⇤ (the ↵ that achieved optimal value of
What does ⇤strong duality say about ↵ ⇤ (the ↵ that achieved optimal value of
dual)
Whatanddoesx strong
(the xduality
that achieves
say aboutoptimal value
↵ (the ↵ of primal
that problem)?
achieved optimal value of
dual) and x⇤⇤ (the x that achieves optimal value of primal problem)?
dual) and x (the x that achieves optimal value of primal problem)?
Whenever strong duality holds, the following conditions (known as KKT con-
Whenever strong duality holds, the following conditions (known as KKT con-
ditions)
Whenever are strong ↵⇤⇤ andholds,
true forduality x⇤⇤ : the following conditions (known as KKT con-
ditions) are true for ↵ ⇤ and x ⇤:
ditions) are true for ↵ and x :
• 1. 5L(x⇤⇤ , ↵⇤⇤ ) = 0 i.e. Gradient of Lagrangian at x⇤⇤ and ↵⇤⇤ is zero.
• 1. 5L(x ⇤, ↵ ⇤) = 0 i.e. Gradient of Lagrangian at x ⇤ and ↵ ⇤ is zero.
• 1. 5L(x
⇤
, ↵ ) ⇤= 0 i.e. Gradient of Lagrangian at x and ↵ is zero.
• 2. x⇤ b i.e. x⇤ is primal feasible
• 2. x ⇤ b i.e. x ⇤ is primal feasible
• 2. x⇤ b i.e. x⇤ is primal feasible
• 3. ↵⇤ 0 i.e. ↵⇤ is dual feasible
• 3. ↵ ⇤ 0 i.e. ↵ ⇤ is dual feasible
• 3. ↵⇤ ⇤ 0 i.e. ↵ is dual feasible
• 4. ↵⇤ (x⇤ b) = 0 (called as complementary slackness)
• 4. ↵ ⇤(x ⇤ b) = 0 (called as complementary slackness)
• 4. ↵ (x b) = 0 (called as complementary slackness)
We use the first one to relate x⇤⇤ and ↵⇤⇤ . We use the last one (complimentary
We use the first one to ⇤relate x ⇤and ↵ ⇤. We use the last one ⇤ (complimentary
slackness)
We usetotheargue
firstthat ↵⇤relate
one to = 0 ifxconstraint is inactive
and ↵ . We use the and
last ↵one>(complimentary
0 if constraint
slackness) to argue that ↵ ⇤= 0 if constraint is inactive and ↵⇤⇤> 0 if constraint
isslackness)
active andto tight.
argue that ↵ = 0 if constraint is inactive and ↵ > 0 if constraint
is active and tight.
is active and tight.
4
Solving the dual
5
Solving the dual
) ↵0 = 2b
• Primal problem:
8
Dual SVM – linearly separable case
• Dual problem:
9
Dual SVM – linearly separable case
10
Dual SVM: Sparsity of dual solution
aj = 0
aj > 0 aj = 0
=0
Only few ajs can be
w .x + b
non-zero : where
aj > 0
constraint is active and
tight
aj > 0
aj = 0
(w.xj + b)yj = 1
Support vectors –
training points j whose
ajs are non-zero 11
Dual SVM – linearly separable case
Lagrange
• Dual problem: Multipliers
,{ξj} L(w, b, ⇠, ↵, µ)
HW3!
13
Dual SVM – non-separable case
comes from
@L Intuition:
=0 If C→∞, recover hard-margin SVM
@⇠
15
Separable using higher-order features
x2 q
x1 r = √x12+x22
x12
x1
16
x1
What if data is not linearly separable?
Use features of features
of features of features….
grows fast!
d = 6, m = 100
about 1.6 billion terms
18
Dual formulation only depends on
dot-products, not on w!
d=1
d=2
d 20
Finally: The Kernel Trick!
21
Common Kernels
• Polynomials of degree d
• Polynomials of degree up to d
• Sigmoid
22
Mercer Kernels
What functions are valid kernels that correspond to feature
vectors j(x)?
23
Overfitting
• Huge feature space with kernels, what about
overfitting???
– Maximizing margin leads to sparse set of support
vectors
– Some interesting theory says that SVMs search for
simple hypothesis with large margin
– Often robust to overfitting
24
What about classification time?
• For a new input x, if we need to represent F(x), we are in trouble!
• Recall classifier: sign(w.F(x)+b)
25
SVMs with Kernels
• Choose a set of features and kernel function
• Solve dual problem to obtain support vectors ai
• At classification time, compute:
Classify as
26
SVMs with Kernels
• Iris dataset, 2 vs 13, Linear Kernel
27
SVMs with Kernels
• Iris dataset, 1 vs 23, Polynomial Kernel degree 2
28
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
29
SVMs with Kernels
• Iris dataset, 1 vs 23, Gaussian RBF kernel
30
SVMs with Kernels
• Chessboard dataset, Gaussian RBF kernel
31
SVMs with Kernels
• Chessboard dataset, Polynomial kernel
32
USPS Handwritten digits
33
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss
34
SVMs vs. Logistic Regression
SVM : Hinge loss
0-1 loss
-1 0 1
35
SVMs vs. Logistic Regression
SVMs Logistic
Regression
Loss function Hinge loss Log-loss