Professional Documents
Culture Documents
2021, 16:18
Recap
Remember the unconstrained SVM Formulation
n
min C ∑ max[1 − yi (w ⊤ xi + b) , 0] +
2
∥w∥ z
w
i=1
h(xi ) l2 −Regularizer
Hinge−Loss
The hinge loss is the SVM's error function of choice, whereas the l2 -regularizer reflects the
complexity of the solution, and penalizes complex solutions. This is an example of empirical
risk minimization with a loss function ℓ and a regularizer r ,
1 n
∑ l(hw (xi ), yi ) + λr(w) ,
n i=1
min
w
Loss Regularizer
where the loss function is a continuous function which penalizes training error, and the
regularizer is a continuous function which penalizes classifier complexity. Here, we define λ as
1
from the previous lecture.[1]
C
Standard
When used for Standard SVM, the
SVM(p = 1)
loss function denotes the size of
Hinge-Loss the margin between linear
(Differentiable)
max [1 − hw (xi )yi , 0] p Squared
separator and its closest points in
either class. Only differentiable
Hingeless SVM
everywhere with p = 2.
(p = 2)
Exponential Loss
AdaBoost value of −hw (xi )yi . This can lead
−hw (xi )yi to nice convergence results, for
e
example in the case of Adaboost,
but it can also cause problems
with noisy data.
Actual
Zero-One Loss Non-continuous and thus
Classification
δ(sign(hw (xi )) ≠ yi ) Loss
impractical to optimize.
Figure 4.1: Plots of Common Classification Loss Functions - x-axis: h(xi )yi , or
"correctness" of prediction; y-axis: loss value
Log-Cosh Loss
log(cosh(h(xi ) − yi )) , ADVANTAGE: Similar to Huber Loss, but twice
−x
cosh(x) = e +2e
x
differentiable everywhere
Figure 4.2: Plots of Common Regression Loss Functions - x-axis: h(xi )yi , or
"error" of prediction; y-axis: loss value
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 3 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18
Regularizers
When we look at regularizers it helps to change the formulation of the optimization
problem to obtain a better geometric intuition:
n n
min ∑ ℓ(hw (x), yi ) + λr(w) ⇔ min ∑ ℓ(hw (x), yi ) subject to: r(w) ≤ B
w,b w,b
i=1 i=1
For each λ ≥ 0 , there exists B ≥ 0 such that the two formulations in (4.1) are equivalent,
and vice versa. In previous sections, l2 -regularizer has been introduced as the component
in SVM that reflects the complexity of solutions. Besides the l2 -regularizer, other types of
useful regularizers and their properties are listed in Table 4.3.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 4 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18
Squared Loss
Ordinary Least Squares No Regularization
1
n Closed form solution:
minw ∑ (w ⊤ xi − yi )2
n
i=1 w = (XX⊤ )−1 Xy⊤
X = [x1 , . . . , xn ]
y = [y1 , . . . , yn ]
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 5 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18
ADVANTAGE: Strictly
convex (i.e. unique
Elastic Net solution)
n
1 + sparsity inducing (good
minw ∑ (w ⊤ xi − yi )2 + α∥w∥1 + (1 − α)∥w∥22
n for feature selection)
i=1
α ∈ [0, 1) + Dual of squared-loss
SVM, see SVEN
DISADVANTAGE: - Non-
differentiable
Often l1 or l2 Regularized
Logistic Regression
n Solve with gradient
minw,b 1
∑ log (1 + e −yi(w
⊤ x +b)
i
) descent.
n 1
i=1 Pr (y|x) =
1+e−y(w ⊤ x+b)
Typically l2 regularized
(sometimes l1 ).
Quadratic program.
Linear Support Vector Machine When kernelized leads to
n
minw,b C ∑ max[1 − yi (w ⊤ xi + b), 0] + ∥w∥22 sparse solutions.
i=1 Kernelized version can be
solved very efficiently
with specialized
algorithms (e.g. SMO)
[1] In Bayesian Machine Learning, it is common to optimize λ , but for the purposes of
this class, it is assumed to be fixed.
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 6 of 6