You are on page 1of 6

10: Empirical Risk Minimization 27.11.

2021, 16:18

10: Empirical Risk Minimization

Recap
Remember the unconstrained SVM Formulation
n
min C ∑ max[1 − yi (w ⊤ xi + b) , 0] +
2
  
∥w∥ z
w
i=1

h(xi ) l2 −Regularizer

Hinge−Loss

The hinge loss is the SVM's error function of choice, whereas the l2 -regularizer reflects the
complexity of the solution, and penalizes complex solutions. This is an example of empirical
risk minimization with a loss function ℓ and a regularizer r ,

1 n
∑ l(hw (xi ), yi ) + λr(w) ,
n i=1   
min
w
Loss Regularizer

where the loss function is a continuous function which penalizes training error, and the
regularizer is a continuous function which penalizes classifier complexity. Here, we define λ as
1
from the previous lecture.[1]
C

Commonly Used Binary Classification Loss Functions


Different Machine Learning algorithms use different loss functions; Table 4.1 shows just a few:

Loss ℓ(hw(xi , yi )) Usage Comments

Standard
When used for Standard SVM, the
SVM(p = 1)
loss function denotes the size of
Hinge-Loss the margin between linear
(Differentiable)
max [1 − hw (xi )yi , 0] p Squared
separator and its closest points in
either class. Only differentiable
Hingeless SVM
everywhere with p = 2.
(p = 2)

One of the most popular loss


Log-Loss Logistic functions in Machine Learning,
log(1 + e −hw(xi )yi) Regression since its outputs are well-
calibrated probabilities.

This function is very aggressive.


The loss of a mis-prediction
increases exponentially with the
− w( i) i
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 1 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18

Exponential Loss
AdaBoost value of −hw (xi )yi . This can lead
−hw (xi )yi to nice convergence results, for
e
example in the case of Adaboost,
but it can also cause problems
with noisy data.

Actual
Zero-One Loss Non-continuous and thus
Classification
δ(sign(hw (xi )) ≠ yi ) Loss
impractical to optimize.

Table 4.1: Loss Functions With Classification y ∈ {−1, +1}


Quiz: What do all these loss functions look like with respect to z = yh(x) ?

Figure 4.1: Plots of Common Classification Loss Functions - x-axis: h(xi )yi , or
"correctness" of prediction; y-axis: loss value

Some questions about the loss functions:


1. Which functions are strict upper bounds on the 0/1-loss?
2. What can you say about the hinge-loss and the log-loss as z → −∞?

Commonly Used Regression Loss Functions


Regression algorithms (where a prediction can lie anywhere on the real-number line) also
have their own host of loss functions:

Loss ℓ(hw(xi , yi )) Comments

Most popular regression loss function


Estimates Mean Label
Squared Loss ADVANTAGE: Differentiable everywhere
2
(h( i ) − i
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 2 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18

(h(xi ) − yi )2 DISADVANTAGE: Somewhat sensitive to


outliers/noise
Also known as Ordinary Least Squares
(OLS)

Also a very popular loss function


Absolute Loss
Estimates Median Label
|h(xi ) − yi |
ADVANTAGE: Less sensitive to noise
DISADVANTAGE: Not differentiable at 0

Huber Loss Also known as Smooth Absolute Loss


1 2 ADVANTAGE: "Best of Both Worlds" of
2 (h(xi ) − yi ) if
Squared and Absolute Loss
|h(xi ) − yi | < δ ,
Once-differentiable
otherwise
δ Takes on behavior of Squared-Loss when
δ(|h(xi ) − yi | − 2) loss is small, and Absolute Loss when loss is
large.

Log-Cosh Loss
log(cosh(h(xi ) − yi )) , ADVANTAGE: Similar to Huber Loss, but twice
−x
cosh(x) = e +2e
x
differentiable everywhere

Table 4.2: Loss Functions With Regression, i.e. y∈R


Quiz: What do the loss functions in Table 4.2 look like with respect to z = h(xi ) − yi ?

Figure 4.2: Plots of Common Regression Loss Functions - x-axis: h(xi )yi , or
"error" of prediction; y-axis: loss value

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 3 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18

Regularizers
When we look at regularizers it helps to change the formulation of the optimization
problem to obtain a better geometric intuition:
n n
min ∑ ℓ(hw (x), yi ) + λr(w) ⇔ min ∑ ℓ(hw (x), yi ) subject to: r(w) ≤ B
w,b w,b
i=1 i=1

For each λ ≥ 0 , there exists B ≥ 0 such that the two formulations in (4.1) are equivalent,
and vice versa. In previous sections, l2 -regularizer has been introduced as the component
in SVM that reflects the complexity of solutions. Besides the l2 -regularizer, other types of
useful regularizers and their properties are listed in Table 4.3.

Regularizer r(w) Properties

ADVANTAGE: Strictly Convex


l2 -Regularization ADVANTAGE: Differentiable
DISADVANTAGE: Uses weights on all features,
r(w) = w ⊤ w = ∥w∥22
i.e. relies on all features to some degree (ideally
we would like to avoid this) - these are known as
Dense Solutions.

Convex (but not strictly)


l1 -Regularization DISADVANTAGE: Not differentiable at 0 (the
r(w) = ∥w∥1 point which minimization is intended to bring us
to
Effect: Sparse (i.e. not Dense) Solutions

lp -Norm (often 0 < p ≤ 1 )


d DISADVANTAGE: Non-convex
∥w∥p = ( ∑ vpi )1/p ADVANTAGE: Very sparse solutions
i=1 Initialization dependent
DISADVANTAGE: Not differentiable

Table 4.3: Types of Regularizers

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 4 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18

Figure 4.3: Plots of Common Regularizers

Famous Special Cases


This section includes several special cases that deal with risk minimization, such as
Ordinary Least Squares, Ridge Regression, Lasso, and Logistic Regression. Table 4.4
provides information on their loss functions, regularizers, as well as solutions.

Loss and Regularizer Comments

Squared Loss
Ordinary Least Squares No Regularization
1
n Closed form solution:
minw ∑ (w ⊤ xi − yi )2
n
i=1 w = (XX⊤ )−1 Xy⊤
X = [x1 , . . . , xn ]
y = [y1 , . . . , yn ]

Ridge Regression Squared Loss


n
1 l2 -Regularization
minw ∑ (w ⊤ xi − yi )2 + λ∥w∥22
n
i=1 w = (XX⊤ + λI)−1 Xy⊤

+ sparsity inducing (good


for feature selection)
+ Convex
n - Not strictly convex (no
1
Lasso minw n ∑ (w ⊤ xi − yi )2 + λ∥w∥1 unique solution)
i=1
- Not differentiable (at 0)

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 5 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18

Solve with (sub)-gradient


descent or SVEN

ADVANTAGE: Strictly
convex (i.e. unique
Elastic Net solution)
n
1 + sparsity inducing (good
minw ∑ (w ⊤ xi − yi )2 + α∥w∥1 + (1 − α)∥w∥22
n for feature selection)
i=1
α ∈ [0, 1) + Dual of squared-loss
SVM, see SVEN
DISADVANTAGE: - Non-
differentiable

Often l1 or l2 Regularized
Logistic Regression
n Solve with gradient
minw,b 1
∑ log (1 + e −yi(w
⊤ x +b)
i
) descent.
n 1
i=1 Pr (y|x) =
1+e−y(w ⊤ x+b)

Typically l2 regularized
(sometimes l1 ).
Quadratic program.
Linear Support Vector Machine When kernelized leads to
n
minw,b C ∑ max[1 − yi (w ⊤ xi + b), 0] + ∥w∥22 sparse solutions.
i=1 Kernelized version can be
solved very efficiently
with specialized
algorithms (e.g. SMO)

Table 4.4: Special Cases

Some additional notes on the Special Cases:


1. Ridge Regression is very fast if data isn't too high dimensional.
2. Ridge Regression is just 1 line of Julia / Python.
3. There is an interesting connection between Ordinary Least Squares and the first
principal component of PCA (Principal Component Analysis). PCA also minimizes
square loss, but looks at perpendicular loss (the horizontal distance between each
point and the regression line) instead.

[1] In Bayesian Machine Learning, it is common to optimize λ , but for the purposes of
this class, it is assumed to be fixed.

https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 6 of 6

You might also like