10: Empirical Risk Minimization

10: Empirical Risk Minimization 27.11.
2021, 16:18
10: Empirical Risk Minimization
Recap
Remember the unconstrained SVM Formulation
n
min C ∑ max[1 − yi (w ⊤ xi + b) , 0] +
2
  
∥w∥ z
w
i=1

h(xi ) l2 −Regularizer
Hinge−Loss
The hinge loss is the SVM's error function of choice, whereas the l2 -regularizer reflects the
complexity of the solution, and penalizes complex solutions. This is an example of empirical
risk minimization with a loss function ℓ and a regularizer r ,
1 n
∑ l(hw (xi ), yi ) + λr(w) ,
n i=1   
min
w
Loss Regularizer
where the loss function is a continuous function which penalizes training error, and the
regularizer is a continuous function which penalizes classifier complexity. Here, we define λ as
1
from the previous lecture.[1]
C
Commonly Used Binary Classification Loss Functions

Different Machine Learning algorithms use different loss functions; Table 4.1 shows just a few:
Loss ℓ(hw(xi , yi )) Usage Comments
Standard
When used for Standard SVM, the
SVM(p = 1)
loss function denotes the size of
Hinge-Loss the margin between linear
(Differentiable)
max [1 − hw (xi )yi , 0] p Squared
separator and its closest points in
either class. Only differentiable
Hingeless SVM
everywhere with p = 2.
(p = 2)
One of the most popular loss

Log-Loss Logistic functions in Machine Learning,
log(1 + e −hw(xi )yi) Regression since its outputs are well-
calibrated probabilities.
This function is very aggressive.

The loss of a mis-prediction
increases exponentially with the
− w( i) i
https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html Page 1 of 6
10: Empirical Risk Minimization 27.11.2021, 16:18
Exponential Loss
AdaBoost value of −hw (xi )yi . This can lead
−hw (xi )yi to nice convergence results, for
e
example in the case of Adaboost,
but it can also cause problems
with noisy data.
Actual
Zero-One Loss Non-continuous and thus
Classification
δ(sign(hw (xi )) ≠ yi ) Loss
impractical to optimize.
Table 4.1: Loss Functions With Classification y ∈ {−1, +1}

Quiz: What do all these loss functions look like with respect to z = yh(x) ?
Figure 4.1: Plots of Common Classification Loss Functions - x-axis: h(xi )yi , or
"correctness" of prediction; y-axis: loss value
Some questions about the loss functions:

1. Which functions are strict upper bounds on the 0/1-loss?
2. What can you say about the hinge-loss and the log-loss as z → −∞?
Commonly Used Regression Loss Functions

Regression algorithms (where a prediction can lie anywhere on the real-number line) also
have their own host of loss functions:
Loss ℓ(hw(xi , yi )) Comments
Most popular regression loss function

Estimates Mean Label
Squared Loss ADVANTAGE: Differentiable everywhere
2
(h( i ) − i
(h(xi ) − yi )2 DISADVANTAGE: Somewhat sensitive to

outliers/noise
Also known as Ordinary Least Squares
(OLS)
Also a very popular loss function

Absolute Loss
Estimates Median Label
|h(xi ) − yi |
ADVANTAGE: Less sensitive to noise
DISADVANTAGE: Not differentiable at 0
Huber Loss Also known as Smooth Absolute Loss

1 2 ADVANTAGE: "Best of Both Worlds" of
2 (h(xi ) − yi ) if
Squared and Absolute Loss
|h(xi ) − yi | < δ ,
Once-differentiable
otherwise
δ Takes on behavior of Squared-Loss when
δ(|h(xi ) − yi | − 2) loss is small, and Absolute Loss when loss is
large.
Log-Cosh Loss
log(cosh(h(xi ) − yi )) , ADVANTAGE: Similar to Huber Loss, but twice
−x
cosh(x) = e +2e
x
differentiable everywhere
Table 4.2: Loss Functions With Regression, i.e. y∈R

Quiz: What do the loss functions in Table 4.2 look like with respect to z = h(xi ) − yi ?
Figure 4.2: Plots of Common Regression Loss Functions - x-axis: h(xi )yi , or
"error" of prediction; y-axis: loss value
Regularizers
When we look at regularizers it helps to change the formulation of the optimization
problem to obtain a better geometric intuition:
n n
min ∑ ℓ(hw (x), yi ) + λr(w) ⇔ min ∑ ℓ(hw (x), yi ) subject to: r(w) ≤ B
w,b w,b
i=1 i=1
For each λ ≥ 0 , there exists B ≥ 0 such that the two formulations in (4.1) are equivalent,
and vice versa. In previous sections, l2 -regularizer has been introduced as the component
in SVM that reflects the complexity of solutions. Besides the l2 -regularizer, other types of
useful regularizers and their properties are listed in Table 4.3.
Regularizer r(w) Properties
ADVANTAGE: Strictly Convex

l2 -Regularization ADVANTAGE: Differentiable
DISADVANTAGE: Uses weights on all features,
r(w) = w ⊤ w = ∥w∥22
i.e. relies on all features to some degree (ideally
we would like to avoid this) - these are known as
Dense Solutions.
Convex (but not strictly)

l1 -Regularization DISADVANTAGE: Not differentiable at 0 (the
r(w) = ∥w∥1 point which minimization is intended to bring us
to
Effect: Sparse (i.e. not Dense) Solutions
lp -Norm (often 0 < p ≤ 1 )

d DISADVANTAGE: Non-convex
∥w∥p = ( ∑ vpi )1/p ADVANTAGE: Very sparse solutions
i=1 Initialization dependent
DISADVANTAGE: Not differentiable
Table 4.3: Types of Regularizers
Figure 4.3: Plots of Common Regularizers
Famous Special Cases

This section includes several special cases that deal with risk minimization, such as
Ordinary Least Squares, Ridge Regression, Lasso, and Logistic Regression. Table 4.4
provides information on their loss functions, regularizers, as well as solutions.
Loss and Regularizer Comments
Squared Loss
Ordinary Least Squares No Regularization
1
n Closed form solution:
minw ∑ (w ⊤ xi − yi )2
n
i=1 w = (XX⊤ )−1 Xy⊤
X = [x1 , . . . , xn ]
y = [y1 , . . . , yn ]
Ridge Regression Squared Loss

n
1 l2 -Regularization
minw ∑ (w ⊤ xi − yi )2 + λ∥w∥22
n
i=1 w = (XX⊤ + λI)−1 Xy⊤
+ sparsity inducing (good

for feature selection)
+ Convex
n - Not strictly convex (no
1
Lasso minw n ∑ (w ⊤ xi − yi )2 + λ∥w∥1 unique solution)
i=1
- Not differentiable (at 0)
Solve with (sub)-gradient

descent or SVEN
ADVANTAGE: Strictly
convex (i.e. unique
Elastic Net solution)
n
1 + sparsity inducing (good
minw ∑ (w ⊤ xi − yi )2 + α∥w∥1 + (1 − α)∥w∥22
n for feature selection)
i=1
α ∈ [0, 1) + Dual of squared-loss
SVM, see SVEN
DISADVANTAGE: - Non-
differentiable
Often l1 or l2 Regularized
Logistic Regression
n Solve with gradient
minw,b 1
∑ log (1 + e −yi(w
⊤ x +b)
i
) descent.
n 1
i=1 Pr (y|x) =
1+e−y(w ⊤ x+b)
Typically l2 regularized
(sometimes l1 ).
Quadratic program.
Linear Support Vector Machine When kernelized leads to
n
minw,b C ∑ max[1 − yi (w ⊤ xi + b), 0] + ∥w∥22 sparse solutions.
i=1 Kernelized version can be
solved very efficiently
with specialized
algorithms (e.g. SMO)
Table 4.4: Special Cases
Some additional notes on the Special Cases:

1. Ridge Regression is very fast if data isn't too high dimensional.
2. Ridge Regression is just 1 line of Julia / Python.
3. There is an interesting connection between Ordinary Least Squares and the first
principal component of PCA (Principal Component Analysis). PCA also minimizes
square loss, but looks at perpendicular loss (the horizontal distance between each
point and the regression line) instead.
[1] In Bayesian Machine Learning, it is common to optimize λ , but for the purposes of
this class, it is assumed to be fixed.

10: Empirical Risk Minimization

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

10: Empirical Risk Minimization

Uploaded by

Copyright:

Available Formats

10: Empirical Risk Minimization 27.11.

10: Empirical Risk Minimization

Commonly Used Binary Classification Loss Functions

Loss ℓ(hw(xi , yi )) Usage Comments

One of the most popular loss

This function is very aggressive.

Table 4.1: Loss Functions With Classification y ∈ {−1, +1}

Some questions about the loss functions:

Commonly Used Regression Loss Functions

Loss ℓ(hw(xi , yi )) Comments

Most popular regression loss function

(h(xi ) − yi )2 DISADVANTAGE: Somewhat sensitive to

Also a very popular loss function

Huber Loss Also known as Smooth Absolute Loss

Table 4.2: Loss Functions With Regression, i.e. y∈R

Regularizer r(w) Properties

ADVANTAGE: Strictly Convex

Convex (but not strictly)

lp -Norm (often 0 < p ≤ 1 )

Table 4.3: Types of Regularizers

Figure 4.3: Plots of Common Regularizers

Famous Special Cases

Loss and Regularizer Comments

Ridge Regression Squared Loss

+ sparsity inducing (good

Solve with (sub)-gradient

Table 4.4: Special Cases

Some additional notes on the Special Cases:

You might also like