Handout5 Regularization

FIE453 – Big Data with
Applications to Finance
Regularization
Who? Walt Pohl

From? Norwegian School of Economics
Department of Finance
When? September 19, 2021

Overfitting
Including many features introduces a new

problem: overfitting.
In sample, adding more features always
improves the fit.
Out of sample, it can worsen the fit.
Bias and Variance of
Predictions
This has a clear description for
least-squares loss.
Suppose that the true model is
Y = f (X ) + ,
and let Ŷ be our prediction.
The bias of Ŷ is
E (Ŷ − f (X )).
The variance of Ŷ is of course

E ((Ŷ − E (Ŷ ))2 ).
Bias-Variance Tradeoff
We want to minimize both bias and
variance:
Bias is bad: on average the predictions are
wrong.
Variance is bad: predictions depend more
on the sample.
Unfortunately, we can’t have both.

Expected least-squared loss equals
Bias 2 + Var(Ŷ ) + Var().

Adding Features
In general:
Adding features lowers bias.
Adding features increases variance.
How do we choose?
Econometrics versus Machine
Learning, cont’d
In traditional econometrics, the trade-off is

handled differently:
We aim for a bias of zero.
If the variance is high, then the results of
the model won’t be statistically significant.
Econometrics versus Machine
Learning, cont’d
The logic plays out differently in machine
learning:
We’re not looking to accept or reject a
model, but rather to predict as well as we
can.
In econometrics the models are deliberately
kept small, which helps minimize the
variance.
In machine learning, we want to use big
models if they help us predict.
So in machine learning we are willing to
tolerate bias to minimize variance.
Regularization
Practitioners have settled on a technique

to manage the bias-variance trade-off:
regularization.
Also known as shrinkage in statistics.
The idea? Penalize Complexity
Regularized Loss Function
We take a loss function and add a penalty

term, R.
E (L(Y , fˆ(X ))) + λR(fˆ).
Different choices of penalty function, R,

lead to different methods.
Regularization as Constraint
We can view regularization as a constraint.
We minimize
E (L(Y , Ŷ ),
with constrained complexity,
R(fˆ) ≤ C ,
for some constant C .

This is really the same thing – λ is the
Lagrange multiplier in the Lgrangian for
the constrained optimization problem.
Ridge Regression
Example:
X X
E (Y − βi Xi )2 + λ βi2
i i
Minimizing this penalty gives you ridge

regression.
Note that λ – known as the ridge
parameter – cannot be estimated from the
data. It must be given.
λ = 0 is ordinary regression. λ → ∞ will
force the coefficients towards zero.
Feature Selection Methods
Ridge regression works by pushing all

coefficients towards zero.
A natural alternative is to only set some
coefficients to zero.
Subset Selection
Subset selection works by choosing a

subset of the features and only regressing
those.
Several standard techniques:
Best subset
Forward stepwise
Backward stepwise
Best subset
For a fixed k, choose the k features that

maximize the R 2 .
Downside: can be computationally
expensive.
Unspecified: k.
Forward selection
Start with only the intercept, and add one

feature at a time. Choose the feature that
increases the R 2 the most.
Downside: not optimal fit.
Unspecified: when to stop.
Backward selection
Start with all features, and remove one

feature at a time. Choose the feature that
decreases the R 2 the least.
Downside: not optimal fit.
Unspecified: when to stop.
Feature Selection as
Regularization
Note that feature selection is still a form a
regularization. R(fˆ) is the number of non-zero
coefficients.
The optimal regularized solution is best subset
selection.
Since R is a discrete function, it’s hard to optimize,
which is why we result to heuristics such as forward
and backward subset selection.
These kinds of heuristics, where we make the best
local decision, are known as greedy algorithms.
The Lasso
The lasso superficially resembles ridge

regression, but has some of the aspects of
subset selection.
It’s regression with a penalty term,
X X
2
E (Y − βi Xi ) + λ |βi |
i i
but the penalty is minimized by setting

some of the coefficients to zero.
Other Penalties
Other penalties appear in the literature:
p-norm (for 1 ≤ p ≤ 2):
X
|βj |p
j
elastic net:
X X
α βj2 + (1 − α) |βj |
j j
Both somewhere between ridge and lasso

in behavior.
There are more sophisticated penalty
functions we will consider later.
How much do we regularize?
For each technique, one question remains

unanswered. How do we choose λ? Or,
how do we choose the number of features?
This too has a standard answer: validation.

Handout5 Regularization

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Handout5 Regularization

Uploaded by

Copyright:

Available Formats

FIE453 – Big Data with

Who? Walt Pohl

When? September 19, 2021

Including many features introduces a new

The variance of Ŷ is of course

Unfortunately, we can’t have both.

Bias 2 + Var(Ŷ ) + Var().

In traditional econometrics, the trade-off is

Practitioners have settled on a technique

We take a loss function and add a penalty

E (L(Y , fˆ(X ))) + λR(fˆ).

Different choices of penalty function, R,

with constrained complexity,

for some constant C .

Minimizing this penalty gives you ridge

Ridge regression works by pushing all

Subset selection works by choosing a

For a fixed k, choose the k features that

Start with only the intercept, and add one

Start with all features, and remove one

The lasso superficially resembles ridge

but the penalty is minimized by setting

Both somewhere between ridge and lasso

For each technique, one question remains

You might also like

Bias 2 + Var(Ŷ ) + Var().