You are on page 1of 20

FIE453 – Big Data with

Applications to Finance
Regularization

Who? Walt Pohl


From? Norwegian School of Economics
Department of Finance

When? September 19, 2021


Overfitting

Including many features introduces a new


problem: overfitting.
In sample, adding more features always
improves the fit.
Out of sample, it can worsen the fit.
Bias and Variance of
Predictions
This has a clear description for
least-squares loss.
Suppose that the true model is
Y = f (X ) + ,
and let Ŷ be our prediction.
The bias of Ŷ is
E (Ŷ − f (X )).

The variance of Ŷ is of course


E ((Ŷ − E (Ŷ ))2 ).
Bias-Variance Tradeoff
We want to minimize both bias and
variance:
Bias is bad: on average the predictions are
wrong.
Variance is bad: predictions depend more
on the sample.

Unfortunately, we can’t have both.


Expected least-squared loss equals

Bias 2 + Var(Ŷ ) + Var().


Adding Features

In general:
Adding features lowers bias.
Adding features increases variance.
How do we choose?
Econometrics versus Machine
Learning, cont’d

In traditional econometrics, the trade-off is


handled differently:
We aim for a bias of zero.
If the variance is high, then the results of
the model won’t be statistically significant.
Econometrics versus Machine
Learning, cont’d
The logic plays out differently in machine
learning:
We’re not looking to accept or reject a
model, but rather to predict as well as we
can.
In econometrics the models are deliberately
kept small, which helps minimize the
variance.
In machine learning, we want to use big
models if they help us predict.
So in machine learning we are willing to
tolerate bias to minimize variance.
Regularization

Practitioners have settled on a technique


to manage the bias-variance trade-off:
regularization.
Also known as shrinkage in statistics.
The idea? Penalize Complexity
Regularized Loss Function

We take a loss function and add a penalty


term, R.

E (L(Y , fˆ(X ))) + λR(fˆ).

Different choices of penalty function, R,


lead to different methods.
Regularization as Constraint
We can view regularization as a constraint.
We minimize

E (L(Y , Ŷ ),

with constrained complexity,

R(fˆ) ≤ C ,

for some constant C .


This is really the same thing – λ is the
Lagrange multiplier in the Lgrangian for
the constrained optimization problem.
Ridge Regression
Example:
X X
E (Y − βi Xi )2 + λ βi2
i i

Minimizing this penalty gives you ridge


regression.
Note that λ – known as the ridge
parameter – cannot be estimated from the
data. It must be given.
λ = 0 is ordinary regression. λ → ∞ will
force the coefficients towards zero.
Feature Selection Methods

Ridge regression works by pushing all


coefficients towards zero.
A natural alternative is to only set some
coefficients to zero.
Subset Selection

Subset selection works by choosing a


subset of the features and only regressing
those.
Several standard techniques:
Best subset
Forward stepwise
Backward stepwise
Best subset

For a fixed k, choose the k features that


maximize the R 2 .
Downside: can be computationally
expensive.
Unspecified: k.
Forward selection

Start with only the intercept, and add one


feature at a time. Choose the feature that
increases the R 2 the most.
Downside: not optimal fit.
Unspecified: when to stop.
Backward selection

Start with all features, and remove one


feature at a time. Choose the feature that
decreases the R 2 the least.
Downside: not optimal fit.
Unspecified: when to stop.
Feature Selection as
Regularization
Note that feature selection is still a form a
regularization. R(fˆ) is the number of non-zero
coefficients.
The optimal regularized solution is best subset
selection.
Since R is a discrete function, it’s hard to optimize,
which is why we result to heuristics such as forward
and backward subset selection.
These kinds of heuristics, where we make the best
local decision, are known as greedy algorithms.
The Lasso

The lasso superficially resembles ridge


regression, but has some of the aspects of
subset selection.
It’s regression with a penalty term,
X X
2
E (Y − βi Xi ) + λ |βi |
i i

but the penalty is minimized by setting


some of the coefficients to zero.
Other Penalties
Other penalties appear in the literature:
p-norm (for 1 ≤ p ≤ 2):
X
|βj |p
j

elastic net:
X X
α βj2 + (1 − α) |βj |
j j

Both somewhere between ridge and lasso


in behavior.
There are more sophisticated penalty
functions we will consider later.
How much do we regularize?

For each technique, one question remains


unanswered. How do we choose λ? Or,
how do we choose the number of features?
This too has a standard answer: validation.

You might also like