You are on page 1of 50

University of Sousse

National Engineering School of Sousse


year: 2023-2024

Ridge Regression
Developed by :

● Houda El Amri
● Oumayma Ben AbdEnnadher
● Safa Kochat
● Tasnim Lahmar

Supervised by :

M.Anis Zaglaoui
Table of contents

01 02 03 04
Introduction to The benefit of ridge Contribution of How to choose the
estimator over OLS the Penalty
Ridge Regression penalty parameter
estimator Parameter

05 06 07
The bias and the variance the MSE of the Ridge and application domain of
of the ridge estimator OLS estimators ridge regression
01
Introduction to Ridge
Regression
Introduction to Ridge Regression
Ridge regression is a linear regression method that adds a quadratic
constraint to the regression coefficients. This has the effect of reducing
the size of the coefficients, which can be useful in cases where the
independent variables are highly correlated.
History of Ridge Regression
We assume only that X's and Y have been centered, so that we have no need for a constant
term in the regression:

● X is a n by[1] p matrix with centered columns,


● Y is a centered n-vector.

Hoerl and Kennard (1970) proposed that potential instability in the LS estimator

could be improved by adding a small constant value λ to the diagonal entries


of the matrix before taking its inverse.
The result is the ridge regression estimator
Ridge regression places a particular form of constraint on the parameters is chosen
to minimize the penalized sum of squares:

which is equivalent to minimization of subject to, for some


i.e. constraining the sum of the squared coefficients.

Therefore, ridge regression puts further constraints on the parameters, βj's, in the linear model. In
this case, what we are doing is that instead of just minimizing the residual sum of squares we also
have a penalty term on the β's. This penalty term is λ (a pre-chosen constant) times the squared
norm of the β vector. This means that if the βj's take on large values, the optimization function is
penalized. We would prefer to take smaller βj's, or βj's that are close to zero to drive the penalty
term small.
Geometric Interpretation of Ridge Regression

Ridge regression employs ellipses and circles to represent contours


of residual sum of squares (RSS). The ridge estimate, at the
intersection of these shapes, balances the trade-off between
minimizing RSS and the penalty term associated with larger values.
This penalty term acts as a constraint, limiting the norm of β values
to be smaller than a constant c. The relationship between the
regularization parameter λ and c dictates the preference for β values
close to zero. Ridge regression thus finely tunes this trade-off, where
a higher λ emphasizes constraining β values, while lower values
prioritize optimal data fitting.
02
The benefit of ridge
estimator over OLS
estimator
What is the benefit of ridge
regression over OLS?
The Ridge estimator offers several benefits over the Ordinary Least Squares (OLS) estimator, particularly
in scenarios where certain challenges, such as multicollinearity, high variance in predictors, and
overfitting,
may arise. Here are some key advantages of the Ridge estimator:

Multicollinearity Mitigation:

● Ridge Estimator: Effectively addresses multicollinearity, a situation where predictor variables are
highly correlated. The regularization term in Ridge Regression helps to stabilize coefficient estimates
even when multicollinearity is present.

● OLS: OLS is highly sensitive to multicollinearity, and in its absence, it may produce unstable and
inflated coefficient estimates, making interpretations less reliable.

.
Variance Reduction:

● Ridge Estimator: Reduces the variance of coefficient estimates by introducing a regularization term.
This is particularly valuable when dealing with predictors that have high variance.

● OLS: Tends to give higher variance for coefficients corresponding to predictors with higher variance,
making the model less robust and potentially leading to overfitting.

Improved Generalization:

● Ridge Estimator: Helps to improve the generalization performance of the model by preventing
overfitting. The shrinkage of coefficients toward zero results in a more balanced model that is less
sensitive to the idiosyncrasies of the training data.

● OLS: May overfit the training data, especially when dealing with a large number of predictors relative
to the number of observations, potentially leading to poor performance on new, unseen data.
Robustness to Outliers:

● Ridge Estimator: Provides some robustness to outliers, as the regularization term limits
the influence of extreme data points on coefficient estimates.

● OLS: Sensitive to outliers, and a single outlier can have a disproportionate impact on the
estimates, potentially leading to biased results.

Unique Solution in the Presence of Multicollinearity:

● Ridge Estimator: Ensures a stable solution even when multicollinearity is present,


offering a unique solution that is less susceptible to the instability associated with OLS.

● OLS: May encounter issues in the presence of multicollinearity, leading to unreliable and
highly sensitive estimates.
Conclusion :
Ridge helps you normalize (“shrink”) Linear Regression Coefficient
Estimates (OLS). This indicates that the predicted parameters are
pushed towards zero to improve their performance on fresh data
sets. It allows you to employ sophisticated models while avoiding
overfitting.
Benefits of Ridge Regression:

● Prevents a model from overfitting.


● Don’t need impartial estimators.
● Introduce just enough bias into the estimates to make them relatively credible
approximations of genuine population values.
● When there is multicollinearity, the Ridge estimator is preferentially effective at
enhancing the least-squares estimate.
03
Contribution of
the Penalty
Parameter
Contribution of the Penalty
Parameter
Role of Penalty Parameter (λ):
The hyperparameter λ plays a crucial role in the regularization of Ridge Regression
models. It controls the strength of the penalty applied to the model coefficients. When
λ is zero, Ridge Regression becomes an ordinary linear regression. As λ increases, the
penalty on the coefficients also increases, and the values of the coefficients decrease
towards zero. This can reduce overfitting as the model becomes less complex.
However, if λ is too high, the model can become overly simple and fail to capture the
complexity of the data adequately, leading to underfitting. Therefore, the choice of λ is
a balancing act: it should be large enough to regularize the effect of the predictor
variables, but not so large that the model loses its predictive accuracy.
Selecting the optimal value of λ is usually done through techniques like
cross-validation, where different values of λ are tested, and the value that results in
the best prediction performance is chosen.
Impact of different λ Values
● When λ=0 ,the penalty term has no effect, and ridge regression will produce the
classical least square coefficients.
● Now let’s see what happens as we increase the value for λ .

λ=1 λ=2 λ=3

● And the larger we make λ , the slope gets asymptotically close to 0 and our prediction for Size
become less sensitive to Weight.

λ=100000
λ=10 λ=1000
So how do we decide what value to give λ ?

if λ is too high, the model can become overly simple and fail to capture
the complexity of the data adequately, leading to underfitting. Therefore,
the choice of λ is a balancing act: it should be large enough to regularize
the effect of the predictor variables, but not so large that the model loses
its predictive accuracy.

Selecting the optimal value of λ is usually done through techniques like


Cross-Validation, where different values of λ are tested, and the value that
results in the best prediction performance is chosen.
Cross-validation is a statistical technique used to assess the performance of a
predictive model and evaluate its ability to generalize to new, unseen data. It
involves partitioning the dataset into multiple subsets, training the model on some of
these subsets, and then evaluating its performance on the remaining subsets.

The basic idea behind cross-validation is to use different parts of the dataset for both
training and testing, allowing for a more robust estimation of the model's
performance. This is particularly important to avoid issues such as overfitting, where
a model performs well on the training data but poorly on new, unseen data.

There are various types of cross-validation, and one of the most commonly used
methods is k-fold cross-validation.
04
How to choose
the penalty
parameter
Selecting the penalty parameter (often denoted as α) for Ridge
Regression involves choosing a value that strikes a balance between
fitting the training data well and preventing overfitting. The penalty
parameter controls the amount of regularization applied to the model,
influencing the shrinkage of the coefficients towards zero.
Selecting the penalty parameter α for Ridge Regression involves balancing model fit and simplicity. Here's a
concise guide:

1. Cross-Validation:
- Objective:Choose α that minimizes the error on a validation set in cross-validation.

2. Regularization Path Analysis:


- Objective: Look for stability in coefficients as α varies; choose where coefficients stabilize or approach zero.

3. Grid Search:
- Objective: Systematically test a grid of α values; select the one with the best cross-validated performance.

4. Information Criteria:
- Objective: Use criteria like AIC or BIC; choose α that minimizes the selected criterion.

5. Domain Knowledge:
- Objective:Incorporate any prior insights about expected coefficient sizes; align α with domain knowledge.

Consider the specific characteristics of your data and modeling goals when choosing α. Cross-validation is often a
reliable choice for unbiased performance estimation on unseen data.
Conclusion :
The choice of the penalty parameter depends on the specific
characteristics of the dataset and the modeling goals. Cross-validation is a
commonly recommended approach as it provides an unbiased estimate of
the model's performance on unseen data. However, other methods can be
valuable depending on the context and available information.
05
The bias of the ridge
estimator
In this section we derive the bias and
variance of the ridge estimator under
the commonly made assumption that,
conditional on X, the errors of the
regression have zero mean and
constant variance 𝞼² and are
uncorrelated:

where 𝞼² is a positive constant


and I is the N✕N identity matrix.
The conditional expected value of the ridge estimator is :

which is different from β unless λ=0 (the OLS case).

The bias of the estimator is


Proof

We can write the ridge estimator as

Therefore,
Proof

The ridge estimator is unbiased, that is,

if and only if,

But this is possible if only if , that is, if the ridge estimator


coincides with the OLS estimator. where is the identity matrix.
Proof

The bias is :
05
The Variance of the
ridge estimator
Importantly, the variance of the ridge estimator is always smaller than the variance
of the OLS estimator.

More precisely, the difference between the covariance matrix of the OLS estimator
and that of the ridge estimator

is positive definite (remember from the lecture on the Gauss-Markov theorem that
the covariance matrices of two estimators are compared by checking whether their
difference is positive definite).
Now, define the matrix

which is invertible. Then, we can rewrite the covariance matrix of the ridge estimator
as follows:

Result :
06
the MSE of the Ridge and
OLS estimators
Mean Squared Error (MSE) is a crucial metric for evaluating the performance of
regression models. It measures the average squared difference between the
predicted and actual values. A lower MSE indicates a better fit of the model to the
data.

Ridge OLS

The mean squared error (MSE) of The OLS estimator has zero bias,
the ridge estimator is equal to so its MSE is
the trace of its covariance matrix
plus the squared norm of its bias
(the so-called bias-variance
decomposition):
The difference between the two MSEs is

We have a difference between two terms (1 and 2). We will prove that the matrix

is positive definite.

Proof

The difference between the two covariance matrices is:


Proof

If λ>0 , the latter matrix is positive definite because for any ง≠0, we have
And

because and its inverse are positive definite.


The square of the bias (term 2) is also strictly positive. Therefore, the difference
between 1 and 2 could in principle be either positive or negative.

whether the difference is positive or negative depends on the penalty parameter λ,


and it is always possible to find a value for λ such that the difference is positive.

Thus, there always exists a value of the penalty parameter such that the ridge
estimator has lower mean squared error than the OLS estimator.

This result is very important from both a practical and a theoretical standpoint.
Although, by the Gauss-Markov theorem, the OLS estimator has the lowest variance
(and the lowest MSE) among the estimators that are unbiased, there exists a biased
estimator (a ridge estimator) whose MSE is lower than that of OLS.
The plot shows that compared to the coefficients of the Ridge Regression
model, those of the OLS model are bigger in magnitude and have a
wider range. As a result, it can be concluded that the OLS model
outperforms the Ridge Regression model in terms of variance and
sensitivity to data noise.

● OLS Model: The higher MSE of the OLS model (0.13) indicates that it
has a relatively higher overall variance compared to the Ridge
Regression model.
● Ridge Regression Model: The lower MSE of the Ridge Regression
model (0.09) suggests that it has a lower overall variance compared
to the OLS model.
Ridge OLS

Bias-Variance Tradeoff OLS tends to have lower bias but Ridge regression introduces a
higher variance, especially when regularization term that helps to
dealing with multicollinearity. It reduce the variance of the
might overfit the training data model, making it more stable
and perform poorly on new, and better at generalizing to new
unseen data. data.

Effect on Coefficient Estimates OLS provides unbiased Ridge regression shrinks the
estimates, but they might be coefficient estimates, reducing
highly sensitive to outliers and their variance. This can lead to a
multicollinearity small bias but often improves
the overall predictive
performance of the model.
Ridge OLS

Model Complexity OLS tends to result in more Ridge regression encourages


complex models with larger simpler models by penalizing
coefficients, especially when large coefficients, which can
dealing with multicollinearity. enhance model interpretability
and generalization.

Selection of Regularization Ridge regression introduces a OLS does not have a


Parameter λ hyperparameter (λ) that controls regularization parameter, so the
the strength of the model is solely based on
regularization. The choice of λ minimizing the sum of squared
affects the tradeoff between errors.
fitting the training data and
regularization.
07
application domain of
ridge regression
Ridge regression is a regression method that is useful in a wide range of
applications where there is a strong correlation between the explanatory variables.
It is particularly useful when the number of explanatory variables is greater than the
number of observations.

Here are some examples of applications of ridge regression:

● Stock price prediction


● Patient classification
● Health models
● Data analysis
Here are some specific examples of applications of ridge regression:

● Stock price prediction: Ridge regression can be used to predict future stock prices
using historical data on stock prices, corporate profits, and other economic
indicators.
● Patient classification: Ridge regression can be used to classify patients into different
risk categories using data on patient symptoms, medical test results, and other
medical information.
● Health models: Ridge regression can be used to develop health models that can
predict the risk of developing specific diseases, such as heart disease or diabetes.
● Data analysis: Ridge regression can be used to analyze large-scale data to discover
relationships between variables.

In general, ridge regression is a useful method in any situation where there is a strong
correlation between the explanatory variables and where the number of explanatory
variables is greater than the number of observations.
Ridge Regression: An Example Using Cookies

In this example, we will use ridge regression to predict the sugar content of unbaked cookies. The data
consists of 40 cookies for which the near-infrared absorption spectrum and sugar content have been
measured.

Correlation between Explanatory Variables

The first step in the analysis is to check the correlation between the explanatory variables. In this example,
the explanatory variables are the 700 wavelengths measured in the absorption spectrum.

The scatter plot above shows the correlation between the wavelengths. It can be seen that the explanatory
variables are strongly correlated with each other. Indeed, the average correlation between the wavelengths
is 0.7. This strong correlation can lead to overfitting of the ordinary least squares regression.
Selecting the Regularization Parameter

The next step is to choose the value of the regularization parameter. The value of the
regularization parameter controls the amount of penalty applied to the coefficients. A higher
value of the regularization parameter will lead to a larger reduction in the size of the
coefficients.

Cross-validation is used to choose the value of the regularization parameter. Cross-validation


consists of dividing the data sample into several subsamples. Ridge regression is estimated
on each subsample and the predictions are calculated on the entire data set. The value of
the regularization parameter that minimizes the average prediction error on the entire data
set is chosen as the optimal value.

In this example, the optimal value of the regularization parameter is 0.206.


Predictions

The predictions are calculated on the validation sample. The average prediction error is 4.95. This
error is much lower than that of ordinary least squares regression, which is 4304.

Interpretation

The improvement in prediction error is due to the fact that ridge regression has reduced the size of
the coefficients. Indeed, the coefficients of ridge regression are much smaller than those of ordinary
least squares regression. This means that the explanatory variables are less important for explaining
the sugar content.
Conclusion

Ridge regression is an effective method for predicting a variable from a set of


highly correlated explanatory variables. In the cookie example, ridge
regression was able to significantly improve the accuracy of predictions.

Overall, Ridge regression is a useful tool for handling multicollinearity and


improving the generalizability of prediction models.

You might also like