Professional Documents
Culture Documents
Ridge Regression
Developed by :
● Houda El Amri
● Oumayma Ben AbdEnnadher
● Safa Kochat
● Tasnim Lahmar
Supervised by :
M.Anis Zaglaoui
Table of contents
01 02 03 04
Introduction to The benefit of ridge Contribution of How to choose the
estimator over OLS the Penalty
Ridge Regression penalty parameter
estimator Parameter
05 06 07
The bias and the variance the MSE of the Ridge and application domain of
of the ridge estimator OLS estimators ridge regression
01
Introduction to Ridge
Regression
Introduction to Ridge Regression
Ridge regression is a linear regression method that adds a quadratic
constraint to the regression coefficients. This has the effect of reducing
the size of the coefficients, which can be useful in cases where the
independent variables are highly correlated.
History of Ridge Regression
We assume only that X's and Y have been centered, so that we have no need for a constant
term in the regression:
Hoerl and Kennard (1970) proposed that potential instability in the LS estimator
Therefore, ridge regression puts further constraints on the parameters, βj's, in the linear model. In
this case, what we are doing is that instead of just minimizing the residual sum of squares we also
have a penalty term on the β's. This penalty term is λ (a pre-chosen constant) times the squared
norm of the β vector. This means that if the βj's take on large values, the optimization function is
penalized. We would prefer to take smaller βj's, or βj's that are close to zero to drive the penalty
term small.
Geometric Interpretation of Ridge Regression
Multicollinearity Mitigation:
● Ridge Estimator: Effectively addresses multicollinearity, a situation where predictor variables are
highly correlated. The regularization term in Ridge Regression helps to stabilize coefficient estimates
even when multicollinearity is present.
● OLS: OLS is highly sensitive to multicollinearity, and in its absence, it may produce unstable and
inflated coefficient estimates, making interpretations less reliable.
.
Variance Reduction:
● Ridge Estimator: Reduces the variance of coefficient estimates by introducing a regularization term.
This is particularly valuable when dealing with predictors that have high variance.
● OLS: Tends to give higher variance for coefficients corresponding to predictors with higher variance,
making the model less robust and potentially leading to overfitting.
Improved Generalization:
● Ridge Estimator: Helps to improve the generalization performance of the model by preventing
overfitting. The shrinkage of coefficients toward zero results in a more balanced model that is less
sensitive to the idiosyncrasies of the training data.
● OLS: May overfit the training data, especially when dealing with a large number of predictors relative
to the number of observations, potentially leading to poor performance on new, unseen data.
Robustness to Outliers:
● Ridge Estimator: Provides some robustness to outliers, as the regularization term limits
the influence of extreme data points on coefficient estimates.
● OLS: Sensitive to outliers, and a single outlier can have a disproportionate impact on the
estimates, potentially leading to biased results.
● OLS: May encounter issues in the presence of multicollinearity, leading to unreliable and
highly sensitive estimates.
Conclusion :
Ridge helps you normalize (“shrink”) Linear Regression Coefficient
Estimates (OLS). This indicates that the predicted parameters are
pushed towards zero to improve their performance on fresh data
sets. It allows you to employ sophisticated models while avoiding
overfitting.
Benefits of Ridge Regression:
● And the larger we make λ , the slope gets asymptotically close to 0 and our prediction for Size
become less sensitive to Weight.
λ=100000
λ=10 λ=1000
So how do we decide what value to give λ ?
if λ is too high, the model can become overly simple and fail to capture
the complexity of the data adequately, leading to underfitting. Therefore,
the choice of λ is a balancing act: it should be large enough to regularize
the effect of the predictor variables, but not so large that the model loses
its predictive accuracy.
The basic idea behind cross-validation is to use different parts of the dataset for both
training and testing, allowing for a more robust estimation of the model's
performance. This is particularly important to avoid issues such as overfitting, where
a model performs well on the training data but poorly on new, unseen data.
There are various types of cross-validation, and one of the most commonly used
methods is k-fold cross-validation.
04
How to choose
the penalty
parameter
Selecting the penalty parameter (often denoted as α) for Ridge
Regression involves choosing a value that strikes a balance between
fitting the training data well and preventing overfitting. The penalty
parameter controls the amount of regularization applied to the model,
influencing the shrinkage of the coefficients towards zero.
Selecting the penalty parameter α for Ridge Regression involves balancing model fit and simplicity. Here's a
concise guide:
1. Cross-Validation:
- Objective:Choose α that minimizes the error on a validation set in cross-validation.
3. Grid Search:
- Objective: Systematically test a grid of α values; select the one with the best cross-validated performance.
4. Information Criteria:
- Objective: Use criteria like AIC or BIC; choose α that minimizes the selected criterion.
5. Domain Knowledge:
- Objective:Incorporate any prior insights about expected coefficient sizes; align α with domain knowledge.
Consider the specific characteristics of your data and modeling goals when choosing α. Cross-validation is often a
reliable choice for unbiased performance estimation on unseen data.
Conclusion :
The choice of the penalty parameter depends on the specific
characteristics of the dataset and the modeling goals. Cross-validation is a
commonly recommended approach as it provides an unbiased estimate of
the model's performance on unseen data. However, other methods can be
valuable depending on the context and available information.
05
The bias of the ridge
estimator
In this section we derive the bias and
variance of the ridge estimator under
the commonly made assumption that,
conditional on X, the errors of the
regression have zero mean and
constant variance 𝞼² and are
uncorrelated:
Therefore,
Proof
The bias is :
05
The Variance of the
ridge estimator
Importantly, the variance of the ridge estimator is always smaller than the variance
of the OLS estimator.
More precisely, the difference between the covariance matrix of the OLS estimator
and that of the ridge estimator
is positive definite (remember from the lecture on the Gauss-Markov theorem that
the covariance matrices of two estimators are compared by checking whether their
difference is positive definite).
Now, define the matrix
which is invertible. Then, we can rewrite the covariance matrix of the ridge estimator
as follows:
Result :
06
the MSE of the Ridge and
OLS estimators
Mean Squared Error (MSE) is a crucial metric for evaluating the performance of
regression models. It measures the average squared difference between the
predicted and actual values. A lower MSE indicates a better fit of the model to the
data.
Ridge OLS
The mean squared error (MSE) of The OLS estimator has zero bias,
the ridge estimator is equal to so its MSE is
the trace of its covariance matrix
plus the squared norm of its bias
(the so-called bias-variance
decomposition):
The difference between the two MSEs is
We have a difference between two terms (1 and 2). We will prove that the matrix
is positive definite.
Proof
If λ>0 , the latter matrix is positive definite because for any ง≠0, we have
And
Thus, there always exists a value of the penalty parameter such that the ridge
estimator has lower mean squared error than the OLS estimator.
This result is very important from both a practical and a theoretical standpoint.
Although, by the Gauss-Markov theorem, the OLS estimator has the lowest variance
(and the lowest MSE) among the estimators that are unbiased, there exists a biased
estimator (a ridge estimator) whose MSE is lower than that of OLS.
The plot shows that compared to the coefficients of the Ridge Regression
model, those of the OLS model are bigger in magnitude and have a
wider range. As a result, it can be concluded that the OLS model
outperforms the Ridge Regression model in terms of variance and
sensitivity to data noise.
● OLS Model: The higher MSE of the OLS model (0.13) indicates that it
has a relatively higher overall variance compared to the Ridge
Regression model.
● Ridge Regression Model: The lower MSE of the Ridge Regression
model (0.09) suggests that it has a lower overall variance compared
to the OLS model.
Ridge OLS
Bias-Variance Tradeoff OLS tends to have lower bias but Ridge regression introduces a
higher variance, especially when regularization term that helps to
dealing with multicollinearity. It reduce the variance of the
might overfit the training data model, making it more stable
and perform poorly on new, and better at generalizing to new
unseen data. data.
Effect on Coefficient Estimates OLS provides unbiased Ridge regression shrinks the
estimates, but they might be coefficient estimates, reducing
highly sensitive to outliers and their variance. This can lead to a
multicollinearity small bias but often improves
the overall predictive
performance of the model.
Ridge OLS
● Stock price prediction: Ridge regression can be used to predict future stock prices
using historical data on stock prices, corporate profits, and other economic
indicators.
● Patient classification: Ridge regression can be used to classify patients into different
risk categories using data on patient symptoms, medical test results, and other
medical information.
● Health models: Ridge regression can be used to develop health models that can
predict the risk of developing specific diseases, such as heart disease or diabetes.
● Data analysis: Ridge regression can be used to analyze large-scale data to discover
relationships between variables.
In general, ridge regression is a useful method in any situation where there is a strong
correlation between the explanatory variables and where the number of explanatory
variables is greater than the number of observations.
Ridge Regression: An Example Using Cookies
In this example, we will use ridge regression to predict the sugar content of unbaked cookies. The data
consists of 40 cookies for which the near-infrared absorption spectrum and sugar content have been
measured.
The first step in the analysis is to check the correlation between the explanatory variables. In this example,
the explanatory variables are the 700 wavelengths measured in the absorption spectrum.
The scatter plot above shows the correlation between the wavelengths. It can be seen that the explanatory
variables are strongly correlated with each other. Indeed, the average correlation between the wavelengths
is 0.7. This strong correlation can lead to overfitting of the ordinary least squares regression.
Selecting the Regularization Parameter
The next step is to choose the value of the regularization parameter. The value of the
regularization parameter controls the amount of penalty applied to the coefficients. A higher
value of the regularization parameter will lead to a larger reduction in the size of the
coefficients.
The predictions are calculated on the validation sample. The average prediction error is 4.95. This
error is much lower than that of ordinary least squares regression, which is 4304.
Interpretation
The improvement in prediction error is due to the fact that ridge regression has reduced the size of
the coefficients. Indeed, the coefficients of ridge regression are much smaller than those of ordinary
least squares regression. This means that the explanatory variables are less important for explaining
the sugar content.
Conclusion