Professional Documents
Culture Documents
INTRODUCTION
R-Squared and Adjusted R-Squared are the key techniques to check the accuracy for a
Regression problem. We will understand each in detail in the subsequent sections.
There are various techniques to check the accuracy of different problems. In case of
classification problems, we use the confusion matrix, F1-Score, Precision, Recall etc.
R-SQUARED
The formula for R-Squared is:
Where,
1
The blue dots in the graph are the actual points. The double ended arrow between each blue
dot and the diagonal line (best fit line) shows the difference between
the predicted and actual point. This is the error/residual. The summation of all these
differences between the actual and the predicted points is what we call as SSres
SSres=∑(yi−y^i)2
In the above figure, you can see that instead of finding the best fit line, the average output
line is taken. The blue dots in the graph are the actual points.
The double ended arrow between each blue dot and the average output line gives the
difference between the predicted and actual point. The summation of all these differences
between the actual and the predicted points is what we call as SStot
2
The logic behind this is, the error in SStot will always be higher as we are taking an average
fit.
Whereas the error for SSres will be comparatively lower than SStot making it a smaller value.
SSres
Therefore, SStot will be a smaller value. Subtracting this from 1 will give us a value
somewhere between 0 and 1.
If the R2 value is nearer to 1, then our best fit line has fitted to the model quite well.
But wait!! Can we encounter a scenario where the R2 value is less than 0?
Yes, the value for R2 can be less than 0 in cases where the output of the best fit line is worse
than the average output line. That means SSres >SStot
DRAWBACK OF R2
There is a drawback to R2 which often makes it difficult to predict the accuracy of the model.
Let’s say we have a simple linear regression model which has one independent feature and
has the equation y = ax + b. Now we add few more independent features to the model. Our
new equation would be a multiple linear regression model with an equation somewhat
like y = ax1 + bx2 + cx3 + d.
Every time when we add an independent feature, the linear regression algorithm adds a
coefficient value to the feature. Ex., The coefficients for the above equation are a, b, c which
got added when new features x1, x2, x3 were introduced to the model.
The Linear regression algorithm assigns the coefficients in such a way that the value of SSres
will always be decreasing, whenever we add a new independent feature.
3
This sounds perfect right, Not really!!
As we increase the number of independent features in the model, the R2 value will also keep
on increasing even though the independent features are not co-related with the dependent
variable.
Chances are the feature that we include can be a complete one-off. It might not have any
relation with the target dependent variable, but still has some coefficient value contributing to
the output. Linear regression algorithm works in such a way that it adds a coefficient value to
every feature that is present in the model.
Ex. suppose we are predicting the age of students in which our model might have one of the
features as the contact number of the students. This feature seems to have no correlation
with the age, but still might have some coefficient value contributing to the output thereby
increasing the overall R2 of the model.
This clearly means that R2 doesn’t have anything to do with the correlation between the
independent features and the dependent variable. It simply increases whenever we add a new
feature to the model.
ADJUSTED R – SQUARED
The formula for Adjusted R-Squared is as follows
(𝑁−1)
Adjusted − R2 = 1 − (1− R2)
𝑁−𝑃−1
Where,
R2 = R − squared value
P = independent features
N = Sample size of the dataset
The Adjusted – R2 has a penalizing factor. It penalizes for adding independent variable that
don’t contribute to the model in any way or are not correlated to the dependent variable.
CASE – I
Let’s say we increase the number of independent features (P) for the model. These features
are not really contributing to the model much or are not correlated to the dependent
variable.
4
Let’s substitute this logic to the Adjusted R-Squared equation. The value for N-P-1 will
(𝑁−1)
decrease as the value for P has increased. Thus, the value for will increase.
𝑁−𝑃−1
Now there is one thing that we need to understand here. As we add new features, it is obvious
that the R-Squared value will increase. But this increase will be insignificant in comparison
(𝑁−1)
to the value because the newly added features are not correlated to the dependent
𝑁−𝑃−1
variable. So, (1− R2) will not decrease much.
(𝑁−1)
Now the value for multiplied with (1− R2) will also not be a decreased value.
𝑁−𝑃−1
(𝑁−1)
Adjusted − R2 = 1 − (1− R2) = 1−(increasing value less than 1)
𝑁−𝑃−1
=smaller value
This is how Adjusted R-Squared penalizes when the features are not correlated to the
dependent variable.
CASE – II
Now let’s say we are adding features which are very much correlated to the dependent
(𝑁−1)
variable. In this case the R2 will be higher and will overwhelm the value.
𝑁−𝑃−1
(𝑁−1)
So, (1 – R2) will be a smaller value which multiplied with an overwhelming value
𝑁−𝑃−1
will give us a smaller value. Now subtracting this from 1 would give us Adjusted R-
Squared which is an increased value compared to the previous case.
(𝑁−1)
Adjusted − R2 = 1 − (1− R2)
𝑁−𝑃−1
So, this signifies that, when the independent features are correlated to the dependent variable,
the Adjusted R- Squared value goes up.
5
CONCLUSION
THANK YOU