You are on page 1of 16

linkedin.

com/in/vikrantkumar95

Linear Regression
Clearly Explained

Swipe
What Is Linear Regression? (1/2)
Technical definition: Linear Regression is a statistical method
used to model a relationship between one dependent variable and
one or more independent variables.

Suppose we had a dataset that showed the Crop Yield of


farmers in different regions and the Rainfall received by that
region:

Here our goal is


To see how strong the relationship is between Rainfall
(independentant variable) and Crop Yield (dependent variable)
To see if we can predict the expected yield of a crop at certain
levels of rainfall,

linkedin.com/in/vikrantkumar95
What Is Linear Regression? (2/2)
Simply put, Linear Regression is a way to predict one thing ( Crop Yield
or Sales) based on the linear relationships it has with other things
(Time or Rainfall). It's like drawing a straight line through a scatter of
points to show the general trend.

)
Tan =B

The above line would have an equation with Crop Yield (the dependent
variable we are trying to predict) as Y-axis and Rainfall (the
independent variable that will be the predictor) as X-axis. The
equation would be
Crop Yield = A + B (Rainfall)

A is the Y-intercept and B is the slope of the line. The higher the value
of B the steeper the line, i.e the more sensitive the Crop Yield is to
changes in Rainfall.

linkedin.com/in/vikrantkumar95
So How Do We Fit A Line? (1/2)
1 2

In Linear Regression, a line is fit by minimizing the Sum of


Residual Squares. Here, a Residual is the difference between
the Predicted Crop Yield (Y predicted) and the actual Crop
Yield (Y actual).

We start with an arbitrary line here in Graph 1 and calculate


the Sum Residual Squares. We then rotate the line (i.e change
the slope B) and adjust the intercept (i.e A) to get Graph 2.
We see that the Sum of Residuals Squares has reduced. We
then adjust the line further to get Graph 3 which has the min
Sum Residual Squares. This is the best fit line.

linkedin.com/in/vikrantkumar95
So How Do We Fit A Line? (2/2)
As we saw in the previous slide, Linear Regression involves
fitting a line by minimizing the Sum of Square Residuals. Also
why it’s sometimes known as Least Squares Regression.

Different orientations of the line result in different Sum of


Residual Squares with the goal being to achieve the
minimum, which is the best fit.

Don’t worry though, the actual process of arriving at the


minimum Sum of Square Residuals is done by Statistical
Softwares.

linkedin.com/in/vikrantkumar95
How Do We Tell How Good The Fit Is?
We fit a line by minimizing the residual squares. However,
how do we know if the best fit line actually captures the
underlying relationship? It could just be a less poor fit
amongst a bunch a poor fits. This is where R² comes in.

Here:
SST is Sum of Squares Total
SSR is Sum of Square Residuals

R² is a ratio between 0 and 1, with 0 implying the Rainfall


(independent variable) has no predictive power with respect
to the Crop Yield (dependent variable) while 1 implies that
Rainfall perfectly explains the variation in Crop Yield.

Don’t worry, we’ll be looking at exactly what this formula and


its components are and how it explains how well a line has fit.

linkedin.com/in/vikrantkumar95
R² Explained (1/3)
We saw two terms in the R² formula - SST and SSR. We’ll take
a look at both here.

SST stands for Sum of Squares Total, which is the sum of the
square residuals around the mean of the dependent variable
(Crop Yield). Which basically means calculating the Sum of
Squares (SS) around a horizontal fit line that passes through
the mean of Crop Yield.

SST

We’ll see in the next slide that this essentially is equivalent of


shifting all the data points to the Y-axis and calculating the
Sum of Square Residuals around the mean of the Crop Yields.

linkedin.com/in/vikrantkumar95
R² Explained (2/3)
We can see below that calculating the Sum of Square
Residual (SS) for either of the graphs, you’d get the same
result, which is the SST.

SST SST

Shifting all the points horizontally to the Y-axis

In simple terms, Sum of Squares Total (SST) measures the


total variation in the dependent variable (the Crop Yield in
this case) in a dataset. It's calculated by summing up the
squares of the differences between each observed value and
the overall mean of the dependent variable.

Variation around Mean = Var(mean) = SST / n

The average of Sum of Squares Total is what we call the


Variation of Crop Yield around its Mean.

linkedin.com/in/vikrantkumar95
R² Explained (3/3)
The second term in the formula was SSR - Sum of Squares
Residual. This is what we used to achieve our best fit - It's
calculated by summing the squares of the differences
between each observed value and its corresponding predicted
value from the regression model.

SSR

Essentially, SSR quantifies how much the data points deviate


from the fitted regression line.

Variation around Fit = Var(fit) = SSR / n

The average of Sum of Squares Residuals is what we call the


Variation of Crop Yield around its Fit line.
linkedin.com/in/vikrantkumar95
Calculating & Interpreting R² (1/2)
SST SSR

Suppose there are 10 data points (n) and the SST comes out to be
400. Then the Var(mean) would be:
Var(mean) = 400 / 10 = 40
Now let’s assume the SSR comes out to be 120. Then Var(fit)
would be:
Var(mean) = 120 / 10 = 12

We can see that the variation in the second graph, i.e the line fit
by least squares, is less compared to the variation around the
mean in the first graph. Therefore, we can say that some of the
variation in the Crop Yield is explained by taking into
consideration the Rainfall.

linkedin.com/in/vikrantkumar95
Calculating & Interpreting R² (2/2)
Let us look at the formula for R² again:

Here the Explained Variance would be the difference between


the Var(mean) and Var(fit). Total Variance is the Variance
around the mean, Var(mean). Hence, based on the values
previously, the calculation for R² would be:

R² = (40 - 12) / 40 = 0.7

Here we get a R² of 0.7 or 70%. This means that there is a 70%


reduction in Variance when we take Rainfall into consideration.
Or we can say that Rainfall explains 70% of the Variation in Crop
Yield. A 0.7 R² would be considered a reasonably good Linear
Regression Model (assuming it’s significant, something we’ll see
in later modules).

A perfect model that passes through all the data points would
have a 0 Variance and hence an R² = 1. Although that rarely
happens in real world and is usually a sign of overfitting.

linkedin.com/in/vikrantkumar95
Future Learnings
What we saw here was just an example of Simple Linear
Regression (only one independent and one dependent
variable). There are a few more things that we’ll cover later
which would broaden your understanding of Linear
Regressions models:

Multiple Linear Regression: Expanding from one


independent variable to multiple variables.
Assumptions of Linear Regression: Discussing normality,
homoscedasticity, independence, and linearity.
Interpreting Coefficients: Understanding what the
regression coefficients mean.
Model Validation Techniques: Techniques like cross-
validation and splitting data into training and test sets.
Diagnostics and Residual Analysis: Identifying issues like
multicollinearity, autocorrelation, and influential outliers.
Model Improvement Strategies: Techniques like
transformation and regularization.
Comparison with Non-Linear Models: Understanding
when linear models are appropriate and when to consider
non-linear alternatives.

linkedin.com/in/vikrantkumar95
Sample Code to Train and
Visualize a Simple Linear
Regression Model

linkedin.com/in/vikrantkumar95
Enjoyed
reading?

Follow for
everything Data
and AI!

linkedin.com/in/vikrantkumar95

You might also like