You are on page 1of 33

Simple Linear

Regression
Contents
• Measures of Association
• Covariance
• Correlation
• Correlation Properties
• Regression Analysis
• Ordinary Least Squares method (OLS)
• Assumptions of Regression Analysis
• Output interpretation
Covariance
• Covariance is a measure of association between two random variables.

• Measures the linear relationship between two variables.

• Covariance can be negative or positive or zero.

• Formula for Covariance


Covariance
• A positive Covariance value  the two variables tend to vary in the same
direction (i.e. if one increases, then the other one increases too).
• A negative value  they vary in opposite directions (i.e. if one increases,
then the other one decreases).
• Zero means that they don’t vary together.

• Limitations
• measures the directional relationship between two variables.
• does not show the strength of the relationship between them.
• Covariance values are not standardized.
Correlation
• A correlation coefficient measures the extent to which two variables tend
to change together. The coefficient describes both the strength and the
direction of the relationship.
• It is considered to be the normalised version of the Covariance.

• The correlation is bounded between -1 and 1.


Correlation

Positve Correlation No Correlation Negative Correlation

Limitation
Correlation is not and cannot be taken to imply causation. Even if there is a very strong association between two
variables we cannot assume that one causes the other.
Covariance vs Correlation
Types of Relationships
Regression Analysis
• Linear regression is a very simple approach for supervised learning.
• Useful technique for predicting a quantitative response.
• Aim of regression analysis is to find a best fit line that passes through the
points.
• Describes the relationship between two variables x and y can be
expressed by the following equation:

𝒀 = 𝒄 + 𝒎𝒙 + 
Regression Analysis

One independent Regression More than one


variable independent variable
Models

Simple Multiple
Regression Regression

Linear Non-linear Linear Non-linear


Ordinary Least Squares method (OLS)
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙 + 𝜺

• 𝑌 - predicted value of the dependent variable, Dependent/Response


variable.
• 𝜷𝟎 - the intercept (the predicted value of Y when all the predictor
variables equal zero).
• 𝜷𝟏 - the regression coefficient (Slope) for the predictor – X.
• 𝒙 - predictor value, Independent/Explanatory variable (Input).
• 𝜺 - Random error/Noise.
OLS – What is the best fit?
14

12

10
Trunk Diameter

0
0 10 20 30 40 50 60 70
Tree Height
OLS – Least Squares
𝟔
Min. Sum of Squares ෍ 𝜺𝟐𝒊 = 𝜺𝟐𝟏 + 𝜺𝟐𝟐 + 𝜺𝟐𝟑 + 𝜺𝟐𝟒 + 𝜺𝟐𝟓 + 𝜺𝟐𝟔
𝒊=𝟎
OLS – Least Squares
• Let εi = 𝑦𝑖 − 𝑦ෝ𝑖 be the prediction error for observation 𝑖.
𝑛
• Sum of Squares of Errors, 𝑆𝑆𝐸 = ෌𝑖=1 𝜀𝑖2
• For good fit, SSE should be minimum, that is “Least Squares”.
𝑛 𝑛

𝑆𝑆𝐸 = ෍ 𝜀𝑖2 = ෍ 𝑦𝑖 − 𝑦ෝ𝑖 2

𝑖=1 𝑖=1
Least Squares Regression Properties

The sum of the residuals from


The sum of the squared residuals
the least squares regression line
is a minimum .
is 0.
(  ( y  yˆ )  0 )
(minimized  ( y  yˆ ) )
2

The simple regression line


always passes through the mean The least squares coefficients are
of the y variable and the mean of unbiased estimates of β0 and β1.
the x variable.
Explained and Unexplained variation
Explained and Unexplained variation

SST  SSE  SSR


Total sum of Squares Sum of Squares Error Sum of Squares
Regression

SST   ( y  y )2 SSE   ( y  ŷ )2 SSR   ( ŷ  y )2

Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Explained and Unexplained variation
SST = Total sum of squares
o Measures the variation of the yi values around their mean y

SSE = Error sum of squares


o Variation attributable to factors other than the relationship between x
and y (Unexplained)
SSR = Regression sum of squares
o Explained variation attributable to the relationship between x and y
Multiple Linear
Regression
Contents
• Multiple Linear Regression
• Assumptions of Regression Analysis
• Measures for goodness-of-fit
Multiple Linear Regression
• Simple Linear Regression equation is:
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙

• The Multiple Linear Regression equation is:

𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 +…….. +𝜷𝒏 𝒙𝒏

• Goal is to fit a regression line between all the independent variables and
the dependent variable.
MLR using Least Squares
• The process of fitting a multiple regression line is the same in Simple
regression.
• Here you have a number of additional independent variables (𝒙𝟏 , 𝒙𝟐 …. 𝒙𝒏 )
and trying to find multiple coefficients (𝜷𝟏 , 𝜷𝟐 …. 𝜷𝒏 ).
• As expected, the best fit line should try to minimize the square of errors:

𝒏 𝒏

𝑺𝑺𝑬 = ෍ 𝜺𝟐𝒊 = ෍ 𝒚𝒊 − 𝒚ෝ𝒊 𝟐 = ෍( 𝒚𝒊 − (𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 + …….. + 𝜷𝒏 𝒙𝒏 ))𝟐


𝒊=𝟏 𝒊=𝟏
Assumptions
1. Linearity - the relationship between the independent variable(x) and the
dependent variable (y) is linear.
 Detection of Linearity and Fixing the Violation of Linearity
 Detected by using a scatterplot.
 If the relation between x and y is not linear, transformations such as square of x, log(x), exp(x),
sqrt(x), and so on, can be tried.

2. Normality - the dependent variable y is distributed normally for each


value of the independent variable x.
 the errors (the deviation between actual and predicted values) should follow
normal distribution with the mean close to zero. Normality is a basic assumption
while calculating regression coefficients. Generally, outliers are the main reason for
the violation of this assumption. Outlier handling technique is used in violation.
Assumptions
2. Normality - A normal probability versus residual probability distribution
plot is also known as a P-P plot. The plot basically tries to draw a perfect
normal probability distribution (P) and compare it with a residual
probability distribution (P).
• If the distribution is normal, then all the points in the P-P plot should fall close to a
diagonal straight line, indicating that the probability distribution of residuals (dots
in the graph) is almost the same as standard normal probability distribution (line).
Assumptions
Assumptions
3. Observations are Independent - Independence of y is one of the most
common assumptions while attempting to fit a linear regression. You can
expect the values of y to depend on independent variables but not on its
own previous values.

4. Homoscedasticity - the variance in y is the same at each stage of x and


there is no special segment or an interval in X where the dispersion in Y
is distinct.
Assumptions

• homoscedasticity - the points must be about the same distance from the line.
• heteroscedasticity - points are at widely varying distances from the regression
line.
Assumptions
Detection of Homoscedasticity Violation and Fixing It
A good approach is to draw the residual versus predicted values (not the
actual values). This gives you a picture of whether y is increasing or
decreasing or whether residual variation is growing or shrinking or whether
it’s random around the predicted line.
2
Coefficient of determination (𝑅 )
• How to judge a good fit line -
• SSE (Minimum or Maximum?)
• SSR (Minimum or Maximum?)
• SSR/SSE(Minimum or Maximum?)
• The coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent variable.
• R-squared provides an estimate of the strength of the relationship between
your model and the response variable.
• The coefficient of determination is also called R-squared and is denoted as 𝑅2 .
2
𝑆𝑆𝑅
𝑅 = 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑅2 ≤ 1
𝑆𝑆𝑇
Problems with R-squared
• The R-squared will either increase or remain the same when adding a new
independent variable to the model. It will never decrease unless you
remove a variable.
• R-squared is the total amount of variation explained by the list of
independent variables in the model. If you add any new junk independent
variable, a variable that has no impact or relation with dependent variable,
the R-square still might increase slightly, but it will never decrease.
Adjusted R-squared
• Its value depends on the number of explanatory variables
• Imposes a penalty for adding additional explanatory variables/junk
variable.
• Adjusted R-squared is calculated as below:
2 2 𝑘 −1 2
𝑅𝑎𝑑𝑗 = 𝑅 − (1 − 𝑅 )
𝑛−𝑘

where, n-number of observations, k-number of variables


F-statistics
• The f-statistic, also known as the F-ratio, F= MSR/MSE and is a measure for
the strength of the regression.
• Here MSR is the predicted mean-squared due to Regression and MSE is the
mean-squared-error. A strong relationship between yi and xi gives a high F-
ratio.
F-statistics
• The Hypothesis of the F-test is:

• 𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑛 = 0 (no relationship)

• 𝐻𝑎 : 𝑎𝑡 𝑙𝑒𝑎𝑠𝑡 𝑜𝑛𝑒 𝛽𝑖 = 0 (atleast one independent variable affects y)

You might also like