Professional Documents
Culture Documents
SLR & MLR
SLR & MLR
Regression
Contents
• Measures of Association
• Covariance
• Correlation
• Correlation Properties
• Regression Analysis
• Ordinary Least Squares method (OLS)
• Assumptions of Regression Analysis
• Output interpretation
Covariance
• Covariance is a measure of association between two random variables.
• Limitations
• measures the directional relationship between two variables.
• does not show the strength of the relationship between them.
• Covariance values are not standardized.
Correlation
• A correlation coefficient measures the extent to which two variables tend
to change together. The coefficient describes both the strength and the
direction of the relationship.
• It is considered to be the normalised version of the Covariance.
Limitation
Correlation is not and cannot be taken to imply causation. Even if there is a very strong association between two
variables we cannot assume that one causes the other.
Covariance vs Correlation
Types of Relationships
Regression Analysis
• Linear regression is a very simple approach for supervised learning.
• Useful technique for predicting a quantitative response.
• Aim of regression analysis is to find a best fit line that passes through the
points.
• Describes the relationship between two variables x and y can be
expressed by the following equation:
𝒀 = 𝒄 + 𝒎𝒙 +
Regression Analysis
Simple Multiple
Regression Regression
12
10
Trunk Diameter
0
0 10 20 30 40 50 60 70
Tree Height
OLS – Least Squares
𝟔
Min. Sum of Squares 𝜺𝟐𝒊 = 𝜺𝟐𝟏 + 𝜺𝟐𝟐 + 𝜺𝟐𝟑 + 𝜺𝟐𝟒 + 𝜺𝟐𝟓 + 𝜺𝟐𝟔
𝒊=𝟎
OLS – Least Squares
• Let εi = 𝑦𝑖 − 𝑦ෝ𝑖 be the prediction error for observation 𝑖.
𝑛
• Sum of Squares of Errors, 𝑆𝑆𝐸 = 𝑖=1 𝜀𝑖2
• For good fit, SSE should be minimum, that is “Least Squares”.
𝑛 𝑛
𝑖=1 𝑖=1
Least Squares Regression Properties
Where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Explained and Unexplained variation
SST = Total sum of squares
o Measures the variation of the yi values around their mean y
𝒀 = 𝜷𝟎 + 𝜷𝟏 𝒙𝟏 + 𝜷𝟐 𝒙𝟐 + 𝜷𝟑 𝒙𝟑 +…….. +𝜷𝒏 𝒙𝒏
• Goal is to fit a regression line between all the independent variables and
the dependent variable.
MLR using Least Squares
• The process of fitting a multiple regression line is the same in Simple
regression.
• Here you have a number of additional independent variables (𝒙𝟏 , 𝒙𝟐 …. 𝒙𝒏 )
and trying to find multiple coefficients (𝜷𝟏 , 𝜷𝟐 …. 𝜷𝒏 ).
• As expected, the best fit line should try to minimize the square of errors:
𝒏 𝒏
• homoscedasticity - the points must be about the same distance from the line.
• heteroscedasticity - points are at widely varying distances from the regression
line.
Assumptions
Detection of Homoscedasticity Violation and Fixing It
A good approach is to draw the residual versus predicted values (not the
actual values). This gives you a picture of whether y is increasing or
decreasing or whether residual variation is growing or shrinking or whether
it’s random around the predicted line.
2
Coefficient of determination (𝑅 )
• How to judge a good fit line -
• SSE (Minimum or Maximum?)
• SSR (Minimum or Maximum?)
• SSR/SSE(Minimum or Maximum?)
• The coefficient of determination is the portion of the total variation in the
dependent variable that is explained by variation in the independent variable.
• R-squared provides an estimate of the strength of the relationship between
your model and the response variable.
• The coefficient of determination is also called R-squared and is denoted as 𝑅2 .
2
𝑆𝑆𝑅
𝑅 = 𝑤ℎ𝑒𝑟𝑒 0 ≤ 𝑅2 ≤ 1
𝑆𝑆𝑇
Problems with R-squared
• The R-squared will either increase or remain the same when adding a new
independent variable to the model. It will never decrease unless you
remove a variable.
• R-squared is the total amount of variation explained by the list of
independent variables in the model. If you add any new junk independent
variable, a variable that has no impact or relation with dependent variable,
the R-square still might increase slightly, but it will never decrease.
Adjusted R-squared
• Its value depends on the number of explanatory variables
• Imposes a penalty for adding additional explanatory variables/junk
variable.
• Adjusted R-squared is calculated as below:
2 2 𝑘 −1 2
𝑅𝑎𝑑𝑗 = 𝑅 − (1 − 𝑅 )
𝑛−𝑘
• 𝐻0 : 𝛽1 = 𝛽2 = 𝛽3 = ⋯ = 𝛽𝑛 = 0 (no relationship)