You are on page 1of 32

Linear Regression

Introduction to Regression
• Linear Regression is a Supervised Learning algorithm

• The dependent variable is a continuous variable, e.g. Revenue

• Regression is an attempt to explain the variation in a continuous dependent variable using


the variation in independent (explanatory) variables
Discussion
• What are scenarios where Regression can be used?
Question
• Identify the type of problem -

• A construction company is making a bid on a project in a remote area of Mumbai. A


certain component of the project will take place in December, and is very sensitive to
the daily high temperatures. They would like to estimate what the average high
temperature will be at the location in December.

a. Supervised learning – Classification problem


b. Unsupervised learning – Regression problem
c. Supervised learning – Regression problem
d. Unsupervised learning – Classification problem
Correlation
• Correlation – a linear relationship between two continuous variables

• It determines both the nature and strength of relationships with two variables

• Coefficient of Correlation, r takes values between -1 to +1.

• Correlation of ‘-1’ indicates a perfect negative correlation

• Correlation of ‘+1’ indicates a perfect positive correlation


Correlation

– Cov(x,y) is covariance of x and y


– Sx is standard deviation of x
– Sy is standard deviation of y

• In real life data, typically you will not observe r value of -1 or +1 or 0


Exercise
Calculate Coefficient of Correlation
Simple and Multiple Regression
• Regression is a technique for finding the relationship between a response variable and one
or more explanatory variables.

• Simple Linear Regression: Predict Y using only one independent variable


– Estimated y = b0 + b1 * x1

• Multiple Linear Regression: Predict Y by considering more than one independent variable
– Estimated y = b0 + b1*x1 + b2*x2
Simple Linear Regression
Marketing Insurance term
Budget (X) (Y) amount
(In lakhs) (in Crs)
23 5
64 15
67 16
45 10
19 4.5
56 12
65 15.5
35 5.6
45 4.9
24 5.01
46 10
76 15.2
45 11
38 7
34 6
32 5.5 Line Equation:
56 12 Insurance term = Slope X Marketing
budget + Intercept
Question
• Which of the following methods do we use to find the best fit line for data in
Linear Regression?

a. Least Square Error


b. Maximum Likelihood
c. Logarithmic Loss
d. Maximum information gain
Simple Linear Regression – Best Fit Line

Best fit line is the line that


minimize the residual sum of
square

Key • How well does the best-fit line represent the scatter-plot?
questions:
• How well does the best-fit line predict the new data?
Simple Linear Regression –
Strength of best fit line
• R-squared is a best metric to measure the strength of the best fit
Strength: line
Best fit line
• R-squared is a statistical measure of how close the data are to the

fitted regression line


• It always takes a value between 0 & 1, 1 indicates that the
variance in dependent variables is completely explained by the
independent variable

• Mathematically, R squared = 1 – RSS


TSS

where, RSS = Residual sum of square


TSS = Total sum of square
Simple Linear Regression –
Coefficient of determination

R-squared: • R squared = 1 – RSS
TSS
Best fit line
where, RSS = Residual sum of square
TSS = Total sum of square

Example
•Dependent variable: Insurance amount
•Independent variable: Marketing Budget
•RSS = 37.12
•TSS = 292.42
•R-squared = 0.87
•R-squared-0.87 indicates that 87% of the
variation in ‘insurance amount’ is explained by
the independent variable, ‘Marketing Budget’
SLR- Equation Interpretation
Interpretation
of coefficients : Equation is -
Best fit line • Insurance_amount =
0.239*Marketing_budget -1.379
• β0 = -1.379
• β1 = 0.239

• The simple linear regression equation tells us


that the predicted insurance amount for the
insurance team will increase by 0.239 of term
insurance amount for every one percent
increase in the marketing budget.
Exercise
Calculate Predicted Y and R-square
Multiple Linear Regression
• It explains the relationship between two or more independent variables and a response
variable by fitting a straight line.

Marketing Yearly Income Insurance term


Budget (X2) ( In (Y)
(X1) (In lakhs) Lakhs) amount (in Crs)
23 4 5
64 40 15
67 42 16
45 15 10
19 32 4.5
56 23 12
65 34 15.5
35 3 5.6
45 4 4.9
24 5 5.01
46 12 10
76 32 15.2
45 14 11 Equation:
38 7 7 Insurance term =β1 X Marketing
34 1 6
32 5.5 5.5
budget(X1) + β2 X Yearly Income(X2) + β0
56 28 12
.

MLR: 𝑌 = β0 + β1 * X1 + β2 * X2 + β3 * X3 + ……………………. + βn * Xn
General Eq. with
‘n’
variables Where,

𝒀 = Expected Output Variable

X1 , X2 ……..Xn = ‘n’ independent variables

β0, β1 ……….βn = Coefficients of independent variables with


constant i.e. intercept (β0)
Adjusted R-square
• The adjusted R2 statistic penalizes the analyst for adding terms to the model.

• It can help guard against overfitting (including regressors that are not really useful)

• Particularly for small N and where results are to be generalized, take more note of adjusted
R2

• Adjusted R2 is used for estimating explained variance in a population.

• Adjusted R-squared is a better metric than R-squared to assess how good the model fits the
data.
MLR Framework Variable
Selection for
model
Explorat
ory Final
Data Model
Business Analysis
Objective
Data
Preparat
ion

Multicolline
arity

Variance Inflation
Factor
Stepwise Variable Selection
Adding one variable by variable
No Variable
and checking p- value

Eliminating the variable of larger


p-value than threshold
Checking p- value of variables
Multicollinearity
• It is a state of very high inter-correlations or inter-associations among the independent
variables (i.e. X1,X2,X3 etc.).

• Let say, there are three variables in the dataset i.e. X1,X2 and X3. There is strong
correlation between these variables?

• What will be effect of this? How to handle this situation?


Dummy Variables
• Independent variables can categorical variables, for example
– Gender
– Brand of laptop
– Nationality

• Since algorithms expect numerical values in independent variables, these need to be


encoded

• What can be a problem if, for example, brand of laptop are coded as follows: HP=1, Dell=2,
Lenovo=3, Asus=4, Acer=5
Dummy Variables
• The correct way to encode the categorical variables is by using dummy variables

• A dummy variable takes on 1 and 0 only

• If a categorical variable has n possible values, then create n-1 dummy variables

• For example, if Laptop brand can take following 5 values HP, Dell, Lenovo, Asus, Acer; then
create 4 dummy variable: Brad_HP, Brand_Dell, Brand_Lenovo, Brand_Asus. Each of these
variable can take value 0 or 1
Why Multicollinearity is a problem?
• Multi collinearity has no impact on predictive power of the model as a whole, but it affects the
calculation for individual predictors

• Parameter estimates may change erratically for small changes in the model or data. It will
make the estimate highly imbalanced.

• This instability will increase the variance of estimates. It means that if there is a small change
in X (i.e. independent variable), produces large changes in estimate(i.e. expected output).
Multicollinearity –
Variance Inflation Factor

• The variable with higher VIF and/or the variable with lower significance from a business
point of view can be removed.

• VIF of any one independent variable is calculated by regressing that variable with other
independent variables

• VIF threshold depends on the business problem. VIF greater than 2 or 5 or 10 is considered
as a VIF threshold.
P-value significance
• Remove variable which has a higher p-value in order of their insignificance.

• If the model have a larger of variables in the final model, then the criteria for selecting
variables in the model can be made toughen i.e. p < 0.02 or 0.03
Assumptions
Top-1 :
Assumptions
• The regression model is linear in
parameter
o For example, the below equation shows
linear relationship in parameters.

Y = β0 + β1* X12 + β2*X2

• Linearity implies that the coefficient of


variables are linear in nature not the
variable in itself

• Non-Linear Equation, it indicates that


coefficients are not varying linearly
Assumptions
Top-2 :
Assumptions

• Normality of residuals i.e. residuals should follow a bell curve


distribution with zero mean

• Detect using plot of residuals

• Possible cause: Missing X


Assumptions
Top-3 :
Assumptions

• Homoscedasticity of residuals or equal variance

• Opposite of Homoscedasticity is Heteroscedasticity -


When the error variance changes in a systematic pattern
with changes in the X value.
• Meaning that, error variance itself changes with value of
X
• Appropriate transformation can eliminate the
heteroscedasticity problem
Assumptions
• :
Top-3 • Detection of Heteroscedasticity
Assumptions o Best way to detect heteroscedasticity is to visualize the
relationship between squared residual and the
independent variable, try to identify whether the plot
exhibits any pattern.

o For multiple linear regression, try to plot squared


residuals against each of the independent variables.

o If the number of independent variables are large, then


plot the squared residuals with the predicted output
variable
Assumptions
Top-4 :
Assumptions • No Multicollinearity among independent variables
• Multi collinear variables should be removed from the
final model
• VIF (variance inflation factor) is used to remove multi
collinearity.
Thank you

You might also like