Estimate Daily High Temperatures Using Regression

Linear Regression
Introduction to Regression
• Linear Regression is a Supervised Learning algorithm
• The dependent variable is a continuous variable, e.g. Revenue
• Regression is an attempt to explain the variation in a continuous dependent variable using

the variation in independent (explanatory) variables
Discussion
• What are scenarios where Regression can be used?
Question
• Identify the type of problem -
• A construction company is making a bid on a project in a remote area of Mumbai. A

certain component of the project will take place in December, and is very sensitive to
the daily high temperatures. They would like to estimate what the average high
temperature will be at the location in December.
a. Supervised learning – Classification problem

b. Unsupervised learning – Regression problem
c. Supervised learning – Regression problem
d. Unsupervised learning – Classification problem
Correlation
• Correlation – a linear relationship between two continuous variables
• It determines both the nature and strength of relationships with two variables
• Coefficient of Correlation, r takes values between -1 to +1.
• Correlation of ‘-1’ indicates a perfect negative correlation
• Correlation of ‘+1’ indicates a perfect positive correlation

Correlation
– Cov(x,y) is covariance of x and y

– Sx is standard deviation of x
– Sy is standard deviation of y
• In real life data, typically you will not observe r value of -1 or +1 or 0

Exercise
Calculate Coefficient of Correlation
Simple and Multiple Regression
• Regression is a technique for finding the relationship between a response variable and one
or more explanatory variables.
• Simple Linear Regression: Predict Y using only one independent variable

– Estimated y = b0 + b1 * x1
• Multiple Linear Regression: Predict Y by considering more than one independent variable
– Estimated y = b0 + b1*x1 + b2*x2
Simple Linear Regression
Marketing Insurance term
Budget (X) (Y) amount
(In lakhs) (in Crs)
23 5
64 15
67 16
45 10
19 4.5
56 12
65 15.5
35 5.6
45 4.9
24 5.01
46 10
76 15.2
45 11
38 7
34 6
32 5.5 Line Equation:
56 12 Insurance term = Slope X Marketing
budget + Intercept
Question
• Which of the following methods do we use to find the best fit line for data in
Linear Regression?
a. Least Square Error

b. Maximum Likelihood
c. Logarithmic Loss
d. Maximum information gain
Simple Linear Regression – Best Fit Line
•
Best fit line is the line that

minimize the residual sum of
square
Key • How well does the best-fit line represent the scatter-plot?
questions:
• How well does the best-fit line predict the new data?
Simple Linear Regression –
Strength of best fit line
• R-squared is a best metric to measure the strength of the best fit
Strength: line
Best fit line
• R-squared is a statistical measure of how close the data are to the
fitted regression line

• It always takes a value between 0 & 1, 1 indicates that the
variance in dependent variables is completely explained by the
independent variable
• Mathematically, R squared = 1 – RSS

TSS
where, RSS = Residual sum of square

TSS = Total sum of square
Simple Linear Regression –
Coefficient of determination
•
R-squared: • R squared = 1 – RSS
TSS
Best fit line
where, RSS = Residual sum of square
TSS = Total sum of square
Example
•Dependent variable: Insurance amount
•Independent variable: Marketing Budget
•RSS = 37.12
•TSS = 292.42
•R-squared = 0.87
•R-squared-0.87 indicates that 87% of the
variation in ‘insurance amount’ is explained by
the independent variable, ‘Marketing Budget’
SLR- Equation Interpretation
Interpretation
of coefficients : Equation is -
Best fit line • Insurance_amount =
0.239*Marketing_budget -1.379
• β0 = -1.379
• β1 = 0.239
• The simple linear regression equation tells us

that the predicted insurance amount for the
insurance team will increase by 0.239 of term
insurance amount for every one percent
increase in the marketing budget.
Exercise
Calculate Predicted Y and R-square
Multiple Linear Regression
• It explains the relationship between two or more independent variables and a response
variable by fitting a straight line.
Marketing Yearly Income Insurance term

Budget (X2) ( In (Y)
(X1) (In lakhs) Lakhs) amount (in Crs)
23 4 5
64 40 15
67 42 16
45 15 10
19 32 4.5
56 23 12
65 34 15.5
35 3 5.6
45 4 4.9
24 5 5.01
46 12 10
76 32 15.2
45 14 11 Equation:
38 7 7 Insurance term =β1 X Marketing
34 1 6
32 5.5 5.5
budget(X1) + β2 X Yearly Income(X2) + β0
56 28 12
.
•
MLR: 𝑌 = β0 + β1 * X1 + β2 * X2 + β3 * X3 + ……………………. + βn * Xn
General Eq. with
‘n’
variables Where,
𝒀 = Expected Output Variable
X1 , X2 ……..Xn = ‘n’ independent variables
β0, β1 ……….βn = Coefficients of independent variables with

constant i.e. intercept (β0)
Adjusted R-square
• The adjusted R2 statistic penalizes the analyst for adding terms to the model.
• It can help guard against overfitting (including regressors that are not really useful)
• Particularly for small N and where results are to be generalized, take more note of adjusted
R2
• Adjusted R2 is used for estimating explained variance in a population.
• Adjusted R-squared is a better metric than R-squared to assess how good the model fits the
data.
MLR Framework Variable
Selection for
model
Explorat
ory Final
Data Model
Business Analysis
Objective
Data
Preparat
ion
Multicolline
arity
Variance Inflation
Factor
Stepwise Variable Selection
Adding one variable by variable
No Variable
and checking p- value
Eliminating the variable of larger

p-value than threshold
Checking p- value of variables
Multicollinearity
• It is a state of very high inter-correlations or inter-associations among the independent
variables (i.e. X1,X2,X3 etc.).
• Let say, there are three variables in the dataset i.e. X1,X2 and X3. There is strong
correlation between these variables?
• What will be effect of this? How to handle this situation?

Dummy Variables
• Independent variables can categorical variables, for example
– Gender
– Brand of laptop
– Nationality
• Since algorithms expect numerical values in independent variables, these need to be

encoded
• What can be a problem if, for example, brand of laptop are coded as follows: HP=1, Dell=2,
Lenovo=3, Asus=4, Acer=5
Dummy Variables
• The correct way to encode the categorical variables is by using dummy variables
• A dummy variable takes on 1 and 0 only
• If a categorical variable has n possible values, then create n-1 dummy variables
• For example, if Laptop brand can take following 5 values HP, Dell, Lenovo, Asus, Acer; then
create 4 dummy variable: Brad_HP, Brand_Dell, Brand_Lenovo, Brand_Asus. Each of these
variable can take value 0 or 1
Why Multicollinearity is a problem?
• Multi collinearity has no impact on predictive power of the model as a whole, but it affects the
calculation for individual predictors
• Parameter estimates may change erratically for small changes in the model or data. It will
make the estimate highly imbalanced.
• This instability will increase the variance of estimates. It means that if there is a small change
in X (i.e. independent variable), produces large changes in estimate(i.e. expected output).
Multicollinearity –
Variance Inflation Factor
• The variable with higher VIF and/or the variable with lower significance from a business
point of view can be removed.
• VIF of any one independent variable is calculated by regressing that variable with other
independent variables
• VIF threshold depends on the business problem. VIF greater than 2 or 5 or 10 is considered
as a VIF threshold.
P-value significance
• Remove variable which has a higher p-value in order of their insignificance.
• If the model have a larger of variables in the final model, then the criteria for selecting
variables in the model can be made toughen i.e. p < 0.02 or 0.03
Assumptions
Top-1 :
Assumptions
• The regression model is linear in
parameter
o For example, the below equation shows
linear relationship in parameters.
Y = β0 + β1* X12 + β2*X2
• Linearity implies that the coefficient of

variables are linear in nature not the
variable in itself
• Non-Linear Equation, it indicates that

coefficients are not varying linearly
Assumptions
Top-2 :
Assumptions
• Normality of residuals i.e. residuals should follow a bell curve

distribution with zero mean
• Detect using plot of residuals
• Possible cause: Missing X

Assumptions
Top-3 :
Assumptions
• Homoscedasticity of residuals or equal variance
• Opposite of Homoscedasticity is Heteroscedasticity -

When the error variance changes in a systematic pattern
with changes in the X value.
• Meaning that, error variance itself changes with value of
X
• Appropriate transformation can eliminate the
heteroscedasticity problem
Assumptions
• :
Top-3 • Detection of Heteroscedasticity
Assumptions o Best way to detect heteroscedasticity is to visualize the
relationship between squared residual and the
independent variable, try to identify whether the plot
exhibits any pattern.
o For multiple linear regression, try to plot squared

residuals against each of the independent variables.
o If the number of independent variables are large, then

plot the squared residuals with the predicted output
variable
Assumptions
Top-4 :
Assumptions • No Multicollinearity among independent variables
• Multi collinear variables should be removed from the
final model
• VIF (variance inflation factor) is used to remove multi
collinearity.
Thank you

Estimate Daily High Temperatures Using Regression

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Estimate Daily High Temperatures Using Regression

Uploaded by

Copyright:

Available Formats

Linear Regression

• The dependent variable is a continuous variable, e.g. Revenue

• Regression is an attempt to explain the variation in a continuous dependent variable using

• A construction company is making a bid on a project in a remote area of Mumbai. A

a. Supervised learning – Classification problem

• Coefficient of Correlation, r takes values between -1 to +1.

• Correlation of ‘-1’ indicates a perfect negative correlation

• Correlation of ‘+1’ indicates a perfect positive correlation

– Cov(x,y) is covariance of x and y

• In real life data, typically you will not observe r value of -1 or +1 or 0

• Simple Linear Regression: Predict Y using only one independent variable

a. Least Square Error

Best fit line is the line that

fitted regression line

• Mathematically, R squared = 1 – RSS

where, RSS = Residual sum of square

• The simple linear regression equation tells us

Marketing Yearly Income Insurance term

𝒀 = Expected Output Variable

X1 , X2 ……..Xn = ‘n’ independent variables

β0, β1 ……….βn = Coefficients of independent variables with

• Adjusted R2 is used for estimating explained variance in a population.

Eliminating the variable of larger

• What will be effect of this? How to handle this situation?

• Since algorithms expect numerical values in independent variables, these need to be

• A dummy variable takes on 1 and 0 only

Y = β0 + β1* X12 + β2*X2

• Linearity implies that the coefficient of

• Non-Linear Equation, it indicates that

• Normality of residuals i.e. residuals should follow a bell curve

• Detect using plot of residuals

• Possible cause: Missing X

• Homoscedasticity of residuals or equal variance

• Opposite of Homoscedasticity is Heteroscedasticity -

o For multiple linear regression, try to plot squared

o If the number of independent variables are large, then

You might also like