1 Simple and Multiple Regression V1.03

Correlation
Dr Shirish Jeble
ICFAI Business School, Pune
1
Definitions
• Dependent variable / Response variable
/Predicted variable/Outcome variable
• Independent variable / Explanatory variable/
Predictor variable
• Correlation
• Spurious Correlation
2
Correlation
• Correlation is statistical measure of an
association relationship between two random
variables
3
Correlation
• Correlation helps in identifying variables used in
model building
• Correlation helps in eliminating those variables
which may cause multi-collinearity
• Value of correlation coefficient lies between -1
and +1
• Positive value of r indicates positive correlation
(as value of x increases, the value of y also
increases)
• r2 = R2 (Coefficient of determination in
regression)
4
Simple and Multiple
Regression
Dr Shirish Jeble
5
Master Contents
Video 1
• Types of Regression
• Business Applications of Regression
Video 2
• Simple Regression
• Multiple Regression
• Non-Linear Regression
• Assumptions in Regression
Video 3
• Variable Selection Approaches
Video 4
• Handling Qualitative Variables
• Preprocessing of Data
Video 5
• Stages of project using regression
• Data Issues in Regression
– Multi- collinearity
– Heteroscedasticity
– Auto-Correlation
6
Types of Regression
S1/5
Dr Shirish Jeble
7
Contents
• Types of Regression
• Business Applications of Regression
8
Types of Regression
Types of Regressions
Linear Non-Linear Multivariate

Regression regression Regression
Multiple
dependent
Simple Multiple Logistic Polynomial variables
Regression Regression Regression Regression
Dependent variable is Boolean values Y=m1x2 + m2x3 + c Structured

continuous Yes/No, 1/0 Y = Log(x) Equation Modelling
Y = SQRT(x)
9
Applications of Regressions
10
Estimate Salary for an
experienced candidate
11
Experience
Communication
Education
skills
Salaries
12
Predict Sales for the next Year,
quarter or a month
13
14
Features Quality
Customer
Advertisement
Satisfaction
Product
Sales
15
Estimate Car Insurance Premium
for New Customers
16
Driver’s
Car Model driving
experience
Miles
Value of
driven on
Car
car
Driver’s Insurance Risk profile

Age Premium of area
17
Thank You
Simple, Multiple and Non-
linear Regression
S2/5
Dr Shirish Jeble
19
Contents
• What is regression
– Simple Regression
– Multiple Regression
– Nonlinear Regression
• R-Square and Adj. R-Squared
• Assumptions in Regression
20
• Regression establishes association of relation
ship between 2 variables, and not a causal
relationship.
21
Higher level of relationship → higher correlation and higher R-Squared
Sales
Correlation =
0.80
Money Spent on TV Advertisement
22
30
25
Lower level of relationship → low correlation and lower R-Squared
20
15
Sales
10
Correlation =
5 0.23
0 Money Spent on Radio Advertisement

0 10 20 30 40 50 60
23
Regression line
Minimise the sum of (square errors ): sum(ei2)
Predicted Value
Y→
Actual Value
e4
e3 e5
e1 e2
X→
24
Simple Regression
X1 =
Advertisement
Y=
Sales
Y = mx + c + e
25
Multiple Regression
Money spent by customer on ecommerce site
X2 Family
Size
X1
X3 Age
Income
Y = m1x1 + m2x2 + m3x3 + mixi + c

26
Non-Linear Regressions
• Y = m1x12 + c
• Y = m1x12 + m2x23 + c
• Y = SQRT(x1)
• Y = Log(x1)
• Y = 1/X
27
R2 and Adjusted R2
28
Assumptions in
Regressions
29
Assumptions in Regression
1. Error term should be normally distributed
2. Changes in error term should not happen with
changes in dependent variable (heteroscedasticity)
3. For Time series - Successive values of error term
should be independent (auto-correlation)
Thank You
Steps for Regression
• Open Data file in R - read.csv()
• Define linear regression using R-command “lm”
• Define regression equation using dependent
variable (Y = mX + c)
• Draw scatter plot between X and Y
• Check for normality of Error
– Draw histogram of residual
– Draw qqnorm(residual) and qqplot(residual)
– Shapiro Wilkis Test
• Use model to predict values of Y when X is known
32
Variable Selection
Approaches in Regression
S3/5
Dr Shirish Jeble
33
Variable Selection Approaches
• Forward Selection
• Backward Elimination
• Stepwise Regression
Y = f(X1, X2, X3, X4………..)
35
Forward Selection
36
Forward Selection
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
37
Forward Selection
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
2. Iteration 1 – Model 1
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
C. Get p-value for X3, F-statistic, R-Square
D. If p-value < 0.05, retain X3 in the model
E. Once in the equation, the variable remains there.
A. Select variable with next highest correlation (X5)
B. Run regression model 2
D. If p-value < 0.05, retain X5 in the model
4. Iteration 3 – Model 3, 4 and 5
A. Repeat the process for rest of variables in the order of decreasing
correlation X1, X2, X4
B. Retain the variable in the model only if p-value < 0.05
38
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X5 + X1
Iteration 4 X3 + X5 + X1+ X2
Iteration 5 X3 + X5 + X1+
X2+X4
• Variable will be retained in the next model only if its significant (its p-value < 0.05)
• Once added variable, it remains in the model
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
39
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X1
Iteration 4 X3 + X1+ X2
Iteration 5 X3 + X1+ X2 + X4
X5:
p-value
= 0.09
X5 will be dropped in model 3 as its p-value is

NOT less than 0.05
40
Backward Elimination
41
• Add all variables in the model X1 to Xi
• Model 1: Run regression for all i variables
– Remove 1 variable with highest p-value
• Model 2: Run regression for i-1 variable
– Remove 2nd variable with highest p-value
• Repeat till i models until all remaining
variables are significant
42
Variables Model 1 Model 2 Model 3 Model 4
Y Y
X1 X1 (0.04)
X2 X2(0.06)
X3 X3(0.03)
X4 X4(0.11)
X5 X5 (0.07)
X6 X6(0.08)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4
Remove the variable with largest p-value if its > 0.05

43
Y Y Y
X1 X1 (0.04) X1 (0.06)
X2 X2(0.06) X2(0.07)
X3 X3(0.03) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4 Remove X5

44
Y Y Y Y Y
X1 X1 (0.04) X1 (0.06) X1 (0.06) X1 (0.03)
X2 X2(0.06) X2(0.07) X2(0.07) X2(0.10)
X3 X3(0.03) X3(0.20) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07) X6(0.07) X6(0.07)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4 Remove X5 Remove X3 Final model

45
Stepwise Regression
46
Stepwise Regression
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
2. Decide 𝜶 𝒂𝒏𝒅 𝜷 𝒗𝒂𝒍𝒖𝒆𝒔 𝒇𝒐𝒓 𝒓𝒆𝒈𝒓𝒆𝒔𝒊𝒐𝒏

(say we select 𝜶 = 𝟎. 𝟎𝟓 𝒂𝒏𝒅 𝜷 = 𝟎. 𝟎𝟓)
47
Stepwise Regression
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
D. If p-value <= 0.05, retain X3 in the model
E. If p-value > 0.05 remove X3 from the model
A. Select variable with next highest correlation (X5)
B. Run regression model 2
D. If p-value <= 0.05, retain X5 in the model
E. If p-value of earlier variable X3 has become > 0.05 remove X3
from the model…otherwise X3 can continue to be in the model
Correlation X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
48
Thank You
Pre-Processing of Data
(Before Regression)
S4/5
Dr Shirish Jeble
50
Handling Qualitative Variables
51
Examples of Qualitative Variables
• Gender (Male, Female)
• Quarter (Q1, Q2, Q3, Q4)
• Customer Satisfaction (Satisfied, not satisfied)
• Age in years, converted to age categories (10-
20, 20-30, 30-40 etc.)
• Income (Lacs), converted to Income groups (5-
10 lacs, 10-15 Lacs, 15-20 lacs etc.
• Country: IND, AUS, ENG, USA, JPN etc.
52
Gender as a Qualitative Variable
Codification
Male = 0
Female = 1
Record # Gender
1 1
2 0
3 0
4 1
Salaries = 𝜷𝟏 ∗ 𝑮𝒆𝒏𝒅𝒆𝒓 + 𝜷𝟐 ∗ 𝑨𝒈𝒆 + 𝒄 + 𝝐
53
Quarter as a Qualitative Variable
# Sales (Y) Advertisement Date Qtr
1 10000 1098 3/1/2019 1
2 12003 1133 20/3/2019 1
3 15090 1480 6/4/2019 2
4 13001 1294 20/6/2019 2
5 17300 1630 18/7/2019 3
5 15980 1490 1/9/2019 3
6 11515 1011 30/11/2019 4
7 18764 1766 15/12/2010 4
Sales = 𝜷𝟏 ∗ 𝑨𝒅𝒗𝒕 + 𝜷𝟐 ∗ 𝑸𝒕𝒓 + 𝒄 + 𝝐

Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 1 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟏
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 2 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟐
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 3 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟑
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 4 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟒 54
Codify Quarter
# Date Qtr # Date Qtr Q1 Q2 Q3 Q4

1 3/1/2019 1 1 3/1/2019 1 1 0 0 0
2 20/3/2019 1 2 20/3/2019 1 1 0 0 0
3 6/4/2019 2 3 6/4/2019 2 0 1 0 0
4 20/6/2019 2 4 20/6/2019 2 0 1 0 0
5 18/7/2019 3 5 18/7/2019 3 0 0 1 0
5 1/9/2019 3 5 1/9/2019 3 0 0 1 0
6 30/11/2019 4 6 30/11/2019 4 0 0 0 1
7 15/12/2010 4 7 15/12/2010 4 0 0 0 1
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄𝑡𝑟 + 𝑐 + 𝜖
55
Codify Quarter
# Date Qtr # Date Qtr Q1 Q2 Q3

1 3/1/2019 1 1 3/1/2019 1 1 0 0
2 20/3/2019 1 2 20/3/2019 1 1 0 0
3 6/4/2019 2 3 6/4/2019 2 0 1 0
4 20/6/2019 2 4 20/6/2019 2 0 1 0
5 18/7/2019 3 5 18/7/2019 3 0 0 1
5 1/9/2019 3 5 1/9/2019 3 0 0 1
6 30/11/2019 4 6 30/11/2019 4 0 0 0
7 15/12/2010 4 7 15/12/2010 4 0 0 0
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄1 + 𝛽3 ∗ 𝑄2 + 𝛽4 ∗ 𝑄3 + 𝑐 + 𝜖
Quarter 4 is implied whenever Q1,Q2,Q3 = 0

56
Age
# Age # Age 10-20 20-30 30-40 40-50 50-60
1 10
1 10 1 0 0 0 0
2 25
2 25 0 1 0 0 0
3 47
3 47 0 0 0 1 0
4 33
4 33 0 0 1 0 0
5 15
5 15 1 0 0 0 0
6 58 6 58 0 0 0 0 1
50-60 Category
Not required
explicitly
Age converted from Continuous Variable to Categorical variable

57
Adding Calculation Fields in Data
58
Adding Calculation Fields in Data
• Given data for unit-price, qty-sold
• Add one more column line-total (datatype=
currency)
• Add formula (unit-price * qty-sold)
• Update all records
59
Interaction Variables
60
Y Y Y Y Y
X1 X1 X1 X1 X1
X2 X2 X2 X2 X2
X3 X3 X3 X3 X3
Xinteraction_1 X1 * X 2 X1 * X 2 X1 * X 2
Xinteraction_2 X2 * X 3 X2 * X 3
Xinteraction_3 X1* X3
R2
Adj. R2
F Statistic
61
Batsma Allroun
Bowler n der SR-B RUNS-S
0 0 1 0.00 0
1 0 0 0.00 0
1 0 0 121.01 167
1 0 0 76.32 58
0 1 0 120.71 1317
0 1 0 95.45 63
1 0 0 72.22
26
1 0 0 165.88
21
0 0 1 114.73
335
Batsman & Batting Strike Rate,
will interact and form new Variable Batsman &
( Batsman * SR-B) Number of runs scored by a player
(Batsman * Runs Scored)
62
What variable will be applicable for Bowler?
63
Outliers in Data
Plot of Residuals for each of the points
Outlier
+2 Std Error
-2 Std Error
64
How to detect Outliers
• Boxplot of residuals
• Plot Residuals for all the observations
– Residual points greater than 2 SE are outliers
• Mahalanobis distance > 10 (in R)
• Cook’s distance > 1 (in R)
65
Thank You
Stages and Data Issues in
Regression Models
S5/5
Dr Shirish Jeble
67
Stages of Regression Project
68
1. Data Collection
Pre-Process 2. Pre-processing of Data Codification, data cleansing
3. Selecting Variables Forward Sel., Backward Elim. etc.)
4. Develop Reg. Model Using Test Data
5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual

Regression
6. Remove Outliers if any
7. Check for Data issues
8. Validate Model Using Validation Data
Deployment 9. Deploy Model

69
5. Model Diagnostics
• F-Test: check f-statistic (should be < 0.05)
• t-tests: check statistical significance of
independent variables (X1, X2, X3….Xi)…p-value
< 0.05)
• Check normality of residuals
70
Checking normality of Residual
Box Plot
Histogram
71
5. Model Diagnostics
• F-Test: check f-statistic (should be < 0.05)
• t-tests: check statistical significance of
independent variables (X1, X2, X3….Xi)…p-value <
0.05)
• Check normality of residuals
– Draw Boxplot to get general idea of residual
– Plot Histogram of residuals using qqnorm(residual)
and qqline(residual)
– Check Skewness, kurtosis for residual
• Explanatory Power of Model
– Check values of R2 and Adj R2
72
6. Check for Outliers
– Mahalanobis Distance (when greater than 10, its
an outlier)
– Cook’s distance (when greater than 1, its an
outlier)
– Plot residual on scatter plot, all points beyond
+2SE or below -2SE are outliers
73
7. Data Issues in Regression
74
Assumptions
1. Error term should be normally distributed
2. Changes in error term should not happen
with changes in dependent variable
(heteroscedasticity)
3. For Time series - Successive values of error
term should be independent (auto-
correlation)
75
Multi-Collinearity
X1 X2 X3
X1 - 0.30 0.90
X2 0.20 - 0.20
X3 0.90 0.23 -
• Correlation between 2 variables very high

• Check Variance Inflation Factor – VIF (when VIF is > 5 indicates presence of multi-
collinearity)
ABS(Residuals)
Heteroscedasticity
Predicted Food Spending →
Plot scatterplot between Predicted (Y) and ABS( Residual)

Remove Heteroscedasticity
independent dependent dependent
Log(Food
Income Food spending Spending)
$74,201.00 $9,646.13 3.984353
$41,659.00 $8,331.80 3.920739
$44,085.00 $9,698.70 3.986714
$63,529.00 $10,799.93 4.033421
Use Log(Y) or SQRT(Y) to remove heteroscedasticity
78
Heteroscedasticity removed
ABS(Residuals)
Predicted (Log of Food Spending) →

Auto-Correlation
Positive auto correlation, residuals indicate few sign changes
2 sign changes
• If no of sign changes on residuals plot is less than [(n-1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• n = 14… [(n-1)/2 - (𝑛 − 1] = 2.89 ~ 3 or more sign changes required for “no auto correlation)
• +ve auto correlation exists in above data
Auto-Correlation - 3
Negative auto correlation : many sign changes
13 sign changes
2 3 5
1 4
• If no of sign changes on residuals plot is > [(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
• n = 14… [(n-1)/2 + (𝑛 − 1] = 10 or less sign changes required for “no auto correlation”
• -ve auto correlation exists in above data
Auto-correlation
• Definition: High correlation between et and et-1
• How to detect:
– Option 1:
• Draw plot of residual and observe no of sign changes
• Applicable only for Time series data
– Few sign changes: positive auto correlation
– Many sign changes: negative auto correlation
• If number of sign changes on residuals plot is less than [(n-
1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• If number of sign changes on residuals plot is greater than
[(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
– Option 2:
• Durbin Watson Test (to be tested)
82
Remedy for Auto-correlation
• Find out correlation p between et and et-1
• Define new variable (Yt-p*Yt-1) & (Xt - p*Xt-1)
for regression
• Run regression and check no of sign changes
83
Checking for Data Issues
Data Issue How to identify the issue Remedy
Multicollinearity • Correlation between 2 variables very high • Remove one of the

• Check Variance Inflation Factor – VIF (should be <4). variables
• Use Principal Component Analysis
Heteroscedasticity • Nonconstant variance of error term • Run regression between
• Draw plot between Predicted value of dependent var Y Log(Y) or SQRT(Y) and Xi
and ABS(residual)
Autocorrelation (for • High correlation between et and et-1 • Find out correlation p
time-series only) • Draw plot of residual and observe no of sign changes between et and et-1
• If number of sign changes on residuals plot is less than • Define new variable
[(n-1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists (Yt-p*Yt-1) & (Xt - p*Xt-1)
• If number of sign changes on residuals plot is greater for regression
than [(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation • Run regression and
check no of sign changes
exists
Outlier Analysis • Mahalanobis Distance (when greater than 10, its an • Drop the outlier points
outlier) from the data and re-run
• Cook’s distance (when greater than 1, its an outlier) the regression
• Plot residual on scatter plot, all points beyond +2SE or
below -2SE are outliers
8. Validate the Model
• 80% of available data : test set
• 20% of available data: validation set
• Run the regression model on test set
• Develop multiple models with the variable selection
approach
• Run all diagnostics and arrive at the best final model (Y
= m1X1 + m2X2 + m3X3 + c) i.e. the model with higher R-Square and
significant independent variables
• Apply this model on validation set (20%)
• Check if the Std Error testing dataset < Std Error training dataset
• If this is true then model is ready to go to production
(Deploy Model)
85
1. Data Collection
Pre-Process 2. Pre-processing of Data Codification, data cleansing
3. Selecting Variables Forward Sel., Backward Elim. etc.)
4. Develop Reg. Model Using Training Data
5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual

Regression
6. Remove Outliers if any
7. Check for Data issues
8. Validate Model Using Test Data
Deployment 9. Deploy Model

86
Thank You

1 Simple and Multiple Regression V1.03

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Simple and Multiple Regression V1.03

Uploaded by

Copyright:

Available Formats

Correlation

Linear Non-Linear Multivariate

Dependent variable is Boolean values Y=m1x2 + m2x3 + c Structured

Driver’s Insurance Risk profile

Money Spent on TV Advertisement

0 Money Spent on Radio Advertisement

Y = m1x1 + m2x2 + m3x3 + mixi + c

X5 will be dropped in model 3 as its p-value is

Remove the variable with largest p-value if its > 0.05

Remove the variable with largest p-value if its > 0.05

Remove the variable with largest p-value if its > 0.05

2. Decide 𝜶 𝒂𝒏𝒅 𝜷 𝒗𝒂𝒍𝒖𝒆𝒔 𝒇𝒐𝒓 𝒓𝒆𝒈𝒓𝒆𝒔𝒊𝒐𝒏

Salaries = 𝜷𝟏 ∗ 𝑮𝒆𝒏𝒅𝒆𝒓 + 𝜷𝟐 ∗ 𝑨𝒈𝒆 + 𝒄 + 𝝐

Sales = 𝜷𝟏 ∗ 𝑨𝒅𝒗𝒕 + 𝜷𝟐 ∗ 𝑸𝒕𝒓 + 𝒄 + 𝝐

# Date Qtr # Date Qtr Q1 Q2 Q3 Q4

Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄𝑡𝑟 + 𝑐 + 𝜖

# Date Qtr # Date Qtr Q1 Q2 Q3

Quarter 4 is implied whenever Q1,Q2,Q3 = 0

Age converted from Continuous Variable to Categorical variable

What variable will be applicable for Bowler?

Pre-Process 2. Pre-processing of Data Codification, data cleansing

3. Selecting Variables Forward Sel., Backward Elim. etc.)

4. Develop Reg. Model Using Test Data

5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual

7. Check for Data issues

8. Validate Model Using Validation Data

Deployment 9. Deploy Model

• Correlation between 2 variables very high

Predicted Food Spending →

Plot scatterplot between Predicted (Y) and ABS( Residual)

independent dependent dependent

Use Log(Y) or SQRT(Y) to remove heteroscedasticity

Predicted (Log of Food Spending) →

Multicollinearity • Correlation between 2 variables very high • Remove one of the

Pre-Process 2. Pre-processing of Data Codification, data cleansing

3. Selecting Variables Forward Sel., Backward Elim. etc.)

4. Develop Reg. Model Using Training Data

5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual

7. Check for Data issues

8. Validate Model Using Test Data

Deployment 9. Deploy Model

You might also like