You are on page 1of 87

Correlation

Dr Shirish Jeble
ICFAI Business School, Pune

1
Definitions
• Dependent variable / Response variable
/Predicted variable/Outcome variable
• Independent variable / Explanatory variable/
Predictor variable
• Correlation
• Spurious Correlation

2
Correlation
• Correlation is statistical measure of an
association relationship between two random
variables

3
Correlation
• Correlation helps in identifying variables used in
model building
• Correlation helps in eliminating those variables
which may cause multi-collinearity
• Value of correlation coefficient lies between -1
and +1
• Positive value of r indicates positive correlation
(as value of x increases, the value of y also
increases)
• r2 = R2 (Coefficient of determination in
regression)
4
Simple and Multiple
Regression

Dr Shirish Jeble
ICFAI Business School, Pune

5
Master Contents
Video 1
• Types of Regression
• Business Applications of Regression
Video 2
• Simple Regression
• Multiple Regression
• Non-Linear Regression
• Assumptions in Regression
Video 3
• Variable Selection Approaches
Video 4
• Handling Qualitative Variables
• Preprocessing of Data
Video 5
• Stages of project using regression
• Data Issues in Regression
– Multi- collinearity
– Heteroscedasticity
– Auto-Correlation

6
Types of Regression
S1/5

Dr Shirish Jeble
ICFAI Business School, Pune

7
Contents
• Types of Regression
• Business Applications of Regression

8
Types of Regression

Types of Regressions

Linear Non-Linear Multivariate


Regression regression Regression

Multiple
dependent
Simple Multiple Logistic Polynomial variables
Regression Regression Regression Regression

Dependent variable is Boolean values Y=m1x2 + m2x3 + c Structured


continuous Yes/No, 1/0 Y = Log(x) Equation Modelling
Y = SQRT(x)

9
Applications of Regressions

10
Estimate Salary for an
experienced candidate

11
Experience

Communication
Education
skills

Salaries

12
Predict Sales for the next Year,
quarter or a month

13
14
Features Quality

Customer
Advertisement
Satisfaction
Product
Sales

15
Estimate Car Insurance Premium
for New Customers

16
Driver’s
Car Model driving
experience
Miles
Value of
driven on
Car
car

Driver’s Insurance Risk profile


Age Premium of area

17
Thank You
Simple, Multiple and Non-
linear Regression
S2/5
Dr Shirish Jeble
ICFAI Business School, Pune

19
Contents
• What is regression
– Simple Regression
– Multiple Regression
– Nonlinear Regression
• R-Square and Adj. R-Squared
• Assumptions in Regression

20
• Regression establishes association of relation
ship between 2 variables, and not a causal
relationship.

21
Higher level of relationship → higher correlation and higher R-Squared

Sales

Correlation =
0.80

Money Spent on TV Advertisement

22
30

25
Lower level of relationship → low correlation and lower R-Squared

20

15
Sales

10

Correlation =
5 0.23

0 Money Spent on Radio Advertisement


0 10 20 30 40 50 60

23
Regression line
Minimise the sum of (square errors ): sum(ei2)

Predicted Value
Y→

Actual Value
e4
e3 e5
e1 e2
X→
24
Simple Regression

X1 =
Advertisement

Y=
Sales

Y = mx + c + e
25
Multiple Regression
Money spent by customer on ecommerce site

X2 Family
Size
X1
X3 Age
Income

Y = m1x1 + m2x2 + m3x3 + mixi + c


26
Non-Linear Regressions
• Y = m1x12 + c
• Y = m1x12 + m2x23 + c
• Y = SQRT(x1)
• Y = Log(x1)
• Y = 1/X

27
R2 and Adjusted R2

28
Assumptions in
Regressions

29
Assumptions in Regression
1. Error term should be normally distributed
2. Changes in error term should not happen with
changes in dependent variable (heteroscedasticity)
3. For Time series - Successive values of error term
should be independent (auto-correlation)
Thank You
Steps for Regression
• Open Data file in R - read.csv()
• Define linear regression using R-command “lm”
• Define regression equation using dependent
variable (Y = mX + c)
• Draw scatter plot between X and Y
• Check for normality of Error
– Draw histogram of residual
– Draw qqnorm(residual) and qqplot(residual)
– Shapiro Wilkis Test
• Use model to predict values of Y when X is known

32
Variable Selection
Approaches in Regression
S3/5
Dr Shirish Jeble
ICFAI Business School, Pune

33
Variable Selection Approaches
• Forward Selection
• Backward Elimination
• Stepwise Regression
Y = f(X1, X2, X3, X4………..)

35
Forward Selection

36
Forward Selection
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63

37
Forward Selection
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
2. Iteration 1 – Model 1
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
C. Get p-value for X3, F-statistic, R-Square
D. If p-value < 0.05, retain X3 in the model
E. Once in the equation, the variable remains there.
3. Iteration 2 – Model 2
A. Select variable with next highest correlation (X5)
B. Run regression model 2
C. Get p-value for X5, F-statistic, R-Square
D. If p-value < 0.05, retain X5 in the model
4. Iteration 3 – Model 3, 4 and 5
A. Repeat the process for rest of variables in the order of decreasing
correlation X1, X2, X4
B. Retain the variable in the model only if p-value < 0.05

38
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X5 + X1
Iteration 4 X3 + X5 + X1+ X2
Iteration 5 X3 + X5 + X1+
X2+X4

• Variable will be retained in the next model only if its significant (its p-value < 0.05)
• Once added variable, it remains in the model

X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63

39
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X1
Iteration 4 X3 + X1+ X2
Iteration 5 X3 + X1+ X2 + X4
X5:
p-value
= 0.09

X5 will be dropped in model 3 as its p-value is


NOT less than 0.05

40
Backward Elimination

41
Backward Elimination
• Add all variables in the model X1 to Xi
• Model 1: Run regression for all i variables
– Remove 1 variable with highest p-value
• Model 2: Run regression for i-1 variable
– Remove 2nd variable with highest p-value
• Repeat till i models until all remaining
variables are significant

42
Backward Elimination
Variables Model 1 Model 2 Model 3 Model 4

Y Y
X1 X1 (0.04)
X2 X2(0.06)
X3 X3(0.03)
X4 X4(0.11)
X5 X5 (0.07)
X6 X6(0.08)
R2 0.95
Adj. R2 0.75

F-Statistic 0.003
Remove X4

Remove the variable with largest p-value if its > 0.05


43
Backward Elimination
Variables Model 1 Model 2 Model 3 Model 4

Y Y Y
X1 X1 (0.04) X1 (0.06)
X2 X2(0.06) X2(0.07)
X3 X3(0.03) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07)
R2 0.95
Adj. R2 0.75

F-Statistic 0.003
Remove X4 Remove X5

Remove the variable with largest p-value if its > 0.05


44
Backward Elimination
Variables Model 1 Model 2 Model 3 Model 4

Y Y Y Y Y
X1 X1 (0.04) X1 (0.06) X1 (0.06) X1 (0.03)
X2 X2(0.06) X2(0.07) X2(0.07) X2(0.10)
X3 X3(0.03) X3(0.20) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07) X6(0.07) X6(0.07)
R2 0.95
Adj. R2 0.75

F-Statistic 0.003
Remove X4 Remove X5 Remove X3 Final model

Remove the variable with largest p-value if its > 0.05


45
Stepwise Regression

46
Stepwise Regression
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63

2. Decide 𝜶 𝒂𝒏𝒅 𝜷 𝒗𝒂𝒍𝒖𝒆𝒔 𝒇𝒐𝒓 𝒓𝒆𝒈𝒓𝒆𝒔𝒊𝒐𝒏


(say we select 𝜶 = 𝟎. 𝟎𝟓 𝒂𝒏𝒅 𝜷 = 𝟎. 𝟎𝟓)

47
Stepwise Regression
1. Find out correlations between Y and all Xs
2. Iteration 1 – Model 1
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
C. Get p-value for X3, F-statistic, R-Square
D. If p-value <= 0.05, retain X3 in the model
E. If p-value > 0.05 remove X3 from the model
3. Iteration 2 – Model 2
A. Select variable with next highest correlation (X5)
B. Run regression model 2
C. Get p-value for X5, F-statistic, R-Square
D. If p-value <= 0.05, retain X5 in the model
E. If p-value of earlier variable X3 has become > 0.05 remove X3
from the model…otherwise X3 can continue to be in the model
Correlation X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
48
Thank You
Pre-Processing of Data
(Before Regression)
S4/5
Dr Shirish Jeble
ICFAI Business School, Pune

50
Handling Qualitative Variables

51
Examples of Qualitative Variables
• Gender (Male, Female)
• Quarter (Q1, Q2, Q3, Q4)
• Customer Satisfaction (Satisfied, not satisfied)
• Age in years, converted to age categories (10-
20, 20-30, 30-40 etc.)
• Income (Lacs), converted to Income groups (5-
10 lacs, 10-15 Lacs, 15-20 lacs etc.
• Country: IND, AUS, ENG, USA, JPN etc.

52
Gender as a Qualitative Variable
Codification
Male = 0
Female = 1

Record # Gender
1 1
2 0
3 0
4 1

Salaries = 𝜷𝟏 ∗ 𝑮𝒆𝒏𝒅𝒆𝒓 + 𝜷𝟐 ∗ 𝑨𝒈𝒆 + 𝒄 + 𝝐

53
Quarter as a Qualitative Variable
# Sales (Y) Advertisement Date Qtr
1 10000 1098 3/1/2019 1
2 12003 1133 20/3/2019 1
3 15090 1480 6/4/2019 2
4 13001 1294 20/6/2019 2
5 17300 1630 18/7/2019 3
5 15980 1490 1/9/2019 3
6 11515 1011 30/11/2019 4
7 18764 1766 15/12/2010 4

Sales = 𝜷𝟏 ∗ 𝑨𝒅𝒗𝒕 + 𝜷𝟐 ∗ 𝑸𝒕𝒓 + 𝒄 + 𝝐


Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 1 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟏
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 2 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟐
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 3 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟑
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 4 + 𝑐 + 𝜖 … . 𝑓𝑜𝑟 𝑸𝒕𝒓 𝟒 54
Codify Quarter

# Date Qtr # Date Qtr Q1 Q2 Q3 Q4


1 3/1/2019 1 1 3/1/2019 1 1 0 0 0
2 20/3/2019 1 2 20/3/2019 1 1 0 0 0
3 6/4/2019 2 3 6/4/2019 2 0 1 0 0
4 20/6/2019 2 4 20/6/2019 2 0 1 0 0
5 18/7/2019 3 5 18/7/2019 3 0 0 1 0
5 1/9/2019 3 5 1/9/2019 3 0 0 1 0
6 30/11/2019 4 6 30/11/2019 4 0 0 0 1
7 15/12/2010 4 7 15/12/2010 4 0 0 0 1

Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄𝑡𝑟 + 𝑐 + 𝜖

55
Codify Quarter

# Date Qtr # Date Qtr Q1 Q2 Q3


1 3/1/2019 1 1 3/1/2019 1 1 0 0
2 20/3/2019 1 2 20/3/2019 1 1 0 0
3 6/4/2019 2 3 6/4/2019 2 0 1 0
4 20/6/2019 2 4 20/6/2019 2 0 1 0
5 18/7/2019 3 5 18/7/2019 3 0 0 1
5 1/9/2019 3 5 1/9/2019 3 0 0 1
6 30/11/2019 4 6 30/11/2019 4 0 0 0
7 15/12/2010 4 7 15/12/2010 4 0 0 0

Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄1 + 𝛽3 ∗ 𝑄2 + 𝛽4 ∗ 𝑄3 + 𝑐 + 𝜖

Quarter 4 is implied whenever Q1,Q2,Q3 = 0


56
Age
# Age # Age 10-20 20-30 30-40 40-50 50-60
1 10
1 10 1 0 0 0 0
2 25
2 25 0 1 0 0 0
3 47
3 47 0 0 0 1 0
4 33
4 33 0 0 1 0 0
5 15
5 15 1 0 0 0 0
6 58 6 58 0 0 0 0 1

50-60 Category
Not required
explicitly

Age converted from Continuous Variable to Categorical variable


57
Adding Calculation Fields in Data

58
Adding Calculation Fields in Data
• Given data for unit-price, qty-sold
• Add one more column line-total (datatype=
currency)
• Add formula (unit-price * qty-sold)
• Update all records

59
Interaction Variables

60
Interaction Variables
Variables Model 1 Model 2 Model 3 Model 4
Y Y Y Y Y
X1 X1 X1 X1 X1
X2 X2 X2 X2 X2
X3 X3 X3 X3 X3
Xinteraction_1 X1 * X 2 X1 * X 2 X1 * X 2
Xinteraction_2 X2 * X 3 X2 * X 3
Xinteraction_3 X1* X3
R2
Adj. R2
F Statistic

61
Interaction Variables
Batsma Allroun
Bowler n der SR-B RUNS-S
0 0 1 0.00 0
1 0 0 0.00 0
1 0 0 121.01 167
1 0 0 76.32 58
0 1 0 120.71 1317
0 1 0 95.45 63
1 0 0 72.22
26
1 0 0 165.88
21
0 0 1 114.73
335
Batsman & Batting Strike Rate,
will interact and form new Variable Batsman &
( Batsman * SR-B) Number of runs scored by a player
(Batsman * Runs Scored)

62
Interaction Variables

What variable will be applicable for Bowler?

63
Outliers in Data
Plot of Residuals for each of the points

Outlier

+2 Std Error

-2 Std Error

64
How to detect Outliers
• Boxplot of residuals
• Plot Residuals for all the observations
– Residual points greater than 2 SE are outliers
• Mahalanobis distance > 10 (in R)
• Cook’s distance > 1 (in R)

65
Thank You
Stages and Data Issues in
Regression Models
S5/5
Dr Shirish Jeble
ICFAI Business School, Pune

67
Stages of Regression Project

68
Stages of Regression Project
1. Data Collection

Pre-Process 2. Pre-processing of Data Codification, data cleansing

3. Selecting Variables Forward Sel., Backward Elim. etc.)

4. Develop Reg. Model Using Test Data

5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual


Regression
6. Remove Outliers if any

7. Check for Data issues

8. Validate Model Using Validation Data

Deployment 9. Deploy Model


69
5. Model Diagnostics
• F-Test: check f-statistic (should be < 0.05)
• t-tests: check statistical significance of
independent variables (X1, X2, X3….Xi)…p-value
< 0.05)
• Check normality of residuals

70
Checking normality of Residual

Box Plot

Histogram

71
5. Model Diagnostics
• F-Test: check f-statistic (should be < 0.05)
• t-tests: check statistical significance of
independent variables (X1, X2, X3….Xi)…p-value <
0.05)
• Check normality of residuals
– Draw Boxplot to get general idea of residual
– Plot Histogram of residuals using qqnorm(residual)
and qqline(residual)
– Check Skewness, kurtosis for residual
• Explanatory Power of Model
– Check values of R2 and Adj R2

72
6. Check for Outliers
– Mahalanobis Distance (when greater than 10, its
an outlier)
– Cook’s distance (when greater than 1, its an
outlier)
– Plot residual on scatter plot, all points beyond
+2SE or below -2SE are outliers

73
7. Data Issues in Regression

74
Assumptions
1. Error term should be normally distributed
2. Changes in error term should not happen
with changes in dependent variable
(heteroscedasticity)
3. For Time series - Successive values of error
term should be independent (auto-
correlation)

75
Multi-Collinearity
X1 X2 X3
X1 - 0.30 0.90
X2 0.20 - 0.20
X3 0.90 0.23 -

• Correlation between 2 variables very high


• Check Variance Inflation Factor – VIF (when VIF is > 5 indicates presence of multi-
collinearity)
ABS(Residuals)
Heteroscedasticity

Predicted Food Spending →

Plot scatterplot between Predicted (Y) and ABS( Residual)


Remove Heteroscedasticity

independent dependent dependent

Log(Food
Income Food spending Spending)
$74,201.00 $9,646.13 3.984353
$41,659.00 $8,331.80 3.920739
$44,085.00 $9,698.70 3.986714
$63,529.00 $10,799.93 4.033421

Use Log(Y) or SQRT(Y) to remove heteroscedasticity

78
Heteroscedasticity removed
ABS(Residuals)

Predicted (Log of Food Spending) →


Auto-Correlation
Positive auto correlation, residuals indicate few sign changes

2 sign changes

• If no of sign changes on residuals plot is less than [(n-1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• n = 14… [(n-1)/2 - (𝑛 − 1] = 2.89 ~ 3 or more sign changes required for “no auto correlation)
• +ve auto correlation exists in above data
Auto-Correlation - 3
Negative auto correlation : many sign changes

13 sign changes
2 3 5
1 4

• If no of sign changes on residuals plot is > [(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
• n = 14… [(n-1)/2 + (𝑛 − 1] = 10 or less sign changes required for “no auto correlation”
• -ve auto correlation exists in above data
Auto-correlation
• Definition: High correlation between et and et-1
• How to detect:
– Option 1:
• Draw plot of residual and observe no of sign changes
• Applicable only for Time series data
– Few sign changes: positive auto correlation
– Many sign changes: negative auto correlation
• If number of sign changes on residuals plot is less than [(n-
1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• If number of sign changes on residuals plot is greater than
[(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
– Option 2:
• Durbin Watson Test (to be tested)

82
Remedy for Auto-correlation
• Find out correlation p between et and et-1
• Define new variable (Yt-p*Yt-1) & (Xt - p*Xt-1)
for regression
• Run regression and check no of sign changes

83
Checking for Data Issues
Data Issue How to identify the issue Remedy

Multicollinearity • Correlation between 2 variables very high • Remove one of the


• Check Variance Inflation Factor – VIF (should be <4). variables
• Use Principal Component Analysis
Heteroscedasticity • Nonconstant variance of error term • Run regression between
• Draw plot between Predicted value of dependent var Y Log(Y) or SQRT(Y) and Xi
and ABS(residual)
Autocorrelation (for • High correlation between et and et-1 • Find out correlation p
time-series only) • Draw plot of residual and observe no of sign changes between et and et-1
• If number of sign changes on residuals plot is less than • Define new variable
[(n-1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists (Yt-p*Yt-1) & (Xt - p*Xt-1)
• If number of sign changes on residuals plot is greater for regression
than [(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation • Run regression and
check no of sign changes
exists
Outlier Analysis • Mahalanobis Distance (when greater than 10, its an • Drop the outlier points
outlier) from the data and re-run
• Cook’s distance (when greater than 1, its an outlier) the regression
• Plot residual on scatter plot, all points beyond +2SE or
below -2SE are outliers
8. Validate the Model
• 80% of available data : test set
• 20% of available data: validation set
• Run the regression model on test set
• Develop multiple models with the variable selection
approach
• Run all diagnostics and arrive at the best final model (Y
= m1X1 + m2X2 + m3X3 + c) i.e. the model with higher R-Square and
significant independent variables
• Apply this model on validation set (20%)
• Check if the Std Error testing dataset < Std Error training dataset
• If this is true then model is ready to go to production
(Deploy Model)

85
Stages of Regression Project
1. Data Collection

Pre-Process 2. Pre-processing of Data Codification, data cleansing

3. Selecting Variables Forward Sel., Backward Elim. etc.)

4. Develop Reg. Model Using Training Data

5. Check Model Diagnostics F-Test, t-Test, R2, Normality of residual


Regression
6. Remove Outliers if any

7. Check for Data issues

8. Validate Model Using Test Data

Deployment 9. Deploy Model


86
Thank You

You might also like