Professional Documents
Culture Documents
Dr Shirish Jeble
ICFAI Business School, Pune
1
Definitions
• Dependent variable / Response variable
/Predicted variable/Outcome variable
• Independent variable / Explanatory variable/
Predictor variable
• Correlation
• Spurious Correlation
2
Correlation
• Correlation is statistical measure of an
association relationship between two random
variables
3
Correlation
• Correlation helps in identifying variables used in
model building
• Correlation helps in eliminating those variables
which may cause multi-collinearity
• Value of correlation coefficient lies between -1
and +1
• Positive value of r indicates positive correlation
(as value of x increases, the value of y also
increases)
• r2 = R2 (Coefficient of determination in
regression)
4
Simple and Multiple
Regression
Dr Shirish Jeble
ICFAI Business School, Pune
5
Master Contents
Video 1
• Types of Regression
• Business Applications of Regression
Video 2
• Simple Regression
• Multiple Regression
• Non-Linear Regression
• Assumptions in Regression
Video 3
• Variable Selection Approaches
Video 4
• Handling Qualitative Variables
• Preprocessing of Data
Video 5
• Stages of project using regression
• Data Issues in Regression
– Multi- collinearity
– Heteroscedasticity
– Auto-Correlation
6
Types of Regression
S1/5
Dr Shirish Jeble
ICFAI Business School, Pune
7
Contents
• Types of Regression
• Business Applications of Regression
8
Types of Regression
Types of Regressions
Multiple
dependent
Simple Multiple Logistic Polynomial variables
Regression Regression Regression Regression
9
Applications of Regressions
10
Estimate Salary for an
experienced candidate
11
Experience
Communication
Education
skills
Salaries
12
Predict Sales for the next Year,
quarter or a month
13
14
Features Quality
Customer
Advertisement
Satisfaction
Product
Sales
15
Estimate Car Insurance Premium
for New Customers
16
Driver’s
Car Model driving
experience
Miles
Value of
driven on
Car
car
17
Thank You
Simple, Multiple and Non-
linear Regression
S2/5
Dr Shirish Jeble
ICFAI Business School, Pune
19
Contents
• What is regression
– Simple Regression
– Multiple Regression
– Nonlinear Regression
• R-Square and Adj. R-Squared
• Assumptions in Regression
20
• Regression establishes association of relation
ship between 2 variables, and not a causal
relationship.
21
Higher level of relationship → higher correlation and higher R-Squared
Sales
Correlation =
0.80
22
30
25
Lower level of relationship → low correlation and lower R-Squared
20
15
Sales
10
Correlation =
5 0.23
23
Regression line
Minimise the sum of (square errors ): sum(ei2)
Predicted Value
Y→
Actual Value
e4
e3 e5
e1 e2
X→
24
Simple Regression
X1 =
Advertisement
Y=
Sales
Y = mx + c + e
25
Multiple Regression
Money spent by customer on ecommerce site
X2 Family
Size
X1
X3 Age
Income
27
R2 and Adjusted R2
28
Assumptions in
Regressions
29
Assumptions in Regression
1. Error term should be normally distributed
2. Changes in error term should not happen with
changes in dependent variable (heteroscedasticity)
3. For Time series - Successive values of error term
should be independent (auto-correlation)
Thank You
Steps for Regression
• Open Data file in R - read.csv()
• Define linear regression using R-command “lm”
• Define regression equation using dependent
variable (Y = mX + c)
• Draw scatter plot between X and Y
• Check for normality of Error
– Draw histogram of residual
– Draw qqnorm(residual) and qqplot(residual)
– Shapiro Wilkis Test
• Use model to predict values of Y when X is known
32
Variable Selection
Approaches in Regression
S3/5
Dr Shirish Jeble
ICFAI Business School, Pune
33
Variable Selection Approaches
• Forward Selection
• Backward Elimination
• Stepwise Regression
Y = f(X1, X2, X3, X4………..)
35
Forward Selection
36
Forward Selection
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
37
Forward Selection
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
2. Iteration 1 – Model 1
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
C. Get p-value for X3, F-statistic, R-Square
D. If p-value < 0.05, retain X3 in the model
E. Once in the equation, the variable remains there.
3. Iteration 2 – Model 2
A. Select variable with next highest correlation (X5)
B. Run regression model 2
C. Get p-value for X5, F-statistic, R-Square
D. If p-value < 0.05, retain X5 in the model
4. Iteration 3 – Model 3, 4 and 5
A. Repeat the process for rest of variables in the order of decreasing
correlation X1, X2, X4
B. Retain the variable in the model only if p-value < 0.05
38
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X5 + X1
Iteration 4 X3 + X5 + X1+ X2
Iteration 5 X3 + X5 + X1+
X2+X4
• Variable will be retained in the next model only if its significant (its p-value < 0.05)
• Once added variable, it remains in the model
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
39
Forward Selection
Model 1 Model 2 Model 3 Model 4 Model 5
Iteration 1 X3
Iteration 2 X3 + X5
Iteration 3 X3 + X1
Iteration 4 X3 + X1+ X2
Iteration 5 X3 + X1+ X2 + X4
X5:
p-value
= 0.09
40
Backward Elimination
41
Backward Elimination
• Add all variables in the model X1 to Xi
• Model 1: Run regression for all i variables
– Remove 1 variable with highest p-value
• Model 2: Run regression for i-1 variable
– Remove 2nd variable with highest p-value
• Repeat till i models until all remaining
variables are significant
42
Backward Elimination
Variables Model 1 Model 2 Model 3 Model 4
Y Y
X1 X1 (0.04)
X2 X2(0.06)
X3 X3(0.03)
X4 X4(0.11)
X5 X5 (0.07)
X6 X6(0.08)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4
Y Y Y
X1 X1 (0.04) X1 (0.06)
X2 X2(0.06) X2(0.07)
X3 X3(0.03) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4 Remove X5
Y Y Y Y Y
X1 X1 (0.04) X1 (0.06) X1 (0.06) X1 (0.03)
X2 X2(0.06) X2(0.07) X2(0.07) X2(0.10)
X3 X3(0.03) X3(0.20) X3(0.20)
X4 X4(0.11)
X5 X5 (0.07) X5 (0.33)
X6 X6(0.08) X6(0.07) X6(0.07) X6(0.07)
R2 0.95
Adj. R2 0.75
F-Statistic 0.003
Remove X4 Remove X5 Remove X3 Final model
46
Stepwise Regression
1. Find out correlations between Y and all Xs
X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
47
Stepwise Regression
1. Find out correlations between Y and all Xs
2. Iteration 1 – Model 1
A. Select variable with highest correlation (X3)
B. Run regression model 1 (Y = 𝛽1 𝑋3 + 𝑐)
C. Get p-value for X3, F-statistic, R-Square
D. If p-value <= 0.05, retain X3 in the model
E. If p-value > 0.05 remove X3 from the model
3. Iteration 2 – Model 2
A. Select variable with next highest correlation (X5)
B. Run regression model 2
C. Get p-value for X5, F-statistic, R-Square
D. If p-value <= 0.05, retain X5 in the model
E. If p-value of earlier variable X3 has become > 0.05 remove X3
from the model…otherwise X3 can continue to be in the model
Correlation X1 X2 X3 X4 X5
Y 0.23 0.09 0.87 0.03 0.63
48
Thank You
Pre-Processing of Data
(Before Regression)
S4/5
Dr Shirish Jeble
ICFAI Business School, Pune
50
Handling Qualitative Variables
51
Examples of Qualitative Variables
• Gender (Male, Female)
• Quarter (Q1, Q2, Q3, Q4)
• Customer Satisfaction (Satisfied, not satisfied)
• Age in years, converted to age categories (10-
20, 20-30, 30-40 etc.)
• Income (Lacs), converted to Income groups (5-
10 lacs, 10-15 Lacs, 15-20 lacs etc.
• Country: IND, AUS, ENG, USA, JPN etc.
52
Gender as a Qualitative Variable
Codification
Male = 0
Female = 1
Record # Gender
1 1
2 0
3 0
4 1
53
Quarter as a Qualitative Variable
# Sales (Y) Advertisement Date Qtr
1 10000 1098 3/1/2019 1
2 12003 1133 20/3/2019 1
3 15090 1480 6/4/2019 2
4 13001 1294 20/6/2019 2
5 17300 1630 18/7/2019 3
5 15980 1490 1/9/2019 3
6 11515 1011 30/11/2019 4
7 18764 1766 15/12/2010 4
55
Codify Quarter
Sales = 𝛽1 ∗ 𝐴𝑑𝑣𝑡 + 𝛽2 ∗ 𝑄1 + 𝛽3 ∗ 𝑄2 + 𝛽4 ∗ 𝑄3 + 𝑐 + 𝜖
50-60 Category
Not required
explicitly
58
Adding Calculation Fields in Data
• Given data for unit-price, qty-sold
• Add one more column line-total (datatype=
currency)
• Add formula (unit-price * qty-sold)
• Update all records
59
Interaction Variables
60
Interaction Variables
Variables Model 1 Model 2 Model 3 Model 4
Y Y Y Y Y
X1 X1 X1 X1 X1
X2 X2 X2 X2 X2
X3 X3 X3 X3 X3
Xinteraction_1 X1 * X 2 X1 * X 2 X1 * X 2
Xinteraction_2 X2 * X 3 X2 * X 3
Xinteraction_3 X1* X3
R2
Adj. R2
F Statistic
61
Interaction Variables
Batsma Allroun
Bowler n der SR-B RUNS-S
0 0 1 0.00 0
1 0 0 0.00 0
1 0 0 121.01 167
1 0 0 76.32 58
0 1 0 120.71 1317
0 1 0 95.45 63
1 0 0 72.22
26
1 0 0 165.88
21
0 0 1 114.73
335
Batsman & Batting Strike Rate,
will interact and form new Variable Batsman &
( Batsman * SR-B) Number of runs scored by a player
(Batsman * Runs Scored)
62
Interaction Variables
63
Outliers in Data
Plot of Residuals for each of the points
Outlier
+2 Std Error
-2 Std Error
64
How to detect Outliers
• Boxplot of residuals
• Plot Residuals for all the observations
– Residual points greater than 2 SE are outliers
• Mahalanobis distance > 10 (in R)
• Cook’s distance > 1 (in R)
65
Thank You
Stages and Data Issues in
Regression Models
S5/5
Dr Shirish Jeble
ICFAI Business School, Pune
67
Stages of Regression Project
68
Stages of Regression Project
1. Data Collection
70
Checking normality of Residual
Box Plot
Histogram
71
5. Model Diagnostics
• F-Test: check f-statistic (should be < 0.05)
• t-tests: check statistical significance of
independent variables (X1, X2, X3….Xi)…p-value <
0.05)
• Check normality of residuals
– Draw Boxplot to get general idea of residual
– Plot Histogram of residuals using qqnorm(residual)
and qqline(residual)
– Check Skewness, kurtosis for residual
• Explanatory Power of Model
– Check values of R2 and Adj R2
72
6. Check for Outliers
– Mahalanobis Distance (when greater than 10, its
an outlier)
– Cook’s distance (when greater than 1, its an
outlier)
– Plot residual on scatter plot, all points beyond
+2SE or below -2SE are outliers
73
7. Data Issues in Regression
74
Assumptions
1. Error term should be normally distributed
2. Changes in error term should not happen
with changes in dependent variable
(heteroscedasticity)
3. For Time series - Successive values of error
term should be independent (auto-
correlation)
75
Multi-Collinearity
X1 X2 X3
X1 - 0.30 0.90
X2 0.20 - 0.20
X3 0.90 0.23 -
Log(Food
Income Food spending Spending)
$74,201.00 $9,646.13 3.984353
$41,659.00 $8,331.80 3.920739
$44,085.00 $9,698.70 3.986714
$63,529.00 $10,799.93 4.033421
78
Heteroscedasticity removed
ABS(Residuals)
2 sign changes
• If no of sign changes on residuals plot is less than [(n-1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• n = 14… [(n-1)/2 - (𝑛 − 1] = 2.89 ~ 3 or more sign changes required for “no auto correlation)
• +ve auto correlation exists in above data
Auto-Correlation - 3
Negative auto correlation : many sign changes
13 sign changes
2 3 5
1 4
• If no of sign changes on residuals plot is > [(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
• n = 14… [(n-1)/2 + (𝑛 − 1] = 10 or less sign changes required for “no auto correlation”
• -ve auto correlation exists in above data
Auto-correlation
• Definition: High correlation between et and et-1
• How to detect:
– Option 1:
• Draw plot of residual and observe no of sign changes
• Applicable only for Time series data
– Few sign changes: positive auto correlation
– Many sign changes: negative auto correlation
• If number of sign changes on residuals plot is less than [(n-
1)/2 - (𝑛 − 1] , then +ve Autocorrelation exists
• If number of sign changes on residuals plot is greater than
[(n-1)/2 + (𝑛 − 1] , then -ve Autocorrelation exists
– Option 2:
• Durbin Watson Test (to be tested)
82
Remedy for Auto-correlation
• Find out correlation p between et and et-1
• Define new variable (Yt-p*Yt-1) & (Xt - p*Xt-1)
for regression
• Run regression and check no of sign changes
83
Checking for Data Issues
Data Issue How to identify the issue Remedy
85
Stages of Regression Project
1. Data Collection