Professional Documents
Culture Documents
Model Development
Conduct
Derive and Analyze Exploratory Data
Explore the data
Descriptive Statistics Analysis
Define functional
Perform Estimate regression form of the
Diagnostic Tests parameters relationship
NO
Model satisfies
diagnostic test Influential
YES Points STOP
Analysis
Things to do in Multiple Linear Regression:
3
The Multiple Regression Model
First, recall Simple Regression Model:
y = β 0 + β1 x + ε
Multiple Regression Model extends idea:
y = β 0 + β1 x1 + β 2 x 2 + + β m x m + ε
4
Example
Data on amount of money spent (Y) by customers at an e-commerce portal, monthly
income (X1) and family size (X2) is collected for 200 customers. Build a regression model.
For every one unit increase in Income amount spent increases by 0.0174
when the variable Family.size is kept constant, and for one unit increase in
Family.Size the amount spent decreases by 129.4 when income is kept
constant.
6
Model Performance Measures
Adjusted R-squared 𝑅𝑅 2
= 1 − (1 − 𝑅𝑅 )
𝑎𝑎𝑎𝑎𝑎𝑎
𝑛𝑛 − 1 2 = 1 − (1 − 0.5841)
(200 − 1)
= 0.579
𝑛𝑛 − 𝑚𝑚 − 1 (200 − 2 − 1)
SEE
7
R-squared Adjusted R-
squared
Model 1 Y1~X2 0.9557 0.9493
Model 2 Y1~X1+X2 0.9573 0.9431
8
F-test for Significance of Overall Regression Model
Ho: β1 = β2 ….. βm = 0, Model: y = β0 + ε
Ha: At least one βi ≠ 0,
Observed F-statistic Fobs = MSR/MSE follows Fm, n-m-1
distribution
F-test is right-tailed test, since values non-negative
Ho rejected when p-value small
Where p-value = p(Fm, n-m-1 > Fobs), represents area in
tail to right of observed value
9
t-test for Significance of Individual Variable
Overall Model
Significance
11
Example
You are an HR analyst for a company. You want to build a model for
predicting Salary of the employees. To build your model, you use the
data available with the company i.e. work experience of an employee.
You are also interested to explore whether gender of an employee has
an impact on the salary.
The data in below table provides salary, gender, and work experience (WE) of 27 workers in a firm. In
Table gender = 1 denotes female and 2 denotes male and WE is the work experience in number of years
14
Inclusion of a categorical variable in Regression Model
If a categorical variable has k categories; create k-1 dummy variables.
The category for which there is no dummy variable is called base category.
Example
Gender Variable Contains two categories: 1( Female); 2 (Male)
Create ‘Gender2’ a dummy variable representing Females
Gender 2 = 1 if Gender =1; Gender 2 = 0 if Gender != 1
Y = β0 + β1 × Gender2
The coefficient to a specific
Gender2=1 ; Y1 = β0 + β1 category of the variable
Gender 2=0 ; Y2= β0 represents the change in the
value of Y from base category.
β1 = Y1-Y2
15
Does Salary depend on the gender of the employee?
16
Solution
Let the regression model be:
Gender 2:
Y = β0 + β1 × Gender2 + β2 × WE 1: Female
0: Male
R output:
Assumptions
Absence of Multi-Collinearity: Predictor Variables should not be
correlated with each other.
L Linear Function: The mean of the response, E(Yi), at each value of the
predictor, xi, is a linear function of the xi.
N Normally Distributed: The errors, ϵi, at each value of the predictor, xi, are
Normally distributed.
E Equal variances (denoted σ2): The errors, ϵi, at each value of the
predictor, xi, have Equal variances.
18
Multicollinearity
Multicollinearity is condition where two or more predictors
are correlated.
Effects of multi collinearity
Data set with severe multicollinearity may have
significant F-test, while having no significant t-tests
for predictors.
The interpretation of coefficients as measuring
marginal effects is unwarranted in the presence of
correlated variables.
19
When the predictor variables are uncorrelated, the coefficients remain the
same when other predictor variables are included.
20
Example -1
21
Example -1
22
Effect on the coefficient of X1 in the
presence of X2 and X3.
The coefficient of x2 changed direction in
the presence of X1 and X3
23
Variance Inflation Factor
Variance Inflation Factors (VIFs) report presence of multi-
collinearity
1
VIFi =
1 − Ri2
24
Remedial Measures
1. Eliminating one of two variables
Limitation: the magnitude of the regression coefficient of the predictors ( correlated) may
be affected.
25
Variable Selection Methods
Several variable selection methods available
Assist analyst in determining which variables to include in
model
Algorithms help select predictors leading to optimal model
Four variable selection methods:
(1) Forward Selection
(2) Backwards Elimination
(3) Stepwise Selection
(4) Best Subsets
26
Y x1 x2 x3 x4
Y~1
Y~x1
Y~x1+x3
Y~x1+x2+x3+x4
27
Forward Selection Procedure
Procedure begins with no variables in model
Step 1:
– Predictor x1 most highly correlated with response selected
Step 2:
– For remaining predictors, compute sequential F-statistic given predictors already
in model
– For example, first pass sequential F-Statistics computed for F(x2|x1), F(x3|x1),
F(x4|x1)
– Select variable with largest sequential F-statistic
Step 3:
– Test significance of sequential F-statistic, for variable selected in Step 2
– If the variable is not significant, then stop reporting model without adding the
variable selected in Step 2
– Otherwise, add variable from Step 2, and return to Step 2
28
Stepwise Procedure
• Stepwise Procedure represents modification to Forward Selection
Procedure
• A variable entered into model during forward selection process
may turn out to be non-significant, as additional variables enter
model
• If a variable in the model is no longer significant, it is removed
from the model. In case of more than one such variable, the one
with smallest partial F-statistic is removed from model
• Procedure terminates when no additional variables can enter or be
removed from the model
29
The Partial F-Test
Suppose model has x1,…,xp predictors and we consider adding additional predictor x*
30
The Partial F-Test (cont’d)
Null hypothesis for Partial F-Test
31
Model Development
Conduct
Derive and Analyze Exploratory Data
Explore the data
Descriptive Statistics Analysis
Define functional
Perform Estimate regression form of the
Diagnostic Tests parameters relationship
NO