You are on page 1of 39

Chapter 14

Multiple Regression and Model


Building

Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Multiple Regression and Model
Building
14.1 The Multiple Regression Model and the
Least Squares Point Estimate
14.2 Model Assumptions and the Standard Error
14.3 R2 and Adjusted R2
14.4 The Overall F Test
14.5 Testing the Significance of an Independent
Variable
14.6 Confidence and Prediction Intervals

14-2
Multiple Regression and Model
Building Continued
14.7 The Sales Territory Performance Case:
Evaluating Employee Performance
14.8 Using Dummy Variables to Model
Qualitative Independent Variables
14.9 Using Squared and Interactive Terms
14.10 Model Building and the Effects of
Multicollinearity
14.11 Residual Analysis in Multiple Regression
14.12 Logistic Regression

14-3
LO 14-1: Explain the
multiple regression
model and the related
least squares point
estimates.
14.1 The Multiple Regression Model and
the Least Squares Point Estimate
 Simple linear regression used one independent
variable to explain the dependent variable
 Some relationships are too complex to be described
using a single independent variable
 Multiple regression uses two or more independent
variables to describe the dependent variable
 This allows multiple regression models to handle more
complex situations
 There is no limit to the number of independent
variables a model can use
 Multiple regression has only one dependent variable

14-4
LO14-1

The Multiple Regression Model


 The linear regression model relating y to x1, x2,…, xk
is y = β0 + β1x1 + β2x2 +…+ βkxk + 
 µy = β0 + β1x1 + β2x2 +…+ βkxk is the mean value of
the dependent variable y when the values of the
independent variables are x1, x2,…, xk
 β0, β1, β2,… βk are the unknown regression
parameters relating the mean value of y to
x1, x2,…, xk
  is an error term that describes the effects on y of all
factors other than the independent variables
x1, x2,…, xk

14-5
LO14-1
The Least Squares Estimates and Point
Estimation and Prediction
1. Estimation/prediction equation
ŷ = b0 + b1x1 + b2x2 + … + bkxk
is the point estimate of the mean value of the
dependent variable when the values of the
independent variables are x1, x2,…, xk
2. It is also the point prediction of an individual value of
the dependent variable when the values of the
independent variables are x1, x2,…, xk
3. b0, b1, b2,…, bk are the least squares point
estimates of the parameters β0, β1, β2,…, βk
4. x1, x2,…, xk are specified values of the independent
predictor variables x1, x2,…, xk

14-6
LO14-1
EXAMPLE 14.1 The Tasty Sub
Shop Case

Figure 14.4 (a) 14-7


LO 14-2: Explain the
assumptions behind
multiple regression and
calculate the standard
error. 14.2 Model Assumptions and the
Standard Error
 The model is

y = β 0 + β 1 x 1 + β 2 x 2 + … + β kx k + 

 Assumptions for multiple regression are


stated about the model error terms, ’s

14-8
LO14-2

The Regression Model Assumptions


Continued

1. Mean of Zero Assumption


The mean of the error terms is equal to 0
2. Constant Variance Assumption
The variance of the error terms σ2 is, the
same for every combination values of x1,
x2,…, xk
3. Normality Assumption
The error terms follow a normal distribution
for every combination values of x1, x2,…, xk
4. Independence Assumption
The values of the error terms are statistically
independent of each other

14-9
LO14-2

Sum of Squares
 Sum of squared errors
SSE   e i2   ( y i  yˆ i ) 2

 Mean squared error: point estimate of the


residual variance σ2
SSE
s  MSE 
2

n-k  1
 Standard error: point estimate of the residual
standard deviation σ
SSE
s  MSE 
n- k  1
14-10
LO 14-3: Calculate and
interpret the multiple
and adjusted multiple
coefficients of

14.3 R2 and Adjusted R2


determination.

1. Total variation is given by the formula


Σ(yi - ȳ)2
2. Explained variation is given by the formula
Σ(ŷi - ȳ)2
3. Unexplained variation is given by the
formula
Σ(yi - ŷi)2
4. Total variation is the sum of explained and
unexplained variation
This section can be read anytime
after reading Section 14.1
14-11
LO14-3

R2 and Adjusted R2 Continued


5. The multiple coefficient of determination is
the ratio of explained variation to total
variation
6. R2 is the proportion of the total variation that
is explained by the overall regression model
7. Multiple correlation coefficient R is the
square root of R2

14-12
LO14-3

Multiple Correlation Coefficient R


 The multiple correlation coefficient R is just
the square root of R2
 With simple linear regression, r would take on
the sign of b1
 There are multiple bi’s with multiple
regression
 For this reason, R is always positive
 To interpret the direction of the relationship
between the x’s and y, you must look to the
sign of the appropriate bi coefficient

14-13
LO14-3

The Adjusted R2
 Adding an independent variable to multiple
regression will raise R2
 R2 will rise slightly even if the new variable
has no relationship to y
 The adjusted R2 corrects this tendency in R2
 As a result, it gives a better estimate of the
importance of the independent variables

14-14
LO 14-4: Test the
significance of a
multiple regression
model by using an F
test.
14.4 The Overall F Test
 To test
H0: β1= β2 = …= βk = 0 versus
Ha: At least one of β1, β2,…, βk ≠ 0
Test statistic
(Explained variation)/k
F(model) 
(Unexplain ed variation)/[n - (k  1)]

 Reject H0 in favor of Ha if F(model) > F* or


p-value < 
*F is based on k numerator and n-(k+1) denominator degrees of freedom

14-15
LO 14-5: Test the
significance of a
single independent
variable. 14.5 Testing the Significance of an
Independent Variable
 A variable in a multiple regression model is
not likely to be useful unless there is a
significant relationship between it and y
 To test significance, we use the null
hypothesis H0: βj = 0
 Versus the alternative hypothesis
Ha: βj ≠ 0

14-16
LO14-5
Testing Significance of an Independent
Variable #2

14-17
LO14-5
Testing Significance of an
Independent Variable #3
 Customary to test significance of every
independent variable
 If we can reject H0: βj = 0 at =0.05, we have
strong evidence the independent variable xj is
significantly related to y
 If we can reject H0: βj = 0 at =0.01, we have
very strong evidence the independent
variable xj is significantly related to y
 The smaller the significance level  at which
H0 can be rejected, the stronger the evidence
that xj is significantly related to y

14-18
LO14-5
A Confidence Interval for the
Regression Parameter βj
 If the regression assumptions hold, 100(1-
)% confidence interval for βj
is [b1 ± t/2 Sbj]
 t/2 is based on n – (k + 1) degrees of
freedom

14-19
LO 14-6: Find and
interpret a confidence
interval for a mean
value and a prediction
interval for an individual
value.
14.6 Confidence and Prediction
Intervals
 The point on the regression line
corresponding to a particular value of x1,
x2,…, xk, of the independent variables is
ŷ = b0 + b1x1 + b2x2 + … + bkxk
 It is unlikely that this value will equal the
mean value of y for these x values
 Therefore, we need to place bounds on how
far away the predicted value might be
 We can do this by calculating a confidence
interval for the mean value of y and a
prediction interval for an individual value of y

14-20
LO14-6

Distance Value
 Both the confidence interval for the mean
value of y and the prediction interval for an
individual value of y employ a quantity called
the distance value
 With simple regression, we were able to
calculate the distance value fairly easily
 However, for multiple regression, calculating
the distance value requires matrix algebra

14-21
LO14-6
A Confidence Interval for a Mean
Value of y
 Assume the regression assumptions hold
Confidence interval
[ŷ  t /2 s( y  yˆ ) ] s( y  yˆ )  s Distance value
Prediction interval
[ŷ  t /2 s( y  yˆ ) ] s( y  yˆ )  s 1  Distance value

 This is based on n-(k+1) degrees of freedom

14-22
14.7 The Sales Territory Performance
Case: Evaluating Employee Performance
yi Yearly sales of the company’s product
x1 Number of months the representative has
been employed
x2 Sales of products in the sales territory
x3 Dollar advertising expenditure in the territory
x4 Weighted average of the company’s market
share in territory for the previous four years
x5 Change in the company’s market share in
the territory over the previous four years

14-23
Partial Excel Output of a Regression Analysis
of the Sales Territory Performance Data

Figure 14.10 14-24


LO 14-7: Use dummy
variables to model
qualitative independent
variables. 14.8 Using Dummy Variables to Model
Qualitative Independent Variables
 So far, we have only looked at including
quantitative data in a regression model
 However, we may wish to include descriptive
qualitative data as well
 For example, might want to include the gender
of respondents
 We can model the effects of different levels of
a qualitative variable by using what are called
dummy variables
 Also known as indicator variables

14-25
LO14-7
How to Construct Dummy
Variables
 A dummy variable always has a value of
either 0 or 1
 For example, to model sales at two locations,
would code the first location as a zero and
the second as a 1
 Operationally, it does not matter which is
coded 0 and which is coded 1

14-26
LO14-7
What If We Have More Than Two
Categories?
 Consider having three categories, say A, B
and C
 Cannot code this using one dummy variable
 A=0, B=1 and C=2 would be invalid
 Assumes the difference between A and B is
the same as B and C
 We must use multiple dummy variables
 Specifically, k categories requires k-1 dummy
variables

14-27
LO14-7

What If We Have Three Categories?


 For A, B, and C, would need two dummy
variables
 x1 is 1 for A, zero otherwise
 x2 is 1 for B, zero otherwise
 If x1 and x2 are zero, must be C
 This is why the third dummy variable is not
needed

14-28
LO14-7

Interaction Models
 So far, have only considered dummy
variables as stand-alone variables
 Model so far is y = β0 + β1x + β2D + 
 Where D is dummy variable
 However, can also look at interaction
between dummy variable and other variables
 That model would take the form
y = β0 + β1x + β2D + β3xD + 
 With an interaction term, both the intercept
and slope are shifted

14-29
LO 14-8: Use
squared and
interaction variables.
14.9 Using Squared and
Interaction Variables
 Quadratic regression model is:

y = β0 + β 1 x + β2 x 2 ε

 where
1. β0 + β1x + β2x2 is μy
2. Β, β, and β2 are the regression parameters
3. ε is an error term

14-30
LO14-8

Using Interaction Variables


 Regression models often contain interaction
variables
 Formed by multiplying two independent
variables together
 Consider a model where x3 and x4 interact
and x3 is used as a quadratic

y = β0 + β1x4 + β2x3 β3x32 + β4x4x3 + ε

14-31
LO 14-9: Describe
multicollinearity and
build a multiple
regression model.
14.10 Model Building and the
Effects of Multicollinearity
 Multicollinearity: when “independent”
variables are related to one another
 Considered severe when the simple
correlation exceeds 0.9
 Even moderate multicollinearity can be a
problem
 Another measurement is variance inflation
factors
 Multicollinearity a problem when VIF>10
 Moderate problem for VIF>5 1
VIF 
1  R 2j
j

14-32
LO14-9
Effect of Adding Independent
Variable
 Adding any independent variable will increase

 Even adding an unimportant independent
variable
 Thus, R² cannot tell us that adding an
independent variable is undesirable

14-33
LO14-9
A Better Criterion is the Standard
Error
 A better criterion is the size of the standard
error s
 If s increases when an independent variable
is added, we should not add that variable
 However, decreasing s alone is not enough
 An independent variable should only be
included if it reduces s enough to offset the
higher t value and reduces the length of the
desired prediction interval for y
SSE
s
n  k  1
14-34
LO14-9

C Statistic
 Another quantity for comparing regression
models is called the C (a.k.a. Cp) statistic
 First, calculate mean square error for the
model containing all p potential independent
variables (s2p)
 Next, calculate SSE for a reduced model with
k independent variables

C  2  n  2k  1
SSE
sp

14-35
LO14-9

C Statistic Continued
 We want the value of C to be small
 Adding unimportant independent variables
will raise the value of C
 While we want C to be small, we also wish to
find a model for which C roughly equals k+1
 A model with C substantially greater than k+1
has substantial bias and is undesirable
 If a model has a small value of C and C for
this model is less than k+1, then it is not
biased and the model should be considered
desirable

14-36
LO14-9
The Partial F Test: An F Test a Portion
of a Regression Model
 To test
 H0: All of the βj coefficients corresponding to the
independent variables in the subset are zero
 Ha: At least one of the βj coefficients is not equal to
zero
(SSE R - SSE C )/k *
F
SSE C /[n - (k  1)]

 Reject H0 in favor of Ha if:


 F(partial) > F or
 p-value < 
F is based on k-g numerator and n-(k+1) denominator degrees of freedom

14-37
LO 14-10: Use residual
analysis to check the
assumptions of multiple
regression.
14.11 Residual Analysis in
Multiple Regression
 For an observed value of yi, the residual is
ei = yi - ŷ = yi – (b0 + b1xi1 + … + bkxik)
 If the regression assumptions hold, the residuals
should look like a random sample from a normal
distribution with mean 0 and variance σ2
 Residual plots
 Residuals versus each independent variable
 Residuals versus predicted y’s
 Residuals in time order (if the response is a time
series)

14-38
LO14-10
Residual Plots for the Sales
Territory Performance Model

Figure 14.31 14-39