Professional Documents
Culture Documents
Chapter 14
Chapter 14
Introduction to Multiple
Regression
Objectives
In this chapter, you learn:
How to develop a multiple regression model.
How to interpret the regression coefficients.
How to determine which independent variables to include in
the regression model.
How to determine which independent variables are most
important in predicting a dependent variable
How to use categorical independent variables in a regression
model.
How to use logistic regression to predict a categorical
dependent variable.
Yi β 0 β1X1i β 2 X 2i β k X ki ε i
Where:
β0 = Y intercept
β1 = slope of Y with variable X1, holding X2, X3, X4, . . . , Xk constant
β2 = slope of Y with variable X2, holding X1, X3, X4, . . . , Xk constant
β3 = slope of Y with variable X3, holding X1, X2, X4,. . . , Xk constant
.
.
βk = slope of Y with variable Xk, holding X1, X2, X3,. . . , Xk-1 constant
εi = random error in Y for observation i
Yi β 0 β1X1i β 2 X 2i ε i
Where:
β0 = Y intercept
β1 = slope of Y with variable X1, holding X2 constant
β2 = slope of Y with variable X2, holding X1 constant
εi = random error in Y for observation i
ˆ b b X b X b X
Yi 0 1 1i 2 2i k ki
In this chapter we will use Excel, JMP, and Minitab to obtain
the regression slope coefficients and other regression
summary measures.
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
X2
X1
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
Example:
2 Independent Variables
DCOVA
A distributor of frozen dessert pies wants to
evaluate factors thought to influence demand.
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales 306.526 - 24.975(Price) 74.131(Advertising)
Observations 15
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
DCOVA
JMP & Minitab Multiple Regression Output
Analysis of Variance
Source DF SS MS F P
Regression 2 29460 14730 6.54 0.012
Residual Error 12 27033 2253
Total 14 56493
Predictions in Excel
DCOVA
Input values
<
Predicted Y value
Confidence interval for the
mean value of Y, given
these X values.
New
Obs Fit SE Fit 95% CI 95% PI
1 428.6 17.2 (391.1, 466.1) (318.6, 538.6)
New
Obs Price Advertising
1 5.50 3.50
Multiple Coefficient of
Determination In Excel DCOVA
Regression Statistics
SSR 29,460.027
Multiple R 0.72213
r2 .52148
R Square 0.52148 SST 56,493.306
Adjusted R Square 0.44172
52.1% of the variation in pie sales
Standard Error 47.46341
is explained by the variation in
Observations 15
price and advertising.
ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333
Multiple Coefficient of
Determination In JMP & Minitab
DCOVA
SSR 29,460.027
r
2
.52148
SST 56,493.306
Analysis of Variance
Source DF SS MS F P
Regression 2 29460 14730 6.54 0.012
Residual Error 12 27033 2253
Total 14 56493
Adjusted r2
DCOVA
r2 never decreases when a new X variable is added
to the model.
– This can be a disadvantage when comparing
models.
What is the net effect of adding a new variable?
– We lose a degree of freedom when a new X
variable is added.
– Did the new X variable add enough explanatory
power to offset the loss of one degree of freedom?
Adjusted r2
DCOVA
Shows the proportion of variation in Y explained by all
X variables adjusted for the number of X variables
used:
2 n 1
r2
1 (1 r )
n k 1
adj
(where n = sample size, k = number of independent variables)
SSR
MSR k
FSTAT
MSE SSE
n k 1
ei = (Yi – Yi)
<
Yi
x2i
X2
x1i
The best fit equation is found
X1 by minimizing the sum of
squared errors, e2.
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
<
ei = (Yi – Yi)
Assumptions:
The errors are normally distributed.
Errors have a constant variance.
The model errors are independent.
<
– Residuals vs. X1i.
– Residuals vs. X2i.
– Residuals vs. time (if time series data).
100 100
Residuals
Residuals
50 50
0 0
-50 -50
50
0 multiple regression
-50
-100 2.5
model is appropriate for
3 3.5 4 4.5 5
Advertising predicting pie sales.
Test Statistic:
bj 0
t STAT (df = n – k – 1)
Sb
j
b j tα / 2 Sb where t has
(n – k – 1) d.f.
j
Here, t has
(15 – 2 – 1) = 12 d.f.
SSR(X1 | X2)
= SSR (all variables) – SSR(X2)
= .05, df = 1 and 12
F0.05 = 4.75
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
DCOVA
(For X1 and X2) (For X2 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 17484.22249
Residual 12 27033.30647 2252.775539 Residual 13 39009.11085
Total 14 56493.33333 Total 14 56493.33333
= .05, df = 1 and 12
F0.05 = 4.75
(For X1 and X2) (For X1 only)
ANOVA ANOVA
df SS MS df SS
Regression 2 29460.02687 14730.01343 Regression 1 11100.43803
Residual 12 27033.30647 2252.775539 Residual 13 45392.8953
Total 14 56493.33333 Total 14 56493.33333
t 2
STAT FSTAT
SSR (X 2 |X1 )
r2
Y2.1
SST SSR(X1 and X 2 ) SSR(X 1 |X 2 )
Measures the proportion of variation in Y explained by X2
while controlling for (holding constant) X1.
11,975.804
2
rY1.2 0.276
54 ,493.333 29,460.027 18,359.589
18,359.589
2
rY2.1 0.496
54 ,493.333 29,460.027 11,975.804
Dummy-Variable Example
(with 2 Levels)
DCOVA
Ŷ b0 b1X1 b2 X2
Let:
Y = pie sales
X1 = price
X2 = holiday (X2 = 1 if a holiday occurred during the week).
(X2 = 0 if there was no holiday that week).
Dummy-Variable Example
(with 2 Levels)
DCOVA
Ŷ b0 b1X1 b2 (1) (b0 b2 ) b1X1 Holiday
Dummy-Variable Models
(more than 2 Levels)
DCOVA
Dummy-Variable Models
(more than 2 Levels) (continued)
DCOVA
Example: Let “colonial” be the default category, and let X2
and X3 be used for the other two categories:
Y = house price
X1 = square feet
X2 = 1 if ranch, 0 otherwise
X3 = 1 if split level, 0 otherwise
Ŷ b0 b1X1 b2 X2 b3 X3
–
b0 b1X1 b2 X2 b3 (X1X2 )
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
Effect of Interaction
DCOVA
Given: Y β0 β1X1 β2 X2 β3 X1X2 ε
Interaction Example
DCOVA
Suppose X2 is a dummy variable and the estimated
regression equation is Ŷ = 1 + 2X1 + 3X2 + 4X1X2
Y
12
X2 = 1:
8 Y = 1 + 2X1 + 3(1) + 4X1(1) = 4 + 6X1
4 X2 = 0:
Y = 1 + 2X1 + 3(0) + 4X1(0) = 1 + 2X1
0
X1
0 0.5 1 1.5
Slopes are different if the effect of X1 on Y depends on X2 value
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
Yi 0 1 X 1i 2 X 2i 3 X 3i 4 X 4i 5 X 5i 6 X 6 i i
Logistic Regression
DCOVA
Used when the dependent variable Y is binary (i.e.,
Y takes on only two values).
Examples:
– Customer prefers Brand A or Brand B.
– Employee chooses to work full-time or part-time.
– Loan is delinquent or is not delinquent.
– Person voted in last election or did not.
Logistic regression allows you to predict the
probability of a particular categorical response.
Dr. Tran Quoc Duy Applied Statistics for Business
ĐẠI HỌC FPT CẦN THƠ
Logistic Regression
DCOVA
(continued)
Logistic Regression
(continued)
DCOVA
Logistic Regression Model:
•b1 = 0.1395. This means that holding constant the effect of whether the credit
cardholder has additional cards for members of the household, for each increase of
$1,000 in annual credit card spending, the estimated natural logarithm of the odds
ratio of purchasing the premium card increases by 0.1395. Therefore, cardholders
who charged more in the previous year are more likely to upgrade to a premium
card.
•b2 = 2.7743. This means that holding constant the annual credit card spending, the
estimated natural logarithm of the odds ratio of purchasing the premium card
increases by 2.7743 for a credit cardholder who has additional cards for members
of the household compared with one who does not have additional cards.
Therefore, cardholders possessing additional cards for other members of the
household are much more likely to upgrade to a premium card.
Chapter Summary
In this chapter we discussed:
How to develop a multiple regression model.
How to interpret the regression coefficients.
How to determine which independent variables to include in
the regression model.
How to determine which independent variables are most
important in predicting a dependent variable
How to use categorical independent variables in a regression
model.
How to use logistic regression to predict a categorical
dependent variable.