You are on page 1of 25

14- 1

Chapter

Fourteen

McGraw- 2005 The McGraw-Hill Companies, Inc., All


14- 2
Chapter Fourteen
Multiple Regression and Correlation
Analysis
GOALS
When you have completed this chapter, you
will be able to:
ONE
Describe the relationship between two or more independent
variables and the dependent variable using a multiple regression
equation.
TWO
Compute and interpret the multiple standard error of estimate and
the coefficient of determination.
THREE
Interpret a correlation matrix.
Goals
FOUR
Setup and interpret an ANOVA table.
14- 3
Chapter Fourteen continued
Multiple Regression and Correlation Ana
GOALS
When you have completed this chapter, you
will be able to:
FIVE
Conduct a test of hypothesis to determine if any of the set of
regression coefficients differ from zero.
SIX
Conduct a test of hypothesis on each of the regression
coefficients.

Goals
14- 4

Multiple Regression and Correlation Anal


The general multiple regression with k
independent variables is given by:

Y ' a b1 X 1 b2 X 2 ...bk X k
Greek letters are
used for a (and a is the Y-intercept.
b (when X1 to Xk are the
denoting independent
population variables.
parameters.
Multiple Regression
Analysis
14- 5

bj is the net change in Y for each unit change in Xj


holding all other values constant, where j=1 to k. It is
called a partial regression coefficient, a net regression
coefficient, or just a regression coefficient.
The least squares criterion Because determining
is used to develop this b1, b2, etc. is very
equation. tedious, a software
package such as Excel
or MINITAB is
recommended.
Multiple Regression
Analysis
14- 6

The Multiple Standard Error of Estimate is


a measure of the effectiveness of the regression equation.

It is measured in the same It is difficult to


units as the dependent determine what is a
variable. large value and
what is a small
The formula is:
value of the
(Y Y ' ) 2 standard error.
s y.12...k
n (k 1)
Multiple Standard Error
of Estimate
14- 7

Assumptions In Multiple Regression and Correlation

The independent variables The dependent


variable must be
and the dependent variable
continuous and at
have a linear relationship. least interval-scaled.
The residuals should
follow the normal
distributed with mean 0.
The variation in (Y-Y) or Successive values of the
residual must be the same dependent variable must
for all values of Y. When be uncorrelated.
this is the case, we say the
Multiple Regression and
difference exhibits Correlation Assumptions
homoscedasticity.
homoscedasticity
Explained Variation 14- 8
ANOVA TABLE Variation
accounted
Source df SS MS for by the
Regression k-1 SSR SSR/(k-1) set of
(Y Y)2 independent
variables.
Error n-k-1 SSE SSE/(n-k-1)
(Y-Y)2
Total n-k-1 SS Total
(Y-Y)

Unexplained or Random Variation Total Variation


Variation not accounted for by the
independent variables.
ANOVA table
14- 9

oA correlation matrix is
used to show all possible
simple correlation coefficients
among the variables. Correlation
Coefficients Cars Advertising
Sales
force

Cars 1.000

oThe matrix is useful for Advertising 0.808 1.000


Sales force 0.872 0.537 1.000
locating correlated
independent variables.

oIt shows how strongly each


independent variable is
correlated with the dependent
variable. Correlation Matrix
14- 10
The global test is used to investigate whether any of the
independent variables have significant coefficients. The
hypotheses are:
H 0 : 1 2 ... k 0
H 1 : Not all s equal 0

The test statistic is the F distribution with k


(number of independent variables) and
n-(k+1) degrees of freedom, where n is the
sample size.

Global Test
14- 11

The test of individual variables is used to determine which


independent variables have nonzero regression coefficients.

The variables that The test statistic is the t


have zero regression distribution with n-
coefficients are (k+1) degrees of
usually dropped from freedom.
the analysis. bj 0
t= S
b
j

Test for Individual


Variables
14- 12
A market researcher for Super
Dollar Super Markets is
studying the yearly amount
families of four or more spend
on food. Three independent
variables are thought to be
related to yearly food
expenditures (Food). Those
variables are: total family
income (Income) in $00, size of
family (Size), and whether the
family has children in college
(College).

EXAMPLE 1
14- 13

Food
expenditures = a + b1*(Income) + b2(Size) + b3(College)
Note the following regarding Other examples of
the regression equation. dummy variables
The variable college is called include gender, the
a dummy or indicator variable. part is acceptable or
It can take only one of two unacceptable, the
possible outcomes. That is a voter will or will not
child is a college student or vote for the incumbent
not. governor.

We usually code one value of the dummy


variable as 1 and the other 0. Example 1
continued
14- 14

Example 1 continued
14- 15

Use a computer software package,


such as MINITAB or Excel, to
develop a correlation matrix.

From the analysis provided by MINITAB, write


out the regression equation
Y = 954 +1.09X1 + 748X2 + 565X3
Food
Expenditure=$954+$1.09*income+$748*size+$565*college
What food expenditure would you
estimate for a family of 4, with no
college students, and an income of
$50,000 (which is input as 500)? Example 1 continued
14- 16
The regression equation is
Food = 954 + 1.09 Income + 748 Size + 565 Student

Predictor Coef SE Coef T P


Constant 954 1581 0.60 0.563
Income 1.092 3.153 0.35 0.738
Size 748.4 303.0 2.47 0.039
Student 564.5 495.1 1.14 0.287

S = 572.7 R-Sq = 80.4% R-Sq(adj) = 73.1%

Analysis of Variance

Source DF SS MS F P
Regression 3 10762903 3587634 10.94 0.003
Residual Error 8 2623764 327970
Total 11 13386667

Example 1 continued
14- 17

Food
Expenditure=$954+$1.09*income+$748*size+$565*college
Each additional $100 dollars of income per year will
increase the amount spent on food by $109 per year.
An additional family member will increase the amount
spent per year on food by $748.
A family with a college student will spend $565 more per
year on food than those without a college student.
Food Expenditure=$954+$1.09*500+$748*4+$565*0
So a family of 4, with no college
students, and an income of $50,000
will spend an estimated $4,491. Example 1 continued
From the regression 14- 18

output we note: Food Income Size College


The coefficient of
determination is 80.4 Food 1.000
percent. This means that
Income 0.587 1.000
more than 80 percent of
the variation in the Size 0.876 0.609 1.000
amount spent on food is College 0.773 0.491 0.743 1.000
accounted for by the
variables income, family None of the correlations among
size, and student. the independent variables should
The strongest correlation cause problems. All are between
between the dependent variable .70 and .70.
and an independent variable is
between family size and amount
spent on food. Correlation matrix
14- 19

Conduct a global test of hypothesis to determine if


any of the regression coefficients are not zero.

H 0 : 1 2 3 0 H1 : at least one

H0 is rejected if F>4.07.
From the MINITAB output, the computed value of
F is 10.94.
Decision: H0 is rejected. Not all the regression
coefficients are zero

Example 1 continued
14- 20

Conduct an individual test to determine which coefficients


are not zero. This is the hypotheses for the independent
variable family size.

H0 : 2 0 H1: 2 0
From the MINITAB output, Thus, using the 5% level
the only significant variable of significance, reject H0
is FAMILY (family size) if the p-value < .05.
using the p-values. The
other variables can be
omitted from the model.
Example 1 continued
14- 21

We rerun the analysis using only the significant independent


family size.
The new regression equation is:

Y = 340 + 1031X2

The coefficient of determination is 76.8 percent. We dropped


two independent variables, and the R-square term was
reduced by only 3.6 percent.

Example 1 continued
14- 22

Regression Analysis: Food versus Size

The regression equation is


Food = 340 + 1031 Size

Predictor Coef SE Coef T P


Constant 339.7 940.7 0.36 0.726
Size 1031.0 179.4 5.75 0.000

S = 557.7 R-Sq = 76.8% R-Sq(adj) = 74.4%

Analysis of Variance

Source DF SS MS F P
Regression 1 10275977 10275977 33.03 0.000
Residual Error 10 3110690 311069
Total 11 13386667

Example 1 continued
14- 23

A residual is the difference between the actual


value of Y and the predicted value Y.

Residuals should be approximately normally


distributed. Histograms and stem-and-leaf
charts are useful in checking this requirement.

A plot of the residuals and their corresponding


Y values is used for showing that there are no
trends or patterns in the residuals.

Analysis of Residuals
14- 24

Residual Plots against Estimated Values of Y

1000
Residuals

500

-500
4500 6000 7500

Y Residual Plot
14- 25

8
7
6
Frequency

5
4
3
2
1
0
-600 -200 200 600 1000

Residuals

Histograms of Residuals