You are on page 1of 74

Regression Analysis

Chapter 7
Regression Analysis

Solutions:

1. a. The scatter chart with weight as the independent variable follows.

10000
9000
8000
7000
6000
Price

5000
4000
3000
2000
1000
0
0.0 5.0 10.0 15.0 20.0
Weight

This scatter chart indicates there may be a negative linear relationship between weight and price.
Lighter bicycles are generally expected to be more expensive, and this scatter chart is consistent with
what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to estimate
the price (y) given the bicycle weight (x).

7-1
Regression Analysis

The estimated simple linear regression equation is yˆ  28818.0037  1439.0064 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and weight follows.

Weight Residual Plot


2000

1000
Residuals

0
0.0 5.0 10.0 15.0 20.0
-1000

-2000
Weight

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 2.14888E-05. Because this p-
value is less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude
that there is a relationship between the weight and price of the bicycles, and our best estimate is that
an increase in weight of one pound corresponds to a price decrease of $1439.01. The price of a
racing bicycle is expected to increase as the weight of the bicycle decreases, so this result is
consistent with what is expected.

The estimated regression parameter b0 suggests that when the weight of a bicycle is zero pounds, the
price is $28,818.00. This result is obviously not realistic, but this parameter estimate and the test of
the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated through
extrapolation (there is no bicycle in the sample data with a weight near zero pounds).

d. The coefficient of determination r2 is 0.8637, so the regression model estimated in part (b) explains
approximately 86% of the variation in the prices of the bicycles in the sample.

e. We predict the price of the 15 pound D’Onofrio Elite bicycle will be

yˆ  28828.0037  1439.0064 15  7232.9071

or approximately $7233.

2. a. The scatter chart with line speed as the independent variable follows.

7-2
Regression Analysis

70

Number of Defective Parts Found


60

50

40

30

20

10

0
0.0 5.0 10.0 15.0 20.0 25.0
Line Speed

This scatter chart indicates there may be a negative linear relationship between line speed and
number of defective parts found. The number of defective parts found is expected to decrease as the
line speed increases, so this scatter chart is consistent with what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to predict
the number of defective parts found (y) given the line speed (x).

The estimated simple linear regression equation is yˆ  22.1739  0.1478 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and line speed follows.

7-3
Regression Analysis

Line Speed Residual Plot


2

1
Residuals

0
0 20 40 60 80
-1

-2
Line Speed

Because we are working with only 6 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.02813. Because this p-value
is greater than the 0.01 level of significance, we do not reject the hypothesis that 1 = 0. We
conclude that there is no relationship between line speed and number of defective parts found.

The estimated regression parameter b0 suggests that when the line speed is zero, the number of
defect parts found is 22.1739. This result is obviously not realistic, but this parameter estimate and
the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated
through extrapolation (there is no observation in the sample data with a line speed near zero).

d. The coefficient of determination r2 is 0.7391, so the regression model estimated in part (b) explains
approximately 74% of the variation in the number of defective parts found in the sample.

3. a, The scatter chart with weekly usage as the independent variable follows.

60.0
Annual Maintenance Expense

50.0

40.0

30.0

20.0

10.0

0.0
0 5 10 15 20 25 30 35 40 45
Weekly Usage (hours)

7-4
Regression Analysis

This scatter chart indicates there may be a positive linear relationship between weekly usage and
annual maintenance expense. Annual maintenance expense is expected to increase as weekly usage
increases, and this scatter chart is consistent with what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to predict
the annual maintenance expense (y) given the weekly usage (x).

The estimated simple linear regression equation is yˆ  10.5280  0.9534 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and weekly usage follows.

Weekly Usage (hours) Residual Plot


10

5
Residuals

0
0 10 20 30 40 50
-5

-10
Weekly Usage (hours)

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0001. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between weekly usage and annual maintenance expense, and our best estimate is
that a one hour increase in weekly usage corresponds to an increase of $95.34 in annual maintenance

7-5
Regression Analysis

expense. Annual maintenance expense is expected to increase as weekly usage increases, so this
result is consistent with what is expected.

The estimated regression parameter b0 suggests that when weekly usage is zero, the annual
maintenance expense is $1,052.80. This result is obviously not realistic, but this parameter estimate
and the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated
through extrapolation (there is no observation in the sample data with a weekly usage near zero).

d. The coefficient of determination r2 is 0.8562, so the regression model estimated in part (b) explains
approximately 86% of the variation in the values of annual maintenance expenses in the sample.

e. The mean weekly usage in our sample is 25.3. Using our simple linear regression model, for this
many hours of usage we predict the annual maintenance expense to be

yˆ  10.5280  0.9534  25.3  34.65 .

or $3,650. Unless Jensen’s has a valid reason to expect weekly usage to be less than usual, the
company should purchase the contract.

We can use the estimated regression model to estimate the breakeven hours of weekly usage for the
proposed $3000 contract. If we set the estimated regression equation equal to 30 (the value of the
dependent variable annual maintenance expense that corresponds to the $3000 cost of the contract)
and solve for x (weekly usage)

10.5280  0.9534 x1  30
0.9534 x1  30  10.5280  19.4720

x1  30  10.5280  19.4720  20.4229


0.9534

we find the estimated breakeven point for the contact is approximately 20.4 usage hours. If Jensen’s
believes weekly usage will exceed 20.4 hours during the life of the contract, the estimated annual
maintenance expense will exceed $3000 and Jensen’s should purchase the contract.

4. a. The scatter chart with distance to work as the independent variable follows.

7-6
Regression Analysis

8
Number of Days Absent 7

0
0 5 10 15 20
Distance to Work (miles)

This scatter chart indicates there may be a negative linear relationship between distance to work and
number of days absent. This is not what would be expected – an employee who lives farther from
her/his job is expected to be absent more frequently than an employee who lives closer to her/his job.

b. The following Excel output provides the estimated regression equation that could be used to predict
the number of days absent (y) given the distance to work (x).

The estimated simple linear regression equation is yˆ  8.0978  0.3442 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and weight follows.

7-7
Regression Analysis

Distance to Work (miles) Residual


Plot
2
1
Residuals

0
-1 0 5 10 15 20
-2
-3
Distance to Work (miles)

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The 99% confidence interval for the regression parameter 1 provided in the Excel output is (-
0.6046, -0.0838). Because this interval does not include zero, we reject the hypothesis that 1 = 0.
We conclude that there is a relationship between distance to work and number of absences. Our best
estimate is that a one mile increase in the distance to work corresponds to a decrease of 0.3442 days
absent.

d. The 99% confidence interval for the regression parameter 0 provided in the Excel output is (5.3839,
10.8117). However, this confidence interval and the test of the hypothesis that 0 = 0 are
meaningless because the y-intercept has been estimated through extrapolation (there is no
observation in the sample data with a distance to work near zero).

e. The coefficient of determination r2 is 0.7109, so the regression model estimated in part (b) explains
approximately 71% of the variation in the values of number of days absent in the sample.

5. a. The scatter chart with age of bus as the independent variable follows.

1000
900
Annual Maintenance Cost ($)

800
700
600
500
400
300
200
100
0
0 1 2 3 4 5 6
Age of Bus (Years)

7-8
Regression Analysis

This scatter chart indicates there may be a positive linear relationship between age of bus and annual
maintenance cost. Older buses generally cost more to maintain, and this scatter chart is consistent
with what is expected.

b. The following Excel output provides the estimated regression equation that could be used to predict
the annual maintenance cost (y) given the age of the bus (x).

The estimated simple linear regression equation is yˆ  220  131.6667 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and age of bus follows.

Age of Bus (years) Residual Plot


150
100
Residuals

50
0
-50 0 1 2 3 4 5 6
-100
-150
Age of Bus (years)

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 7.62662E-05. Because this p-
value is less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude
that there is a relationship between age of bus and annual maintenance cost, and our best estimate is
that a one year increase in age of bus corresponds to an increase of $131.67 in annual maintenance

7-9
Regression Analysis

cost. Annual maintenance cost of a bus is expected to increase as the bus ages, so these results are
consistent with what is expected.

The estimated regression parameter b0 suggests that when the age of the bus is zero, the annual
maintenance expense is $220. This result is obviously not realistic, but this parameter estimate and
the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated
through extrapolation (there is no observation in the sample data with a bus age near zero).

d. The coefficient of determination r2 is 0.8725, so the regression model estimated in part (b) explains
approximately 87% of the variation in the values of annual maintenance cost in the sample.

e. Using this regression model, the predicted annual maintenance cost for a 3.5 year old bus is

yˆ  220  131.6667  3.5  680.8333

or approximately $680.

6. a. The scatter chart with hours spent studying as the independent variable follows.

120

100
Total Points Earned

80

60

40

20

0
0 20 40 60 80 100 120
Hours Spent Studying

This scatter chart indicates there may be a positive linear relationship between hours spent studying
and total points earned. Students who spend more time studying generally earn more points, and this
scatter chart is consistent with what is expected.

b. The following Excel output provides the estimated regression equation showing how total points
earned (y) is related to hours spent studying (x).

7 - 10
Regression Analysis

The estimated simple linear regression equation is yˆ  8.6742  0.8014 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and hours spent studying follows.

Hours Spent Studying Residual Plot


20

10
Residuals

0
0 20 40 60 80 100 120
-10

-20
Hours Spent Studying

The residuals at each value of hours spent studying appear to have a mean of 0, have similar
variances, and be concentrated around 0. The conditions necessary for inference to be valid in
regression appear to be satisfied, and so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 1.09619E-60. Because this p-
value is less than the 0.01 level of significance, we reject the hypothesis that 1 = 0. We conclude
that there is a relationship between hours spent studying and total points earned, and our best
estimate is that a one hour increase in hours spent studying corresponds to an increase of 0.8014 in
total points earned. This result is consistent with what would be expected.

The estimated regression parameter b0 suggests that when a student spends zero hours studying, the
total points earned by the student is 8.6742. This parameter estimate and the test of the hypothesis
that 0 = 0 are meaningless because the y-intercept has been estimated through extrapolation (there is
no observation in the sample data with hours of studying near zero).

7 - 11
Regression Analysis

d. The coefficient of determination r2 is 0.8277, so the regression model estimated in part (b) explains
approximately 83% of the variation in the values of total points earned in the sample.

e. Using this regression model, the predicted total points earned by Mark (who spent 95 hours
studying) is

yˆ  8.6742  0.8014  95   84.8056

or approximately 85 points.

7. a. The scatter chart with DJIA as the independent variable follows.

1420

1400

1380

1360
S&P 500

1340

1320

1300

1280

1260
12200 12400 12600 12800 13000 13200 13400
DJIA

This scatter chart indicates there may be a positive linear relationship between DJIA and S&P 500.
Since both indexes are used as measures of overall movement in the stock market, a positive
relationship is expected between these two variables. This scatter chart is consistent with what is
expected.

b. The following Excel output provides the estimated regression equation showing how S&P 500 (y) is
related to DJIA (x).

7 - 12
Regression Analysis

The estimated simple linear regression equation is yˆ  669.0212  0.1573 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and DJIA follows.

DJIA Residual Plot


20

10
Residuals

0
12200 12400 12600 12800 13000 13200 13400
-10

-20
DJIA

Because we are working with only 15 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The 95% confidence interval for the regression parameter 1 provided in the Excel output is 0.1353,
0.1792). Because this interval does not include zero, we reject the hypothesis that 1 = 0. We
conclude that there is a relationship between DJIA and S&P 500, and our best estimate is that a one
point increase in DJIA corresponds to an increase of 0.1573 points for the S&P 500. This result
appears to be reasonable.

d. The 95% confidence interval for the regression parameter 0 provided in the Excel output is (-
951.4540, -386.5885). However, this parameter estimate and confidence interval (and the
corresponding test of the hypothesis that 0 = 0) are meaningless because the y-intercept has been
estimated through extrapolation (there is no observation in the sample data with a value of DJIA near
zero).

7 - 13
Regression Analysis

e. The coefficient of determination r2 is 0.9486, so the regression model estimated in part (b) explains
approximately 95% of the variation in the values of S&P 500 in the sample.

f. Using this regression model, when DJIA is 13,500 the predicted S&P 500 is

yˆ  669.0212  0.1573 13000   1454.0836

or approximately 1454 points.

g. The maximum DJIA in our sample data is 13,233, so when the DJIA value of 13,500 is used to
predict the S&P 500 value in part (e) the regression model has been extrapolated beyond the
experimental region of the data, so you should be concerned about this prediction.

8. a. The scatter chart with miles as the independent variable follows.

18.0

16.0

14.0

12.0

10.0
Price

8.0

6.0

4.0

2.0

0.0
0 20 40 60 80 100 120
Miles

This scatter chart indicates there may be a negative linear relationship between miles and price.
Since a Camry with higher miles will generally sell for a lower price, a positive relationship is
expected between these two variables. This scatter chart is consistent with what is expected.

b. The following Excel output provides the estimated regression equation showing how price (y) is
related to miles (x).

7 - 14
Regression Analysis

The estimated simple linear regression equation is yˆ  16.4698  0.0588 x .

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the
residuals and miles follows.

Miles (1000s) Residual Plot


4

2
Residuals

0
0 20 40 60 80 100 120
-2

-4
Miles (1000s)

Because we are working with only 19 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0003. Because this p-value is
less than the 0.01 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between miles and price, and our best estimate is that a one thousand mile increase
corresponds to a decrease of $58.77. This result appears to be reasonable.

The estimated regression parameter b0 suggests that if a Camry has zero miles, the predicted price is
$16,469.76. This result is obviously not realistic, but this parameter estimate and the test of the
hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated through
extrapolation (there is no observation in the sample data with miles near zero).

7 - 15
Regression Analysis

d. The coefficient of determination r2 is 0.5387, so the regression model estimated in part (b) explains
approximately 54% of the variation in the values of price in the sample.

e. Excel output for the predicted prices and residuals for the model estimated in part (b) follows:

A bargain is a Camry for which the predicted price given its miles exceeds the actual price (the price
of the automobile is less that the value the automobile given its miles). The residuals ei  yi  yˆi for
this simple linear regression model are the differences between the actual prices and the predicted
prices, so the Camry with the largest negative residual is the best bargain. The twelfth automobile in
the sample sold for $12,500, and with 28,000 miles its predicted price is $14,824. Thus, this
automobile sold for $2,324 less than the predicted price for a Camry with 28,000 miles. The fourth
automobile in the data, which has 47,000 mile, was almost as big a bargain, selling for $2,207 less
than the predicted price for a Camry with the 47,000 miles.

f. Using the estimated regression equation developed in part (b), the predicted mean price for a
previously owned 2007 Camry that has been driven 60,000 miles is

yˆ  16.4698  0.0588  60   12.9433 .

7 - 16
Regression Analysis

or $12,943. Depending on other factors not considered in the model (various options, the physical
condition of the body and interior, etc.), this is a reasonable price to expect to pay for a Camry that
has been driven 60,000 miles.

9. a. The following Excel output provides the estimated regression equation showing how weekly gross
revenue (y) is related to the amount of television advertising (x).

The estimated simple linear regression equation is yˆ  45.4323  40.0640 x .

Before performing any hypothesis tests on the results, we check the conditions necessary for valid
inference in regression. The Excel plot of the residuals and television advertising follows.

Television Advertising (100s)


Residual Plot
100
50
Residuals

0
0.0 2.0 4.0 6.0 8.0
-50
-100
Television Advertising (100s)

The variance in the residuals is possibly increasing as the amount of television advertising increases,
which gives some cause for concern, However, when working with few observations (in this case,
we are working with only 8 observations), assessing the conditions necessary for inference to be
valid in regression is extremely difficult, and inference is usually performed unless these is evidence
of an extreme violation of one or more of the conditions necessary for valid inference. This scatter
chart does not provide strong evidence of a violation of the conditions, so we will proceed with our
inference.

The p-value associated with the estimated regression parameter b1 is 0.0339. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

7 - 17
Regression Analysis

is a relationship between television advertising and weekly gross revenue at the 0.05 level of
significance, and our best estimate is that a $100 increase in television advertising corresponds to an
increase of $4006.40 in weekly gross revenue. This result appears to be reasonable.

b. The coefficient of determination r2 is 0.5552, so the regression model estimated in part (a) explains
approximately 56% of the variation in the values of weekly gross revenue in the sample.

c. The following Excel output provides the estimated regression equation with both television
advertising (x1) and newspaper advertising (x2) as the independent variables.

The estimated multiple linear regression equation is yˆ  42.5696  22.4022 x1  19.4986 x2 .

First we check the conditions necessary for valid inference in regression. The Excel plots of the
residuals and each of the two independent variables follow.

Television Advertising (100s)


Residual Plot
40
20
Residuals

0
0.0 2.0 4.0 6.0 8.0
-20
-40
Television Advertising (100s)

Because we are working with only 8 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. This scatter chart of the residuals versus television
advertising does not provide strong evidence of a violation of the conditions.

7 - 18
Regression Analysis

Newspaper Advertising (100s)


Residual Plot
40
20
Residuals

0
0.0 2.0 4.0 6.0 8.0 10.0
-20
-40
Newspaper Advertising (100s)

Similarly, this scatter chart of the residuals versus newspaper advertising does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0252. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between television advertising and weekly gross revenue at the 0.05 level of
significance, and our best estimate is that if we hold newspaper advertising constant, a $100 increase
in television advertising corresponds to an increase of $2240.22 in weekly gross revenue. This result
appears to be reasonable.

The p-value associated with the estimated regression parameter b2 is 0.0033. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there
is a relationship between newspaper advertising and weekly gross revenue at the 0.05 level of
significance, and our best estimate is that if we hold television advertising constant, a $100 increase
in newspaper advertising corresponds to an increase of $1949.86 in weekly gross revenue. This
result also appears to be reasonable.

The estimated regression parameter b0 suggests that when television advertising and newspaper
advertising are both zero, the predicted weekly gross revenue is -$4256.96. This result is obviously
not realistic, but this parameter estimate and the test of the hypothesis that 0 = 0 are meaningless
because the y-intercept has been estimated through extrapolation (there is no observation in the
sample data for which both television advertising and newspaper advertising are near zero).

d. The coefficient of determination R2 is 0.9322, so the regression model estimated in part (c) explains
approximately 93% of the variation in the values of weekly gross revenue in the sample.

e. For the multiple linear regression model estimated in part (c), the overall regression relationship is
significant, the estimated regression coefficients b1 and b2 are significant and consistent with what
would be expected, and this model explains approximately 37% more variation in the values of
weekly gross revenue in the sample than the simple linear regression model estimated in part (a).The
model estimated in part (c) is satisfactory and is superior to the model estimated in part (a).

f. Management can feel confident that increased spending on both television and newspaper
advertising coincides with increased weekly gross revenue. The results also suggest that television
advertising may be slightly more effective than newspaper advertising in generating revenue.

10. a. The following Excel output provides the estimated multiple linear regression equation with comfort
(x1), amenities (x2), and in-house dining (x3) as the independent variables.

7 - 19
Regression Analysis

The estimated multiple linear regression equation is yˆ  35.6967  0.1093x1  0.2443x2  0.2474 x3 .

b. Before performing any hypothesis tests on the results, we check the conditions necessary for valid
inference in regression. The Excel plots of the residuals and comfort, amenities, and in-house dining
follow.

Comfort Residual Plot


4

2
Residuals

0
85.0 90.0 95.0 100.0 105.0
-2

-4
Comfort

7 - 20
Regression Analysis

Amenities Residual Plot


4

2
Residuals

0
0.0 20.0 40.0 60.0 80.0 100.0 120.0
-2

-4
Amenities

In-House Dining Residual Plot


4

2
Residuals

0
0.0 20.0 40.0 60.0 80.0 100.0 120.0
-2

-4
In-House Dining

None of these scatter charts provide strong evidence of a violation of the conditions necessary for
valid inference in regression, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.4117. Because this p-value is
greater than the 0.01 level of significance, we do not reject the hypothesis that 1 = 0. We conclude
that there is not a relationship between the score on comfort and the overall score at the 0.01 level of
significance when controlling for amenities and in-house dining.

The p-value associated with the estimated regression parameter b2 is 3.69454E-05. Because this p-
value is less than the 0.01 level of significance, we reject the hypothesis that 2 = 0. We conclude
that there is a relationship between the score on amenities and the overall score at the 0.01 level of
significance, and our best estimate is that if we hold the scores on comfort and in-house dining
constant, a one point increase in the score on amenities corresponds to an increase of 0.2443 in
overall score.

The p-value associated with the estimated regression parameter b3 is 0.0011. Because this p-value is
less than the 0.01 level of significance, we reject the hypothesis that 3 = 0. We conclude that there
is a relationship between the score on in-house dining and the overall score at the 0.01 level of
significance, and our best estimate is that if we hold the scores on comfort and amenities constant, a
one point increase in the score on in-house dining corresponds to an increase of 0.2443 in overall
score.

7 - 21
Regression Analysis

If ratings for comfort, amenities, and in-house dining are related to overall score, the relationships
are expected to be positive. The results are consistent with expectations for all three relationships.

c. The following Excel output provides the estimated multiple linear regression equation with
amenities (x1), and in-house dining (x2) as the independent variables.

The estimated multiple linear regression equation is yˆ  45.1461  0.2526 x1  0.2483x2 .

A review of the Excel plots of the residuals and the two independent variables that follow show no
dramatic departures from the conditions necessary for valid inference in regression. We can proceed
with inference.

Amenities Residual Plot


4

2
Residuals

0
0.0 20.0 40.0 60.0 80.0 100.0 120.0
-2

-4
Amenities

7 - 22
Regression Analysis

In-House Dining Residual Plot


4

2
Residuals

0
0.0 20.0 40.0 60.0 80.0 100.0 120.0
-2

-4
In-House Dining

The p-value associated with the estimated regression parameter b1 (which now corresponds to
amenities) is 1.32524E-05. Because this p-value is less than the 0.01 level of significance, we reject
the hypothesis that 1 = 0. We conclude that there is a relationship between the score on amenities
and the overall score at the 0.01 level of significance, and our best estimate is that if we hold the
scores on in-house dining constant, a one point increase in the score on amenities corresponds to an
increase of 0.2526 on overall score.

The p-value associated with the estimated regression parameter b2 (which now corresponds to in-
house dining) is 0.0009. Because this p-value is less than the 0.01 level of significance, we reject the
hypothesis that 2 = 0. We conclude that there is a relationship between the score on in-house dining
and the overall score at the 0.01 level of significance, and our best estimate is that if we hold the
scores on amenities constant, a one point increase in the score on in-house dining corresponds to an
increase of 0.2483 in overall score.

For this multiple linear regression model, the overall regression relationship is significant, and the
estimated regression coefficients b1 and b2 are significant and consistent with what would be
expected. Furthermore, this model has a coefficient of determination of R2 = 0.7387. The model with
all three independent variables (comfort, amenities, and in-house dining) from part (a) has a multiple
coefficient of determination of R2 = 0.7498, and so this model explains little more than 1% more of
the variation in overall ratings in the sample than does the model that includes only the independent
variables amenities and in-house dining (that is, removing comfort as an independent variable
resulted in a loss of little more that 1% of the explained variation in overall score). Thus, the simpler
multiple regression model developed in part (d) is preferred.

11. a. The following Excel output provides the estimated multiple linear regression equation with
satisfaction trade price (x1) and satisfaction with speed of execution (x2) as the independent variables.

7 - 23
Regression Analysis

The estimated multiple linear regression equation is yˆ  0.7835  0.5580 x1  0.7342 x2 .

The coefficient of determination R2 is 0.6827, so this regression model explains approximately 68%
of the variation in the values of overall satisfaction in the sample.

b. Before testing any hypotheses, we check the conditions necessary for valid inference in regression.
The Excel plots of the residuals and each of the two independent variables follow.

Satisfaction with Trade Price


Residual Plot
1
0.5
Residuals

0
0.0 1.0 2.0 3.0 4.0 5.0
-0.5
-1
Satisfaction with Trade Price

Because we are working with only 14 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions.

7 - 24
Regression Analysis

Satisfaction with Speed of Execution


Residual Plot
1
0.5
Residuals

0
0.0 1.0 2.0 3.0 4.0 5.0
-0.5
-1
Satisfaction with Speed of Execution

Similarly, this scatter chart does not provide strong evidence of a violation of the conditions, so we
will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0357. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between satisfaction with trade price and overall satisfaction with the electronic
trade at the 0.05 level of significance.

The p-value associated with the estimated regression parameter b2 is 0.0006. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there
is a relationship between satisfaction with speed of execution and overall satisfaction with the
electronic trade at the 0.05 level of significance.

c. With regard to the relationship between satisfaction with trade price and overall satisfaction with the
electronic trade, we estimate that if we hold satisfaction with speed of execution constant, a 1 point
increase in satisfaction with trade price corresponds to an increase of 0.5580 in overall satisfaction
with the electronic trade.

With regard to the relationship between satisfaction with speed of trade and overall satisfaction with
the electronic trade, we estimate that if we hold satisfaction with trade price constant, a 1 point
increase in satisfaction with speed of execution corresponds to an increase of 0.7342 in overall
satisfaction with the electronic trade.

Overall satisfaction with the electronic trade should generally increase as satisfaction with trade
price increase and as satisfaction with the speed of the trade increases. Both of the estimated
relationships in this multiple linear regression model are consistent with what would be expected.

d. If Finger Lakes Investments can achieve its performance level goals (satisfactory levels of service
levels (3) for both trade price and speed of execution), its predicted overall satisfaction level will be

yˆ  0.7835  0.5580  3  0.7342  3  3.0929 .

or approximately 3.1.

e. The possible responses (scores) for each question were no opinion (0), unsatisfied (1), somewhat
satisfied (2), satisfied (3), and very satisfied (4). The responses unsatisfied, somewhat satisfied,
satisfied, and very satisfied each represent a degree of satisfaction. However, the response no

7 - 25
Regression Analysis

opinion does not represent a degree of satisfaction and should not be part of this scale. Giving the no
opinion response a value of zero is not appropriate.

12. a. The following Excel output provides the estimated simple linear regression equation that could be
used to predict the percentage of games won (y) given the average number of passing yards per
attempt (x).

The estimated simple linear regression equation is yˆ  58.7703  16.3906 x , and the coefficient of
determination r2 is 0.5771, so this regression model explains approximately 58% of the variation in
the sample values of percentage of games won.

b. The following Excel output provides the estimated simple linear regression equation that could be
used to predict the percentage of games won (y) given the number of interceptions thrown per
attempt (x).

7 - 26
Regression Analysis

The estimated simple linear regression equation is yˆ  97.5383  1600.4909 x , and the coefficient of
determination r2 is 0.4379, so this regression model explains approximately 44% of the variation in
the sample values of percentage of games won.

c. The following Excel output provides the estimated multiple linear regression equation that could be
used to predict the percentage of games won (y) given the average number of passing yards per
attempt (x1) and the number of interceptions thrown per attempt (x2).

The estimated multiple linear regression equation is yˆ  5.7633  12.9494 x1  1083.7880 x2 , and
the coefficient of determination R2 is 0.7525, so this regression model explains approximately 75%
of the variation in the sample values of percentage of games won.

d. Using the estimated regression equation developed in part (c), the predicted percentage of games
won by the Kansas City Chiefs for the 2011 season (during which the Kansas City Chiefs average
number of passing yards per attempt was 6.2 and the number of interceptions thrown per attempt
was 0.036) is

yˆ  5.7633  12.9494  6.2   1083.7880  0.036   35.5064

or 35.51%. During the 2011 season the Kansas City Chiefs won 43.75% of its games (recall the
team’s record for the 2011 season was 7 wins and 9 loses, and so the team performed better than
what we would predict for a team with an average number of passing yards per attempt of 6.2 and
number of interceptions thrown per attempt of 0.036.

e. The estimated simple linear regression equation that uses only the average number of passing yards
per attempt as the independent variable to predict the percentage of games won has a coefficient of
determination of r2 = 0.5771, and the estimated multiple linear regression equation that uses both the
average number of passing yards per attempt and the number of interceptions thrown per attempt as
the independents variable to predict the percentage of games won has a coefficient of determination
of R2 = 0.7525. The multiple linear regression model fits the data better, as it explains over 17%
more variation the percentage of games won than did the simple linear regression.

7 - 27
Regression Analysis

13. a. The following Excel output provides the estimated simple linear regression equation that could be
used to predict repair time (y) given the number of months since the last maintenance service (x).

The estimated simple linear regression equation is yˆ  2.1473  0.3041x .

Before testing the hypotheses of no relationship between repair time and the number of months since
the last maintenance service, we check the conditions necessary for valid inference in regression.
The Excel plot of the residuals and the number of months since the last maintenance service follows.

Months Since Last Service Residual


Plot
2
1
Residuals

0
0 2 4 6 8 10
-1
-2
Months Since Last Service

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, this scatter chart does not provide strong
evidence of a violation of the conditions, so we will proceed with our inference. Since the level of
significance for use in hypothesis testing has not been given, we will use the standard 0.05 level
throughout this problem.

The p-value associated with the estimated regression parameter b1 is 0.0163. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between repair time and the number of months since the last maintenance service at
the 0.05 level of significance, and our best estimate is that a 1 month increase in number of months
since the last maintenance service corresponds to an increase of 0.3041 hours in repair time.

7 - 28
Regression Analysis

The coefficient of determination r2 is 0.5342, so the regression model explains approximately 53%
of the variation in the values of repair time in the sample.

b. The predicted repair time and residual for each of the ten repairs in the data, provided in the Excel
output and sorted by residual, are provided in the following table.

Repair Time Months Since Type of Predicted Repair


in Hours Last Service Repair Repairperson Time in Hours Residuals

1.8 3 Mechanical Donna Newton 3.0597 -1.2597

3.0 6 Mechanical Donna Newton 3.9721 -0.9721

4.2 9 Mechanical Bob Jones 4.8845 -0.6845

2.9 2 Electrical Donna Newton 2.7555 0.1445

2.9 2 Electrical Donna Newton 2.7555 0.1445

4.8 8 Electrical Bob Jones 4.5803 0.2197

4.8 8 Mechanical Bob Jones 4.5803 0.2197

4.5 6 Electrical Donna Newton 3.9721 0.5279

4.9 7 Electrical Bob Jones 4.2762 0.6238

4.4 4 Electrical Bob Jones 3.3638 1.0362

Mechanical repairs generally have negative residuals and electrical repairs generally have positive
residuals. The residuals ei  yi  yˆi for this simple linear regression model are the differences
between the actual repair times and the predicted repair times, so mechanical repairs tend to take less
time than predicted and electrical repairs generally take more time than predicted.

Two of the repairs made by Donna Newton have large negative residuals indicating that the model
greatly overestimated the amount of time that these repairs would take. On the other hand, repairs
made by Bob Jones typically have positive residuals, indicating that repairs made by Bob Jones
generally take more time than predicted.

These results suggest that using dummy variables to represent the type of repair (mechanical or
electrical) and repairperson (Donna Newton or Bob Jones) may enhance this fit of the regression
model.

The scatter chart of months since last service and repair time in hours for which the points
representing electrical and mechanical repairs are shown with different shapes and colors follows.

7 - 29
Regression Analysis

6.0
R
e 5.0
p
a 4.0
i
r
3.0
Electrical
T Mechanical
2.0
i
m
1.0
e

0.0
0 2 4 6 8 10
Months Since Last Service

This chart suggests that electrical repairs generally take longer than mechanical repairs, and so using
dummy variables to represent the type of repair (mechanical or electrical) may enhance this fit of the
regression model.

The scatter chart of months since last service and repair time in hours for which the points
representing repairs by Bob Jones and Donna Newton are shown with different shapes and colors
follows.

6.0
R
e 5.0
p
a 4.0
i
r
3.0
Bob Jones
T Donna Newton
2.0
i
m
1.0
e

0.0
0 2 4 6 8 10
Months Since Last Service

This chart suggests that repairs made by Bob Jones generally take longer than repairs made by
Donna Newton, and so using dummy variables to represent the repairperson (Donna Newton or Bob
Jones) may enhance this fit of the regression model.

c. The following Excel output provides the estimated multiple linear regression equation that could be
used to predict repair time given the number of months since the last maintenance service (x1) and
the type of repair (mechanical or electrical, x2).

7 - 30
Regression Analysis

The estimated multiple linear regression equation is yˆ  0.9305  0.3876 x1  1.2627 x2 .

Before testing the hypotheses of no relationship between repair time and the independent variables in
this model, we check the conditions necessary for valid inference in regression. Excel plots of the
residuals with each independent variable in this model follow.

Months Since Last Service Residual


Plot
1
0.5
Residuals

0
0 2 4 6 8 10
-0.5
-1
Months Since Last Service

Type of Repair Dummy Residual


Plot
1
0.5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5
-1
Type of Repair Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, these scatter charts do not provide strong

7 - 31
Regression Analysis

evidence of a violation of the conditions. We also note that the parameter estimate and associated p-
value corresponding to months since last service do not change substantially when the dummy
variable for type of repair is introduced into the model. This suggests that multicollinearity is not an
issue for this regression model. We will therefore proceed with our inferences.

The p-value associated with the estimated regression parameter b1 is 0.0004. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between number of months since the last maintenance service and repair time at the
0.05 level of significance. We estimate that holding the type of repair constant, a 1 month increase in
number of months since the last maintenance service corresponds with an increase of 0.3876 hours
in repair time.

The p-value associated with the estimated regression parameter b2 is 0.0051. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there
is a relationship between the type of repair and repair time at the 0.05 level of significance. We
estimate that holding the number of months since the last maintenance service constant, an electrical
repair takes 1.2627 hours longer than a mechanical repair.

The y-intercept for this model has been estimated through extrapolation and so does not have a
meaningful interpretation.

The coefficient of determination is R2 = 0.8592, so the regression model explains approximately


86% of the variation in the values of repair time in the sample.

d. The following Excel output provides the estimated multiple linear regression equation that could be
used to predict repair time given the number of months since the last maintenance service (x1) and
the repairperson (Bob Jones or Donna Newton, x2).

The estimated multiple linear regression equation is yˆ  3.5263  0.1519 x1  1.0835x2 .

Before testing the hypotheses of no relationship between repair time and the independent variables in
this model, we check the conditions necessary for valid inference in regression. Excel plots of the
residuals with each independent variable in this model follow.

7 - 32
Regression Analysis

Months Since Last Service Residual


Plot
2
1
Residuals

0
0 2 4 6 8 10
-1
-2
Months Since Last Service

Repairperson Dummy Residual Plot


1.5
1
Residuals

0.5
0
-0.5 0 0.2 0.4 0.6 0.8 1 1.2
-1
-1.5
Repairperson Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, these scatter charts do not provide strong
evidence of a violation of the conditions. We also note that the parameter estimate and associated p-
value corresponding to months since last service change substantially when the dummy variable for
the repairperson is introduced into the model. This suggests that multicollinearity is possibly an issue
for this regression model. We will keep this result in mind as we proceed with our inferences.

The estimated regression parameter b1 implies that holding the repairperson constant, a 1 month
increase in number of months since the last maintenance service corresponds to an increase of
0.1519 hours in repair time. However, the p-value associated with the parameter estimate is 0.25671,
which exceeds the 0.05 level of significance and so leads us to not reject a test of the hypothesis that
1 = 0. Based on this multiple linear regression model, we conclude that, holding the repair person
constant, there is no relationship between months since the last maintenance service and repair time.
We note, however, that we have evidence that the independent variables repairperson and months
since the last maintenance service are related, which may explain why months since the last
maintenance service is not statistically significant in this model.

The estimated regression parameter b2 implies that holding the number of months since the last
maintenance service constant, Donna Newton takes 1.0835 hours less than Bob Jones to make a
repair. However, the p-value associated with the parameter estimate is 0.1165, which exceeds the
0.05 level of significance and so leads us to not reject a test of the hypothesis that 2 = 0. Based on
this multiple linear regression model, we conclude that, holding the months since the last
maintenance service constant, there is no difference in the repairs time for Donna Newton and Bob
Jones. Again we note that we have evidence that the independent variables repair person and months

7 - 33
Regression Analysis

since the last maintenance service are related, which may explain why the repairperson dummy
variable is not statistically significant in this model.

The y-intercept for this model has been estimated through extrapolation and so does not have a
meaningful interpretation.

The coefficient of determination is R2 = 0.6805, so the regression model explains approximately


68% of the variation in the values of repair time in the sample.

e. The following Excel output provides the estimated multiple linear regression equation that could be
used to predict repair time given the number of months since the last maintenance service (x1), type
of repair (mechanical or electrical, x2), and the repairperson (Bob Jones or Donna Newton, x3).

The estimated multiple linear regression equation is yˆ  1.8602  0.2914 x1  1.1024 x2  0.6091x3 .

Before testing the hypotheses of no relationship between repair time and the number of months since
the last maintenance service, we check the conditions necessary for valid inference in regression.
The Excel plots of the residuals with each independent variable follow.

Months Since Last Service Residual


Plot
1
0.5
Residuals

0
0 2 4 6 8 10
-0.5
-1
Months Since Last Service

7 - 34
Regression Analysis

Type of Repair Dummy Residual


Plot
1
0.5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5
-1
Type of Repair Dummy

Repairperson Dummy Residual Plot


1

0.5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-0.5

-1
Repairperson Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference
to be valid in regression is extremely difficult. However, these scatter charts do not provide strong
evidence of a violation of the conditions.

To check for the potential introduction of multicolinearity that may occur when we add the dummy
variables to the model, we compare the parameter estimates and p-values from the model that
includes only the number of months since the last maintenance service and the type of repair dummy
variable (from part c) with those from the model that includes all three independent variables (from
part e). When making thiese comparisons we observe that the parameter estimates and p-values
associated with the number of months since the last maintenance service and the tye of repair
dummy variable do not change substantially when the dummy variable for repairperson is introduced
into or removed from the model.

We also compare the parameter estimates and p-values from the model that includes only the
number of months since the last maintenance service and the repairperson dummy variable (from
part d) with those from the model that includes all three independent variables (from part e). When
making thiese comparisons we observe that i) the parameter estimates and p-values associated with
the number of months since the last maintenance service change substantially and ii) the parameter
estimates and p-values associated with the repairperson dummy variable do not change substantially
when the dummy variable for type of repair is introduced into or removed from the model.These
results suggest that multicollinearity between the number of months since the last maintenance
service the repairperson dummy variable may be an issue for this regression model. We will keep
this in mind as we proceed with our inferences.

7 - 35
Regression Analysis

The p-value associated with the estimated regression parameter b1 is 0.0130. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there
is a relationship between number of months since the last maintenance service and repair time at the
0.05 level of significance. We estimate is that holding the type of repair and repairperson constant, a
1 month increase in number of months since the last maintenance service corresponds to an increase
of 0.2914 hours in repair time.

The p-value associated with the estimated regression parameter b2 is 0.0109. Because this p-value is
less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there
is a relationship between the type of repair and repair time at the 0.05 level of significance. We
estimate is that holding the number of months since the last maintenance service and the
repairperson (Donna Newton or Bob Jones) constant, an electrical repair takes 1.1024 hours longer
than a mechanical repair.

Furthermore, the estimated regression parameter b3 implies that holding the number of months since
the last maintenance service and the type repair (mechanical or electrical) constant, Donna Newton
takes 0.6091 hours less than Bob Jones to make a repair. However, the p-value associated with the
parameter estimate is 0.1674, which exceeds the 0.05 level of significance and so leads us to not
reject a test of the hypothesis that 3 = 0. Based on this multiple linear regression model, we
conclude that there is no difference in repair times for Bob Jones and Donna Newton.

Finally, the y-intercept for this model has been estimated through extrapolation and so does not have
a meaningful interpretation.

The coefficient of determination R2 = 0.9002, so the regression model explains approximately 90%
of the variation in the values of repair time in the sample.

f. In the model in from part (c) that includes the number of months since the last maintenance service
and the type of repair (mechanical or electrical), we found a significant relationship between each of
the independent variables in the model and the dependent variable, and each of these relationships is
what would be expected. Furthermore, although the model from part (e) model with all three
independent variables in the model has the highest R2 and so explains the greatest proportion of
variation in the sample repair times, the R2 for the multiple linear regression model from part (c) is
only moderately smaller. Note that the model from part (c) includes the number of months since the
last maintenance service and the type of repair as independent variables, and to build the model in
part (e) we have added the repairperson dummy variable to the model in part (c). Because of the
multicollinearity between number of months since the last maintenance service and the repairperson
dummy variable, adding the repairperson dummy variable to a model that already includes includes
the number of months since the last maintenance service will do little to enhance the ability of the
model to explain variation in the dependent variable repair times.

We want to select the simplest model that works well, and so the preferred multiple linear regression
model is the model in part (c) that includes the number of months since the last maintenance service
and the type of repair (mechanical or electrical).

14. a. The following Excel output provides the estimated multiple linear regression equation that could be
used to predict delay given the industry dummy variable (x1), the public dummy variable (x2), quality
(x3), and finished (x4).

7 - 36
Regression Analysis

The estimated multiple linear regression equation is

yˆ  80.4286  11.9442 x1  4.8163x2  2.6236 x3  4.0725 x4 .

b. The coefficient of determination is R2 = 0.3826, so the regression model explains approximately


38% of the variation in the values of repair time in the sample. Other independent variables that
could you include in this regression model to improve the fit include the amount of taxes reported by
the company that is being audited and what type of audit (Taxpayer Compliance Measurement
Program Audit, IRS Correspondence, IRS Office Audit, or IRS Field Audit) is being conducted.

c. Before testing any hypotheses about this regression model, we check the conditions necessary for
valid inference in regression. Excel plots of the residuals and each of the independent variables
follow.

Industry Residual Plot


30
20
Residuals

10
0
-10 0 0.2 0.4 0.6 0.8 1 1.2
-20
-30
Industry

7 - 37
Regression Analysis

Public Residual Plot


30
20
Residuals

10
0
-10 0 0.2 0.4 0.6 0.8 1 1.2
-20
-30
Public

Quality Residual Plot


30
20
Residuals

10
0
-10 0 1 2 3 4 5 6
-20
-30
Quality

Finished Residual Plot


30
20
Residuals

10
0
-10 0 1 2 3 4 5
-20
-30
Finished

The residuals appear to have a relatively constant variance across the values of each independent
variable and do not appear to be badly skewed for any variable. However, the mean of the residuals
possibly differs from zero at several values of each of the quantitative independent variables (quality
and finished). A closer look at these scagtter charts suggests that both quality and finished may have
a nonlinear relationship with delay. We will keep these findings in mind as we proceed.

In checking for multicollinearity, we first calculate the correlation coefficient r for the quantitative
independent variables (quality and finished) to determine if our quantitative variables are strongly
correlated. We note that the correlation between these quality and finished is 0.0356, which indicates
that multicollinearity between the quantitative variables is not a concern.

7 - 38
Regression Analysis

Next we rerun the regression after removing the industry dummy variable (x1) from the original
model and we compare the parameter estimates and associated p-values for each of the reamining
independent variables to the parameter estimates and associated p-values for the original model.

When making these comparisons we observe that these values do not change substantially when the
dummy variable for industry is introduced into or removed from the model and conclude that the
industry dummy variable does not create a problem with multicollinearity.

Finally, we rerun the regression after removing the public dummy variable (x2) from the original
model and compare the parameter estimates and associated p-values for each of the reamining
independent variables to the parameter estimates and associated p-values for the original model.

7 - 39
Regression Analysis

When making these comparisons we observe that these values do not change substantially when the
dummy variable for whether the company is publicly traded is introduced into or removed from the
model and conclude that the public dummy variable does not create a problem with
multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore
proceed with our inferences.

The p-value for the test of the hypothesis that  1 = 0 is 0.0034. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a difference
in delay between the industries at the 0.05 level of significance. We estimate that, holding the values
of public, quality, and finished constant, the delay experienced by an industrial company is 11.9442
days longer than the delay experienced by a bank, savings and loan, or insurance company.

The p-value for the test of the hypothesis that  2 = 0 is 0. 2625. Because this p-value is greater than
the 0.05 level of significance, we do not reject the hypothesis that 2 = 0, and we conclude that there
is no difference in the delays whether the company was traded on an organized exchange or over the
counter when controlling for industry, quality, and finished.

The p-value for the test of the hypothesis that  3 = 0 is 0.0332. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a
relationship between delay and quality at the 0.05 level of significance. We estimate that, holding the
values of industry, public, and finished constant, when the overall quality of internal controls (as
judged by the auditor) increases by one point the delay decreases by 2.6236 days.

The p-value for the test of the hypothesis that  4 = 0 is 0.0345. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 4 = 0, and conclude that there is a
relationship between delay and overall quality of internal controls (as judged by the auditor) at the
0.05 level of significance. We estimate that, holding the values of industry, public, and quality
constant, when finished (as judged by the auditor) increases by one point the delay decreases by
4.0725 days.

d. Since we did not reject the hypothesis 2 = 0 in the previous model, we will remove x2 (the public
dummy) from our multiple linear regression model. The following Excel output provides the
estimated multiple linear regression equation that could be used to predict delay given the industry
dummy variable (x1), quality (x2), and finished (x3).

7 - 40
Regression Analysis

The estimated multiple linear regression equation is yˆ  79.7324  12.6453 x1  2.8204 x2  4.1940 x3
and the coefficient of determination for this model is R2 = 0.3597, so the regression model explains
almost as much variation in the values of repair time in the sample as did the model that included all
four independent variables.

Before testing any hypotheses about this regression model, we again check the conditions necessary
for valid inference in regression.

The Excel plots of the residuals and each of the independent variables follow.

Industry Residual Plot


30
20
Residuals

10
0
-10 0 0.2 0.4 0.6 0.8 1 1.2
-20
-30
Industry

7 - 41
Regression Analysis

Quality Residual Plot


30
20
Residuals

10
0
-10 0 1 2 3 4 5 6
-20
-30
Quality

Finished Residual Plot


30
20
Residuals

10
0
-10 0 1 2 3 4 5
-20
-30
Finished

The residuals appear to have a relatively constant variance across the values of each independent
variable and do not appear to be badly skewed at any value of any independent variable. However,
the mean of the residuals possibly differs from zero at several values of each of the quantitative
independent variables (quality and finished). A closer look at these scatter charts again suggests that
quality and finished may each have a nonlinear relationship with delay. We will proceed with our
inference but will keep our findings in mind as we proceed.

We have already determined that the correlation coefficient r for quality and finished is 0.0356,
which indicates that multicollinearity between the quantitative variables is not a concern.

Next we rerun this regression after removing the industry dummy variable (x1) from our model and
compare the parameter estimates and associated p-values for each of the reamining independent
variables to the parameter estimates and associated p-values for the original model.

7 - 42
Regression Analysis

When making these comparisons we observe that these values do not change substantially when the
dummy variable for industry is introduced into or removed from the model and conclude that the
industry dummy variable does not create a problem with multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore
proceed with our inferences.

The p-value for the test of the hypothesis that  1 = 0 is 0.0019. Because this p-value is less than the
0.05 level of significance, we again reject the hypothesis that 1 = 0, and conclude that there is a
difference in delay between the industries at the 0.05 level of significance. We estimate that, holding
quality and finished constant, the delay experienced by an industrial company is 12.6453 days longer
than the delay experienced by a bank, savings and loan, or insurance company.

The p-value for the test of the hypothesis that  2 = 0 is 0.0217. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a
relationship between delay and quality at the 0.05 level of significance. We estimate that, holding
industry and finished constant, when the overall quality of internal controls (as judged by the
auditor) increases by one point the delay decreases by 2.8204 days.

The p-value for the test of the hypothesis that  3 = 0 is 0.0300. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a
relationship between delay and overall quality of internal controls (as judged by the auditor) at the
0.05 level of significance. We estimate that, holding industry and quality constant, when finished (as
judged by the auditor) increases by one point the delay decreases by 4.1940 days.

We have noted that the residuals plotted over each of the quantitative variables suggested possible
nonlinear relationships between the dependent variable delay and the two quantitative variables
(quality and finished). If we can think of plausible reasons why these two relationships could be
nonlinear, we may wish to consider this quadratic model next:

ŷ   0  1 x1   2 x2   3 x2   4 x3   5 x3
2 2

7 - 43
Regression Analysis

where x1 is industry, x2 is quality, and x3 is finished.

15. a. The following Excel output provides the estimated multiple linear regression equation that can be
used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x).

The estimated multiple linear regression equation is yˆ  35.3950  2.8821x , and the coefficient of
determination for this model is r2 = 0.6945, so this simple linear regression model explains
approximately 69% of the variation in the values of HwyMPG in the sample.

Before testing the hypothesis 1 =0 for this regression model, we check the conditions necessary for
valid inference in regression. The Excel plot of the residuals and displacement follows.

Displacement Residual Plot


10

5
Residuals

0
0 1 2 3 4 5 6 7
-5

-10
Displacement

The residuals appear deviate somewhat from a constant variance with a mean of zero, but do not
appear to be badly skewed at any value of displacement. Because the apparent violations of the
conditions necessary for valid inference in regression do not appear to be severe, we will proceed
with our inference and keep these findings in mind.

The p-value for the test of the hypothesis that  1 = 0 is 1.51247E-81. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a
relationship between HwyMPG and displacement at the 0.05 level of significance. We estimate that
a one liter increase in the engine’s displacement coincides with a decrease of 2.8821 in HwyMPG.

7 - 44
Regression Analysis

b. The scatter chart of HwyMPG and displacement for which the points representing compact, midsize,
and large automobiles are shown in different shapes and or colors follows.

40

35

H 30
w 25
y
20 Compact
M Midsize
15
P
Large
G 10

0
0 1 2 3 4 5 6 7
Displacement

This chart suggests that for each class of automobile (compact, midsize, and large) HwyMPG
decreases as displacement increases. The chart also suggests that midsize automobiles generally
have the highest HwyMPG while compact automobiles generally have the lowest HwyMPG.
Although this seems counterintuitive, the chart shows that this is likely occurring because the
midsize automobiles in the sample data tend to have low engine displacement, while the compact
automobiles in the sample data tend to have high engine displacement. The chart does suggest that
using dummy variables to represent the classes of automobile may enhance this fit of the regression
model.

c. The following Excel output provides the estimated multiple linear regression equation that can be
used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x1) and
the dummy variables ClassMidsize (x2) and ClassLarge (x3).

7 - 45
Regression Analysis

The estimated multiple linear regression equation is yˆ  29.0359  1.6625 x1  4.4686 x2  1.8047 x3 ,
and the coefficient of determination for this model is R2 = 0.8182, so this multiple linear regression
model explains approximately 82% of the variation in the values of HwyMPG in the sample.

d. Before testing any hypotheses for this regression model, we check the conditions necessary for valid
inference in regression. Excel plots of the residuals and each independent variable follow.

Displacement Residual Plot


10

5
Residuals

0
0 1 2 3 4 5 6 7
-5

-10
Displacement

ClassMidsize Residual Plot


10

5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-5

-10
ClassMidsize

ClassLarge Residual Plot


10

5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-5

-10
ClassLarge

The residuals appear to have a mean of zero and do not appear to be badly skewed at any value of
each independent variable. However, the variance does not appear to be constant across levels of
displacement or ClassLarge. Theseviolations do not appear to be severe.

7 - 46
Regression Analysis

We also note that the parameter estimate and associated p-value for each of the independent variable
displacement does change substantially when the dummy variables that represent the class of
automobile are introduced into the model (which makes sense – the displacement or size of the
engine is generally related to the class or size of the automobile), suggesting that multicollinearity is
possibly a concern. We will keep this result in mind as we proceed with our inference.

The p-value for the test of the hypothesis that  2 = 0 is 7.14209E-35. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a
difference in HwyMPG between midsized automobiles and compact automobiles at the 0.05 level of
significance. We estimate that holding displacement constant, midsized automobiles get 4.4686 more
HwyMPG than do compact automobiles. That is, if a compact automobile and a midsized
automobile have the same displacement, we expect a midsized automobile to get about 4.5 more
miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that  3 = 0 is 9.14602E-09. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a
difference in HwyMPG between large automobiles and compact automobiles at the 0.05 level of
significance. We estimate that holding displacement constant, large automobiles get 1.8047 more
HwyMPG than do compact automobiles. That is, if a compact automobile and a large automobile
have the same displacement, we expect a large automobile to get about 1.8 more miles per gallon
than the compact automobile.

e. The following Excel output provides the estimated multiple linear regression equation that can be
used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x1) and
the dummy variables ClassMidsize (x2), ClassLarge (x3), and FuelPremium (x4).

The estimated multiple linear regression equation is

yˆ  29.7624  1.6347 x1  3.9634 x2  1.6450 x3  1.1210 x4

and the coefficient of determination for this model is R2 = 0.8338, so this multiple linear regression
model explains approximately 83% of the variation in the values of HwyMPG in the sample.

7 - 47
Regression Analysis

f. Before testing any hypotheses for this regression model, we check the conditions necessary for valid
inference in regression. Excel plots of the residuals and each independent variable follow.

Displacement Residual Plot


8
6
4
Residuals

2
0
-2 0 1 2 3 4 5 6 7
-4
-6
Displacement

ClassMidsize Residual Plot


8
6
4
Residuals

2
0
-2 0 0.2 0.4 0.6 0.8 1 1.2
-4
-6
ClassMidsize

ClassLarge Residual Plot


8
6
4
Residuals

2
0
-2 0 0.2 0.4 0.6 0.8 1 1.2
-4
-6
ClassLarge

7 - 48
Regression Analysis

FuelPremium Residual Plot


8
6
4
Residuals

2
0
-2 0 0.2 0.4 0.6 0.8 1 1.2
-4
-6
FuelPremium

The residuals appear to have a mean of zero and do not appear to be badly skewed at any value of
any independent variable. However, the variance does not appear to be constant across values of the
independent variable displacement, but this violation does not appear to be severe. We also note that
the parameter estimates and associated p-values for each of the independent variables does not
change substantially when the dummy variable FuelPremium is introduced into the model,
suggesting that the dummy variable FuelPremium does not create further issues with
multicollinearity. We therefore will proceed with our inference.

The p-value for the test of the hypothesis that  1 = 0 is 1.06276E-34. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a
relationship between HwyMPG and displacement at the 0.05 level of significance. We estimate that,
for a fixed class of automobile and type of fuel used, a one liter increase in the engine’s
displacement coincides with a decrease of 1.6347 in HwyMPG.

The p-value for the test of the hypothesis that  2 = 0 is 6.31598E-29. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a
difference in HwyMPG between midsized automobiles and compact automobiles at the 0.05 level of
significance. We estimate that holding displacement and fuel type constant, a midsized automobile
gets 3.9634 more HwyMPG than do compact automobiles. That is, if the compact automobile and a
midsized automobile have the same displacement and use the same type of fuel, we expect a
midsized automobile to get about 4.0 more miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that  3 = 0 is 4.89555E-08. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a
difference in HwyMPG between large automobiles and compact automobiles at the 0.05 level of
significance. We estimate that holding displacement and fuel type constant, a large automobile gets
1.6450 more HwyMPG than do compact automobiles. That is, if a compact automobile and a large
automobile have the same displacement and use the same type of fuel, we expect the large
automobile to get about 1.6 more miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that  4 = 0 is 1.6116E-07. Because this p-value is less than
the 0.05 level of significance, we reject the hypothesis that 4 = 0, and conclude that there is a
difference in HwyMPG between automobiles that use premium fuel and automobiles do not use
premium fuel at the 0.05 level of significance. We estimate that holding displacement and class of
automobile constant, an automobile that uses premium fuel gets 1.1210 less HwyMPG than do
automobiles that do not use premium fuel. That is, if two automobiles are of the same class and have

7 - 49
Regression Analysis

the same displacement, and one of the automobiles uses premium fuel and the other does not, we
expect the automobile that uses premium fuel to get about 1.1 fewer miles per gallon than the
automobile that does not use premium fuel.

16. a. The scatter chart with vehicle speed as the independent variable follows:

1600

1400

1200
Traffic Flow

1000

800

600

400

200

0
0 10 20 30 40 50 60
Vehicle Speed

The scatter chart suggests that vehicle speed and traffic flow are positively related.

b. The following Excel output provides the estimated simple linear regression equation that could be
used to predict traffic flow (y) given the vehicle speed (x).

The estimated multiple linear regression equation is yˆ  1039.5757  6.6006 x , and the coefficient of
determination for this model is r2 = 0.3133, so the regression model explains approximately 31% of
the variation in the sample values of traffic flow.

7 - 50
Regression Analysis

Before testing any hypotheses about this regression model, we check the conditions necessary for
valid inference in regression. The Excel plot of the residuals and vehicle speed follows.

Vehicle Speed Residual Plot


200
100
Residuals

0
0 10 20 30 40 50 60
-100
-200
-300
Vehicle Speed

The residuals appear to have a relatively constant variance across the values of vehicle speed and do
not appear to be badly skewed at any value of the independent variable. However, the mean of the
residuals possibly differs from zero at several values of the independent variable; this suggests the
relationship between vehicle speed and traffic flow may be nonlinear. We will proceed with our
inference but will keep our findings in mind as we proceed.

The p-value for the test of the hypothesis that  1 = 0 is 1.41658E-09. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a
relationship between vehicle speed and traffic flow at the 0.05 level of significance. Our best
estimate is that when vehicle speed increases by 1 mph, traffic flow increases by 6.6006 vehicles.

c. The following Excel output provides the estimated second order quadratic regression equation that
could be used to predict traffic flow (y) given the vehicle speed (x).

2
The estimated second order quadratic regression equation is yˆ  621.2138  28.0372 x  0.2665 x ,
and the coefficient of determination for this model is R2 = 0.3431, so the quadratic regression model

7 - 51
Regression Analysis

explains approximately 3% more of the variation in the sample values of traffic flow than did the
linear regression model in part (b).

Before testing any hypotheses about this regression model, we again check the conditions necessary
for valid inference in regression. The Excel plot of the residuals and vehicle speed follows.

Vehicle Speed Residual Plot


200
100
Residuals

0
0 10 20 30 40 50 60
-100
-200
-300
Vehicle Speed

This scatter chart is very similar to the scatter chart of the residuals from the simple linear regression
model estimated in part (b). When we plot the residuals from the quadratic model against squared
values of vehicle speed.

Vehicle Speed Sq Residual Plot


200
100
Residuals

0
0 500 1000 1500 2000 2500 3000
-100
-200
-300
Vehicle Speed Sq

the scatter chart does not provide strong evidence of a violation of the conditions, so we will proceed
with our inference.

The p-value for the test of the hypothesis that  1 = 0 is 0.0074. Because this p-value is less than the
0.05 level of significance, we again reject the hypothesis that 1 = 0. Similarly, the p-value for the
test of the hypothesis that 2 = 0 is 0.0384. Because this p-value is less than the 0.05 level of
significance, we reject the hypothesis that 2 = 0. We therefore conclude that there is a nonlinear
relationship between vehicle speed and traffic flow. We estimate that when vehicle speed increases
from some value x to x+1, the traffic flow changes by

28.0372 [(x + 1) – x] - 0.2665 [(x + 1)2 – x2]

= 28.0372 (x – x +1) -0.2665 (x2 + 2x + 1 – x2)

= 28.0372 - 0.2665 (2x + 1)

7 - 52
Regression Analysis

= 27.7707 - 0.5331x

That is, estimated traffic flow initially increases as vehicle speed increases when the traffic is
traveling at a relatively low speed, and then eventually decreases as vehicle speed increases. Solving
this result for x

27.7707 - 0.5331x = 0

-0.5331x = -27.7707

x = -27.7707/ -0.5330 = 52.0935.

tells us that estimated maximum traffic flow occurs at a vehicle speed of 52 miles per hour; at speeds
below 52 mile per hour the traffic flow increases as vehicle speed increases, and at speeds above 52
mile per hour the traffic flow decreases as vehicle speed increases. Substituting 52 miles per hour
into the estimated second order quadratic regression equation:

 
yˆ  621.2138  28.0372  52   0.2665 52  1358.41
2

yields the estimated maximum traffic flow of approximately 1,358 vehicles.

A plot of the linear and quadratic regression lines helps us better understand the difference in how
these two models fit the sample data.

1600

1400

1200
Traffic Flow

1000

800

600

400

200

0
0 10 20 30 40 50 60
Vehicle Speed

This display shows that there is little difference in how the simple linear regression line (in green)
and the quadratic regression line (in red) fit the sample data. Comparison of the coefficients of
determination for these two models shows that the estimated second order quadratic regression
equation only explains slightly more 3% of the variation in the sample values of traffic flow than did
the less complex simple linear regression model, Since the simple linear regression model has almost
the same explanatory power as the quadratic regression model and is far simpler, the simple linear
regression model is superior.

7 - 53
Regression Analysis

d. By reducing the range of the axes for the scatter chart we developed in part (a), we can see more
clearly where (if at all) the relationship between vehicle speed and traffic flow changes:

1600

1500

1400
Traffic Flow

1300

1200

1100

1000
20 25 30 35 40 45 50 55
Vehicle Speed

If there is a change in the relationship between vehicle speed and traffic flow, it is not prominent.
We will use 45 as the knot (you could select a different value to use as the knot – this is subjective –
and the results of part (c) could be used to estimate the value to use for the knot).

First we create a dummy variable that is equal to 1 if vehicle speed exceeds the knot value of 45 and
zero otherwise, then we multiply this dummy variable by the difference between vehicle speed and
the knot value of 45. We then estimate a regression model with this new variable (the product of the
knot dummy variable and the difference between vehicle speed and the knot value of 45) and vehicle
speed as the independent variables.

The following Excel output provides the estimated piecewise linear regression equation with a knot
at vehicle speed = 45 that could be used to predict traffic flow (y) given the vehicle speed (x).

7 - 54
Regression Analysis

If vehicle speed does not exceed 45 miles per hour, the estimated regression equation is

yˆ  984.5875  8.1287 x

and if vehicle speed exceeds 45 miles per hour, the estimated regression equation is

yˆ  984.5875  8.1287 x  6.5507  x  45 


 1279.3672  1.5780 x

According to this model, the estimated increase in traffic flow that corresponds with a 1 mile per
hour increase in vehicle speed is much smaller if the traffic speed is over 45 miles per hour.

Note that the coefficient of determination for this model is R2 = 0.3281, so the piecewise linear
regression model with a knot at vehicle speed = 45 explains approximately 1% more of the variation
in the sample values of traffic flow than did the much less complex simple linear regression model in
part (b). Also note that the p-value for the test of the hypothesis that 2 = 0 is 0.1460. Because this p-
value exceeds the 0.05 level of significance, we do not reject the hypothesis that 2 = 0 (i.e., the knot
interaction is not statistically significant). Furthermore, the piecewise linear regression model with a
knot at vehicle speed = 45 explains less of the variation in the sample values of traffic flow than did
the second order quadratic regression model in part (c). Thus, the piecewise linear regression model
with a knot of 45 should not be considered further. Note that a piecewise linear regression model
with a different knot (perhaps a knot of 52) may perform much better than our piecewise linear
regression model with a knot of 45.

e. We split the data set so that the first data set contains 65 observations with values of vehicle speed
less than 45 and the second data set contains 35 observations with values of vehicle speed greater
than or equal to 45.

The following Excel output provides the estimated simple linear regression equation that could be
used to predict traffic flow (y) given vehicle speed (x) below 45.

The estimated multiple linear regression equation is 𝑦̂ = 961.5736 + 8.8039𝑥, and the coefficient
of determination for this model is r2 = 0.2978, so the regression model explains approximately 30%
of the variation in the sample values of traffic flow corresponding to vehicle speeds less than 45.

7 - 55
Regression Analysis

The following Excel output provides the estimated simple linear regression equation that could be
used to predict traffic flow (y) given vehicle speed (x) greater than or equal to 45.

The estimated multiple linear regression equation is 𝑦̂ = 1167.7323 + 3.8026𝑥, and the coefficient
of determination for this model is r2 = 0.0207, so the regression model explains approximately 2% of
the variation in the sample values of traffic flow corresponding to vehicle speeds greater than or
equal to 45.

Separating the data into two sets and fitting separate simple linear regression equations to each set
results in an even worse fit than the piecewise linear regression with a single knot at vehicle speed =
45.

Comparing predicted values of traffic flow for vehicle speeds of 44 and 46 (slightly below and above
the knot value of 45) will allow us to see the difference between the piecewise linear regression with
a single knot and two separate simple regression equations.

For vehicle speed = 44 the piecewise linear regression with a single knot produces

𝑦̂ = 984.5875 + 8.1287(44) = 1342.25

For vehicle speed = 46, the piecewise linear regression with a single knot produces

𝑦̂ = 984.5875 + 8.1287(46) − 6.5507(46 − 45) = 1351.96

Alternatively, for vehicle speed = 44 the simple linear regression fit on observations with vehicle
speeds < 45 produces

𝑦̂ = 961.5736 + 8.8039(44) = 1348.94

For vehicle speed = 46, the simple linear regression fit on observations with vehicle speeds ≥ 45

𝑦̂ = 1167.7323 + 3.8026(46) = 1342.65

7 - 56
Regression Analysis

That is, fitting two separate simple linear regression equations results in predicted traffic flow being
considerably less at vehicle speed = 46 than at vehicle speed = 44. This is opposite the behavior
predicted by the piecewise linear regression with a single knot at vehicle speed = 45.

To visualize how this happens, note from the charts below how the piecewise linear regression
“connects” two regression lines at the knot value of 45 while the two linear regression equations fit
separately result in a disjointed fit.

Fitting Two Separate Linear Regression Equations


1600

1500
Traffic Flow

1400
y = 8.8039x + 961.57
1300

1200
y = 3.8026x + 1167.7
1100

1000
0 10 20 30 40 50 60
Vehicle Speed

Piecewise Linear Regression


1600

1500

y = 8.1287x + 984.59
Traffic Flow

1400

1300

1200
y = 1.5780x + 1279.40
1100

1000
0 10 20 30 40 50 60
Vehicle Speed

f. Other independent variables that you could add to your regression model to explain more variation
in traffic flow include number of accidents and weather conditions (i.e., rainy or snowy).

17. a. The scatter chart with years to maturity as the independent variable follows.

7 - 57
Regression Analysis

5
Yield

0
0 5 10 15 20 25 30 35
Years

A simple linear regression model does not appear to be appropriate; there appears to be a curvilinear
relationship between years to maturity and yield.

b. The following Excel output provides the estimated second order quadratic regression equation that
could be used to predict yield (y) given the years to maturity (x).

2
The estimated second order quadratic regression equation is yˆ  1.0170  0.4606 x  0.0103 x , and
the coefficient of determination for this model is R2 = 0.6678, so the quadratic regression model
explains approximately 67% of the variation in sample values of yield.

Before testing any hypotheses about this regression model, we again assess the conditions necessary
for valid inference in regression. Excel plots of the residuals with years to maturity and years to
maturity squared follow.

7 - 58
Regression Analysis

Years Residual Plot


3
2
Residuals

1
0
0 5 10 15 20 25 30 35
-1
-2
Years

Years Sq Residual Plot


3
2
Residuals

1
0
0 200 400 600 800 1000
-1
-2
Years Sq

These scatter charts do not provide strong evidence of a violation of the conditions, so we will
proceed with our inference.

The p-value for the test of the hypothesis that  1 = 0 is 1.80383E-06. Because this p-value is less
than the 0.05 level of significance, we reject the hypothesis that 1 = 0. Similarly, the p-value for the
test of the hypothesis that 2 = 0 is 0.0003. Because this p-value is less than the 0.05 level of
significance, we reject the hypothesis that 2 = 0. We therefore conclude that there is a nonlinear
relationship between years to maturity and yield. We estimate that when years to maturity increases
by 1 year from some value x to x+1, the yield changes by

0.4606 [(x + 1) – x] - 0.0103 [(x + 1)2 – x2]

= 0.4606 (x – x + 1) - 0.0103 (x2 + 2x + 1 – x2)

= 0.4606 - 0.0103 (2x + 1)

= 0.4503 - 0.0206x

That is, estimated yield initially increases as years to maturity increases, and then eventually
decreases as years to maturity increases. Solving this result for x

0.4503 - 0.0206x = 0

- 0.0206x = - 0.4503

7 - 59
Regression Analysis

x = -0.4503 / - 0.0206 = 21.9631

tells us that estimated maximum yield to maturity occurs at approximately 22 years. Substituting 22
years into the estimated second order quadratic regression equation:

yˆ  1.0170  0.4606  22   0.0103  22   6.19


2

yields the estimated maximum yield to maturity of approximately 6.19%.

c. A plot of the linear and quadratic regression lines overlaid on the scatter chart of years to maturity
and yield follows.

5
Yield

0
0 5 10 15 20 25 30 35
Years

This display shows that there is a substantial difference in how the simple linear regression line (in
green) and the quadratic regression line (in red) fit the sample data. If we were to run the simple
linear regression in Excel, we would find that the coefficients of determination is 0.7258, and so the
quadratic regression model (with a coefficient of determination of 0.8172) explains almost 10%
more of the variation in our sample yields.

d. Other independent variables you could add to the regression model to explain more variation in yield
include the prevailing market rate at the time the bond was issued and the credit rating given to the
company issuing the bond by Moody's, Standard and Poor's, or Fitch.

18. a. The scatter chart with vehicle speed as the independent variable follows.

7 - 60
Regression Analysis

1200

1000

800
Mortgage ($)

600

400

200

0
0 200 400 600 800 1000 1200
Rent ($)

The scatter chart suggests that rent is positively related to mortgage. However, it is not clear that the
relationship is linear, and so a simple linear regression model appear may not be appropriate.

b. The following Excel output provides the estimated simple linear regression equation that could be
used to predict the monthly mortgage on the median priced home (y) given the average asking rent
(x).

The estimated multiple linear regression equation is yˆ  197.9583  1.0699 x .

The plot of the residuals for this model against the independent variable average asking rent follows.

7 - 61
Regression Analysis

Rent ($) Residual Plot


100
Residuals

0
0 200 400 600 800 1000 1200
-100

-200
Rent ($)

The mean of the residuals appears to differ from zero at several values of the independent variable
average asking rent; residuals for observations with relatively small or relatively large values of the
independent variable average asking rent tend to be positive, while the remaining residuals tend to be
negative. This suggests the relationship between the independent variable average asking rent and
the dependent variable monthly mortgage on the median priced home may be nonlinear, and so a
simple linear regression model may not be appropriate.

c. The following Excel output provides the estimated second-order quadratic regression equation that
could be used to predict the monthly mortgage on the median priced home (y) given the average
asking rent (x).

2
The estimated second-order quadratic regression equation is yˆ  3965.6331  8.2606 x  0.0051x .

d. Excel plots of the residuals for the estimated second-order quadratic regression model against the
independent variables average asking rent and average asking rent squared follow.

7 - 62
Regression Analysis

Rent ($) Residual Plot


100
50
Residuals

0
0 200 400 600 800 1000 1200
-50
-100
-150
Rent ($)

Rent Squared Residual Plot


100
50
Residuals

0
0 500000 1000000 1500000
-50
-100
-150
Rent Squared

These scatter charts suggest that the estimated second-order quadratic regression model fits the
sample data much better than the simple linear regression model.

A plot of the linear and quadratic regression lines overlaid on the scatter chart of the monthly
mortgage on the median priced home and the average asking rent will also help us better understand
the difference in how the quadratic regression model and a simple linear regression model fit the
sample data.

1200

1000

800
Mortgage ($)

600

400

200

0
0 200 400 600 800 1000 1200
Rent ($)

7 - 63
Regression Analysis

This display shows that there is a substantial difference in how the simple linear regression line (in
green) and the quadratic regression line (in red) fit the sample data. Note that the coefficient of
determination for second-order quadratic regression model is R2 = 0.8985, so the regression model
explains almost 13% more variation in the sample values of the monthly mortgage on the median
priced home than does the simple linear regression model. The estimated regression model
developed in part (c) is superior to the model developed in part (a).

19. a. The following Excel output provides the estimated multiple linear regression equation that relates
risk of a stroke to the person’s age (x1), blood pressure (x2), and whether the person is a smoker (x3).

The estimated multiple linear regression equation is yˆ  91.7595  1.0767 x1  0.2518 x2  8.740 x3 .

b. Before testing any hypotheses about this regression model, we again check the conditions necessary
for valid inference in regression. Excel plots of the residuals and each of the independent variables
follow.

Age Residual Plot


10
5
Residuals

0
0 20 40 60 80 100
-5
-10
-15
Age

7 - 64
Regression Analysis

Blood Pressure Residual Plot


10
5
Residuals

0
0 50 100 150 200 250
-5
-10
-15
Blood Pressure

Smoker Dummy Residual Plot


10
5
Residuals

0
0 0.2 0.4 0.6 0.8 1 1.2
-5
-10
-15
Smoker Dummy

None of these scatter charts provides strong evidence of a violation of the conditions, so we will
proceed with our inference.

Next we check for evidence of multicollinearity. First note that by using Excel’s we can determined
that the correlation coefficient r for age and blood pressure is -0.3090, which indicates that
multicollinearity between the quantitative variables is not a concern.

Now we rerun this regression after removing the smoker dummy variable (x3) from our model and
compare the parameter estimates and associated p-values for each of the reamining independent
variables to the parameter estimates and associated p-values for the original model.

7 - 65
Regression Analysis

When making these comparisons we observe that these values do not change substantially when the
smoker dummy variable is introduced into or removed from the model and conclude that the smoker
dummy variable does not create a problem with multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore
proceed with our inferences.

The p-value for the test of the hypothesis that  3 = 0 is 0.0102. Because this p-value is less than the
0.05 level of significance, we again reject the hypothesis that 3 = 0, and conclude that there is a
difference between smokers and nonsmokers in the risk of a stroke. We estimate that holding age
and blood pressure constant, smokers have a risk of stroke that is 8.7399 percent higher than
nonsmokers.

c. For a patient with the profile of Art Speen (a 68-year-old smoker who has blood pressure of 175),
the predicted risk of a stroke is:

yˆ  91.7595  1.0767  68  1.0767 175   1.0767 1  34.2661

or a probability of approximately .34.

d. Other factors that could be included in the model as independent variables include family history of
stroke, weight/obesity, and gender.

20. a. The following Excel output provides the estimated multiple linear regression equation that that
includes critical reading (x1) and mathematics (x2) SAT scores as independent variables.

7 - 66
Regression Analysis

The estimated multiple linear regression equation is yˆ  2.6717  0.0043x1  0.0045 x2 , and the
coefficient of determination for this multiple regression model is R2 = 0.7722, so the regression
model explains approximately 77% of the variation in the sample values of the Freshman GPA.

Before testing any hypotheses about this regression model, we check the conditions necessary for
valid inference in regression. The correlation between the two independent variables (critical reading
and mathematics SAT scores) is -0.0161, so there is no need for concern about autocorrelation.
Furthermore, the Excel plots of the residuals and each of the quantitative independent variables
(critical reading and mathematics SAT scores) follow.

Reading Residual Plot


1

0.5
Residuals

0
0 200 400 600 800 1000
-0.5

-1
Reading

7 - 67
Regression Analysis

Math Residual Plot


1

0.5
Residuals

0
0 200 400 600 800 1000
-0.5

-1
Math

The residuals appear to have a relatively constant variance across the values of each independent
variable and do not appear to be badly skewed at any value of either variable. However, the mean of
the residuals possibly differs from zero at several values of each independent variable. Although
these violations do not appear to be dramatic, we will proceed with our inference but will keep our
findings in mind as we proceed.

The p-value for the test of the hypothesis that  1 = 0 is 9.06411E-44. Because this p-value is less
than the 0.05 level of significance, we again reject the hypothesis that 1 = 0, and conclude that there
is a relationship between SAT scores on critical reading and Freshman GPA. We estimate that
holding SAT score on mathematics constant, a one point increase in SAT score on critical reading
corresponds to an increase in Freshman GPA of 0.0043. We expect freshman GPA to increase as the
SAT score on critical reading increases, so this result appears to be reasonable.

The p-value for the test of the hypothesis that  2 = 0 is 1.20495E-45. Because this p-value is less
than the 0.05 level of significance, we again reject the hypothesis that 2 = 0, and conclude that there
is a relationship between SAT scores on mathematics and Freshman GPA. We estimate that holding
SAT score on critical reading constant, a one point increase in SAT score on mathematics
corresponds to an increase in Freshman GPA of 0.0045. We expect freshman GPA to increase as the
SAT score on mathematics increases, so this result appears to be reasonable.

b. Using the multiple linear regression model developed in part (a), the predicted freshman GPA of
Bobby Engle (a student who has been admitted to Ruggles College with a 660 SAT score on critical
reading and at a 630 SAT score on mathematics) is

yˆ  2.6717  0.0043  660   0.0045  630   3.0519

or approximately 3.05.

c. The following Excel output provides the estimated multiple linear regression equation that that
includes critical reading (x1) and mathematics (x2) SAT scores and their interaction (x1x2) as
independent variables.

7 - 68
Regression Analysis

The estimated multiple linear regression equation is

yˆ  0.7977  0.0011x1  0.0010 x2  0.000009 x1 x2

and the coefficient of determination for this multiple regression model is R2 = 0.7855, so the
regression model explains approximately 79% of the variation in the sample values of the Freshman
GPA.

Before testing any hypotheses about this regression model, we again check the conditions necessary
for valid inference in regression. As noted before, the correlation between the two independent
variables (critical reading and mathematics SAT scores) is -0.0161, so there is no need for concern
about autocorrelation. Excel plots of the residuals and each of the independent follow.

Reading Residual Plot


1

0.5
Residuals

0
0 200 400 600 800 1000
-0.5

-1
Reading

7 - 69
Regression Analysis

Math Residual Plot


1

0.5
Residuals

0
0 200 400 600 800 1000
-0.5

-1
Math

ReadMath Residual Plot


1

0.5
Residuals

0
0 200000 400000 600000 800000
-0.5

-1
ReadMath

The addition of the readmath interation appears to have had relatively little impact on the residuals.
However, we did not believe that these violations were extreme in the multiple regression model that
did not include the interaction, and so we will proceed with our inference.

The p-value for the test of the hypothesis that  1 = 0 is 0.4810. Because this p-value is greater than
the 0.05 level of significance, we do not reject the hypothesis that 1 = 0, and we conclude that there
is not a linear relationship between SAT scores on critical reading and Freshman GPA when
controlling for SAT scores on mathematics and the interaction between SAT scores on critical
reading and SAT scores on mathematics.

The p-value for the test of the hypothesis that  2 = 0 is 0.5479. Because this p-value is greater than
the 0.05 level of significance, we do not reject the hypothesis that 2 = 0, and we conclude that there
is not a linear relationship between SAT scores on mathematics and Freshman GPA when
controlling for SAT scores on reading and the interaction between SAT scores on critical reading
and SAT scores on mathematics.

The p-value for the test of the hypothesis that  3 = 0 is 0.0006. Because this p-value is less than the
0.05 level of significance, we reject the hypothesis that 3 = 0, and we conclude that there is a
relationship between the interaction of SAT scores on critical reading and mathematics and the
dependent variable Freshman GPA. We estimate that when the SAT score on critical reading
increases by one point, Freshman GPA increases by 0.000009*SAT score on mathematics.
Similarly, we estimate that when the SAT score on mathematics increases by one point, Freshman
GPA increases by 0.000009*SAT score on critical reading. These results support the conjecture
made by the Ruggles College Director of Admissions.

7 - 70
Regression Analysis

d. The model developed in part (a) is simpler and explains almost as much variation in the sample
values of freshman GPA as the regression model developed in part (c), so the regression model
developed in part (a) is superior.

e. Other factors that could be added to the model as independent variables include the student’s high
school GPA and the number of credit and the number of hours per week the student plans to work in
paid employment during her/his freshman year.

21. a. The following Excel output provides the estimated simple linear regression equation with the
customer’s annual household income as the independent variable (x) and credit card charges accrued
by a customer over the past year as the dependent variable (y).

The estimated simple regression equation is yˆ  3146.3609  121.3549x1 .

We estimate that as a customer’s annual income increases by $1000, the credit card charges accrued
by the customer over the past year will be $121.35 higher.

The coefficient of determination for this multiple regression model is r2 = 0.3135, so this simple
linear regression model explains approximately 31% of the variation in the sample values of credit
card charges accrued by a customer over the past year.

b. The following Excel output provides the estimated simple linear regression equation with the
number of members in the customer’s household as the independent variable (x) and credit card
charges accrued by a customer over the past year as the dependent variable (y).

7 - 71
Regression Analysis

The estimated simple regression equation is yˆ  8853.1544  550.9074 x1 .

We estimate that as the number of members in the customer’s household increases by one, the credit
card charges accrued by the customer over the past year will be $550.91 higher.

The coefficient of determination for this multiple regression model is r2 = 0.0354, so this simple
linear regression model explains approximately 4% of the variation in the sample values of credit
card charges accrued by a customer over the past year.

c. The following Excel output provides the estimated simple linear regression equation with the
customer’s number of years of post-high school education as the independent variable (x) and credit
card charges accrued by a customer over the past year as the dependent variable (y).

The estimated simple regression equation is yˆ  12, 638.0766  509.8743 x .

We estimate that as a customer’s years of post-high school education increases by one year, the
credit card charges accrued by the customer over the past year will be $509.87 lower.

7 - 72
Regression Analysis

The coefficient of determination for this multiple regression model is r2 = 0.0160, so this simple
linear regression model explains approximately 2% of the variation in the sample values of credit
card charges accrued by a customer over the past year.

d. The following Excel output from Figure 4.25 provides the estimated simple linear regression
equation with the customer’s annual household income (x1), number of members of the household
(x2), and number of years of post-high school education (x3) as independent variables and credit card
charges accrued by a customer over the past year as the dependent variable (y).

The multiple regression estimates of b1, b2, and b3 are almost identical to the estimate b1 from the
estimated simple linear regressions in parts (a), (b), and (c), respectively. This stability in the values
of the parameter estimates suggests little multicollinearity in the multiple linear regression that
includes the customer’s annual household income (x1), number of members of the household (x2),
and number of years of post-high school education (x3) as independent variables.

e. The coefficient of determination for the multiple regression model in Figure 4.25 is R2 = 0.3635, so
the regression model explains approximately 36% of the variation in the sample values of credit card
charges accrued by a customer over the past year. This R2 is only slightly less than the sum of the r2s
from parts (a), (b), and (c), which is 0.3667. This indicates that there is little redundancy in the
variation in sample values credit card charges accrued by a customer over the past year that is
explained by these three independent variables, i.e., there is almost no multicollinearity in the
estimated multiple linear regression from Figure 4.25.

f. The following Excel output provides the estimated simple linear regression equation with the
customer’s annual household income (x1), number of members of the household (x2), number of
years of post-high school education (x3), age (x4), a dummy variable for gender (x5), and a dummy
variable for whether a customer has exceeded his/her credit limit in past 12 months (x6) as
independent variables and credit card charges accrued by a customer over the past year as the
dependent variable (y).

7 - 73
Regression Analysis

The coefficient of determination for this multiple regression model is R2 = 0.3678, so the regression
model explains approximately 37% of the variation in the sample values of credit card charges
accrued by a customer over the past year. This R2 is only slightly larger than the R2 from the multiple
regression in Figure 4.25. Therefore, adding age, a dummy variable for gender, and a dummy
variable for whether a customer has exceeded his/her credit limit in past 12 months as independent
variables to the multiple regression in Figure 4.25 does not substantially improve the fit of the
model.

7 - 74