0 views

Uploaded by ijaz afzal

cnxzVCzxchvkjk

- Econometrics Project
- Statistics Model Paper
- Panel Data Problem Set 2
- Analyzing GRT Data in Stata
- Notes Part 2
- Chapter 4
- Official Econometrics Report_Trangntt50
- What determines the academic and professional participation of economists?
- Introduction to Bivariate Regression
- Chapter 5
- Spatial variability of the active layer, permafrost, and soil profile depth in Alaskan soils
- eapanalysis-100617114539-phpapp01
- FAQ R squared.docx
- Cost-Management-2nd-Edition-Eldenburg-Test-Bank.docx
- Sample Final Solutions.docx
- Estimation of Postmortem Interval Using Thanatoche 2
- Hw 5 Solutions
- Ch 03 Wooldridge 6e PPT Updated
- VESHelp.pdf
- Pdb c Capital Labour

You are on page 1of 74

Chapter 7

Regression Analysis

Solutions:

10000

9000

8000

7000

6000

Price

5000

4000

3000

2000

1000

0

0.0 5.0 10.0 15.0 20.0

Weight

This scatter chart indicates there may be a negative linear relationship between weight and price.

Lighter bicycles are generally expected to be more expensive, and this scatter chart is consistent with

what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to estimate

the price (y) given the bicycle weight (x).

7-1

Regression Analysis

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and weight follows.

2000

1000

Residuals

0

0.0 5.0 10.0 15.0 20.0

-1000

-2000

Weight

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 2.14888E-05. Because this p-

value is less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude

that there is a relationship between the weight and price of the bicycles, and our best estimate is that

an increase in weight of one pound corresponds to a price decrease of $1439.01. The price of a

racing bicycle is expected to increase as the weight of the bicycle decreases, so this result is

consistent with what is expected.

The estimated regression parameter b0 suggests that when the weight of a bicycle is zero pounds, the

price is $28,818.00. This result is obviously not realistic, but this parameter estimate and the test of

the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated through

extrapolation (there is no bicycle in the sample data with a weight near zero pounds).

d. The coefficient of determination r2 is 0.8637, so the regression model estimated in part (b) explains

approximately 86% of the variation in the prices of the bicycles in the sample.

or approximately $7233.

2. a. The scatter chart with line speed as the independent variable follows.

7-2

Regression Analysis

70

60

50

40

30

20

10

0

0.0 5.0 10.0 15.0 20.0 25.0

Line Speed

This scatter chart indicates there may be a negative linear relationship between line speed and

number of defective parts found. The number of defective parts found is expected to decrease as the

line speed increases, so this scatter chart is consistent with what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to predict

the number of defective parts found (y) given the line speed (x).

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and line speed follows.

7-3

Regression Analysis

2

1

Residuals

0

0 20 40 60 80

-1

-2

Line Speed

Because we are working with only 6 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.02813. Because this p-value

is greater than the 0.01 level of significance, we do not reject the hypothesis that 1 = 0. We

conclude that there is no relationship between line speed and number of defective parts found.

The estimated regression parameter b0 suggests that when the line speed is zero, the number of

defect parts found is 22.1739. This result is obviously not realistic, but this parameter estimate and

the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated

through extrapolation (there is no observation in the sample data with a line speed near zero).

d. The coefficient of determination r2 is 0.7391, so the regression model estimated in part (b) explains

approximately 74% of the variation in the number of defective parts found in the sample.

3. a, The scatter chart with weekly usage as the independent variable follows.

60.0

Annual Maintenance Expense

50.0

40.0

30.0

20.0

10.0

0.0

0 5 10 15 20 25 30 35 40 45

Weekly Usage (hours)

7-4

Regression Analysis

This scatter chart indicates there may be a positive linear relationship between weekly usage and

annual maintenance expense. Annual maintenance expense is expected to increase as weekly usage

increases, and this scatter chart is consistent with what would be expected.

b. The following Excel output provides the estimated regression equation that could be used to predict

the annual maintenance expense (y) given the weekly usage (x).

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and weekly usage follows.

10

5

Residuals

0

0 10 20 30 40 50

-5

-10

Weekly Usage (hours)

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0001. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between weekly usage and annual maintenance expense, and our best estimate is

that a one hour increase in weekly usage corresponds to an increase of $95.34 in annual maintenance

7-5

Regression Analysis

expense. Annual maintenance expense is expected to increase as weekly usage increases, so this

result is consistent with what is expected.

The estimated regression parameter b0 suggests that when weekly usage is zero, the annual

maintenance expense is $1,052.80. This result is obviously not realistic, but this parameter estimate

and the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated

through extrapolation (there is no observation in the sample data with a weekly usage near zero).

d. The coefficient of determination r2 is 0.8562, so the regression model estimated in part (b) explains

approximately 86% of the variation in the values of annual maintenance expenses in the sample.

e. The mean weekly usage in our sample is 25.3. Using our simple linear regression model, for this

many hours of usage we predict the annual maintenance expense to be

or $3,650. Unless Jensen’s has a valid reason to expect weekly usage to be less than usual, the

company should purchase the contract.

We can use the estimated regression model to estimate the breakeven hours of weekly usage for the

proposed $3000 contract. If we set the estimated regression equation equal to 30 (the value of the

dependent variable annual maintenance expense that corresponds to the $3000 cost of the contract)

and solve for x (weekly usage)

10.5280 0.9534 x1 30

0.9534 x1 30 10.5280 19.4720

0.9534

we find the estimated breakeven point for the contact is approximately 20.4 usage hours. If Jensen’s

believes weekly usage will exceed 20.4 hours during the life of the contract, the estimated annual

maintenance expense will exceed $3000 and Jensen’s should purchase the contract.

4. a. The scatter chart with distance to work as the independent variable follows.

7-6

Regression Analysis

8

Number of Days Absent 7

0

0 5 10 15 20

Distance to Work (miles)

This scatter chart indicates there may be a negative linear relationship between distance to work and

number of days absent. This is not what would be expected – an employee who lives farther from

her/his job is expected to be absent more frequently than an employee who lives closer to her/his job.

b. The following Excel output provides the estimated regression equation that could be used to predict

the number of days absent (y) given the distance to work (x).

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and weight follows.

7-7

Regression Analysis

Plot

2

1

Residuals

0

-1 0 5 10 15 20

-2

-3

Distance to Work (miles)

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The 99% confidence interval for the regression parameter 1 provided in the Excel output is (-

0.6046, -0.0838). Because this interval does not include zero, we reject the hypothesis that 1 = 0.

We conclude that there is a relationship between distance to work and number of absences. Our best

estimate is that a one mile increase in the distance to work corresponds to a decrease of 0.3442 days

absent.

d. The 99% confidence interval for the regression parameter 0 provided in the Excel output is (5.3839,

10.8117). However, this confidence interval and the test of the hypothesis that 0 = 0 are

meaningless because the y-intercept has been estimated through extrapolation (there is no

observation in the sample data with a distance to work near zero).

e. The coefficient of determination r2 is 0.7109, so the regression model estimated in part (b) explains

approximately 71% of the variation in the values of number of days absent in the sample.

5. a. The scatter chart with age of bus as the independent variable follows.

1000

900

Annual Maintenance Cost ($)

800

700

600

500

400

300

200

100

0

0 1 2 3 4 5 6

Age of Bus (Years)

7-8

Regression Analysis

This scatter chart indicates there may be a positive linear relationship between age of bus and annual

maintenance cost. Older buses generally cost more to maintain, and this scatter chart is consistent

with what is expected.

b. The following Excel output provides the estimated regression equation that could be used to predict

the annual maintenance cost (y) given the age of the bus (x).

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and age of bus follows.

150

100

Residuals

50

0

-50 0 1 2 3 4 5 6

-100

-150

Age of Bus (years)

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 7.62662E-05. Because this p-

value is less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude

that there is a relationship between age of bus and annual maintenance cost, and our best estimate is

that a one year increase in age of bus corresponds to an increase of $131.67 in annual maintenance

7-9

Regression Analysis

cost. Annual maintenance cost of a bus is expected to increase as the bus ages, so these results are

consistent with what is expected.

The estimated regression parameter b0 suggests that when the age of the bus is zero, the annual

maintenance expense is $220. This result is obviously not realistic, but this parameter estimate and

the test of the hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated

through extrapolation (there is no observation in the sample data with a bus age near zero).

d. The coefficient of determination r2 is 0.8725, so the regression model estimated in part (b) explains

approximately 87% of the variation in the values of annual maintenance cost in the sample.

e. Using this regression model, the predicted annual maintenance cost for a 3.5 year old bus is

or approximately $680.

6. a. The scatter chart with hours spent studying as the independent variable follows.

120

100

Total Points Earned

80

60

40

20

0

0 20 40 60 80 100 120

Hours Spent Studying

This scatter chart indicates there may be a positive linear relationship between hours spent studying

and total points earned. Students who spend more time studying generally earn more points, and this

scatter chart is consistent with what is expected.

b. The following Excel output provides the estimated regression equation showing how total points

earned (y) is related to hours spent studying (x).

7 - 10

Regression Analysis

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and hours spent studying follows.

20

10

Residuals

0

0 20 40 60 80 100 120

-10

-20

Hours Spent Studying

The residuals at each value of hours spent studying appear to have a mean of 0, have similar

variances, and be concentrated around 0. The conditions necessary for inference to be valid in

regression appear to be satisfied, and so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 1.09619E-60. Because this p-

value is less than the 0.01 level of significance, we reject the hypothesis that 1 = 0. We conclude

that there is a relationship between hours spent studying and total points earned, and our best

estimate is that a one hour increase in hours spent studying corresponds to an increase of 0.8014 in

total points earned. This result is consistent with what would be expected.

The estimated regression parameter b0 suggests that when a student spends zero hours studying, the

total points earned by the student is 8.6742. This parameter estimate and the test of the hypothesis

that 0 = 0 are meaningless because the y-intercept has been estimated through extrapolation (there is

no observation in the sample data with hours of studying near zero).

7 - 11

Regression Analysis

d. The coefficient of determination r2 is 0.8277, so the regression model estimated in part (b) explains

approximately 83% of the variation in the values of total points earned in the sample.

e. Using this regression model, the predicted total points earned by Mark (who spent 95 hours

studying) is

or approximately 85 points.

1420

1400

1380

1360

S&P 500

1340

1320

1300

1280

1260

12200 12400 12600 12800 13000 13200 13400

DJIA

This scatter chart indicates there may be a positive linear relationship between DJIA and S&P 500.

Since both indexes are used as measures of overall movement in the stock market, a positive

relationship is expected between these two variables. This scatter chart is consistent with what is

expected.

b. The following Excel output provides the estimated regression equation showing how S&P 500 (y) is

related to DJIA (x).

7 - 12

Regression Analysis

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and DJIA follows.

20

10

Residuals

0

12200 12400 12600 12800 13000 13200 13400

-10

-20

DJIA

Because we are working with only 15 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The 95% confidence interval for the regression parameter 1 provided in the Excel output is 0.1353,

0.1792). Because this interval does not include zero, we reject the hypothesis that 1 = 0. We

conclude that there is a relationship between DJIA and S&P 500, and our best estimate is that a one

point increase in DJIA corresponds to an increase of 0.1573 points for the S&P 500. This result

appears to be reasonable.

d. The 95% confidence interval for the regression parameter 0 provided in the Excel output is (-

951.4540, -386.5885). However, this parameter estimate and confidence interval (and the

corresponding test of the hypothesis that 0 = 0) are meaningless because the y-intercept has been

estimated through extrapolation (there is no observation in the sample data with a value of DJIA near

zero).

7 - 13

Regression Analysis

e. The coefficient of determination r2 is 0.9486, so the regression model estimated in part (b) explains

approximately 95% of the variation in the values of S&P 500 in the sample.

f. Using this regression model, when DJIA is 13,500 the predicted S&P 500 is

g. The maximum DJIA in our sample data is 13,233, so when the DJIA value of 13,500 is used to

predict the S&P 500 value in part (e) the regression model has been extrapolated beyond the

experimental region of the data, so you should be concerned about this prediction.

18.0

16.0

14.0

12.0

10.0

Price

8.0

6.0

4.0

2.0

0.0

0 20 40 60 80 100 120

Miles

This scatter chart indicates there may be a negative linear relationship between miles and price.

Since a Camry with higher miles will generally sell for a lower price, a positive relationship is

expected between these two variables. This scatter chart is consistent with what is expected.

b. The following Excel output provides the estimated regression equation showing how price (y) is

related to miles (x).

7 - 14

Regression Analysis

c. First we check the conditions necessary for valid inference in regression. The Excel plot of the

residuals and miles follows.

4

2

Residuals

0

0 20 40 60 80 100 120

-2

-4

Miles (1000s)

Because we are working with only 19 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0003. Because this p-value is

less than the 0.01 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between miles and price, and our best estimate is that a one thousand mile increase

corresponds to a decrease of $58.77. This result appears to be reasonable.

The estimated regression parameter b0 suggests that if a Camry has zero miles, the predicted price is

$16,469.76. This result is obviously not realistic, but this parameter estimate and the test of the

hypothesis that 0 = 0 are meaningless because the y-intercept has been estimated through

extrapolation (there is no observation in the sample data with miles near zero).

7 - 15

Regression Analysis

d. The coefficient of determination r2 is 0.5387, so the regression model estimated in part (b) explains

approximately 54% of the variation in the values of price in the sample.

e. Excel output for the predicted prices and residuals for the model estimated in part (b) follows:

A bargain is a Camry for which the predicted price given its miles exceeds the actual price (the price

of the automobile is less that the value the automobile given its miles). The residuals ei yi yˆi for

this simple linear regression model are the differences between the actual prices and the predicted

prices, so the Camry with the largest negative residual is the best bargain. The twelfth automobile in

the sample sold for $12,500, and with 28,000 miles its predicted price is $14,824. Thus, this

automobile sold for $2,324 less than the predicted price for a Camry with 28,000 miles. The fourth

automobile in the data, which has 47,000 mile, was almost as big a bargain, selling for $2,207 less

than the predicted price for a Camry with the 47,000 miles.

f. Using the estimated regression equation developed in part (b), the predicted mean price for a

previously owned 2007 Camry that has been driven 60,000 miles is

7 - 16

Regression Analysis

or $12,943. Depending on other factors not considered in the model (various options, the physical

condition of the body and interior, etc.), this is a reasonable price to expect to pay for a Camry that

has been driven 60,000 miles.

9. a. The following Excel output provides the estimated regression equation showing how weekly gross

revenue (y) is related to the amount of television advertising (x).

Before performing any hypothesis tests on the results, we check the conditions necessary for valid

inference in regression. The Excel plot of the residuals and television advertising follows.

Residual Plot

100

50

Residuals

0

0.0 2.0 4.0 6.0 8.0

-50

-100

Television Advertising (100s)

The variance in the residuals is possibly increasing as the amount of television advertising increases,

which gives some cause for concern, However, when working with few observations (in this case,

we are working with only 8 observations), assessing the conditions necessary for inference to be

valid in regression is extremely difficult, and inference is usually performed unless these is evidence

of an extreme violation of one or more of the conditions necessary for valid inference. This scatter

chart does not provide strong evidence of a violation of the conditions, so we will proceed with our

inference.

The p-value associated with the estimated regression parameter b1 is 0.0339. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

7 - 17

Regression Analysis

is a relationship between television advertising and weekly gross revenue at the 0.05 level of

significance, and our best estimate is that a $100 increase in television advertising corresponds to an

increase of $4006.40 in weekly gross revenue. This result appears to be reasonable.

b. The coefficient of determination r2 is 0.5552, so the regression model estimated in part (a) explains

approximately 56% of the variation in the values of weekly gross revenue in the sample.

c. The following Excel output provides the estimated regression equation with both television

advertising (x1) and newspaper advertising (x2) as the independent variables.

First we check the conditions necessary for valid inference in regression. The Excel plots of the

residuals and each of the two independent variables follow.

Residual Plot

40

20

Residuals

0

0.0 2.0 4.0 6.0 8.0

-20

-40

Television Advertising (100s)

Because we are working with only 8 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. This scatter chart of the residuals versus television

advertising does not provide strong evidence of a violation of the conditions.

7 - 18

Regression Analysis

Residual Plot

40

20

Residuals

0

0.0 2.0 4.0 6.0 8.0 10.0

-20

-40

Newspaper Advertising (100s)

Similarly, this scatter chart of the residuals versus newspaper advertising does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0252. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between television advertising and weekly gross revenue at the 0.05 level of

significance, and our best estimate is that if we hold newspaper advertising constant, a $100 increase

in television advertising corresponds to an increase of $2240.22 in weekly gross revenue. This result

appears to be reasonable.

The p-value associated with the estimated regression parameter b2 is 0.0033. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there

is a relationship between newspaper advertising and weekly gross revenue at the 0.05 level of

significance, and our best estimate is that if we hold television advertising constant, a $100 increase

in newspaper advertising corresponds to an increase of $1949.86 in weekly gross revenue. This

result also appears to be reasonable.

The estimated regression parameter b0 suggests that when television advertising and newspaper

advertising are both zero, the predicted weekly gross revenue is -$4256.96. This result is obviously

not realistic, but this parameter estimate and the test of the hypothesis that 0 = 0 are meaningless

because the y-intercept has been estimated through extrapolation (there is no observation in the

sample data for which both television advertising and newspaper advertising are near zero).

d. The coefficient of determination R2 is 0.9322, so the regression model estimated in part (c) explains

approximately 93% of the variation in the values of weekly gross revenue in the sample.

e. For the multiple linear regression model estimated in part (c), the overall regression relationship is

significant, the estimated regression coefficients b1 and b2 are significant and consistent with what

would be expected, and this model explains approximately 37% more variation in the values of

weekly gross revenue in the sample than the simple linear regression model estimated in part (a).The

model estimated in part (c) is satisfactory and is superior to the model estimated in part (a).

f. Management can feel confident that increased spending on both television and newspaper

advertising coincides with increased weekly gross revenue. The results also suggest that television

advertising may be slightly more effective than newspaper advertising in generating revenue.

10. a. The following Excel output provides the estimated multiple linear regression equation with comfort

(x1), amenities (x2), and in-house dining (x3) as the independent variables.

7 - 19

Regression Analysis

The estimated multiple linear regression equation is yˆ 35.6967 0.1093x1 0.2443x2 0.2474 x3 .

b. Before performing any hypothesis tests on the results, we check the conditions necessary for valid

inference in regression. The Excel plots of the residuals and comfort, amenities, and in-house dining

follow.

4

2

Residuals

0

85.0 90.0 95.0 100.0 105.0

-2

-4

Comfort

7 - 20

Regression Analysis

4

2

Residuals

0

0.0 20.0 40.0 60.0 80.0 100.0 120.0

-2

-4

Amenities

4

2

Residuals

0

0.0 20.0 40.0 60.0 80.0 100.0 120.0

-2

-4

In-House Dining

None of these scatter charts provide strong evidence of a violation of the conditions necessary for

valid inference in regression, so we will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.4117. Because this p-value is

greater than the 0.01 level of significance, we do not reject the hypothesis that 1 = 0. We conclude

that there is not a relationship between the score on comfort and the overall score at the 0.01 level of

significance when controlling for amenities and in-house dining.

The p-value associated with the estimated regression parameter b2 is 3.69454E-05. Because this p-

value is less than the 0.01 level of significance, we reject the hypothesis that 2 = 0. We conclude

that there is a relationship between the score on amenities and the overall score at the 0.01 level of

significance, and our best estimate is that if we hold the scores on comfort and in-house dining

constant, a one point increase in the score on amenities corresponds to an increase of 0.2443 in

overall score.

The p-value associated with the estimated regression parameter b3 is 0.0011. Because this p-value is

less than the 0.01 level of significance, we reject the hypothesis that 3 = 0. We conclude that there

is a relationship between the score on in-house dining and the overall score at the 0.01 level of

significance, and our best estimate is that if we hold the scores on comfort and amenities constant, a

one point increase in the score on in-house dining corresponds to an increase of 0.2443 in overall

score.

7 - 21

Regression Analysis

If ratings for comfort, amenities, and in-house dining are related to overall score, the relationships

are expected to be positive. The results are consistent with expectations for all three relationships.

c. The following Excel output provides the estimated multiple linear regression equation with

amenities (x1), and in-house dining (x2) as the independent variables.

A review of the Excel plots of the residuals and the two independent variables that follow show no

dramatic departures from the conditions necessary for valid inference in regression. We can proceed

with inference.

4

2

Residuals

0

0.0 20.0 40.0 60.0 80.0 100.0 120.0

-2

-4

Amenities

7 - 22

Regression Analysis

4

2

Residuals

0

0.0 20.0 40.0 60.0 80.0 100.0 120.0

-2

-4

In-House Dining

The p-value associated with the estimated regression parameter b1 (which now corresponds to

amenities) is 1.32524E-05. Because this p-value is less than the 0.01 level of significance, we reject

the hypothesis that 1 = 0. We conclude that there is a relationship between the score on amenities

and the overall score at the 0.01 level of significance, and our best estimate is that if we hold the

scores on in-house dining constant, a one point increase in the score on amenities corresponds to an

increase of 0.2526 on overall score.

The p-value associated with the estimated regression parameter b2 (which now corresponds to in-

house dining) is 0.0009. Because this p-value is less than the 0.01 level of significance, we reject the

hypothesis that 2 = 0. We conclude that there is a relationship between the score on in-house dining

and the overall score at the 0.01 level of significance, and our best estimate is that if we hold the

scores on amenities constant, a one point increase in the score on in-house dining corresponds to an

increase of 0.2483 in overall score.

For this multiple linear regression model, the overall regression relationship is significant, and the

estimated regression coefficients b1 and b2 are significant and consistent with what would be

expected. Furthermore, this model has a coefficient of determination of R2 = 0.7387. The model with

all three independent variables (comfort, amenities, and in-house dining) from part (a) has a multiple

coefficient of determination of R2 = 0.7498, and so this model explains little more than 1% more of

the variation in overall ratings in the sample than does the model that includes only the independent

variables amenities and in-house dining (that is, removing comfort as an independent variable

resulted in a loss of little more that 1% of the explained variation in overall score). Thus, the simpler

multiple regression model developed in part (d) is preferred.

11. a. The following Excel output provides the estimated multiple linear regression equation with

satisfaction trade price (x1) and satisfaction with speed of execution (x2) as the independent variables.

7 - 23

Regression Analysis

The coefficient of determination R2 is 0.6827, so this regression model explains approximately 68%

of the variation in the values of overall satisfaction in the sample.

b. Before testing any hypotheses, we check the conditions necessary for valid inference in regression.

The Excel plots of the residuals and each of the two independent variables follow.

Residual Plot

1

0.5

Residuals

0

0.0 1.0 2.0 3.0 4.0 5.0

-0.5

-1

Satisfaction with Trade Price

Because we are working with only 14 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions.

7 - 24

Regression Analysis

Residual Plot

1

0.5

Residuals

0

0.0 1.0 2.0 3.0 4.0 5.0

-0.5

-1

Satisfaction with Speed of Execution

Similarly, this scatter chart does not provide strong evidence of a violation of the conditions, so we

will proceed with our inference.

The p-value associated with the estimated regression parameter b1 is 0.0357. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between satisfaction with trade price and overall satisfaction with the electronic

trade at the 0.05 level of significance.

The p-value associated with the estimated regression parameter b2 is 0.0006. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there

is a relationship between satisfaction with speed of execution and overall satisfaction with the

electronic trade at the 0.05 level of significance.

c. With regard to the relationship between satisfaction with trade price and overall satisfaction with the

electronic trade, we estimate that if we hold satisfaction with speed of execution constant, a 1 point

increase in satisfaction with trade price corresponds to an increase of 0.5580 in overall satisfaction

with the electronic trade.

With regard to the relationship between satisfaction with speed of trade and overall satisfaction with

the electronic trade, we estimate that if we hold satisfaction with trade price constant, a 1 point

increase in satisfaction with speed of execution corresponds to an increase of 0.7342 in overall

satisfaction with the electronic trade.

Overall satisfaction with the electronic trade should generally increase as satisfaction with trade

price increase and as satisfaction with the speed of the trade increases. Both of the estimated

relationships in this multiple linear regression model are consistent with what would be expected.

d. If Finger Lakes Investments can achieve its performance level goals (satisfactory levels of service

levels (3) for both trade price and speed of execution), its predicted overall satisfaction level will be

or approximately 3.1.

e. The possible responses (scores) for each question were no opinion (0), unsatisfied (1), somewhat

satisfied (2), satisfied (3), and very satisfied (4). The responses unsatisfied, somewhat satisfied,

satisfied, and very satisfied each represent a degree of satisfaction. However, the response no

7 - 25

Regression Analysis

opinion does not represent a degree of satisfaction and should not be part of this scale. Giving the no

opinion response a value of zero is not appropriate.

12. a. The following Excel output provides the estimated simple linear regression equation that could be

used to predict the percentage of games won (y) given the average number of passing yards per

attempt (x).

The estimated simple linear regression equation is yˆ 58.7703 16.3906 x , and the coefficient of

determination r2 is 0.5771, so this regression model explains approximately 58% of the variation in

the sample values of percentage of games won.

b. The following Excel output provides the estimated simple linear regression equation that could be

used to predict the percentage of games won (y) given the number of interceptions thrown per

attempt (x).

7 - 26

Regression Analysis

The estimated simple linear regression equation is yˆ 97.5383 1600.4909 x , and the coefficient of

determination r2 is 0.4379, so this regression model explains approximately 44% of the variation in

the sample values of percentage of games won.

c. The following Excel output provides the estimated multiple linear regression equation that could be

used to predict the percentage of games won (y) given the average number of passing yards per

attempt (x1) and the number of interceptions thrown per attempt (x2).

The estimated multiple linear regression equation is yˆ 5.7633 12.9494 x1 1083.7880 x2 , and

the coefficient of determination R2 is 0.7525, so this regression model explains approximately 75%

of the variation in the sample values of percentage of games won.

d. Using the estimated regression equation developed in part (c), the predicted percentage of games

won by the Kansas City Chiefs for the 2011 season (during which the Kansas City Chiefs average

number of passing yards per attempt was 6.2 and the number of interceptions thrown per attempt

was 0.036) is

or 35.51%. During the 2011 season the Kansas City Chiefs won 43.75% of its games (recall the

team’s record for the 2011 season was 7 wins and 9 loses, and so the team performed better than

what we would predict for a team with an average number of passing yards per attempt of 6.2 and

number of interceptions thrown per attempt of 0.036.

e. The estimated simple linear regression equation that uses only the average number of passing yards

per attempt as the independent variable to predict the percentage of games won has a coefficient of

determination of r2 = 0.5771, and the estimated multiple linear regression equation that uses both the

average number of passing yards per attempt and the number of interceptions thrown per attempt as

the independents variable to predict the percentage of games won has a coefficient of determination

of R2 = 0.7525. The multiple linear regression model fits the data better, as it explains over 17%

more variation the percentage of games won than did the simple linear regression.

7 - 27

Regression Analysis

13. a. The following Excel output provides the estimated simple linear regression equation that could be

used to predict repair time (y) given the number of months since the last maintenance service (x).

Before testing the hypotheses of no relationship between repair time and the number of months since

the last maintenance service, we check the conditions necessary for valid inference in regression.

The Excel plot of the residuals and the number of months since the last maintenance service follows.

Plot

2

1

Residuals

0

0 2 4 6 8 10

-1

-2

Months Since Last Service

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, this scatter chart does not provide strong

evidence of a violation of the conditions, so we will proceed with our inference. Since the level of

significance for use in hypothesis testing has not been given, we will use the standard 0.05 level

throughout this problem.

The p-value associated with the estimated regression parameter b1 is 0.0163. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between repair time and the number of months since the last maintenance service at

the 0.05 level of significance, and our best estimate is that a 1 month increase in number of months

since the last maintenance service corresponds to an increase of 0.3041 hours in repair time.

7 - 28

Regression Analysis

The coefficient of determination r2 is 0.5342, so the regression model explains approximately 53%

of the variation in the values of repair time in the sample.

b. The predicted repair time and residual for each of the ten repairs in the data, provided in the Excel

output and sorted by residual, are provided in the following table.

in Hours Last Service Repair Repairperson Time in Hours Residuals

Mechanical repairs generally have negative residuals and electrical repairs generally have positive

residuals. The residuals ei yi yˆi for this simple linear regression model are the differences

between the actual repair times and the predicted repair times, so mechanical repairs tend to take less

time than predicted and electrical repairs generally take more time than predicted.

Two of the repairs made by Donna Newton have large negative residuals indicating that the model

greatly overestimated the amount of time that these repairs would take. On the other hand, repairs

made by Bob Jones typically have positive residuals, indicating that repairs made by Bob Jones

generally take more time than predicted.

These results suggest that using dummy variables to represent the type of repair (mechanical or

electrical) and repairperson (Donna Newton or Bob Jones) may enhance this fit of the regression

model.

The scatter chart of months since last service and repair time in hours for which the points

representing electrical and mechanical repairs are shown with different shapes and colors follows.

7 - 29

Regression Analysis

6.0

R

e 5.0

p

a 4.0

i

r

3.0

Electrical

T Mechanical

2.0

i

m

1.0

e

0.0

0 2 4 6 8 10

Months Since Last Service

This chart suggests that electrical repairs generally take longer than mechanical repairs, and so using

dummy variables to represent the type of repair (mechanical or electrical) may enhance this fit of the

regression model.

The scatter chart of months since last service and repair time in hours for which the points

representing repairs by Bob Jones and Donna Newton are shown with different shapes and colors

follows.

6.0

R

e 5.0

p

a 4.0

i

r

3.0

Bob Jones

T Donna Newton

2.0

i

m

1.0

e

0.0

0 2 4 6 8 10

Months Since Last Service

This chart suggests that repairs made by Bob Jones generally take longer than repairs made by

Donna Newton, and so using dummy variables to represent the repairperson (Donna Newton or Bob

Jones) may enhance this fit of the regression model.

c. The following Excel output provides the estimated multiple linear regression equation that could be

used to predict repair time given the number of months since the last maintenance service (x1) and

the type of repair (mechanical or electrical, x2).

7 - 30

Regression Analysis

Before testing the hypotheses of no relationship between repair time and the independent variables in

this model, we check the conditions necessary for valid inference in regression. Excel plots of the

residuals with each independent variable in this model follow.

Plot

1

0.5

Residuals

0

0 2 4 6 8 10

-0.5

-1

Months Since Last Service

Plot

1

0.5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-0.5

-1

Type of Repair Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, these scatter charts do not provide strong

7 - 31

Regression Analysis

evidence of a violation of the conditions. We also note that the parameter estimate and associated p-

value corresponding to months since last service do not change substantially when the dummy

variable for type of repair is introduced into the model. This suggests that multicollinearity is not an

issue for this regression model. We will therefore proceed with our inferences.

The p-value associated with the estimated regression parameter b1 is 0.0004. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between number of months since the last maintenance service and repair time at the

0.05 level of significance. We estimate that holding the type of repair constant, a 1 month increase in

number of months since the last maintenance service corresponds with an increase of 0.3876 hours

in repair time.

The p-value associated with the estimated regression parameter b2 is 0.0051. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there

is a relationship between the type of repair and repair time at the 0.05 level of significance. We

estimate that holding the number of months since the last maintenance service constant, an electrical

repair takes 1.2627 hours longer than a mechanical repair.

The y-intercept for this model has been estimated through extrapolation and so does not have a

meaningful interpretation.

86% of the variation in the values of repair time in the sample.

d. The following Excel output provides the estimated multiple linear regression equation that could be

used to predict repair time given the number of months since the last maintenance service (x1) and

the repairperson (Bob Jones or Donna Newton, x2).

Before testing the hypotheses of no relationship between repair time and the independent variables in

this model, we check the conditions necessary for valid inference in regression. Excel plots of the

residuals with each independent variable in this model follow.

7 - 32

Regression Analysis

Plot

2

1

Residuals

0

0 2 4 6 8 10

-1

-2

Months Since Last Service

1.5

1

Residuals

0.5

0

-0.5 0 0.2 0.4 0.6 0.8 1 1.2

-1

-1.5

Repairperson Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, these scatter charts do not provide strong

evidence of a violation of the conditions. We also note that the parameter estimate and associated p-

value corresponding to months since last service change substantially when the dummy variable for

the repairperson is introduced into the model. This suggests that multicollinearity is possibly an issue

for this regression model. We will keep this result in mind as we proceed with our inferences.

The estimated regression parameter b1 implies that holding the repairperson constant, a 1 month

increase in number of months since the last maintenance service corresponds to an increase of

0.1519 hours in repair time. However, the p-value associated with the parameter estimate is 0.25671,

which exceeds the 0.05 level of significance and so leads us to not reject a test of the hypothesis that

1 = 0. Based on this multiple linear regression model, we conclude that, holding the repair person

constant, there is no relationship between months since the last maintenance service and repair time.

We note, however, that we have evidence that the independent variables repairperson and months

since the last maintenance service are related, which may explain why months since the last

maintenance service is not statistically significant in this model.

The estimated regression parameter b2 implies that holding the number of months since the last

maintenance service constant, Donna Newton takes 1.0835 hours less than Bob Jones to make a

repair. However, the p-value associated with the parameter estimate is 0.1165, which exceeds the

0.05 level of significance and so leads us to not reject a test of the hypothesis that 2 = 0. Based on

this multiple linear regression model, we conclude that, holding the months since the last

maintenance service constant, there is no difference in the repairs time for Donna Newton and Bob

Jones. Again we note that we have evidence that the independent variables repair person and months

7 - 33

Regression Analysis

since the last maintenance service are related, which may explain why the repairperson dummy

variable is not statistically significant in this model.

The y-intercept for this model has been estimated through extrapolation and so does not have a

meaningful interpretation.

68% of the variation in the values of repair time in the sample.

e. The following Excel output provides the estimated multiple linear regression equation that could be

used to predict repair time given the number of months since the last maintenance service (x1), type

of repair (mechanical or electrical, x2), and the repairperson (Bob Jones or Donna Newton, x3).

The estimated multiple linear regression equation is yˆ 1.8602 0.2914 x1 1.1024 x2 0.6091x3 .

Before testing the hypotheses of no relationship between repair time and the number of months since

the last maintenance service, we check the conditions necessary for valid inference in regression.

The Excel plots of the residuals with each independent variable follow.

Plot

1

0.5

Residuals

0

0 2 4 6 8 10

-0.5

-1

Months Since Last Service

7 - 34

Regression Analysis

Plot

1

0.5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-0.5

-1

Type of Repair Dummy

1

0.5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-0.5

-1

Repairperson Dummy

Because we are working with only 10 observations, assessing the conditions necessary for inference

to be valid in regression is extremely difficult. However, these scatter charts do not provide strong

evidence of a violation of the conditions.

To check for the potential introduction of multicolinearity that may occur when we add the dummy

variables to the model, we compare the parameter estimates and p-values from the model that

includes only the number of months since the last maintenance service and the type of repair dummy

variable (from part c) with those from the model that includes all three independent variables (from

part e). When making thiese comparisons we observe that the parameter estimates and p-values

associated with the number of months since the last maintenance service and the tye of repair

dummy variable do not change substantially when the dummy variable for repairperson is introduced

into or removed from the model.

We also compare the parameter estimates and p-values from the model that includes only the

number of months since the last maintenance service and the repairperson dummy variable (from

part d) with those from the model that includes all three independent variables (from part e). When

making thiese comparisons we observe that i) the parameter estimates and p-values associated with

the number of months since the last maintenance service change substantially and ii) the parameter

estimates and p-values associated with the repairperson dummy variable do not change substantially

when the dummy variable for type of repair is introduced into or removed from the model.These

results suggest that multicollinearity between the number of months since the last maintenance

service the repairperson dummy variable may be an issue for this regression model. We will keep

this in mind as we proceed with our inferences.

7 - 35

Regression Analysis

The p-value associated with the estimated regression parameter b1 is 0.0130. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 1 = 0. We conclude that there

is a relationship between number of months since the last maintenance service and repair time at the

0.05 level of significance. We estimate is that holding the type of repair and repairperson constant, a

1 month increase in number of months since the last maintenance service corresponds to an increase

of 0.2914 hours in repair time.

The p-value associated with the estimated regression parameter b2 is 0.0109. Because this p-value is

less than the 0.05 level of significance, we reject the hypothesis that 2 = 0. We conclude that there

is a relationship between the type of repair and repair time at the 0.05 level of significance. We

estimate is that holding the number of months since the last maintenance service and the

repairperson (Donna Newton or Bob Jones) constant, an electrical repair takes 1.1024 hours longer

than a mechanical repair.

Furthermore, the estimated regression parameter b3 implies that holding the number of months since

the last maintenance service and the type repair (mechanical or electrical) constant, Donna Newton

takes 0.6091 hours less than Bob Jones to make a repair. However, the p-value associated with the

parameter estimate is 0.1674, which exceeds the 0.05 level of significance and so leads us to not

reject a test of the hypothesis that 3 = 0. Based on this multiple linear regression model, we

conclude that there is no difference in repair times for Bob Jones and Donna Newton.

Finally, the y-intercept for this model has been estimated through extrapolation and so does not have

a meaningful interpretation.

The coefficient of determination R2 = 0.9002, so the regression model explains approximately 90%

of the variation in the values of repair time in the sample.

f. In the model in from part (c) that includes the number of months since the last maintenance service

and the type of repair (mechanical or electrical), we found a significant relationship between each of

the independent variables in the model and the dependent variable, and each of these relationships is

what would be expected. Furthermore, although the model from part (e) model with all three

independent variables in the model has the highest R2 and so explains the greatest proportion of

variation in the sample repair times, the R2 for the multiple linear regression model from part (c) is

only moderately smaller. Note that the model from part (c) includes the number of months since the

last maintenance service and the type of repair as independent variables, and to build the model in

part (e) we have added the repairperson dummy variable to the model in part (c). Because of the

multicollinearity between number of months since the last maintenance service and the repairperson

dummy variable, adding the repairperson dummy variable to a model that already includes includes

the number of months since the last maintenance service will do little to enhance the ability of the

model to explain variation in the dependent variable repair times.

We want to select the simplest model that works well, and so the preferred multiple linear regression

model is the model in part (c) that includes the number of months since the last maintenance service

and the type of repair (mechanical or electrical).

14. a. The following Excel output provides the estimated multiple linear regression equation that could be

used to predict delay given the industry dummy variable (x1), the public dummy variable (x2), quality

(x3), and finished (x4).

7 - 36

Regression Analysis

38% of the variation in the values of repair time in the sample. Other independent variables that

could you include in this regression model to improve the fit include the amount of taxes reported by

the company that is being audited and what type of audit (Taxpayer Compliance Measurement

Program Audit, IRS Correspondence, IRS Office Audit, or IRS Field Audit) is being conducted.

c. Before testing any hypotheses about this regression model, we check the conditions necessary for

valid inference in regression. Excel plots of the residuals and each of the independent variables

follow.

30

20

Residuals

10

0

-10 0 0.2 0.4 0.6 0.8 1 1.2

-20

-30

Industry

7 - 37

Regression Analysis

30

20

Residuals

10

0

-10 0 0.2 0.4 0.6 0.8 1 1.2

-20

-30

Public

30

20

Residuals

10

0

-10 0 1 2 3 4 5 6

-20

-30

Quality

30

20

Residuals

10

0

-10 0 1 2 3 4 5

-20

-30

Finished

The residuals appear to have a relatively constant variance across the values of each independent

variable and do not appear to be badly skewed for any variable. However, the mean of the residuals

possibly differs from zero at several values of each of the quantitative independent variables (quality

and finished). A closer look at these scagtter charts suggests that both quality and finished may have

a nonlinear relationship with delay. We will keep these findings in mind as we proceed.

In checking for multicollinearity, we first calculate the correlation coefficient r for the quantitative

independent variables (quality and finished) to determine if our quantitative variables are strongly

correlated. We note that the correlation between these quality and finished is 0.0356, which indicates

that multicollinearity between the quantitative variables is not a concern.

7 - 38

Regression Analysis

Next we rerun the regression after removing the industry dummy variable (x1) from the original

model and we compare the parameter estimates and associated p-values for each of the reamining

independent variables to the parameter estimates and associated p-values for the original model.

When making these comparisons we observe that these values do not change substantially when the

dummy variable for industry is introduced into or removed from the model and conclude that the

industry dummy variable does not create a problem with multicollinearity.

Finally, we rerun the regression after removing the public dummy variable (x2) from the original

model and compare the parameter estimates and associated p-values for each of the reamining

independent variables to the parameter estimates and associated p-values for the original model.

7 - 39

Regression Analysis

When making these comparisons we observe that these values do not change substantially when the

dummy variable for whether the company is publicly traded is introduced into or removed from the

model and conclude that the public dummy variable does not create a problem with

multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore

proceed with our inferences.

The p-value for the test of the hypothesis that 1 = 0 is 0.0034. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a difference

in delay between the industries at the 0.05 level of significance. We estimate that, holding the values

of public, quality, and finished constant, the delay experienced by an industrial company is 11.9442

days longer than the delay experienced by a bank, savings and loan, or insurance company.

The p-value for the test of the hypothesis that 2 = 0 is 0. 2625. Because this p-value is greater than

the 0.05 level of significance, we do not reject the hypothesis that 2 = 0, and we conclude that there

is no difference in the delays whether the company was traded on an organized exchange or over the

counter when controlling for industry, quality, and finished.

The p-value for the test of the hypothesis that 3 = 0 is 0.0332. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a

relationship between delay and quality at the 0.05 level of significance. We estimate that, holding the

values of industry, public, and finished constant, when the overall quality of internal controls (as

judged by the auditor) increases by one point the delay decreases by 2.6236 days.

The p-value for the test of the hypothesis that 4 = 0 is 0.0345. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 4 = 0, and conclude that there is a

relationship between delay and overall quality of internal controls (as judged by the auditor) at the

0.05 level of significance. We estimate that, holding the values of industry, public, and quality

constant, when finished (as judged by the auditor) increases by one point the delay decreases by

4.0725 days.

d. Since we did not reject the hypothesis 2 = 0 in the previous model, we will remove x2 (the public

dummy) from our multiple linear regression model. The following Excel output provides the

estimated multiple linear regression equation that could be used to predict delay given the industry

dummy variable (x1), quality (x2), and finished (x3).

7 - 40

Regression Analysis

The estimated multiple linear regression equation is yˆ 79.7324 12.6453 x1 2.8204 x2 4.1940 x3

and the coefficient of determination for this model is R2 = 0.3597, so the regression model explains

almost as much variation in the values of repair time in the sample as did the model that included all

four independent variables.

Before testing any hypotheses about this regression model, we again check the conditions necessary

for valid inference in regression.

The Excel plots of the residuals and each of the independent variables follow.

30

20

Residuals

10

0

-10 0 0.2 0.4 0.6 0.8 1 1.2

-20

-30

Industry

7 - 41

Regression Analysis

30

20

Residuals

10

0

-10 0 1 2 3 4 5 6

-20

-30

Quality

30

20

Residuals

10

0

-10 0 1 2 3 4 5

-20

-30

Finished

The residuals appear to have a relatively constant variance across the values of each independent

variable and do not appear to be badly skewed at any value of any independent variable. However,

the mean of the residuals possibly differs from zero at several values of each of the quantitative

independent variables (quality and finished). A closer look at these scatter charts again suggests that

quality and finished may each have a nonlinear relationship with delay. We will proceed with our

inference but will keep our findings in mind as we proceed.

We have already determined that the correlation coefficient r for quality and finished is 0.0356,

which indicates that multicollinearity between the quantitative variables is not a concern.

Next we rerun this regression after removing the industry dummy variable (x1) from our model and

compare the parameter estimates and associated p-values for each of the reamining independent

variables to the parameter estimates and associated p-values for the original model.

7 - 42

Regression Analysis

When making these comparisons we observe that these values do not change substantially when the

dummy variable for industry is introduced into or removed from the model and conclude that the

industry dummy variable does not create a problem with multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore

proceed with our inferences.

The p-value for the test of the hypothesis that 1 = 0 is 0.0019. Because this p-value is less than the

0.05 level of significance, we again reject the hypothesis that 1 = 0, and conclude that there is a

difference in delay between the industries at the 0.05 level of significance. We estimate that, holding

quality and finished constant, the delay experienced by an industrial company is 12.6453 days longer

than the delay experienced by a bank, savings and loan, or insurance company.

The p-value for the test of the hypothesis that 2 = 0 is 0.0217. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a

relationship between delay and quality at the 0.05 level of significance. We estimate that, holding

industry and finished constant, when the overall quality of internal controls (as judged by the

auditor) increases by one point the delay decreases by 2.8204 days.

The p-value for the test of the hypothesis that 3 = 0 is 0.0300. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a

relationship between delay and overall quality of internal controls (as judged by the auditor) at the

0.05 level of significance. We estimate that, holding industry and quality constant, when finished (as

judged by the auditor) increases by one point the delay decreases by 4.1940 days.

We have noted that the residuals plotted over each of the quantitative variables suggested possible

nonlinear relationships between the dependent variable delay and the two quantitative variables

(quality and finished). If we can think of plausible reasons why these two relationships could be

nonlinear, we may wish to consider this quadratic model next:

ŷ 0 1 x1 2 x2 3 x2 4 x3 5 x3

2 2

7 - 43

Regression Analysis

15. a. The following Excel output provides the estimated multiple linear regression equation that can be

used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x).

The estimated multiple linear regression equation is yˆ 35.3950 2.8821x , and the coefficient of

determination for this model is r2 = 0.6945, so this simple linear regression model explains

approximately 69% of the variation in the values of HwyMPG in the sample.

Before testing the hypothesis 1 =0 for this regression model, we check the conditions necessary for

valid inference in regression. The Excel plot of the residuals and displacement follows.

10

5

Residuals

0

0 1 2 3 4 5 6 7

-5

-10

Displacement

The residuals appear deviate somewhat from a constant variance with a mean of zero, but do not

appear to be badly skewed at any value of displacement. Because the apparent violations of the

conditions necessary for valid inference in regression do not appear to be severe, we will proceed

with our inference and keep these findings in mind.

The p-value for the test of the hypothesis that 1 = 0 is 1.51247E-81. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a

relationship between HwyMPG and displacement at the 0.05 level of significance. We estimate that

a one liter increase in the engine’s displacement coincides with a decrease of 2.8821 in HwyMPG.

7 - 44

Regression Analysis

b. The scatter chart of HwyMPG and displacement for which the points representing compact, midsize,

and large automobiles are shown in different shapes and or colors follows.

40

35

H 30

w 25

y

20 Compact

M Midsize

15

P

Large

G 10

0

0 1 2 3 4 5 6 7

Displacement

This chart suggests that for each class of automobile (compact, midsize, and large) HwyMPG

decreases as displacement increases. The chart also suggests that midsize automobiles generally

have the highest HwyMPG while compact automobiles generally have the lowest HwyMPG.

Although this seems counterintuitive, the chart shows that this is likely occurring because the

midsize automobiles in the sample data tend to have low engine displacement, while the compact

automobiles in the sample data tend to have high engine displacement. The chart does suggest that

using dummy variables to represent the classes of automobile may enhance this fit of the regression

model.

c. The following Excel output provides the estimated multiple linear regression equation that can be

used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x1) and

the dummy variables ClassMidsize (x2) and ClassLarge (x3).

7 - 45

Regression Analysis

The estimated multiple linear regression equation is yˆ 29.0359 1.6625 x1 4.4686 x2 1.8047 x3 ,

and the coefficient of determination for this model is R2 = 0.8182, so this multiple linear regression

model explains approximately 82% of the variation in the values of HwyMPG in the sample.

d. Before testing any hypotheses for this regression model, we check the conditions necessary for valid

inference in regression. Excel plots of the residuals and each independent variable follow.

10

5

Residuals

0

0 1 2 3 4 5 6 7

-5

-10

Displacement

10

5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-5

-10

ClassMidsize

10

5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-5

-10

ClassLarge

The residuals appear to have a mean of zero and do not appear to be badly skewed at any value of

each independent variable. However, the variance does not appear to be constant across levels of

displacement or ClassLarge. Theseviolations do not appear to be severe.

7 - 46

Regression Analysis

We also note that the parameter estimate and associated p-value for each of the independent variable

displacement does change substantially when the dummy variables that represent the class of

automobile are introduced into the model (which makes sense – the displacement or size of the

engine is generally related to the class or size of the automobile), suggesting that multicollinearity is

possibly a concern. We will keep this result in mind as we proceed with our inference.

The p-value for the test of the hypothesis that 2 = 0 is 7.14209E-35. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a

difference in HwyMPG between midsized automobiles and compact automobiles at the 0.05 level of

significance. We estimate that holding displacement constant, midsized automobiles get 4.4686 more

HwyMPG than do compact automobiles. That is, if a compact automobile and a midsized

automobile have the same displacement, we expect a midsized automobile to get about 4.5 more

miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that 3 = 0 is 9.14602E-09. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a

difference in HwyMPG between large automobiles and compact automobiles at the 0.05 level of

significance. We estimate that holding displacement constant, large automobiles get 1.8047 more

HwyMPG than do compact automobiles. That is, if a compact automobile and a large automobile

have the same displacement, we expect a large automobile to get about 1.8 more miles per gallon

than the compact automobile.

e. The following Excel output provides the estimated multiple linear regression equation that can be

used to predict the fuel efficiency for highway driving (y) given the engine’s displacement (x1) and

the dummy variables ClassMidsize (x2), ClassLarge (x3), and FuelPremium (x4).

and the coefficient of determination for this model is R2 = 0.8338, so this multiple linear regression

model explains approximately 83% of the variation in the values of HwyMPG in the sample.

7 - 47

Regression Analysis

f. Before testing any hypotheses for this regression model, we check the conditions necessary for valid

inference in regression. Excel plots of the residuals and each independent variable follow.

8

6

4

Residuals

2

0

-2 0 1 2 3 4 5 6 7

-4

-6

Displacement

8

6

4

Residuals

2

0

-2 0 0.2 0.4 0.6 0.8 1 1.2

-4

-6

ClassMidsize

8

6

4

Residuals

2

0

-2 0 0.2 0.4 0.6 0.8 1 1.2

-4

-6

ClassLarge

7 - 48

Regression Analysis

8

6

4

Residuals

2

0

-2 0 0.2 0.4 0.6 0.8 1 1.2

-4

-6

FuelPremium

The residuals appear to have a mean of zero and do not appear to be badly skewed at any value of

any independent variable. However, the variance does not appear to be constant across values of the

independent variable displacement, but this violation does not appear to be severe. We also note that

the parameter estimates and associated p-values for each of the independent variables does not

change substantially when the dummy variable FuelPremium is introduced into the model,

suggesting that the dummy variable FuelPremium does not create further issues with

multicollinearity. We therefore will proceed with our inference.

The p-value for the test of the hypothesis that 1 = 0 is 1.06276E-34. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a

relationship between HwyMPG and displacement at the 0.05 level of significance. We estimate that,

for a fixed class of automobile and type of fuel used, a one liter increase in the engine’s

displacement coincides with a decrease of 1.6347 in HwyMPG.

The p-value for the test of the hypothesis that 2 = 0 is 6.31598E-29. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 2 = 0, and conclude that there is a

difference in HwyMPG between midsized automobiles and compact automobiles at the 0.05 level of

significance. We estimate that holding displacement and fuel type constant, a midsized automobile

gets 3.9634 more HwyMPG than do compact automobiles. That is, if the compact automobile and a

midsized automobile have the same displacement and use the same type of fuel, we expect a

midsized automobile to get about 4.0 more miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that 3 = 0 is 4.89555E-08. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 3 = 0, and conclude that there is a

difference in HwyMPG between large automobiles and compact automobiles at the 0.05 level of

significance. We estimate that holding displacement and fuel type constant, a large automobile gets

1.6450 more HwyMPG than do compact automobiles. That is, if a compact automobile and a large

automobile have the same displacement and use the same type of fuel, we expect the large

automobile to get about 1.6 more miles per gallon than the compact automobile.

The p-value for the test of the hypothesis that 4 = 0 is 1.6116E-07. Because this p-value is less than

the 0.05 level of significance, we reject the hypothesis that 4 = 0, and conclude that there is a

difference in HwyMPG between automobiles that use premium fuel and automobiles do not use

premium fuel at the 0.05 level of significance. We estimate that holding displacement and class of

automobile constant, an automobile that uses premium fuel gets 1.1210 less HwyMPG than do

automobiles that do not use premium fuel. That is, if two automobiles are of the same class and have

7 - 49

Regression Analysis

the same displacement, and one of the automobiles uses premium fuel and the other does not, we

expect the automobile that uses premium fuel to get about 1.1 fewer miles per gallon than the

automobile that does not use premium fuel.

16. a. The scatter chart with vehicle speed as the independent variable follows:

1600

1400

1200

Traffic Flow

1000

800

600

400

200

0

0 10 20 30 40 50 60

Vehicle Speed

The scatter chart suggests that vehicle speed and traffic flow are positively related.

b. The following Excel output provides the estimated simple linear regression equation that could be

used to predict traffic flow (y) given the vehicle speed (x).

The estimated multiple linear regression equation is yˆ 1039.5757 6.6006 x , and the coefficient of

determination for this model is r2 = 0.3133, so the regression model explains approximately 31% of

the variation in the sample values of traffic flow.

7 - 50

Regression Analysis

Before testing any hypotheses about this regression model, we check the conditions necessary for

valid inference in regression. The Excel plot of the residuals and vehicle speed follows.

200

100

Residuals

0

0 10 20 30 40 50 60

-100

-200

-300

Vehicle Speed

The residuals appear to have a relatively constant variance across the values of vehicle speed and do

not appear to be badly skewed at any value of the independent variable. However, the mean of the

residuals possibly differs from zero at several values of the independent variable; this suggests the

relationship between vehicle speed and traffic flow may be nonlinear. We will proceed with our

inference but will keep our findings in mind as we proceed.

The p-value for the test of the hypothesis that 1 = 0 is 1.41658E-09. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 1 = 0, and conclude that there is a

relationship between vehicle speed and traffic flow at the 0.05 level of significance. Our best

estimate is that when vehicle speed increases by 1 mph, traffic flow increases by 6.6006 vehicles.

c. The following Excel output provides the estimated second order quadratic regression equation that

could be used to predict traffic flow (y) given the vehicle speed (x).

2

The estimated second order quadratic regression equation is yˆ 621.2138 28.0372 x 0.2665 x ,

and the coefficient of determination for this model is R2 = 0.3431, so the quadratic regression model

7 - 51

Regression Analysis

explains approximately 3% more of the variation in the sample values of traffic flow than did the

linear regression model in part (b).

Before testing any hypotheses about this regression model, we again check the conditions necessary

for valid inference in regression. The Excel plot of the residuals and vehicle speed follows.

200

100

Residuals

0

0 10 20 30 40 50 60

-100

-200

-300

Vehicle Speed

This scatter chart is very similar to the scatter chart of the residuals from the simple linear regression

model estimated in part (b). When we plot the residuals from the quadratic model against squared

values of vehicle speed.

200

100

Residuals

0

0 500 1000 1500 2000 2500 3000

-100

-200

-300

Vehicle Speed Sq

the scatter chart does not provide strong evidence of a violation of the conditions, so we will proceed

with our inference.

The p-value for the test of the hypothesis that 1 = 0 is 0.0074. Because this p-value is less than the

0.05 level of significance, we again reject the hypothesis that 1 = 0. Similarly, the p-value for the

test of the hypothesis that 2 = 0 is 0.0384. Because this p-value is less than the 0.05 level of

significance, we reject the hypothesis that 2 = 0. We therefore conclude that there is a nonlinear

relationship between vehicle speed and traffic flow. We estimate that when vehicle speed increases

from some value x to x+1, the traffic flow changes by

7 - 52

Regression Analysis

= 27.7707 - 0.5331x

That is, estimated traffic flow initially increases as vehicle speed increases when the traffic is

traveling at a relatively low speed, and then eventually decreases as vehicle speed increases. Solving

this result for x

27.7707 - 0.5331x = 0

-0.5331x = -27.7707

tells us that estimated maximum traffic flow occurs at a vehicle speed of 52 miles per hour; at speeds

below 52 mile per hour the traffic flow increases as vehicle speed increases, and at speeds above 52

mile per hour the traffic flow decreases as vehicle speed increases. Substituting 52 miles per hour

into the estimated second order quadratic regression equation:

yˆ 621.2138 28.0372 52 0.2665 52 1358.41

2

A plot of the linear and quadratic regression lines helps us better understand the difference in how

these two models fit the sample data.

1600

1400

1200

Traffic Flow

1000

800

600

400

200

0

0 10 20 30 40 50 60

Vehicle Speed

This display shows that there is little difference in how the simple linear regression line (in green)

and the quadratic regression line (in red) fit the sample data. Comparison of the coefficients of

determination for these two models shows that the estimated second order quadratic regression

equation only explains slightly more 3% of the variation in the sample values of traffic flow than did

the less complex simple linear regression model, Since the simple linear regression model has almost

the same explanatory power as the quadratic regression model and is far simpler, the simple linear

regression model is superior.

7 - 53

Regression Analysis

d. By reducing the range of the axes for the scatter chart we developed in part (a), we can see more

clearly where (if at all) the relationship between vehicle speed and traffic flow changes:

1600

1500

1400

Traffic Flow

1300

1200

1100

1000

20 25 30 35 40 45 50 55

Vehicle Speed

If there is a change in the relationship between vehicle speed and traffic flow, it is not prominent.

We will use 45 as the knot (you could select a different value to use as the knot – this is subjective –

and the results of part (c) could be used to estimate the value to use for the knot).

First we create a dummy variable that is equal to 1 if vehicle speed exceeds the knot value of 45 and

zero otherwise, then we multiply this dummy variable by the difference between vehicle speed and

the knot value of 45. We then estimate a regression model with this new variable (the product of the

knot dummy variable and the difference between vehicle speed and the knot value of 45) and vehicle

speed as the independent variables.

The following Excel output provides the estimated piecewise linear regression equation with a knot

at vehicle speed = 45 that could be used to predict traffic flow (y) given the vehicle speed (x).

7 - 54

Regression Analysis

If vehicle speed does not exceed 45 miles per hour, the estimated regression equation is

yˆ 984.5875 8.1287 x

and if vehicle speed exceeds 45 miles per hour, the estimated regression equation is

1279.3672 1.5780 x

According to this model, the estimated increase in traffic flow that corresponds with a 1 mile per

hour increase in vehicle speed is much smaller if the traffic speed is over 45 miles per hour.

Note that the coefficient of determination for this model is R2 = 0.3281, so the piecewise linear

regression model with a knot at vehicle speed = 45 explains approximately 1% more of the variation

in the sample values of traffic flow than did the much less complex simple linear regression model in

part (b). Also note that the p-value for the test of the hypothesis that 2 = 0 is 0.1460. Because this p-

value exceeds the 0.05 level of significance, we do not reject the hypothesis that 2 = 0 (i.e., the knot

interaction is not statistically significant). Furthermore, the piecewise linear regression model with a

knot at vehicle speed = 45 explains less of the variation in the sample values of traffic flow than did

the second order quadratic regression model in part (c). Thus, the piecewise linear regression model

with a knot of 45 should not be considered further. Note that a piecewise linear regression model

with a different knot (perhaps a knot of 52) may perform much better than our piecewise linear

regression model with a knot of 45.

e. We split the data set so that the first data set contains 65 observations with values of vehicle speed

less than 45 and the second data set contains 35 observations with values of vehicle speed greater

than or equal to 45.

The following Excel output provides the estimated simple linear regression equation that could be

used to predict traffic flow (y) given vehicle speed (x) below 45.

The estimated multiple linear regression equation is 𝑦̂ = 961.5736 + 8.8039𝑥, and the coefficient

of determination for this model is r2 = 0.2978, so the regression model explains approximately 30%

of the variation in the sample values of traffic flow corresponding to vehicle speeds less than 45.

7 - 55

Regression Analysis

The following Excel output provides the estimated simple linear regression equation that could be

used to predict traffic flow (y) given vehicle speed (x) greater than or equal to 45.

The estimated multiple linear regression equation is 𝑦̂ = 1167.7323 + 3.8026𝑥, and the coefficient

of determination for this model is r2 = 0.0207, so the regression model explains approximately 2% of

the variation in the sample values of traffic flow corresponding to vehicle speeds greater than or

equal to 45.

Separating the data into two sets and fitting separate simple linear regression equations to each set

results in an even worse fit than the piecewise linear regression with a single knot at vehicle speed =

45.

Comparing predicted values of traffic flow for vehicle speeds of 44 and 46 (slightly below and above

the knot value of 45) will allow us to see the difference between the piecewise linear regression with

a single knot and two separate simple regression equations.

For vehicle speed = 44 the piecewise linear regression with a single knot produces

For vehicle speed = 46, the piecewise linear regression with a single knot produces

Alternatively, for vehicle speed = 44 the simple linear regression fit on observations with vehicle

speeds < 45 produces

For vehicle speed = 46, the simple linear regression fit on observations with vehicle speeds ≥ 45

7 - 56

Regression Analysis

That is, fitting two separate simple linear regression equations results in predicted traffic flow being

considerably less at vehicle speed = 46 than at vehicle speed = 44. This is opposite the behavior

predicted by the piecewise linear regression with a single knot at vehicle speed = 45.

To visualize how this happens, note from the charts below how the piecewise linear regression

“connects” two regression lines at the knot value of 45 while the two linear regression equations fit

separately result in a disjointed fit.

1600

1500

Traffic Flow

1400

y = 8.8039x + 961.57

1300

1200

y = 3.8026x + 1167.7

1100

1000

0 10 20 30 40 50 60

Vehicle Speed

1600

1500

y = 8.1287x + 984.59

Traffic Flow

1400

1300

1200

y = 1.5780x + 1279.40

1100

1000

0 10 20 30 40 50 60

Vehicle Speed

f. Other independent variables that you could add to your regression model to explain more variation

in traffic flow include number of accidents and weather conditions (i.e., rainy or snowy).

17. a. The scatter chart with years to maturity as the independent variable follows.

7 - 57

Regression Analysis

5

Yield

0

0 5 10 15 20 25 30 35

Years

A simple linear regression model does not appear to be appropriate; there appears to be a curvilinear

relationship between years to maturity and yield.

b. The following Excel output provides the estimated second order quadratic regression equation that

could be used to predict yield (y) given the years to maturity (x).

2

The estimated second order quadratic regression equation is yˆ 1.0170 0.4606 x 0.0103 x , and

the coefficient of determination for this model is R2 = 0.6678, so the quadratic regression model

explains approximately 67% of the variation in sample values of yield.

Before testing any hypotheses about this regression model, we again assess the conditions necessary

for valid inference in regression. Excel plots of the residuals with years to maturity and years to

maturity squared follow.

7 - 58

Regression Analysis

3

2

Residuals

1

0

0 5 10 15 20 25 30 35

-1

-2

Years

3

2

Residuals

1

0

0 200 400 600 800 1000

-1

-2

Years Sq

These scatter charts do not provide strong evidence of a violation of the conditions, so we will

proceed with our inference.

The p-value for the test of the hypothesis that 1 = 0 is 1.80383E-06. Because this p-value is less

than the 0.05 level of significance, we reject the hypothesis that 1 = 0. Similarly, the p-value for the

test of the hypothesis that 2 = 0 is 0.0003. Because this p-value is less than the 0.05 level of

significance, we reject the hypothesis that 2 = 0. We therefore conclude that there is a nonlinear

relationship between years to maturity and yield. We estimate that when years to maturity increases

by 1 year from some value x to x+1, the yield changes by

= 0.4503 - 0.0206x

That is, estimated yield initially increases as years to maturity increases, and then eventually

decreases as years to maturity increases. Solving this result for x

0.4503 - 0.0206x = 0

- 0.0206x = - 0.4503

7 - 59

Regression Analysis

tells us that estimated maximum yield to maturity occurs at approximately 22 years. Substituting 22

years into the estimated second order quadratic regression equation:

2

c. A plot of the linear and quadratic regression lines overlaid on the scatter chart of years to maturity

and yield follows.

5

Yield

0

0 5 10 15 20 25 30 35

Years

This display shows that there is a substantial difference in how the simple linear regression line (in

green) and the quadratic regression line (in red) fit the sample data. If we were to run the simple

linear regression in Excel, we would find that the coefficients of determination is 0.7258, and so the

quadratic regression model (with a coefficient of determination of 0.8172) explains almost 10%

more of the variation in our sample yields.

d. Other independent variables you could add to the regression model to explain more variation in yield

include the prevailing market rate at the time the bond was issued and the credit rating given to the

company issuing the bond by Moody's, Standard and Poor's, or Fitch.

18. a. The scatter chart with vehicle speed as the independent variable follows.

7 - 60

Regression Analysis

1200

1000

800

Mortgage ($)

600

400

200

0

0 200 400 600 800 1000 1200

Rent ($)

The scatter chart suggests that rent is positively related to mortgage. However, it is not clear that the

relationship is linear, and so a simple linear regression model appear may not be appropriate.

b. The following Excel output provides the estimated simple linear regression equation that could be

used to predict the monthly mortgage on the median priced home (y) given the average asking rent

(x).

The plot of the residuals for this model against the independent variable average asking rent follows.

7 - 61

Regression Analysis

100

Residuals

0

0 200 400 600 800 1000 1200

-100

-200

Rent ($)

The mean of the residuals appears to differ from zero at several values of the independent variable

average asking rent; residuals for observations with relatively small or relatively large values of the

independent variable average asking rent tend to be positive, while the remaining residuals tend to be

negative. This suggests the relationship between the independent variable average asking rent and

the dependent variable monthly mortgage on the median priced home may be nonlinear, and so a

simple linear regression model may not be appropriate.

c. The following Excel output provides the estimated second-order quadratic regression equation that

could be used to predict the monthly mortgage on the median priced home (y) given the average

asking rent (x).

2

The estimated second-order quadratic regression equation is yˆ 3965.6331 8.2606 x 0.0051x .

d. Excel plots of the residuals for the estimated second-order quadratic regression model against the

independent variables average asking rent and average asking rent squared follow.

7 - 62

Regression Analysis

100

50

Residuals

0

0 200 400 600 800 1000 1200

-50

-100

-150

Rent ($)

100

50

Residuals

0

0 500000 1000000 1500000

-50

-100

-150

Rent Squared

These scatter charts suggest that the estimated second-order quadratic regression model fits the

sample data much better than the simple linear regression model.

A plot of the linear and quadratic regression lines overlaid on the scatter chart of the monthly

mortgage on the median priced home and the average asking rent will also help us better understand

the difference in how the quadratic regression model and a simple linear regression model fit the

sample data.

1200

1000

800

Mortgage ($)

600

400

200

0

0 200 400 600 800 1000 1200

Rent ($)

7 - 63

Regression Analysis

This display shows that there is a substantial difference in how the simple linear regression line (in

green) and the quadratic regression line (in red) fit the sample data. Note that the coefficient of

determination for second-order quadratic regression model is R2 = 0.8985, so the regression model

explains almost 13% more variation in the sample values of the monthly mortgage on the median

priced home than does the simple linear regression model. The estimated regression model

developed in part (c) is superior to the model developed in part (a).

19. a. The following Excel output provides the estimated multiple linear regression equation that relates

risk of a stroke to the person’s age (x1), blood pressure (x2), and whether the person is a smoker (x3).

The estimated multiple linear regression equation is yˆ 91.7595 1.0767 x1 0.2518 x2 8.740 x3 .

b. Before testing any hypotheses about this regression model, we again check the conditions necessary

for valid inference in regression. Excel plots of the residuals and each of the independent variables

follow.

10

5

Residuals

0

0 20 40 60 80 100

-5

-10

-15

Age

7 - 64

Regression Analysis

10

5

Residuals

0

0 50 100 150 200 250

-5

-10

-15

Blood Pressure

10

5

Residuals

0

0 0.2 0.4 0.6 0.8 1 1.2

-5

-10

-15

Smoker Dummy

None of these scatter charts provides strong evidence of a violation of the conditions, so we will

proceed with our inference.

Next we check for evidence of multicollinearity. First note that by using Excel’s we can determined

that the correlation coefficient r for age and blood pressure is -0.3090, which indicates that

multicollinearity between the quantitative variables is not a concern.

Now we rerun this regression after removing the smoker dummy variable (x3) from our model and

compare the parameter estimates and associated p-values for each of the reamining independent

variables to the parameter estimates and associated p-values for the original model.

7 - 65

Regression Analysis

When making these comparisons we observe that these values do not change substantially when the

smoker dummy variable is introduced into or removed from the model and conclude that the smoker

dummy variable does not create a problem with multicollinearity.

Our results suggest that multicollinearity is not an issue for this regression model. We will therefore

proceed with our inferences.

The p-value for the test of the hypothesis that 3 = 0 is 0.0102. Because this p-value is less than the

0.05 level of significance, we again reject the hypothesis that 3 = 0, and conclude that there is a

difference between smokers and nonsmokers in the risk of a stroke. We estimate that holding age

and blood pressure constant, smokers have a risk of stroke that is 8.7399 percent higher than

nonsmokers.

c. For a patient with the profile of Art Speen (a 68-year-old smoker who has blood pressure of 175),

the predicted risk of a stroke is:

d. Other factors that could be included in the model as independent variables include family history of

stroke, weight/obesity, and gender.

20. a. The following Excel output provides the estimated multiple linear regression equation that that

includes critical reading (x1) and mathematics (x2) SAT scores as independent variables.

7 - 66

Regression Analysis

The estimated multiple linear regression equation is yˆ 2.6717 0.0043x1 0.0045 x2 , and the

coefficient of determination for this multiple regression model is R2 = 0.7722, so the regression

model explains approximately 77% of the variation in the sample values of the Freshman GPA.

Before testing any hypotheses about this regression model, we check the conditions necessary for

valid inference in regression. The correlation between the two independent variables (critical reading

and mathematics SAT scores) is -0.0161, so there is no need for concern about autocorrelation.

Furthermore, the Excel plots of the residuals and each of the quantitative independent variables

(critical reading and mathematics SAT scores) follow.

1

0.5

Residuals

0

0 200 400 600 800 1000

-0.5

-1

Reading

7 - 67

Regression Analysis

1

0.5

Residuals

0

0 200 400 600 800 1000

-0.5

-1

Math

The residuals appear to have a relatively constant variance across the values of each independent

variable and do not appear to be badly skewed at any value of either variable. However, the mean of

the residuals possibly differs from zero at several values of each independent variable. Although

these violations do not appear to be dramatic, we will proceed with our inference but will keep our

findings in mind as we proceed.

The p-value for the test of the hypothesis that 1 = 0 is 9.06411E-44. Because this p-value is less

than the 0.05 level of significance, we again reject the hypothesis that 1 = 0, and conclude that there

is a relationship between SAT scores on critical reading and Freshman GPA. We estimate that

holding SAT score on mathematics constant, a one point increase in SAT score on critical reading

corresponds to an increase in Freshman GPA of 0.0043. We expect freshman GPA to increase as the

SAT score on critical reading increases, so this result appears to be reasonable.

The p-value for the test of the hypothesis that 2 = 0 is 1.20495E-45. Because this p-value is less

than the 0.05 level of significance, we again reject the hypothesis that 2 = 0, and conclude that there

is a relationship between SAT scores on mathematics and Freshman GPA. We estimate that holding

SAT score on critical reading constant, a one point increase in SAT score on mathematics

corresponds to an increase in Freshman GPA of 0.0045. We expect freshman GPA to increase as the

SAT score on mathematics increases, so this result appears to be reasonable.

b. Using the multiple linear regression model developed in part (a), the predicted freshman GPA of

Bobby Engle (a student who has been admitted to Ruggles College with a 660 SAT score on critical

reading and at a 630 SAT score on mathematics) is

or approximately 3.05.

c. The following Excel output provides the estimated multiple linear regression equation that that

includes critical reading (x1) and mathematics (x2) SAT scores and their interaction (x1x2) as

independent variables.

7 - 68

Regression Analysis

and the coefficient of determination for this multiple regression model is R2 = 0.7855, so the

regression model explains approximately 79% of the variation in the sample values of the Freshman

GPA.

Before testing any hypotheses about this regression model, we again check the conditions necessary

for valid inference in regression. As noted before, the correlation between the two independent

variables (critical reading and mathematics SAT scores) is -0.0161, so there is no need for concern

about autocorrelation. Excel plots of the residuals and each of the independent follow.

1

0.5

Residuals

0

0 200 400 600 800 1000

-0.5

-1

Reading

7 - 69

Regression Analysis

1

0.5

Residuals

0

0 200 400 600 800 1000

-0.5

-1

Math

1

0.5

Residuals

0

0 200000 400000 600000 800000

-0.5

-1

ReadMath

The addition of the readmath interation appears to have had relatively little impact on the residuals.

However, we did not believe that these violations were extreme in the multiple regression model that

did not include the interaction, and so we will proceed with our inference.

The p-value for the test of the hypothesis that 1 = 0 is 0.4810. Because this p-value is greater than

the 0.05 level of significance, we do not reject the hypothesis that 1 = 0, and we conclude that there

is not a linear relationship between SAT scores on critical reading and Freshman GPA when

controlling for SAT scores on mathematics and the interaction between SAT scores on critical

reading and SAT scores on mathematics.

The p-value for the test of the hypothesis that 2 = 0 is 0.5479. Because this p-value is greater than

the 0.05 level of significance, we do not reject the hypothesis that 2 = 0, and we conclude that there

is not a linear relationship between SAT scores on mathematics and Freshman GPA when

controlling for SAT scores on reading and the interaction between SAT scores on critical reading

and SAT scores on mathematics.

The p-value for the test of the hypothesis that 3 = 0 is 0.0006. Because this p-value is less than the

0.05 level of significance, we reject the hypothesis that 3 = 0, and we conclude that there is a

relationship between the interaction of SAT scores on critical reading and mathematics and the

dependent variable Freshman GPA. We estimate that when the SAT score on critical reading

increases by one point, Freshman GPA increases by 0.000009*SAT score on mathematics.

Similarly, we estimate that when the SAT score on mathematics increases by one point, Freshman

GPA increases by 0.000009*SAT score on critical reading. These results support the conjecture

made by the Ruggles College Director of Admissions.

7 - 70

Regression Analysis

d. The model developed in part (a) is simpler and explains almost as much variation in the sample

values of freshman GPA as the regression model developed in part (c), so the regression model

developed in part (a) is superior.

e. Other factors that could be added to the model as independent variables include the student’s high

school GPA and the number of credit and the number of hours per week the student plans to work in

paid employment during her/his freshman year.

21. a. The following Excel output provides the estimated simple linear regression equation with the

customer’s annual household income as the independent variable (x) and credit card charges accrued

by a customer over the past year as the dependent variable (y).

We estimate that as a customer’s annual income increases by $1000, the credit card charges accrued

by the customer over the past year will be $121.35 higher.

The coefficient of determination for this multiple regression model is r2 = 0.3135, so this simple

linear regression model explains approximately 31% of the variation in the sample values of credit

card charges accrued by a customer over the past year.

b. The following Excel output provides the estimated simple linear regression equation with the

number of members in the customer’s household as the independent variable (x) and credit card

charges accrued by a customer over the past year as the dependent variable (y).

7 - 71

Regression Analysis

We estimate that as the number of members in the customer’s household increases by one, the credit

card charges accrued by the customer over the past year will be $550.91 higher.

The coefficient of determination for this multiple regression model is r2 = 0.0354, so this simple

linear regression model explains approximately 4% of the variation in the sample values of credit

card charges accrued by a customer over the past year.

c. The following Excel output provides the estimated simple linear regression equation with the

customer’s number of years of post-high school education as the independent variable (x) and credit

card charges accrued by a customer over the past year as the dependent variable (y).

We estimate that as a customer’s years of post-high school education increases by one year, the

credit card charges accrued by the customer over the past year will be $509.87 lower.

7 - 72

Regression Analysis

The coefficient of determination for this multiple regression model is r2 = 0.0160, so this simple

linear regression model explains approximately 2% of the variation in the sample values of credit

card charges accrued by a customer over the past year.

d. The following Excel output from Figure 4.25 provides the estimated simple linear regression

equation with the customer’s annual household income (x1), number of members of the household

(x2), and number of years of post-high school education (x3) as independent variables and credit card

charges accrued by a customer over the past year as the dependent variable (y).

The multiple regression estimates of b1, b2, and b3 are almost identical to the estimate b1 from the

estimated simple linear regressions in parts (a), (b), and (c), respectively. This stability in the values

of the parameter estimates suggests little multicollinearity in the multiple linear regression that

includes the customer’s annual household income (x1), number of members of the household (x2),

and number of years of post-high school education (x3) as independent variables.

e. The coefficient of determination for the multiple regression model in Figure 4.25 is R2 = 0.3635, so

the regression model explains approximately 36% of the variation in the sample values of credit card

charges accrued by a customer over the past year. This R2 is only slightly less than the sum of the r2s

from parts (a), (b), and (c), which is 0.3667. This indicates that there is little redundancy in the

variation in sample values credit card charges accrued by a customer over the past year that is

explained by these three independent variables, i.e., there is almost no multicollinearity in the

estimated multiple linear regression from Figure 4.25.

f. The following Excel output provides the estimated simple linear regression equation with the

customer’s annual household income (x1), number of members of the household (x2), number of

years of post-high school education (x3), age (x4), a dummy variable for gender (x5), and a dummy

variable for whether a customer has exceeded his/her credit limit in past 12 months (x6) as

independent variables and credit card charges accrued by a customer over the past year as the

dependent variable (y).

7 - 73

Regression Analysis

The coefficient of determination for this multiple regression model is R2 = 0.3678, so the regression

model explains approximately 37% of the variation in the sample values of credit card charges

accrued by a customer over the past year. This R2 is only slightly larger than the R2 from the multiple

regression in Figure 4.25. Therefore, adding age, a dummy variable for gender, and a dummy

variable for whether a customer has exceeded his/her credit limit in past 12 months as independent

variables to the multiple regression in Figure 4.25 does not substantially improve the fit of the

model.

7 - 74

- Econometrics ProjectUploaded byCarlos Ferreira
- Statistics Model PaperUploaded byapi-3699388
- Panel Data Problem Set 2Uploaded byYadavalli Chandradeep
- Analyzing GRT Data in StataUploaded byAnonymous EAineTiz
- Notes Part 2Uploaded byAXA2000
- Chapter 4Uploaded bybhuvaneshkmrs
- Official Econometrics Report_Trangntt50Uploaded byNguyen Trang
- What determines the academic and professional participation of economists?Uploaded bySudhanshu K Mishra
- Introduction to Bivariate RegressionUploaded byMegaDocs
- Chapter 5Uploaded byBia Bezerra de Meneses
- Spatial variability of the active layer, permafrost, and soil profile depth in Alaskan soilsUploaded byichiameri
- eapanalysis-100617114539-phpapp01Uploaded byRohit Jindal
- FAQ R squared.docxUploaded byronny
- Cost-Management-2nd-Edition-Eldenburg-Test-Bank.docxUploaded byaperez1105
- Sample Final Solutions.docxUploaded bydungnt0406
- Estimation of Postmortem Interval Using Thanatoche 2Uploaded byAngel Mella
- Hw 5 SolutionsUploaded byalicesays
- Ch 03 Wooldridge 6e PPT UpdatedUploaded byMy Tran Ha
- VESHelp.pdfUploaded byTommy Lee
- Pdb c Capital LabourUploaded byFernando Saragih Napitu
- Banking Sector ResultsUploaded byMuhammad Arif
- regressionUploaded byNur Sophia
- Training Survey for BanksUploaded bygirishtiwaskar
- LECTURE 4_140929Uploaded byAhmad Shahir
- burford-exam 3 spring 2018Uploaded byapi-430812455
- WorkshetUploaded byMatthew
- OutputUploaded byPriya Kumari
- fm finalUploaded byAnkit Kesarwani
- Sage Dictionary Ols RegressionUploaded byKireet Pant
- Witte_VanGelder_Spitler_02Uploaded byEthan Weikel

- Gidion_DalluUploaded byijaz afzal
- A_STUDY_ON_JOB_SATISFACTION_AMONG_EMPLOY.docxUploaded byKiranKumar
- 151470765 Comprehensive Company DatabseUploaded byijaz afzal
- 2018Q4 Factory Disclosure ListUploaded byijaz afzal
- Jaime Allison Media KitUploaded byijaz afzal
- 7-Habits-COMPLETE.pptUploaded bygokulm2202
- VIP ShippingUploaded byijaz afzal
- 1Uploaded byijaz afzal
- Reglog Iud 38 SampelUploaded bySarah Christiawan
- Chapter4_Study_Guide.docUploaded byijaz afzal
- 361_v1_Performance-Evaluation-Classified-Staff_2016.docUploaded byjetrogsarajema
- 6. Understanding PID Control Systems (1)Uploaded byijaz afzal
- Dear MrUploaded byijaz afzal
- Dear MrUploaded byijaz afzal
- Command LineUploaded byijaz afzal
- commandline.txtUploaded byijaz afzal
- 12factorappmethodology20160615-160727042159Uploaded byijaz afzal
- HybridUploaded byijaz afzal
- 960107Uploaded byijaz afzal
- lexicalinputUploaded byijaz afzal
- Problem No 1Uploaded byijaz afzal
- Math Lab 007Uploaded byijaz afzal
- P1Uploaded byijaz afzal
- Attachment 0b8893accf2a0e864ba6f3cab644de64Uploaded byijaz afzal
- EGA324-C1-Surname-StudentID-000000(17-18)Uploaded byijaz afzal
- usUploaded byijaz afzal
- 3trafficcounts.docUploaded byijaz afzal
- 3369509_109732931_HW5+House+CPMUploaded byijaz afzal

- Soil Saturated Hydraulic Conductivity Assessment From Expert EvaluationUploaded byPriscila Ribeiro
- Preliminary Power Prediction During Early Design Stages of a ShipUploaded byBawa Sandhu
- National Flood Interoperability Experiment Project TalksUploaded byConsortium of Universities for the Advancement of Hydrologic Science, Inc.
- 200705 Gas Lift Opt OtcUploaded byTaha
- Elvis PresleyUploaded byranga
- Cost Modelling.pdfUploaded bynzy06
- De Cuper i in. 2008-1Uploaded byfonika11
- Is Military Incompetence Adaptive - Johnson Et Al.pdfUploaded byCristina Petrişor
- DHI Nonlinear Channel Optimisation Simulator (NCOS)Uploaded byjbloggs2007
- An Internet-Of-Things Enabled ConnectedUploaded byManas Das
- i Jcs It 2013040110Uploaded byramadwidodo
- Dam Break Dso 98 004Uploaded byHernan Valenzuela Diaz
- ICAS Sci 2011 Paper EUploaded byvtendean
- MPM _ InggrisUploaded byMuzammil Al Aris
- Nemode Business Models for Bigdata 2014 OxfordUploaded byVlad GB
- Control SyllabusUploaded byBiswajit Debnath
- SpatialUploaded byloveandexile
- SPE-119509-PA-PUploaded byOrtiz Catalina
- Leadership and Emotional IntelligenceUploaded byGeorge Geo
- Urban SprawlUploaded byAliSercanKesten
- ARRAYWIZARD User Reference GuideUploaded byPragati Vatsa
- 2015 Brief Bioinform 16 1025-1034Uploaded bymbrylinski
- Comparing Methods of Predicting Pore PressureUploaded byIgnatius
- Density and Viscosity of Molten Zn-Al AlloysUploaded byhsemarg
- [4]. Cont-Level Pred, On Leakage CurrentUploaded byrvh164
- 6936 Statements by Seers and Prophets....Uploaded byMarianne Zipf
- Real Estrate Valuation Theory.pdfUploaded byavisitoronscribd
- Gallagher 2018Uploaded bycheby92
- Multivariate State Estimation Technique for Remaining Useful Life prediction of electronic products.pdfUploaded byPradeep Kundu
- 1 - Altman - Financial Ratios, Discriminant Analysis, And the Prediction of Corporate BankruptcyUploaded bydavidfpessoa