You are on page 1of 94

Chapter 16: Regression Analysis: Model Building

Textbook Exercises:
1. Consider the following data for two variables, X and Y.

a. Develop an estimated regression equation for the data of the form


ŷ = b0 + b1x.

b. Use the results from part (a) to test for a significant relationship between
X and Y. Use α = 0.05.

c. Develop a scatter diagram for the data. Does the scatter diagram suggest an
estimated regression equation of the form ŷ = b0 + b1x + b2x2? Explain.

d. Develop an estimated regression equation for the data of the form


^y = b0 + b1x + b2x2.

e. Refer to part (d). Is the relationship between X, X2, and Y significant? Use α = 0.05.

f. Predict the value of Y when X = 25.

2. Consider the following data for two variables, X and Y.

a. Develop an estimated regression equation for the data of the form


ŷ = b0 + b1x. Comment on the adequacy of this equation for predicting Y.
b. Develop an estimated regression equation for the data of the form
ŷ = b0 + b1x + b2x2.
Comment on the adequacy of this equation for predicting Y.
c. Predict the value of Y when X = 20.

3. Consider the following data for two variables, X and Y.

a. Does there appear to be a linear relationship between X and Y? Explain.


b. Develop the estimated regression equation relating X and Y.
c. Plot the standardized residuals versus for the estimated regression equation
developed in part (b). Do the model assumptions appear to be satisfied? Explain.
d. Perform a logarithmic transformation on the dependent variable Y. Develop an
estimated regression equation using the transformed dependent variable. Do the
model assumptions appear to be satisfied by using the transformed dependent
variable? Does a reciprocal transformation work better in this case? Explain.

4. The table below lists the number of people (millions) living with HIV globally
(www.avert.org/global-hiv-and-aids-statistics) from 2013 to 2017

Number of people
Year living with HIV (m)
2013 35.2
2014 35.9
2015 36.7
2016 36.7
2017 36.9

a. Plot the data, letting X = 0 correspond to the year 2013, Find a linear
ŷ = b0 + b1 x that models the data,
b. Plot the function on the graph with the data and determine how well the graph fits the
data.

5. In working further with the problem of exercise 4, statisticians suggested the use of the
following curvilinear estimated regression equation.
a. Use the data of exercise 4 to determine estimated regression equation.
b. Use α = 0.01 to test for a significant relationship.

6 An international study of life expectancy by Ross (1994) covers variables

LifeExp Life expectancy in years


People.per.TV Average number of people per TV
People.per.Dr Average number of people per physician
LifeExp.Male Male life expectancy in years
LifeExp.Female Female life expectancy in years

With data details as follows:


(Note that the average number of people per TV is not given for Tanzania and Zaire.)

a. Develop scatter diagrams for these data, treating LifeExp as the dependent variable.
b. Does a simple linear model appear to be appropriate? Explain.
c. Estimate simple regression equations for the data accordingly. Which do you prefer
and why?

7. To assess the reliability of computer media, Choice magazine (www.choice.com.au) has


obtained data by:

price (A$) Paid in April 2005


pack the number of disks in the pack
media one of CD (CD), DVD (DVD-R) or DVDRW (DVD+/-RW)

with details as follows:

a. Develop scatter diagrams for these data with pack and media as potential
independent variables.
b. Does a simple or multiple linear regression model appear to be appropriate?
c. Develop an estimated regression equation for the data you believe will best explain
the relationship between these variables.

8.
In Europe the number of Internet users varies widely from country to country. In 1999, 44.3
per cent of all Swedes used the Internet, while in France the audience was less than 10 per
cent. The disparities are expected to persist even though Internet usage is expected to grow
dramatically over the next several years. The following table shows the number of Internet
users in 2011 and in 2018 for selected European countries. (www.internetworldstats.com/)

2011 2018
Austria 74.8 87.9
Belgium 81.4 94.4
Denmark 89.0 96.9
Finland 88.6 94.3
France 77.2 92.6
Germany 82.7 96.2
Ireland 66.8 92.7
Netherlands 89.5 95.9
Norway 97.2 99.2
Spain 65.6 92.6
Sweden 92.9 96.7
Switzerland 84.2 91.0
UK 84.5 94.7
a. Develop a scatter diagram of the data using the 2011 Internet user percentage
as the independent variable. Does a simple linear regression model appear to
be appropriate? Discuss.
b. Develop an estimated multiple regression equation with X = the number of
2011 Internet users and X2 as the two independent variables.
c. Consider the nonlinear relationship shown by equation (16.6). Use logarithms
to develop an estimated regression equation for this model.
d. Do you prefer the estimated regression equation developed in part (b) or part
(c)? Explain.

9. In a regression analysis involving 27 observations, the following estimated regression


equation was developed.

For this estimated regression equation SST = 1550 and SSE = 520.

a. At α = 0.05, test whether X1 is significant.


Suppose that variables X2 and X3 are added to the model and the following regression
equation is obtained.

For this estimated regression equation SST = 1550 and SSE = 100.
b. Use an F test and a 0.05 level of significance to determine whether X2 and X3
contribute significantly to the model.

10. In a regression analysis involving 30 observations, the following estimated regression


equation was obtained.

For this estimated regression equation SST = 1805 and SSR = 1760.

a. At α = 0.05, test the significance of the relationship among the variables.


Suppose variables X1 and X4 are dropped from the model and the following estimated
regression equation is obtained.

For this model SST = 1805 and SSR = 1705.

b. Compute SSE(x1, x2, x3, x4).


c. Compute SSE(x2, x3).
d. Use an F test and a 0.05 level of significance to determine whether X1 and X4
contribute significantly to the model.

11. In an experiment involving measurements of Heat Production (calories) at various Body


Masses (kgs) and Work levels (Calories/hour) on a stationary bike, the following results were
obtained:
Body Mass (M) Work level (W) Heat production (H)
43.7 19 177
43.7 43 279
43.7 56 346
54.6 13 160
54.6 19 193
54.6 43 280
54.6 56 335
55.7 13 169
55.7 26 212
55.7 34.5 244
55.7 43 285
58.8 13 181
58.8 43 298
60.5 19 212
60.5 43 317
60.5 56 347
61.9 13 186
61.9 19 216
61.9 34.5 265
61.9 43 306
61.9 56 348
66.7 13 209
66.7 43 324
66.7 56 352
a. Develop an estimated regression equation that can be used to predict Heat production
for a given Body Mass and Work level.
b. Consider adding an independent variable to the model developed in part (a) for the
interaction between Body Mass and Work level. Develop an estimated regression
equation using these three independent variables.
c. At a 0.05 level of significance, test to see whether the addition of the interaction term
contributes significantly to the estimated regression equation developed in part (a).

12. Failure data obtained in the course of the development of a silver-zinc battery for a NASA
programme were analyzed by Sidik, Leibecki and Bozek in 1980. Relevant variables were as
follows:

Adopting ln(y) as the response variable, a number of regression models were estimated for the
data using MINITAB:
a. Explain this computer output, carrying out any additional tests you think necessary
or appropriate.
b. Is the first model significantly better than the second?
c. Which model do you prefer and why?

13. A section of MINITAB output from an analysis of data relating to truck exhaust
emissions under different atmospheric conditions (Hare and Bradow, 1977) is as follows:
Variables used in this analysis are defi ned as follows:
Nox Nitrous oxides, NO and NO2, (grams/km)
Humi Humidity (grains H2O/lbm dry air)
Temp emperature (°F)
HT humi × temp

a. Provide a descriptive summary of this information, carrying out any further calculations or
statistical tests you think relevant or necessary.
b. It has been argued that the inclusion of quadratic terms
HH = humi × humi
TT = temp × temp

on the right-hand side of the model will lead to a significantly improved R-square outcome.
Details of the revised analysis are shown below. Is the claim justified?
14. Brownlee (1965) presents stack loss data for a chemical plant involving 21 observations
on four variables, namely:
Airflow: Flow of cooling air
Temp: Cooling Water Inlet Temperature
Acid: Concentration of acid [per 1000, minus 500]
Loss: Stack loss (the dependent variable) is 10 times the percentage of the ingoing
ammonia to the plant that escapes from the absorption column unabsorbed; that
is, an (inverse) measure of the over-all efficiency of the plant

Loss Airflow Temp Acid


42 80 27 89
37 80 27 88
37 75 25 90
28 62 24 87
18 62 22 87
18 62 23 87
19 62 24 93
20 62 24 93
15 58 23 87
14 58 18 80
14 58 18 89
13 58 17 88
11 58 18 82
12 58 19 93
8 50 18 89
7 50 18 86
8 50 19 72
8 50 19 79
9 50 20 80
15 56 20 82
15 70 20 91
Develop an estimated regression equation that can be used to predict loss. Briefly discuss
the process you used to develop a recommended estimated regression equation for these
data.
15. A study investigated the relationship between audit delay (Delay), the length of time
from a company’s fiscal year-end to the date of the auditor’s report, and variables that
describe the client and the auditor. Some of the independent variables that were included
in this study follow.

Industry A dummy variable coded 1 if the firm was an industrial company or 0 if


the firm was a bank, savings and loan, or insurance company.
Public A dummy variable coded 1 if the company was traded on an organized
exchange or over the counter; otherwise coded 0.
Quality A measure of overall quality of internal controls, as judged by the auditor,
on a five-point scale ranging from “virtually none” (1) to “excellent” (5).
Finished A measure ranging from 1 to 4, as judged by the auditor, where 1 indicates
“all work performed subsequent to year-end” and 4 indicates “most work
performed prior to year-end.”

A sample of 40 companies provided the following data.

Delay Industry Public Quality Finished


62 0 0 3 1
45 0 1 3 3
54 0 0 2 2
71 0 1 1 2
91 0 0 1 1
62 0 0 4 4
61 0 0 3 2
69 0 1 5 2
80 0 0 1 1
52 0 0 5 3
47 0 0 3 2
65 0 1 2 3
60 0 0 1 3
81 1 0 1 2
73 1 0 2 2
89 1 0 2 1
71 1 0 5 4
76 1 0 2 2
68 1 0 1 2
68 1 0 5 2
86 1 0 2 2
76 1 1 3 1
67 1 0 2 3
57 1 0 4 2
55 1 1 3 2
54 1 0 5 2
69 1 0 3 3
82 1 0 5 1
94 1 0 1 1
74 1 1 5 2
75 1 1 4 3
69 1 0 2 2
71 1 0 4 4
79 1 0 5 2
80 1 0 1 4
91 1 0 4 1
92 1 0 1 4
46 1 1 4 3
72 1 0 5 2
85 1 0 5 1

a. Develop the estimated regression equation using all of the independent variables.
b. Did the estimated regression equation developed in part (a) provide a good fit?
Explain.
c. Develop a scatter diagram showing Delay as a function of Finished. What does this
scatter diagram indicate about the relationship between Delay and Finished?
d. On the basis of your observations about the relationship between Delay and Finished,
develop an alternative estimated regression equation to the one developed in (a) to
explain as much of the variability in Delay as possible.

16. In a study of car ownership in 24 countries, data (OECD, 1982) have been collected on the
following variables:

ao cars per person


pop population (millions)
den population density
gdp per capita income ($)
pr petrol price (cents per litre)
con petrol consumption (tonnes per car per year)
tr bus and rail use (passenger km per person)

Selective results from a linear modelling analysis (ao is the dependent variable) are as
follows:
a. Which of the various model options considered here do you prefer and why?
b. Corresponding stepwise output from MINITAB terminates after two stages, gdp
being the first independent variable selected and pr the second. How does this latest
information reconcile with that summarized earlier?
c. Does it alter in any way, your inferences for a.? If so, why and if not, why not?

17. In a regression analysis of data from a cloud-seeding experiment (Hand et al, 1994)
relevant variables are defined thus:

x1 = 1, seeding or 0, no seeding
x2 = number of days since the experiment began
x3 = seeding suitability factor
x4 = percent cloud cover
x5 = total rainfall on target area one hour before seeding
x6 = 1, moving radar echo or 2, a stationary radar echo
y = amount of rain (cubic metres * 107) that fell in target area for a 6 hour
period on each day seeding was suitable

Corresponding MINITAB results are as follows:


Correlation: x1, x2, x3, x4, x5, x6, y

x1 x2 x3 x4 x5 x6
x2 0.030
0.888

x3 0.177 0.451
0.408 0.027

x4 0.062 -0.350 -0.151


0.773 0.094 0.481

x5 -0.030 -0.269 0.040 0.648


0.889 0.204 0.854 0.001

x6 -0.103 -0.218 -0.186 -0.019 -0.257


0.633 0.305 0.384 0.928 0.225

y 0.076 -0.496 -0.408 0.270 0.174 0.332


0.724 0.014 0.048 0.202 0.417 0.113
Cell Contents: Pearson correlation
P-Value

MODEL 1

Regression Analysis: y versus x2, x3, x4, x5, x1, x6

Method

Categorical predictor coding (1, 0)

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 6 85.584 14.2640 1.77 0.165
x2 1 9.925 9.9255 1.23 0.282
x3 1 11.827 11.8268 1.47 0.242
x4 1 0.019 0.0195 0.00 0.961
x5 1 3.595 3.5951 0.45 0.513
x1 1 5.703 5.7029 0.71 0.411
x6 1 15.159 15.1589 1.88 0.188
Error 17 136.751 8.0442
Total 23 222.335

Model Summary

S R-sq R-sq(adj) R-sq(pred)


2.83623 38.49% 16.78% 0.00%

Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 6.82 2.45 2.78 0.013
x2 -0.0321 0.0289 -1.11 0.282 1.53
x3 -0.911 0.751 -1.21 0.242 1.39
x4 0.006 0.115 0.05 0.961 1.95
x5 1.84 2.76 0.67 0.513 2.12
x1
1 1.01 1.20 0.84 0.411 1.08
x6
2 2.17 1.58 1.37 0.188 1.23

Regression Equation

x1 x6
0 1 y = 6.82 - 0.0321 x2 - 0.911 x3 + 0.006 x4 + 1.84 x5

0 2 y = 8.99 - 0.0321 x2 - 0.911 x3 + 0.006 x4 + 1.84 x5

1 1 y = 7.84 - 0.0321 x2 - 0.911 x3 + 0.006 x4 + 1.84 x5

1 2 y = 10.00 - 0.0321 x2 - 0.911 x3 + 0.006 x4 + 1.84 x5

Fits and Diagnostics for Unusual Observations

Obs y Fit Resid Std Resid


1 12.85 7.98 4.87 2.08 R
2 5.52 7.89 -2.37 -2.28 R
7 0.47 5.65 -5.18 -2.18 R
15 11.86 5.05 6.81 2.72 R

R Large residual

Durbin-Watson Statistic

Durbin-Watson Statistic = 1.44792


By allowing for possible interactions between x1 and x3, x4, x5 and x6 new modelling results
were obtained - with details given below:

MODEL 2
Regression Analysis: y versus x2, x3, x4, x5, x1x5, x1x3, x1x4, x1, x6, x1x6

The following terms cannot be estimated and were removed:


x1x6

Method

Categorical predictor coding (1, 0)

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 9 158.926 17.658 3.90 0.011
x2 1 15.522 15.522 3.43 0.085
x3 1 1.017 1.017 0.22 0.643
x4 1 18.102 18.102 4.00 0.065
x5 1 6.288 6.288 1.39 0.258
x1x5 1 1.363 1.363 0.30 0.592
x1x3 1 30.787 30.787 6.80 0.021
x1x4 1 23.039 23.039 5.09 0.041
x1 1 60.852 60.852 13.44 0.003
x6 1 22.452 22.452 4.96 0.043
Error 14 63.409 4.529
Total 23 222.335

Model Summary

S R-sq R-sq(adj) R-sq(pred)


2.12819 71.48% 53.15% 0.00%
Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant -0.14 2.53 -0.06 0.955
x2 -0.0447 0.0242 -1.85 0.085 1.89
x3 0.373 0.787 0.47 0.643 2.71
x4 0.402 0.201 2.00 0.065 10.58
x5 3.84 3.26 1.18 0.258 5.27
x1x5 -2.21 4.04 -0.55 0.592 7.94
x1x3 -3.19 1.22 -2.61 0.021 24.17
x1x4 -0.501 0.222 -2.26 0.041 15.16
x1
1 15.53 4.24 3.67 0.003 23.79
x6
2 2.85 1.28 2.23 0.043 1.44

Regression Equation

x1 x6
0 1 y = -0.14 - 0.0447 x2 + 0.373 x3 + 0.402 x4 + 3.84 x5 - 2.21 x1x5 - 3.19 x1x3
- 0.501 x1x4

0 2 y = 2.71 - 0.0447 x2 + 0.373 x3 + 0.402 x4 + 3.84 x5 - 2.21 x1x5 - 3.19 x1x3


- 0.501 x1x4

1 1 y = 15.39 - 0.0447 x2 + 0.373 x3 + 0.402 x4 + 3.84 x5 - 2.21 x1x5 - 3.19 x1x3


- 0.501 x1x4

1 2 y = 18.24 - 0.0447 x2 + 0.373 x3 + 0.402 x4 + 3.84 x5 - 2.21 x1x5 - 3.19 x1x3


- 0.501 x1x4

Fits and Diagnostics for Unusual Observations


Std
Obs y Fit Resid Resid
1 12.85 9.80 3.05 2.12 R
15 11.86 7.46 4.40 2.57 R

R Large residual

Durbin-Watson Statistic = 2.26671

a. Is the second model significantly more effective than the first?


b. From a broad comparison of the two models, which would you choose for forecasting
rainfall in the target area? Give your reasons for and against.

18. Refer to the data in exercise 15.


a. Develop an estimated regression equation that can be used to predict Delay by using
Industry and Quality.
b. Plot the residuals obtained from the estimated regression equation developed in part
(a) as a function of the order in which the data are presented. Does any
autocorrelation appear to be present in the data? Explain.
c. At the 0.05 level of significance, test for any positive autocorrelation in the data.
Chapter 16: Regression Analysis: Model Building

Textbook Exercises Solutions:


1. a. The Minitab output is shown below:

The regression equation is


Y = - 6.8 + 1.23 X

Predictor Coef SE Coef T p


Constant -6.77 14.17 -0.48 0.658
X 1.2296 0.4697 2.62 0.059

S = 7.269 R-sq = 63.1% R-sq(adj) = 53.9%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 362.13 362.13 6.85 0.059
Residual Error 4 211.37 52.84
Total 5 573.50

b. Since the p-value corresponding to F = 6.85 is 0.59 > 0 the relationship
is not significant.

c.
-
40+ *
-
Y - * *
- *
-
30+
-
-
-
- *
20+
-
-
-
- *
10+
------+---------+---------+---------+---------+---------+X
20.0 25.0 30.0 35.0 40.0 45.0

The scatter diagram suggests that a curvilinear relationship may be appropriate.

d. The Minitab output is shown below:

The regression equation is


Y = - 169 + 12.2 X - 0.177 XSQ

Predictor Coef SE Coef T p


Constant -168.88 39.79 -4.24 0.024
X 12.187 2.663 4.58 0.020
XSQ -0.17704 0.04290 -4.13 0.026

S = 3.248 R-sq = 94.5% R-sq(adj) = 90.8%

Analysis of Variance

SOURCE DF SS MS F p
Regression 2 541.85 270.92 25.68 0.013
Residual Error 3 31.65 10.55
Total 5 573.50

e. Since the p-value corresponding to F = 25.68 is 0.013 < 0 the


relationship is significant.

f. = -168.88 + 12.187(25) - 0.17704(25)2 = 25.145


2. a. The Minitab output is shown below:

The regression equation is


Y = 9.32 + 0.424 X

Predictor Coef SE Coef T p


Constant 9.315 4.196 2.22 0.113
X 0.4242 0.1944 2.18 0.117

S = 3.531 R-sq = 61.4% R-sq(adj) = 48.5%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 59.39 59.39 4.76 0.117
Residual Error 3 37.41 12.47
Total 4 96.80

The high p-value (0.117) indicates a weak relationship; note that 61.4% of the
variability in y has been explained by x.

b. The Minitab output is shown below:

The regression equation is


Y = - 8.10 + 2.41 X - 0.0480 XSQ

Predictor Coef SE Coef T p


Constant -8.101 4.104 -1.97 0.187
X 2.4127 0.4409 5.47 0.032
XSQ -0.04797 0.01050 -4.57 0.045

S = 1.279 R-sq = 96.6% R-sq(adj) = 93.2%

Analysis of Variance

SOURCE DF SS MS F p
Regression 2 93.529 46.765 28.60 0.034
Residual Error 2 3.271 1.635
Total 4 96.800

At the 0.05 level of significance, the relationship is significant; the fit is


excellent.

c. = -8.101 + 2.4127(20) - 0.04797(20)2 = 20.965

3. a. The scatter diagram shows some evidence of a possible linear


relationship.

b. The Minitab output is shown below:

The regression equation is


Y = 2.32 + 0.637 X

Predictor Coef SE Coef T p


Constant 2.322 1.887 1.23 0.258
X 0.6366 0.3044 2.09 0.075

S = 2.054 R-sq = 38.5% R-sq(adj) = 29.7%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 18.461 18.461 4.37 0.075
Residual Error 7 29.539 4.220
Total 8 48.000

c. The following standardized residual plot indicates that the constant


variance assumption is not satisfied.

-
- *
-
1.2+ *
-
-
- *
- * *
0.0+
-
- * *
-
-
-1.2+
- * *
-
-
+---------+---------+---------+---------+---------+------YHAT
3.0 4.0 5.0 6.0 7.0 8.0

d. The logarithmic transformation does not appear to eliminate the wedged-shaped


pattern in the above residual plot. The reciprocal transformation does,
however, remove the wedge-shaped pattern. Neither transformation provides a
good fit. The Minitab output for the reciprocal transformation and the
corresponding standardized residual pot are shown below.

The regression equation is


1/Y = 0.275 - 0.0152 X

Predictor Coef SE Coef T p


Constant 0.27498 0.04601 5.98 0.000
X -0.015182 0.007421 -2.05 0.080

S = 0.05009 R-sq = 37.4% R-sq(adj) = 28.5%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 0.010501 0.010501 4.19 0.080
Residual Error 7 0.017563 0.002509
Total 8 0.028064

- *
-
-
-
1.0+ *
- *
-
-
- *
0.0+ *
-
-
- * *
-
-1.0+
- * *
-
--+---------+---------+---------+---------+---------+----YHAT
0.140 0.160 0.180 0.200 0.220 0.240
4. a./b.

The proposed linear function looks a fairly good fit from the plot above. The
high R2 of 86.13% appears to corroborate this viewpoint.

5. The Minitab output is shown below:

The regression equation is


Y = 433 + 37.4 X - 0.383 XSQ

Predictor Coef SE Coef T p


Constant 432.6 141.2 3.06 0.055
X 37.429 7.807 4.79 0.017
XSQ -0.3829 0.1036 -3.70 0.034

S = 15.83 R-sq = 98.0% R-sq(adj) = 96.7%

Analysis of Variance

SOURCE DF SS MS F p
Regression 2 36643 18322 73.15 0.003
Residual Error 3 751 250
Total 5 37395
a. Since the linear relationship was significant (Exercise 4), this relationship must
be significant. Note also that since the p-value of 0.003 <  = 0.05, we can
reject H0.

b. The fitted value is 1302.01, with a standard deviation of 9.93. The 95%
confidence interval is 1270.41 to 1333.61; the 95% prediction interval is
1242.55 to 1361.47.

6. a. The scatter diagrams are shown below:

90
80
70
60
50
LifeExp
40
30
20
10
0
0 100 200 300 400 500 600 700
People.per.TV

90
80
70
60
50
LifeExp
40
30
20
10
0
0 10000 20000 30000 40000
People.per.Dr
90
80
70
60
50
LifeExp
40
30
20
10
0
50 60 70 80 90
LifeExp.Male

90
80
70
60
50
LifeExp
40
30
20
10
0
45 50 55 60 65 70 75 80
LifeExp.Female

b. The relationship between LifeExp and LifeExp.Male is almost


perfectly linear. Ditto the relationship between LifeExp and
LifeExp.Female. This is only to be expected since the values of the
LifeExp.Male and LifeExp.Female values directly make up the
corresponding LifeExp ones. In these circumstances using either
LifeExp.Female and LifeExp.Male as predictors of LifeExp would
make no real sense from a causal regression standpoint.

The other two variables (People.per.TV and People.per.Dr) of


LifeExp – from the first two scattergrams above – do not look wholly
convincing predictors in linear modelling terms (a hyperbolic or
negative exponential fit might be more convincing). Nevertheless, as
part c. shows, significant linear regression models can be obtained in
each case.
c.

= 69.648 - 0.036 People.per.TV R2 = 0.367

= 69.902 - 0.0007 People.per.Dr R2 = 0.444

The latter equation looks marginally better from an R square point of view.
However, neither model looks particularly impressive against their respective
scattergrams shown earlier.

7. a. The scatter diagrams are shown below:


Note that for CD, the dummy variable DVD = 0 and DVDRW = 0; for DVD, DVD = 1 and
DVDRW = 0
For dummy variable DVDRW, DVD = 0 and DVDRW = 1

b. Yes, the scattergrams suggests a regression model is likely to hold according to the
scattergrams in a.
c. From the selective SPSS (stepwise regression) output below, a number of
significant regression models can be found for the data:

Particularly, the 3 predictor model:

= 1.327 +2.157 DVDRW +0.786 DVD – 0.24 pack

8. a. The scatter diagram is shown below:


A simple linear regression model does not look particularly appropriate from
this plot. The coefficient of determination result of 48.72% only seems to
support this conclusion.

b.
Regression Analysis: y versus x, x^2

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 2 63.74 31.869 8.11 0.008
x 1 10.53 10.528 2.68 0.133
x^2 1 13.56 13.556 3.45 0.093
Error 10 39.27 3.927
Total 12 103.01

Model Summary

S R-sq R-sq(adj) R-sq(pred)


1.98174 61.87% 54.25% 45.25%
Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 149.5 39.7 3.77 0.004
x -1.625 0.993 -1.64 0.133 271.01
x^2 0.01143 0.00615 1.86 0.093 271.01

Regression Equation

y = 149.5 - 1.625 x + 0.01143 x^2

Fits and Diagnostics for Unusual Observations

Obs y Fit Resid Std Resid


1 87.90 91.92 -4.02 -2.22 R

R Large residual

Though the overall regression is significant according to the F value in the ANOVA table, the
coefficient for x and x squared from the preceding output are not significant (respective
pvalues of 0.133 and 0.093 are both > α = 0.05).

c.
Regression Analysis: lny versus x

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 1 0.005636 0.005636 10.08 0.009
x 1 0.005636 0.005636 10.08 0.009
Error 11 0.006149 0.000559
Total 12 0.011785
Model Summary

S R-sq R-sq(adj) R-sq(pred)


0.0236437 47.82% 43.08% 24.82%

Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 4.3566 0.0598 72.83 0.000
x 0.002284 0.000719 3.18 0.009 1.00

Regression Equation

lny = 4.3566 + 0.002284 x

Fits and Diagnostics for Unusual Observations

Obs lny Fit Resid Std Resid


1 4.4762 4.5275 -0.0513 -2.33 R

R Large residual

From the above the regression of log(y) on log(x) yields a significant fit with (pvalue < 0.05)
for both regression coefficients.

d. The estimated regression in part (c) is preferred even though it has a lower Rsquare of
47.82%, because the multiple regression model in b) seems to be suffering from
multicollinearity.

9. a. SSR = SST - SSE = 1030

MSR = 1030 MSE = 520/25 = 20.8 F = 1030/20.8 = 49.52


F.05 = 4.24 (25 DF)

Since 49.52 > 4.24 we reject H0: and conclude that x1 is significant.

b.

F.05 = 3.42 (2 degrees of freedom numerator and 23 denominator)

Since 48.3 > 3.42 the addition of variables x2 and x3 is statistically significant

10. a. SSE = SST - SSR = 1805 - 1760 = 45

MSR = 1760/4 = 440 MSE =45/25 = 1.8

F = 440/1.8 = 244.44

F.05 = 2.76 (4 degrees of freedom numerator and 25 denominator)

Since 244.44 > 2.76, variables x1 and x4 contribute significantly to the model

b. SSE(x1, x2, x3, x4) = 45

c. SSE(x2, x3) = 1805 - 1705 = 100

d.

F.05 = 3.39 (2 numerator and 25 denominator DF)

Since 15.28 > 3.39 we conclude that x1 and x3 contribute significantly to the
model.

11. a. The Minitab output is shown below:

The regression equation is


H = 127 + 2.21 W + 0.0297 M
Predictor Coef SE Coef T P VIF
Constant 126.588 6.393 19.80 0.000
W 2.2115 0.6129 3.61 0.002 13.007
M 0.02974 0.01024 2.90 0.009 13.007

S = 13.3340 R-Sq = 96.3% R-Sq(adj) = 95.9%

Analysis of Variance

Source DF SS MS F P
Regression 2 96157 48079 270.41 0.000
Residual Error 21 3734 178
Total 23 99891

Source DF Seq SS
W 1 94659
M 1 1499

Unusual Observations

Obs W H Fit SE Fit Residual St Resid


3 56.0 346.00 323.21 9.15 22.79 2.35RX
22 13.0 209.00 181.12 4.66 27.88 2.23R

R denotes an observation with a large standardized residual.


X denotes an observation whose X value gives it large leverage.

Ominously the VIF’s above are both greater than 10 indicating a potential
multicollinearity problem. Corresponding correlation results below support this
assessment.
b. The Minitab output is shown below:

The regression equation is


H = 121 + 2.33 W + 0.0354 M - 0.000114 M*W

Predictor Coef SE Coef T P VIF


Constant 120.76 14.73 8.20 0.000
W 2.3308 0.6810 3.42 0.003 15.442
M 0.03538 0.01652 2.14 0.045 32.539
M*W -0.0001141 0.0002587 -0.44 0.664 36.221

S = 13.5973 R-Sq = 96.3% R-Sq(adj) = 95.7%

Analysis of Variance

Source DF SS MS F P
Regression 3 96193 32064 173.43 0.000
Residual Error 20 3698 185
Total 23 99891

Source DF Seq SS
W 1 94659
M 1 1499
M*W 1 36

Unusual Observations

Obs W H Fit SE Fit Residual St Resid


3 56.0 346.00 322.25 9.58 23.75 2.46R
22 13.0 209.00 180.46 4.99 28.54 2.26R

R denotes an observation with a large standardized residual.


Again, the VIF’s are a problem here – in fact even more so.

c. Stepwise Regression: H versus M, W, M*W

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is H on 3 predictors, with N = 24

Step 1 2
Constant 126.6 126.6

W 3.92 2.21
T-Value 19.95 3.61
P-Value 0.000 0.002

M 0.030
T-Value 2.90
P-Value 0.009

S 15.4 13.3
R-Sq 94.76 96.26
R-Sq(adj) 94.52 95.91
Mallows Cp 8.3 2.2

Because the interaction term M*W has not been loaded into the model in the
latter Stepwise Regression we deduce it does not contribute significantly to the
model.

12. For the first model featuring the five predictors x1, x2, x3, x4 and x5, the significant F
ratio from the ANOVA table (pvalue = 0.005 < α = 0.05) suggests that the overall model is
a significant fit to the data. Yet none of the individual t tests associated with each of the
regression slopes beforehand are significant except that for x4 (pvalue = 0.005 < α = 0.05).
From the VIF’s which are all close to 1, multicollinearity would not appear to be a
problem for the data. The R Square of 66.3% indicates that the multiple regression model
explains 66.3% of the variation in the response variable and this might be regarded as quite
favourable. On the downside, the model suffers from a single outlier according to
MINITAB but for a sample of size 20 this does not seem unreasonable. The Durbin-
Watson statistic of 1.72 but for a two-sided Durbin Watson test the relevant dL and dU
values (based on n = 20 and k = 5 predictors) are 0.70 and 1.87. As dL < 1.72 < dU we
deduce the test is inconclusive.

The second model is a simple regression with just x4 as the predictor. The model is
significant according to both the overall F test and the specific t tests associated with the
regression slope for x4. As would be expected the R square value has dropped – in this
case to 51.7%. Again, there is an outlier (albeit for observation 12 now instead of
observation 1 previously but with a corresponding standardized residual of -2.07 this does
not look too serious.)

To check if the earlier five predictor model is a significant improvement on this one predictor
model, a partial F test can be undertaken. The relevant calculation using equation (16,11) is
as follows (note that p = 5, q = 1):

F = SSE(x1, x2, . . . , xq) - SSE(x1, x2, . . . , xq, xq+1, . . . , xp)


p–q
_________________________________________
SSE(x1, x2, . . . , xq, xq+1, . . . , xp)
n-p–1

= 23.002-16.032
5-1
___________
16.032
14

= 1.52 < 3.11 = F.95(4,14)

Hence, we are not able to reject H0: 1 = 2 = 3 = 5 = 0 at a 5% significance level and


deduce that the five predictor model is not a significant improvement on the
corresponding 1 predictor equivalent.
Note that because of the ‘ln’ transformation on y the relationship between y and x4 has an
essentially exponential character despite the fact that we have effectively used a linear
modelling formulation for the analysis.

13. a. For the first model, the F ratio from the ANOVA table (pvalue = 0.000 < α = 0.05) is
highly significant which suggests the overall model offers a significant fit to the data.
Ignoring the constant, t tests for the regression slopes corresponding to the humi, temp
and HT variables are all significant (have a pvalue < 5%). The R Square (coefficient of
determination) of 71.5% is favourable and suggests the multiple regression model
explains the variation in the response quite well. There is one outlier but given the
sample size is 44 this does not seem to be especially problematic. Observation 6 is
categorized as influential and this should be investigated further. The Durbin-Watson
statistic of 1.63 but for a two-sided Durbin Watson test the relevant dL and dU values
(based on n = 44 and k = 3 predictors) are 1.29 and 1.58. As 1.58 = dU < 1.63 < 4-dU
=2.42 we deduce no evidence of first order serial correlation of residuals is present.

b. Again according to the F ratio details provided, the second model too is significant.
However from corresponding t tests only the slopes for humi and HH can be considered
significantly different from zero. In this case however the R square is an impressive
79.7%. There are two outliers and one (different) influential observation with this model.
The outliers do not look serious but as before the influential observation should be
investigated. The Durbin-Watson statistic is 1.78. For n = 44 and k = 5, dL and dU are 1.29
and 1.58 respectively. As 1.58 = dU <1.78 < 4 - dU = 2.42 we deduce there is no problem
with residuals suffering from first order serial correlation.

To check if the earlier five predictor model is a significant improvement on this one predictor
model, a partial F test can be undertaken. The relevant calculation using equation (16,11) is
as follows:

F = SSE(x1, x2, . . . , xq) - SSE(x1, x2, . . . , xq, xq+1, . . . , xp)


p–q
_________________________________________
SSE(x1, x2, . . . , xq, xq+1, . . . , xp)
n-p–1

= 0.14166-0.100887
5-3
___________
0.100887
38

= 26.68 > 3.25 = F.95(2,38)


Hence, we reject H0: 4 = 5 = 0 at the 5% significance level and deduce that the five predictor
model is a significant improvement on the corresponding 3 predictor alternative.

14. Correlation: Loss, Airflow, Temp, Acid


Correlations
Loss Airflow Temp
Airflow 0.920
0.000

Temp 0.876 0.782


0.000 0.000

Acid 0.400 0.500 0.391


0.073 0.021 0.080
Cell Contents
Pearson correlation
P-Value

Regression Analysis: Loss versus Airflow, Temp, Acid


Stepwise Selection of Terms
α to enter = 0.15, α to remove = 0.15
Analysis of Variance
Source DF Adj SS Adj MS F-Value P-Value
Regression 2 1880.44 940.221 89.64 0.000
Airflow 1 294.36 294.355 28.06 0.000
Temp 1 130.32 130.321 12.42 0.002
Error 18 188.80 10.489
Lack-of-Fit 17 188.30 11.076 22.15 0.166
Pure Error 1 0.50 0.500
Total 20 2069.24
Model Summary
R-
S R-sq sq(adj) R-sq(pred)
3.23862 90.88% 89.86% 85.81%
Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant -50.36 5.14 -9.80 0.000
Airflow 0.671 0.127 5.30 0.000 2.57
Temp 1.295 0.367 3.52 0.002 2.57
Regression Equation
Loss = -50.36 + 0.671 Airflow + 1.295 Temp
Fits and Diagnostics for Unusual Observations
Obs Loss Fit Resid Std Resid
21 15.00 22.53 -7.53 -2.73 R
R Large residual

Best Subsets Regression: Loss versus Airflow, Temp, Acid


Response is Loss
A
i
r
f T A
Mallow l e c
R-Sq R-Sq s o m i
Vars R-Sq (adj) (pred) Cp S w p d
1 84.6 83.8 80.7 13.3 4.0982 X
1 76.7 75.4 70.1 28.9 5.0427 X
2 90.9 89.9 85.8 2.9 3.2386 X X
2 85.1 83.4 80.8 14.4 4.1442 X X
3 91.4 89.8 85.9 4.0 3.2434 X X X

From the correlation matrix loss is significantly correlated with


Airflow and Temp but not Acid at the 5% level. Using MINTAB’s
Stepwise procedure only Airflow and Temp are retained as predictors
in line with the correlation analysis. Similarly the standout model of
the five summarized for the Best Subsets modelling is the first of the
two variable options as highlighted. This has a better R-sq(adj) than
the full three predictor model, a better S value and very comparable
R-sq and Cp outcomes.

15. a. From the correlation matrix provided, the sales response is significantly correlated
with all of the three predictors listed. However, the attract variable is also significantly
correlated with that for airplay suggesting potential problems of multicollinearity if both
variables are fitted together in a linear model.
Regression Analysis: Delay versus Industry, Public, Quality, Finished

Method

Categorical predictor coding (1, 0)

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 9 4794.80 532.76 8.12 0.000
Industry 1 742.53 742.53 11.31 0.002
Public 1 18.85 18.85 0.29 0.596
Quality 4 1292.66 323.16 4.92 0.004
Finished 3 1973.74 657.91 10.02 0.000
Error 30 1969.17 65.64
Lack-of-Fit 18 738.42 41.02 0.40 0.961
Pure Error 12 1230.75 102.56
Total 39 6763.97

Model Summary

S R-sq R-sq(adj) R-sq(pred)


8.10179 70.89% 62.15% 47.13%

Coefficients
Term Coef SE Coef T-Value P-Value VIF
Constant 83.98 3.83 21.91 0.000
Industry
1 9.94 2.95 3.36 0.002 1.17
Public
1 1.85 3.46 0.54 0.596 1.27
Quality
2 -3.70 4.21 -0.88 0.387 1.73
3 -16.70 4.28 -3.90 0.000 1.61
4 -12.96 4.53 -2.86 0.008 1.59
5 -8.63 3.89 -2.22 0.034 1.73
Finished
2 -16.72 3.44 -4.86 0.000 1.79
3 -20.58 4.28 -4.81 0.000 1.79
4 -9.82 4.73 -2.08 0.047 1.49

Regression Equation

Delay = 83.98 + 0.0 Industry_0 + 9.94 Industry_1 + 0.0 Public_0 + 1.85 Public_1
+ 0.0 Quality_1 - 3.70 Quality_2 - 16.70 Quality_3 - 12.96 Quality_4 - 8.63 Quality_5
+ 0.0 Finished_1 - 16.72 Finished_2 - 20.58 Finished_3 - 9.82 Finished_4

Fits and Diagnostics for Unusual Observations

Obs Delay Fit Resid Std Resid


38 46.00 62.23 -16.23 -2.40 R

R Large residual

b. Yes though R-sq(pred) suggests the model might not be as good predicting new
observations as existing data. (over-fitting may be an issue.)
c.

Scatterplot of Delay vs Finished


100

90

80
Delay

70

60

50

40
1.0 1.5 2.0 2.5 3.0 3.5 4.0
Finished

d. The relationship between Delay and finished looks more quadratic than linear so
adjusting our model in a. to allow for an additional term Finished_squared we obtain the
new regression analysis based on the stepwise method:

Regression Analysis: Delay versus Finished_squared, Industry, Public, Quality, Finished

Method

Categorical predictor coding (1, 0)

Stepwise Selection of Terms

α to enter = 0.15, α to remove = 0.15

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 8 4775.9 596.99 9.31 0.000
Industry 1 731.2 731.17 11.40 0.002
Quality 4 1294.6 323.65 5.05 0.003
Finished 3 2013.9 671.31 10.47 0.000
Error 31 1988.0 64.13
Lack-of-Fit 19 757.3 39.86 0.39 0.968
Pure Error 12 1230.8 102.56
Total 39 6764.0

Model Summary

S R-sq R-sq(adj) R-sq(pred)


8.00811 70.61% 63.02% 50.77%

Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 84.13 3.78 22.26 0.000
Industry
1 9.84 2.92 3.38 0.002 1.16
Quality
2 -3.88 4.15 -0.93 0.357 1.72
3 -16.32 4.17 -3.91 0.000 1.57
4 -12.60 4.43 -2.85 0.008 1.56
5 -8.54 3.84 -2.22 0.034 1.72
Finished
2 -16.45 3.36 -4.89 0.000 1.75
3 -19.91 4.04 -4.93 0.000 1.63
4 -10.05 4.66 -2.16 0.039 1.48

Regression Equation

Delay = 84.13 + 0.0 Industry_0 + 9.84 Industry_1 + 0.0 Quality_1 - 3.88 Quality_2
- 16.32 Quality_3 - 12.60 Quality_4 - 8.54 Quality_5 + 0.0 Finished_1
- 16.45 Finished_2 - 19.91 Finished_3 - 10.05 Finished_4
Fits and Diagnostics for Unusual Observations

Obs Delay Fit Resid Std Resid


38 46.00 61.46 -15.46 -2.26 R

R Large residual

This looks a marginally better model than that in a. but ironically does not include a term
in Finished_squared. The non-significant variable Public has also been dropped from the
earlier model.

16.
a. From the best subsets regression summary, the 5 predictor model with an R Square of
86.2% is almost as good on all measures as the full 6 predictor model represented by the
bottom line of the table. The same five predictor model is described in detail after the
correlation matrix and can be seen from the ANOVA F statistic to be significant overall.
Corresponding t statistics are also significant (though technically the pvalue (of 0.054)
associated with the regression slope for the pop variable is just slightly above the test size of
5%).
b. Clearly multicollinearity is a problem here. This is informed by significant correlations
between predictor variables e.g. pr and con. Also, the sign of the coefficient of the con
predictor in the detailed regression output is opposite to that of the corresponding correlation
between con and ao.
c. Yes. In these circumstances the two predictor model from Stepwise now looks technically
more appealing.

17 a. Yes, the test for improvement yields the significant test result.
F(3,14) = (136.751- 63.409)/(17-14) = 5.40 > 3.34 = 5% critical vale for the F(3,14)
distribution.
63.409/14

b) Model 2 is much favoured over Model 1 which seems to be handicapped by serious


multicollinearity. Also model 2 has a much higher adjusted R square ……

18 a.
Regression Analysis: Delay versus Industry, Quality

Method

Categorical predictor coding (1, 0)

Analysis of Variance

Source DF Adj SS Adj MS F-Value P-Value


Regression 5 2762.0 552.40 4.69 0.002
Industry 1 1023.8 1023.75 8.70 0.006
Quality 4 1685.9 421.48 3.58 0.015
Error 34 4002.0 117.70
Lack-of-Fit 4 114.3 28.58 0.22 0.925
Pure Error 30 3887.6 129.59
Total 39 6764.0

Model Summary

S R-sq R-sq(adj) R-sq(pred)


10.8492 40.83% 32.13% 17.63%

Coefficients

Term Coef SE Coef T-Value P-Value VIF


Constant 73.33 4.21 17.43 0.000
Industry
1 11.41 3.87 2.95 0.006 1.12
Quality
2 -9.51 5.33 -1.79 0.083 1.54
3 -18.93 5.49 -3.45 0.002 1.48
4 -15.83 5.82 -2.72 0.010 1.47
5 -11.85 5.07 -2.34 0.025 1.64
Regression Equation

Delay = 73.33 + 0.0 Industry_0 + 11.41 Industry_1 + 0.0 Quality_1 - 9.51 Quality_2
- 18.93 Quality_3 - 15.83 Quality_4 - 11.85 Quality_5

Fits and Diagnostics for Unusual Observations

Obs Delay Fit Resid Std Resid


36 91.00 68.90 22.10 2.24 R
38 46.00 68.90 -22.90 -2.32 R

R Large residual

b.

Residual Plots for Delay


Normal Probability Plot Versus Fits
99
20
90
10
Residual
Percent

50 0

-10
10
-20
1
-20 -10 0 10 20 50 60 70 80
Residual Fitted Value

Histogram Versus Order


8 20

6 10
Frequency

Residual

4 0

-10
2
-20
0
-20 -10 0 10 20 1 5 10 15 20 25 30 35 40
Residual Observation Order

See bottom right hand residual plot in particular. Yes, this does seem to suggest possible
first order serial correlation is present.

c. Following on from b. the three predictor model based on Retired, Unemployment and
Total Staff would be preferred.
d. Durbin-Watson Statistic = 1.66441

For n=40 and k=2 and 5% significance level, dL = 1.39, dU = 1.60 and we have d = 1.66441 >
dU = 1.60. So we deduce from Figure 14.17 that there no evidence of positive autocorrelation.
Chapter 16: Regression Analysis: Model Building

Supplementary Exercises:
19. A study investigated the relationship between audit delay (Delay), the length of time from
a company’s fiscal year-end to the date of the auditor's report, and variables that describe
the client and the auditor. Some of the independent variables that were included in this
study follow.

Industry A dummy variable coded 1 if the firm was an industrial company or 0 if the
firm was a bank, savings and loan, or insurance company.

Public A dummy variable coded 1 if the company was traded on an organized


exchange or over the counter; otherwise coded 0.

Quality A measure of overall quality of internal controls, as judged by the auditor, a


five-point scale ranging from "virtually none" (1) to "excellent" (5)

Finished A measure ranging from 1 to 4, as judged by the auditor, where 1 indicates


“all work performed subsequent to year-end" and 4 indicates "most work
performed prior to year-end."

A sample of 40 companies provided the following data.


Delay Industry Public Quality Finished
62 0 0 3 1
45 0 1 3 3
54 0 0 2 2
71 0 1 1 2
91 0 0 1 1
62 0 0 4 4
61 0 0 3 2
69 0 1 5 2
80 0 0 1 1
52 0 0 5 3
47 0 0 3 2
65 0 1 2 3
60 0 0 1 3
81 1 0 1 2
73 1 0 2 2
89 1 0 2 1
71 1 0 5 4
76 1 0 2 2
68 1 0 1 2
68 1 0 5 2
86 1 0 2 2
76 1 1 3 1
67 1 0 2 3
57 1 0 4 2
55 1 1 3 2
54 1 0 5 2
69 1 0 3 3
82 1 0 5 1
94 1 0 1 1
74 1 1 5 2
75 1 1 4 3
69 1 0 2 2
71 1 0 4 4
79 1 0 5 2
90 1 0 1 4
91 1 0 4 1
92 1 0 1 4
46 1 1 4 3
72 1 0 5 2
85 1 0 5 1

a. Develop the estimated regression equation using all of the independent variables

b. How well does the estimated regression equation developed in part (a) represent the
data?

c. Develop a scatter diagram showing Delay as a function of Finished. What does this
scatter diagram indicate about the relationship between Delay and Finished.

d. On the basis of your observations about the relationship between Delay and Finished,
develop an alternative estimated regression equation to the one developed in (a) to
explain as much of the variability in Delay as possible.

20. Annual data published by Conrad (1989) over a 21 year period features the following
variables:

Y = consumption of tobacco goods


X1 = real personal disposable income per capita
X2 = real price of tobacco goods
HSA, HSB, HSC = health scare dummy variables
(where HSA = 1 for years 8 and 9, 0 otherwise
HSB = 1 for years 10 and 11, 0 otherwise
HSC = 1 for years 17 and 18, 0 otherwise)

Corresponding MINITAB modelling output is as follows:

The regression equation is


lnY = 5.63 + 0.0478 HSA + 0.0149 HSB - 0.0535 HSC - 0.0126 lnX1 + 0.001 lnX2

Predictor Coef SE Coef T P


Constant 5.6307 0.2093 26.90 0.000
HSA 0.04785 0.02932 1.63 0.124
HSB 0.01485 0.03131 0.47 0.642
HSC -0.05352 0.03057 -1.75 0.100
lnX1 -0.01261 0.03274 -0.39 0.706
lnX2 0.0009 0.1133 0.01 0.994

S = 0.0382038 R-Sq = 34.8% R-Sq(adj) = 13.1%

Analysis of Variance

Source DF SS MS F P
Regression 5 0.011700 0.002340 1.60 0.219
Residual Error 15 0.021893 0.001460
Total 20 0.033593

Source DF Seq SS
HSA 1 0.004986
HSB 1 0.000710
HSC 1 0.005775
lnX1 1 0.000229
lnX2 1 0.000000

Unusual Observations

Obs HSA lnY Fit SE Fit Residual St Resid


7 0.00 5.62654 5.55090 0.01145 0.07564 2.08R

R denotes an observation with a large standardized residual.

Stepwise Regression: lnY versus HSA, HSB, HSC, lnX1, lnX2

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is lnY on 5 predictors, with N = 21

Step 1 2
Constant 5.556 5.551

HSC -0.064 -0.059


T-Value -2.30 -2.23
P-Value 0.033 0.039

HSA 0.046
T-Value 1.75
P-Value 0.096
S 0.0372 0.0353
R-Sq 21.81 33.23
R-Sq(adj) 17.70 25.82
Mallows C-p 1.0 0.4

Best Subsets Regression: lnY versus HSA, HSB, HSC, lnX1, lnX2

Response is lnY

ll
HHHnn
Mallows SSSXX
Vars R-Sq R-Sq(adj) C-p S ABC12
1 21.8 17.7 1.0 0.037181 X
1 14.8 10.4 2.6 0.038802 X
2 33.2 25.8 0.4 0.035299 X X
2 22.8 14.2 2.8 0.037965 X X
3 34.1 22.5 2.2 0.036073 X X X
3 33.7 22.0 2.3 0.036200 X X X
4 34.8 18.5 4.0 0.036991 X X X X
4 34.2 17.7 4.1 0.037173 X X X X
5 34.8 13.1 6.0 0.038204 X X X X X

a. Comment on the effectiveness of the various models here carrying out any statistical
tests or additional analysis you think appropriate.

b. How would you advise the Tobacco Research Council who sourced the data?

21. Refer to the data in exercise 19. Consider a model in which only Industry is used to
predict Delay. At a 0.01 level of significance, test for any positive autocorrelation in the
data.

22. Refer to the data in exercise 19.


a. Develop an estimated regression equation that can be used to predict Delay by using
Industry and Quality.

b. Plot the residuals obtained from the estimated regression equation developed in part (a)
as a function of the order in which the data are presented. Does any autocorrelation
appear to be present in the data? Explain.

c. At the 0.05 level of significance, test for any positive autocorrelation in the data.

23. A regression analysis of heart disease by country (Cooper & Weekes, 1983) is based on the
following variables:
sug sugar consumption
tdp total dairy products consumption
agemp percentage employment in agriculture, fishing and forestry
ihdmr ischaemic heart disease mortality rate (RESPONSE variable)

Relevant MINITAB output for two contrasting models is given below:

MODEL 1

Regression Analysis

The regression equation is


ihdmr = 1.6 + 1.41 sug + 0.178 tdp - 2.12 agemp

Predictor Coef StDev T P


Constant 1.62 86.16 0.02 0.985
sug 1.4070 0.9752 1.44 0.167
tdp 0.1775 0.1214 1.46 0.162
agemp -2.124 2.137 -0.99 0.334

S = 58.38 R-Sq = 68.0% R-Sq(adj) = 62.3%

Analysis of Variance

Source DF SS MS F P
Regression 3 123022 41007 12.03 0.000
Residual Error 17 57950 3409
Total 20 180972

Source DF Seq SS
sug 1 114848
tdp 1 4806
agemp 1 3369

Unusual Observations
Obs sug ihdmr Fit SE Fit Residual St Resid
6 115 240.9 288.2 46.3 -47.3 -1.33 X
19 117 105.3 227.2 15.5 -121.9 -2.17R

R denotes an observation with a large standardized residual


X denotes an observation whose X value gives it large influence.

MODEL 2

Regression Analysis

The regression equation is


ihdmr = - 84.4 + 2.73 sug

Predictor Coef StDev T P


Constant -84.35 50.10 -1.68 0.109
sug 2.7255 0.4744 5.74 0.000

S = 58.99 R-Sq = 63.5% R-Sq(adj) = 61.5%

Analysis of Variance

Source DF SS MS F P
Regression 1 114848 114848 33.00 0.000
Residual Error 19 66124 3480
Total 20 180972

Unusual Observations
Obs sug ihdmr Fit SE Fit Residual St Resid
19 117 105.3 234.3 14.7 -129.0 -2.26R

R denotes an observation with a large standardized residual

Explain this computer output, carrying out any additional tests you think necessary or
appropriate. Is the first model significantly better than the second? Which model do you
prefer and why?

24. A regression analysis of UK imports (Barrow, 2001) is based on the following variables:

lnimport natural log of UK imports in real prices (£bn) (RESPONSE variable)


lngdp natural log of UK Gross Domestic Product in real value terms (£bn)
lnprice natural logarithm of the unit value index of imports
laglnimport lagged value of lnimport variable

Relevant MINITAB output is given below:

Regression Analysis: lnimport versus lngdp, lnprice

The regression equation is


lnimport = - 4.08 + 1.76 lngdp - 0.292 lnprice

Predictor Coef SE Coef T P VIF


Constant -4.078 1.514 -2.69 0.015
lngdp 1.7625 0.1644 10.72 0.000 4.9
lnprice -0.2917 0.1280 -2.28 0.035 4.9

S = 0.04093 R-Sq = 97.8% R-Sq(adj) = 97.6%

Analysis of Variance

Source DF SS MS F P
Regression 2 1.35348 0.67674 403.88 0.000
Residual Error 18 0.03016 0.00168
Total 20 1.38364
Source DF Seq SS
lngdp 1 1.34478
lnprice 1 0.00870

Unusual Observations
Obs lngdp lnimport Fit SE Fit Residual St Resid
2 5.55 4.32281 4.22598 0.01726 0.09683 2.61R

R denotes an observation with a large standardized residual

Durbin-Watson statistic = 1.09

Correlations: lnimport, lngdp, lnprice, laglnimport

lnimport lngdp lnprice


lngdp 0.986
0.000

lnprice -0.916 -0.893


0.000 0.000

laglnimp 0.977 0.954 -0.939


0.000 0.000 0.000

Cell Contents: Pearson correlation


P-Value

Stepwise Regression: lnimport versus lngdp, lnprice, laglnimport

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is lnimport on 3 predictors, with N = 20


N(cases with missing observations) = 1 N(all cases) = 21
Step 1 2
Constant -7.605 -4.881

lngdp 2.134 1.332


T-Value 26.48 7.21
P-Value 0.000 0.000

laglnimp 0.408
T-Value 4.55
P-Value 0.000

S 0.0430 0.0297
R-Sq 97.50 98.87
R-Sq(adj) 97.36 98.74
C-p 19.6 2.1

a. Explain this computer output, carrying out any additional tests you think necessary or
appropriate.
b. Which of the various models shown do you prefer and why?

25. Tony runs a used car business. He would like to predict monthly sales. Tony believes that
sales, y (in €’0,000s) is directly related to the number of sales-people employed (x 1) and the
average number of cars on the lot for sale (x2). The following data were collected over a
period of 10 months:

y x1 x2

5.8 4 20
8.1 4 25
7.5 5 15
13.3 8 30
11.4 7 25
15.0 9 35
7.0 3 17
8.3 5 20
5.1 2 18
6.8 4 23

a. Calculate the correlations between y, x1 and x2. What do these suggest?

b. Estimate the regression equation of y on x1 and x2 and calculate the corresponding


VIF’s for the independent variables.
c. What do you infer from (b)?

d. Plot the residuals by , x1 and x2 and comment on the validity of the theoretical

assumptions for regression in this case.

26. For the data in Exercise 25, use MINITAB to carry out a
Stepwise and
best subsets analysis.

a. Interpret the resulting computer outputs.


b. Which of the various models covered by these outputs do you most prefer and why?

27. The CEO of a computer firm is interested in funding research proposals by graduate students
who wish to perform experiments in the firm's advanced technology laboratory during the
summer months. The CEO receives 18 proposals and sends these proposals to the director
of the laboratory for evaluation. The director rates the proposals on two different criteria and
gives a score between zero and ten for each criterion, with 10 representing the best score
possible. (The variables x1 and x2 represent these two scores. The variable y (in €000s) is
the level of funding that the CEO grants for the proposal.) The collected data are given
below:

y x1 x2
9.5 8.7 9.2
7.3 8.1 8.0
6.5 7.4 7.7
8.4 8.4 8.6
8.0 8.3 8.0
6.1 7.0 7.3
8.5 8.6 8.8
7.2 8.3 7.8
5.8 6.7 7.0
6.3 7.3 7.5
9.0 8.6 9.0
6.4 7.7 7.5
7.0 7.9 7.9
7.4 8.2 8.0
8.3 8.5 8.4
8.2 8.6 7.9
5.3 6.6 6.9
6.7 7.8 7.5

The director tries to work out what the CEO will grant, given how he scores a proposal.

a. Find a 90% confidence interval for the mean value of y at x1 = 8.0 and
x2 = 7.8.

b. Find a 90% confidence interval for the value of Y at x1 = 8.0 and x2 = 7.8.

c. Interpret each of these confidence intervals: what is the difference between them?

28. For the data in Exercise 27, use MINITAB to carry out a
Stepwise and
best subsets analysis.

a. Interpret the resulting computer outputs.


b. Which of the various models covered by these outputs do you most prefer and why?

29. Consider the following dataset for 12 growth-orientated companies. Y represents the growth
rate of a company for the current year. X1 represents the growth rate of the company for the
previous year. X2 represents the percentage of the market that does not use the company's
product or a similar product, and X3 represents the current growth rate for the industry sector
to which the company belongs. (All values are percentages.)

Y X1 X2 X3

20 10 30 2.8
30 15 60 3.4
24 12 35 5.6
36 42 38 2.8
18 15 25 10.1
47 45 40 6.2
33 30 40 2.8
35 32 32 7.9
27 19 32 3.4
28 24 31 10.1
20 24 20 7.9
32 20 50 2.8

a. Using MINITAB, derive the sample correlations for the variables and estimate the
regression equation of Y on X1, X2 and X3. Test the significance of X1, X2 and X3 in the
model. What do you deduce?

b. Perform a stepwise analysis of the data using the backward elimination procedure.
Comment on the results obtained and compare these with the outputs from (a). Are they
consistent?

30. Chatterjee and Price (1977) present attitude data for clerical staff towards their supervisors
within a large commercial organization. Details of the variables involved in the study and
of the predictive model obtained using the MINITAB package, are as follows:

y : Overall rating of job being done by supervisor


cmplain : Handles staff complaints
prvileg : Does not allow special privileges
learn : Opportunity to learn new things
rises : Rises based on performances
critcal : Too critical of poor performances
advance : Rate of advancing to better jobs

y - 10.8 + 0.613 cmplain - 0.073 prvileg


(0.161) (0.136)
vif 2.7 1.6
+ 0.320 learn + 0. 082 rises
(0.169) (0.222)
vif 2.3 3.1

+ 0.038 critcal - 0.217 advance


(0.147) (0.178)
vif 1.2 2.0

Note that the figures in brackets here are the standard errors of associated estimated regression
coefficients. Also, that the total (corrected) sum of squares on y = 4296.97 and the sample size =
30
The sample correlations for the data are as follows:
y cmplain prvileg learn rises critcal
cmplain 0.825
prvileg 0.426 0.558
learn 0.624 0.597 0.493
rises 0.590 0.669 0.445 0.640
critcal 0.156 0.188 0.147 0.116 0.377
advance 0.155 0.225 0.343 0.532 0.574 0.283

Results from running the MINITAB’s best subsets procedure for the data are given below:

Best Subsets Regression of y cp ca


mr rd
pv1riv
1ieita
a1ascn
Adj. iereac
VARS R-sq R-sq C-p s ngns1e
1 68.1 67.0 1.4 6.9933 X
1 38.9 36.7 26.6 9.6835 X
2 70.8 68.6 1.1 6.8168 X X
2 68.4 66.0 3.2 7.0927 X X
3 72.6 69.4 1.6 6.7343 X X X
3 71.5 68.2 2.5 6.8630 XXX
4 72.9 68.6 3.3 6.8206 XXX X
4 72.9 68.5 3.4 6.8310 X XX X
5 73.2 67.6 5.1 6.9294 XXXX X
5 73.1 67.5 5.1 6.9396 XXX XX
6 73.3 66.3 7.0 7.0680 XXXXXX

a. Given the evidence provided here and making any additional calculations and / or
statistical tests you think necessary, how would you interpret this information?
b. What is your overall view of the model's effectiveness?

31. Pre-employment tests are widely used in many large corporations as an approach for
estimating likely job performance. In a published study, separate regression analyses (see
MODEL 2 below) were conducted for white and minority sections of a recruitment
sample. The results, given, contrast with those from a pooled analysis of the entire sample
(MODEL 1):
jperf : Job Performance
test : Pro-employment test
race : 1 if a minority applicant, 0 if a white applicant
racetest : race X test

MODEL 1
jperf = 1.03 + 2.36 test
(0.868) (0.538)
ANOVA
SOURCE df SS MS F
Regression 1 48.723 48.723 19.25
Error 18 45.568 2.532
Itotal 19 94.291

Note that figures in brackets above denote the standard errors of corresponding regression slope
estimates, Corresponding to the 'test' variable taking the value 2.5 we have predicted jperf value,
confidence and prediction intervals as follows:-

Fit 95% C.I. 95% P.I.


6.936 (5.554, 8.319) (3.3I8, 10.554)
MODEL 2
jperf = 2.01 - 1.91 race + 1.31 test + 2.00 racetest
(1.540) (0.670) (0.954)

Error SS = 31.655

For this alternative formulation the predicted value of jperf, corresponding to the value of 2.5 for
'test', confidence and prediction intervals shown separately for white and minority employees are
as follows:-

Fit 95% C.I. 95% P.I.


Minority 8.374 (6.681,10.068) (4.945, 11.804)
White 5.294 (3.491, 7.097) (1.809, 8.779)

a. Interpret these results, carrying out any additional calculations, tests etc. you think necessary.
b. Is MODEL 2 significantly better than MODEL 1? Depending on your answer here, what
would you say this signifies in terns of the two groups?

32. Data relating to import activity in the French economy have been analysed by Malinvaud
(1966). Details of a multiple regression model developed from these data appear below:
import : Imports
doprod : Domestic Production
stock : Stock Formation
consum : Domestic Consumption
The sample correlations for these data are as follows:
import doprod stock
import
deprod 0.984
stock 0.266 0.215
consum 0.985 0.999 0.214

import = - 19.7 + 0.032 doprod + 0.414 stock


(0.187) (0.322)
vif 469.7 1.0
+ 0.243 consum
(0.285)
vif 469.4

(Note the figures in brackets here are the standard errors of the corresponding regression slope
estimates.)

Estimated Error Variance, s2 = 5.10


Sample Size = 18
R Square = 97.3%
Durbin-Watson statistic = 0.24

a. Interpret these results, carrying out any additional calculations, tests etc. you think
necessary.
The VIF values here reveal major problems with multicollinearity. Thus, estimated
coefficients in the regression model as well as corresponding t tests are likely to be very
dubious.
b. What is your overall view of the model as a technology for predicting French Imports?
What improvements (if any) are necessary, in your opinion, before implementation of
the model is finally considered?

33. A regression analysis of data from a cloud-seeding experiment (Woodley et al; 1977)
yields the following results:-

MODEL 1

= 4.654 + 1.013 x1 - 0.032 x2 - 0.911 x3

(3.337) (1.203) (0.029) (0.751)


+ 0.006 x4 + 1.844 x5 + 2.168 x6
(.115) (2.758) (1.579)
where y = amount of rain (cubic metres x 10 7) that fell in target area for a 6 hour period on
each day seeding was suitable.

x1 =1, seeding or 0, no seeding


x2 = number of days since the experiment began
x3 = seeding suitability factor
x4 = per cent cloud cover
x5 = total rainfall on target area one hour before seeding
x6 = 1, moving radar echo, or 2, a stationary radar
(Note the bracketed figures are the standard errors of the estimated regression coefficients)

R2 = 0.385, Durbin-Watson statistic = 1.448


s2 = 8.044, Sample size = 24
Correlations
x1 x2 x3 x4 x5 x6 y
x1 1 .03 .177 .062 -.030 -.103 .076
x2 1 .451 -.350 -.269 -.218 -.496
x3 1 -.151 .040 -.186 -.408
x4 1 .648 -.019 .270
x5 1 -.257 .174
x6 1 .332
y 1
A second analysis of the data yields the alternative model:

MODEL 2

= - 3.499 + 16.245x1 - 0.045x2 + 0.420x3 + 0.388x4

(4.063) (5.522) (0.025) (.845) (.218)


+ 4•108x5+ 3.153x6 - 3.200x1x3- 0.486x1x4
(3.601) (1.933) (1.267) (0.241)
- 2.557x1x5 -0.526x1x6
(4.481) (2.643)
R2 = 0.72
s2 = 4.86
a. Is the second model significantly more effective than the first?
b. From a broad comparison of the two models, which would you choose for forecasting
rainfall in the target area? Give your reasons, for and against.
Chapter 16: Regression Analysis: Model Building

Supplementary Exercises Solutions:


19. a. The Minitab output is shown below:

The regression equation is


AUDELAY = 80.4 + 11.9 INDUS - 4.82 PUBLIC - 2.62 ICQUAL - 4.07 INTFIN

Predictor Coef SE Coef T p


Constant 80.429 5.916 13.60 0.000
INDUS 11.944 3.798 3.15 0.003
PUBLIC -4.816 4.229 -1.14 0.263
ICQUAL -2.624 1.184 -2.22 0.033
INTFIN -4.073 1.851 -2.20 0.035

S = 10.92 R-sq = 38.3% R-sq(adj) = 31.2%

Analysis of Variance

SOURCE DF SS MS F p
Regression 4 2587.7 646.9 5.42 0.002
Residual Error 35 4176.3 119.3
Total 39 6764.0

b. The low value of the adjusted coefficient of determination (31.2%) does not
indicate a good fit.

c. The scatter diagram is shown below:

96+
- * *
AUDELAY - 3
- * *
- *
80+ * 2 *
- * *
- 3 *
- 3 * 2
- 2 *
64+ *
- * * * *
- *
- 3
- *
48+ *
- 2
--+---------+---------+---------+---------+---------+----INTFIN
0.0 1.0 2.0 3.0 4.0 5.0

The scatter diagram suggests a curvilinear relationship between these two


variables.

d. The output from the stepwise procedure is shown below, where INTFINSQ is
the square of INTFIN.

Response is AUDELAY on 5 predictors, with N = 40

Step 1 2
Constant 112.4 112.8

INDUS 11.5 11.6


T-Value 3.67 3.80
P-Value 0.001 0.001

PUBLIC -1.0
T-Value -0.29
P-Value 0.775

ICQUAL -2.45 -2.49


T-Value -2.51 -2.60
P-Value 0.017 0.014
INTFIN -36.0 -36.6
T-Value -4.61 -4.91
P-Value 0.000 0.000

INTFINSQ 6.5 6.6


T-Value 4.17 4.44
P-Value 0.000 0.000

S 9.01 8.90
R-Sq 59.15 59.05
R-Sq(adj) 53.14 54.37
C-p 6.0 4.1

20. a. The results from the Stepwise procedure indicate that lnY can be significantly
explained in terms of the dummy variables HSC and HSA. At the same time, the R 2
and adj R2 values for this model (33.23%, 25.82% respectively) are not particularly
high. The same model features in the Best Subsets output (it corresponds with the
first of the two predictor alternatives of models in the list) and technically appears the
have the edge on its eight competitors. However, one practical problem with the HSA
variable is that the sign of the estimated regression coefficient is positive, suggesting
that the health scare in year 8 actually resulted in a growth rather than a decline in
tobacco consumption.

b. From the various comments in (a) the linear formulation adopted for analysing the data
does not seem to have been helpful or productive. The absence of lnX1 or lnX2 as
predictors in any of the models is a particular indictment so much so that one wonders
why this approach was ever investigated in the first place.

21. The computer output is shown below:

The regression equation is


AUDELAY = 63.0 + 11.1 INDUS

Predictor Coef SE Coef T p


Constant 63.000 3.393 18.57 0.000
INDUS 11.074 4.130 2.68 0.011

S = 12.23 R-sq = 15.9% R-sq(adj) = 13.7%

Analysis of Variance

SOURCE DF SS MS F p
Regression 1 1076.1 1076.1 7.19 0.011
Residual Error 38 5687.9 149.7
Total 39 6764.0

Unusual Observations
Obs. INDUS AUDELAY Fit Stdev.Fit Residual St.Resid
5 0.00 91.00 63.00 3.39 28.00 2.38R
38 1.00 46.00 74.07 2.35 -28.07 -2.34R

Durbin-Watson statistic = 1.55

At the 0.05 level of significance, dL = 1.44 and dU = 1.54. Since d = 1.55 > dU,
there is no significant positive autocorrelation.

22. a. The Minitab output is shown below:

The regression equation is


AUDELAY = 70.6 + 12.7 INDUS - 2.92 ICQUAL

Predictor Coef SE Coef T p


Constant 70.634 4.558 15.50 0.000
INDUS 12.737 3.966 3.21 0.003
ICQUAL -2.919 1.238 -2.36 0.024

S = 11.56 R-sq = 26.9% R-sq(adj) = 22.9%

Analysis of Variance

SOURCE DF SS MS F p
Regression 2 1818.6 909.3 6.80 0.003
Residual Error 37 4945.4 133.7
Total 39 6764.0

SOURCE DF SEQ SS
INDUS 1 1076.1
ICQUAL 1 742.4

Unusual Observations
Obs. INDUS AUDELAY Fit Stdev.Fit Residual St.Resid
5 0.00 91.00 67.71 3.78 23.29 2.13R
38 1.00 46.00 71.70 2.44 -25.70 -2.27R

R denotes an obs. with a large st. resid.

Durbin-Watson statistic = 1.43

b. The residual plot as a function of the order in which the data are presented is
shown below:

RESID -
- 5
- 6
17.5+ 0
- 89 89
- 6 4 7
- 1
- 46 7 01 9
0.0+ 1 7 24 802 35
- 0 5
- 3 7 2
- 3 3
- 1 9 46
-17.5+ 2
- 5
- 8
-
+---------+---------+---------+---------+
0 10 20 30 40

There is no obvious pattern in the data indicative of positive autocorrelation.

c. At the .05 level of significance, dL = 1.44 and dU = 1.54. Since d = 1.43 > dU,
there is no significant positive autocorrelation.

23.
MODEL 1
This is a particularly flawed model. None of the predictors here are significant
according to their individual pvalues yet the F statistic has a pvalue of 0.000 < α
=0.05 indicating that the model, overall, is significant. Because of this there is a
strong suspicion that multicollinearity is present. The R2 of 68% for the model is
relatively good and there is a single outlying residual and as well as an influential
observation. The outlier does not look serious because of its standardized residual
value but observation 6’s influence needs to be carefully checked out.

MODEL 2
This is a much simpler model which not surprisingly overcomes many of the
problems of MODEL 1. The single predictor used in the model is significant.
Observation 6 is no longer influential. Observation 19 is still associated with an
outlying residual but this is hardly any worse than before.

Is MODEL 1 despite its various difficulties an improvement on MODEL 2 though?


To find out we note the following Error SS details:

Model DF Error SS
1 17 57950
2 19 66124

Based on formula (16.11) the relevant test statistic is:

F = (66124 – 57950)/(19-17) = 1.20


57950/17

Under the hypothesis H0: β2 = β3 = 0 F has an F distribution on 2 and 17 degrees of


freedom. Since the 5% critical value for this distribution is 3.59 we cannot reject H 0
and deduce therefore that MODEL 1 is not a significant improvement on MODEL 2.
This is the clincher and so MODEL 2 would be preferred.

24. a. The model is significant overall with all its predictors significant also. This is borne
out by the pvalues for the F and t statistics which are < α = 0.05 without exception. A
two-sided Durbin-Watson test (α = 0.05) yields an inconclusive result since dL =1.01
< d= 1.09 < dU = 1.41 There is a single outlier, but this does not appear to be too
extreme according to its standardized residual value. The main problem with the
model is multicollinearity as evidenced by the high correlations between all variables
– and which was somehow played down by previous VIF values. The earlier t test
results are therefore likely to be very dubious.

The Stepwise output features a new predictor laglnimp which happens to be selected
for the final step 2 model. The problem is that this model too is likely to suffer from
multicollinearity.

b. Hence the preferred model of all those considered is the Stepwise (step 1) simple
regression model with the lngdp predictor. This model had a very high R square
(97.5%) and is highly significant according to the pvalue result (0.000) provided by
Stepwise.

25. a.

y x1
x1 0.964
0.000

x2 0.873 0.815
0.001 0.004

Cell Contents: Pearson correlation


P-Value
All the correlations here are very high as well as being highly significant (pvalue < 0.01)

b.

The regression equation is


y = 0.01 + 1.11 x1 + 0.139 x2

Predictor Coef SE Coef T P VIF


Constant 0.009 1.098 0.01 0.994
x1 1.1102 0.2116 5.25 0.001 3.0
x2 0.13855 0.07648 1.81 0.113 3.0

S = 0.821927 R-Sq = 95.2% R-Sq(adj) = 93.8%

Analysis of Variance

Source DF SS MS F P
Regression 2 93.072 46.536 68.88 0.000
Residual Error 7 4.729 0.676
Total 9 97.801

Source DF Seq SS
x1 1 90.855
x2 1 2.217

c. The VIF values in (b) do not suggest problems of multicollinearity are possible despite
the significant correlation between x1 and x2 found in (a). Clearly however there are
problems since x2 (with a pvalue = 0.113 > 0.05 = α) is not a significant predictor whereas
x1 (with a pvalue = 0.001 < 0.05 = α) is. (According to the correlations both should be
significant predictors.)
d. The relevant plots are as follows:
None of these plots seem to be out of line with theoretical assumptions but the sample
size is relatively small, so this is not altogether unexpected.

26. a.

Stepwise Regression: y versus x1, x2

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is y on 2 predictors, with N = 10

Step 1 2
Constant 1.575278 0.008928

x1 1.42 1.11
T-Value 10.23 5.25
P-Value 0.000 0.001

x2 0.139
T-Value 1.81
P-Value 0.113

S 0.932 0.822
R-Sq 92.90 95.16
R-Sq(adj) 92.01 93.78
Mallows C-p 4.3 3.0

Best Subsets Regression: y versus x1, x2

Response is y

Mallows xx
Vars R-Sq R-Sq(adj) C-p S 12
1 92.9 92.0 4.3 0.93182 X
1 76.1 73.2 28.5 1.7078 X
2 95.2 93.8 3.0 0.82193 X X

Stepwise seems to favour the full two predictor model which also corresponds to model
described on the bottom line of the Best Subsets output. Yet the adj R2 value (of 92.0%) for
the single x1 predictor model (step 1) is almost the same as that for the full model (93.8%).
Note that the corresponding difference between root mean square error values is slightly more
pronounced.

b. Despite this, the single x1 alternative might be favoured given earlier concerns about hidden
multicollinearity.

27. Relevant regression output is as follows:

The regression equation is


y = - 6.93 + 0.698 x1 + 1.10 x2

Predictor Coef SE Coef T P


Constant -6.9262 0.7543 -9.18 0.000
x1 0.6979 0.1835 3.80 0.002
x2 1.0978 0.1911 5.74 0.000
S = 0.249288 R-Sq = 96.0% R-Sq(adj) = 95.5%

Analysis of Variance

Source DF SS MS F P
Regression 2 22.344 11.172 179.77 0.000
Residual Error 15 0.932 0.062
Total 17 23.276

Source DF Seq SS
x1 1 20.294
x2 1 2.050

Unusual Observations

Obs x1 y Fit SE Fit Residual St Resid


16 8.60 8.2000 7.7481 0.1434 0.4519 2.22R

R denotes an observation with a large standardized residual.

a.

New
Obs Fit SE Fit 90% CI
1 7.2196 0.0709 (7.0953, 7.3439)

b.
New
Obs Fit SE Fit 90% PI
1 7.2196 0.0709 (6.7653, 7.6740)
c. The (confidence) interval in a. corresponds to any proposal with the scores x 1 = 8.0 and x2
= 7.8 whereas the (prediction) interval in b. corresponds to a specific proposal with the
scores x1 = 8.0 and x2 = 7.8

28. a.
Stepwise Regression: y versus x1, x2

Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15

Response is y on 2 predictors, with N = 18

Step 1 2
Constant -6.436 -6.926

x2 1.73 1.10
T-Value 13.69 5.74
P-Value 0.000 0.000

x1 0.70
T-Value 3.80
P-Value 0.002

S 0.338 0.249
R-Sq 92.13 96.00
R-Sq(adj) 91.64 95.46
Mallows C-p 15.5 3.0

Best Subsets Regression: y versus x1, x2

Response is y

Mallows xx
Vars R-Sq R-Sq(adj) C-p S 12
1 92.1 91.6 15.5 0.33834 X
1 87.2 86.4 34.0 0.43170 X
2 96.0 95.5 3.0 0.24929 X X

From the Stepwise and Best Subsets output it is clear the full two predictor model is
most favoured. Both predictors X1 and X2 contribute very significantly to the model
according to the relevant T ratios and pvalues. The root mean square error value is
also markedly better for this model than the alternatives.

b. The two predictor model is conspicuously better than either single predictor
alternatives for representing the data.

29. a. Relevant output is as follows:

Correlations: Y, X1, X2, X3

Y X1 X2
X1 0.828
0.001

X2 0.492 0.041
0.104 0.899

X3 -0.235 0.012 -0.578


0.463 0.969 0.049

Cell Contents: Pearson correlation


P-Value

From the correlations here, it can be seen X1 and X2 are respectively the most correlated with
y.

Regression Analysis: Y versus X1, X2, X3


The regression equation is
Y = 1.41 + 0.589 X1 + 0.364 X2 + 0.087 X3

Predictor Coef SE Coef T P


Constant 1.409 5.717 0.25 0.811
X1 0.58919 0.08290 7.11 0.000
X2 0.3641 0.1066 3.42 0.009
X3 0.0870 0.3965 0.22 0.832

S = 3.10495 R-Sq = 89.7% R-Sq(adj) = 85.8%

Analysis of Variance

Source DF SS MS F P
Regression 3 670.54 223.51 23.18 0.000
Residual Error 8 77.13 9.64
Total 11 747.67

Source DF Seq SS
X1 1 513.14
X2 1 156.94
X3 1 0.46

The output above shows a significant linear model has been fitted (the pvalue for the F ratio is
0.000 < 0.05 = α). X1 and X2 are significant predictors of Y (for each of the T ratios pvalue <
0.05 = α).

b.

Stepwise Regression: Y versus X1, X2, X3

Backward elimination. Alpha-to-Remove: 0.1


Response is Y on 3 predictors, with N = 12

Step 1 2
Constant 1.409 2.356

X1 0.589 0.590
T-Value 7.11 7.53
P-Value 0.000 0.000

X2 0.364 0.351
T-Value 3.42 4.27
P-Value 0.009 0.002

X3 0.09
T-Value 0.22
P-Value 0.832

S 3.10 2.94
R-Sq 89.68 89.62
R-Sq(adj) 85.82 87.32
Mallows C-p 4.0 2.0

The best model according to this procedure is the one featuring the two predictors X1
and X2. This is essentially what we would have expected following the regression
analysis in (a).

30. The initial model does not seem to be affected by multicollinearity from the VIF values
yet the sample correlations between predictors do look potentially problematic in places
e.g. the correlation between cmplian and rises of 0.666. T ratios for the model are given
below:

Predictor Coefficient Standard Error t


cmplain 0.613 0.161 3.81
prvileg -0.073 0.136 -0.54
learn 0.32 0.169 1.89
rises 0.082 0.222 0.37
critcal 0.038 0.147 0.26
advance -0.217 0.178 -1.22

Only the t ratio for cmplain is significant since under H0: βi = 0 i= 1, 2, 3.. 6 t is
distributed on 23 = n – p – 1 = 30 – 6 -1 degrees of freedom. And apart from cmplain
none of the ratios above are > t.025(23) = 2.069 or < - t.025(23) = -2.069.

Using the Total sum of squares result of 4296.97 and the root mean square error value =
7.0680 from the bottom line of the Best Subsets output, the ANOVA table for the
model can be constructed as follows:
df SS MS F
Regression 6 3147.968 524.661 10.502
Error 23 1149.002 49.957
Total 29 4296.97

The F statistic here is significant (10.502 > 2.53 = F .05 for an F distribution on 6 and
23 degrees of freedom. Thus, we would reject H0: β1 = β2 = …β6 = 0 and deduce the
model is significant.

From the Best subsets output two models stand out - namely the first of the two
predictor models and the first of the three predictor models listed. The three predictor
model features the advance predictor which is not strongly correlated with y. In this
sense the two predictor one might therefore be preferred. The full six predictor model
falls well short of either of these alternatives.

31. a. MODEL 1

This is a significant regression model based on the F ratio result of 19.25. (Under H0:
β1 =0 , F has an F distribution on 1 and 18 degrees of freedom. The 5% critical value
for this distribution is F.05 =4.41. Since F = 19.25 > F.05 =4.41 we would therefore
reject H0.

MODEL 2
From the Error SS information provided we can recreate the corresponding ANOVA
table as follows:

df SS MS F
Regression 3 62.636 20.879 10.553
Error 16 31.655 1.978
Total 19 94.291

(since the TOTAL SS remains the same however many predictors we choose for the
modelling). The F statistic here is significant (10.553 > 3.24 = F .05 for an F
distribution on 3 and 16 degrees of freedom. Thus, we would reject H0: β1 = β2 = β3 =
0 and deduce the model is significant.

Corresponding t ratios are as follows:

Coefficient standard error t


race -1.91 1.54 -1.24
test 1.31 0.67 1.96
raceXtest 2 0.954 2.10

The relevant t distribution under H0: βi = 0 i= 1, 2, 3 is t on 16 degrees of freedom.


As none of the ratios above are > t.025(16) = 2.12 or < - t.025(17) = -2.12 we cannot
reject H0 for i= 0, 1, 2,3

b. To test if MODEL 2 is an improvement over MODEL 1 we note the following error


sums of squares details:

Model Error SS
1 45.568
2 31.655

Based on formula (16.11) the relevant test statistic is:

F = (45.568 – 31.655)/2 = 3.52


31.655/(20-3-1)
Under the hypothesis H0: β2 = β3 = 0 F has an F distribution on 2 and 16 degrees
of freedom. Since the 5% critical value for this distribution is 3.63 we cannot reject
H0
and deduce MODEL 2 is not a significant improvement on MODEL 1. Hence
introducing the race variable into the model either explicitly or through an
interaction does not seem to have improved the model’s performance. The fact that
both the confidence and prediction intervals overlap for minority and white
candidates for MODEL 2 suggests that the classification has no significant value
for the exercise. MODEL 1 would therefore be preferred.

32. a. The VIF values here reveal a major problem of multicollinearity. Thus, estimated
coefficients for the regression model as well as corresponding t tests are likely to be very
dubious. From the correlation matrix the source of the multicollinearity seems to be
between the doprod and consum predictors. With a correlation of 0.999 they would be
regarded mathematically by MINITAB as being in essence, identical variables. One of
them needs to be dropped from the model – it is up to the analyst to decide which.
Whether the stock predictor is worth retaining is another issue and could be investigated
using stepwise procedures.

The R2 = 97.3% result is impressive and corresponds with an F value for the ANOVA
table of

F = R2(n-p-1) = 0.973(18-3-1) = 168.17


(1-R2)p (1-0.973)3

Under H0: β1 = β2 = β3 = 0 F has an F distribution on 3 and 14 degrees of freedom. The


5% critical value for this distribution is F .05 =3.34. Since F = 168.17 > F .05 =3.34 we
would reject H0 and deduce the model is significant.

From the Durbin Watson tables provided, it can be shown for n = 18 p = 3 and α =
0.025 that dL = 0.82 and dU = 1.56. Based on a two sided test approach we deduce the
test result of d = 0.24 indicates significant positive autocorrelation of errors exists.

b. The model is very problematic as it stands. Both the multicollinearity and first order
serial correlation of errors problems need to be resolved before it can be seriously
considered as statistical prediction tool.
33. MODEL 1
T statistics can be calculated for the estimated model by calculating the regression
coefficient / standard error ratios as follows:

t
Constant 1.39
x1 0.84
x2 -1.10
x3 -1.21
x4 0.05
x5 0.69
x6 1.37

Given:

H0: βi = 0
H1: βi ≠ 0 i= 0, 1, 2, …6
and α = .05 the above ratios under H0 have a t distribution on 17 degrees of freedom
where 17 = n – p - 1 where n = 24 and p = the number of independent variables = 6.
As none of the ratios above are > t.025(17) = 2.11 or < - t.025(17) = -2.11 we cannot
reject H0 for i= 0, 1, 2, …6.

From the R2 value of 0.385, the F statistic for the ANOVA table can be calculated
as follows:

F = R2(n-p-1) = 0.385(24-6-1) = 1.78


(1-R2)p (1-0.385)6

Under H0: β1 = β2 = ……= β6 = 0 F has an F distribution on 6 and 17 degrees of


freedom. It can be shown that for this distribution F.05 = 2.92. As F = 1.78 < 2.92
H0 cannot be rejected and we deduce the multiple regression model is not
significant.

Superficially the sample correlations provided do not indicate problems of


multicollinearity (though it would have helped to have been provided with
corresponding pvalues).

The Durbin Watson statistic cannot strictly be tested using the critical values
provided in the book (only a maximum of 5 predictors is catered for) but given for
n = 24 p = 5 and
α = 0.025 that dL = 0.83 and dU = 1.79 a two-sided test is likely to be inconclusive.

MODEL 2
The second model features an additional four interaction terms. Their presence
seems to considerably improve the R2 result and it can be shown based on the t
ratios below that x1and x3x1 are significant predictors:
t
Constant -0.86
x1 2.94
x2 1.8
x3 0.50
x4 1.78
x5 1.14
x6 1.63
x3x1 -2.53
x1x4 -2.01
x1x5 -0.57
x1x6 -0.21

(As before

H0: βi = 0
H1: βi = 0 i= 0, 1, 2, …10
and α = .05. The above ratios under H0 have a t distribution on 13 degrees of
freedom where 13 = n – p – 1, n = 24 and p = the number of independent variables
= 10. The relevant critical values in this case are t .025(17) = 2.17 and -t.025(17) = -
2.17).

a. To test if MODEL 2 is an improvement over MODEL 1 we calculate the


error sums of squares = mean square error X degrees of freedom for each model =
s2(n-p-1).

From the details provided we have:

Model 1 s2 (n-p-1) Error SS


1 8.044 17 136.748
2 4.86 13 63.18

Based on formula (16.11) the relevant test statistic is:

F = (136.748 – 63.18)/4 = 3.78


63.18/13

Under the hypothesis H0: β7 = β8 = ……= β10 = 0 F has an F distribution on 4 and


13 degrees of freedom. Since the 5% critical value for this distribution is 3.18 we
reject H0 in favour of

H1: One or more of the parameters is not equal to zero

and deduce MODEL 2 is a significant improvement on MODEL 1.

b. Both models are problematic but MODEL 2 is at least a significant improvement on


MODEL 1 so this would be preferred. (MODEL 1 is not significant in any way.)
Many terms in MODEL 2 are not significant and needlessly clutter it so ideally these
should be eliminated using, for example, the stepwise technique before the model
would actually come into operation.

You might also like