You are on page 1of 42

Chapter 8

Multiple Regression

8.1 Chapter Outline


In this chapter:

• the multiple regression model will be introduced

• you will learn how to perform basic multiple regression calculations

• you will learn how to check the assumptions required for the multiple regression
model

• you will learn how to interpret the multiple regression model

• you will learn how to make predictions on the basis of the multiple regression model

• techniques for investigating whether the model can be simplified will be examined

• you will learn some special uses of the multiple regression technique

8.2 The Multiple Regression Model – Summary


The Population A simple population of subjects

The Variables Several continuous numerical variables of interest, one of which is con-
sidered as a response, and the others as predictors

The Null Hypothesis The response is independent of the predictors

131
132 CHAPTER 8. MULTIPLE REGRESSION

The Model The population is conceived as a spectrum of sub-populations, corresponding


to the continuum of possible values each predictor variable may take

• The Fundamental Sampling Assumption –

– the subjects in the sample have been selected at random from the population;
or,
– the experimenter may pre-determine which values of the predictor variables to
run the experiment for and how many times to do so for each choice, in which
case it is assumed that for each of the chosen sub-populations the subjects in
that sub-sample have been selected at random from the sub-population

• Linearity – the sub-population average value of the response is linearly related to


the predictors:
average(response for the given predictors)
= a + b1 × predictor1 + b2 × predictor2 + · · · + bk × predictork

• Homoscedasticity – the sub-population standard deviation of the response is the


same for each sub-population
• Conditional Normality – the sub-population distribution of the response is normal
for each sub-population

Assessing the Model

• The Fundamental Sampling Assumption can only really be assessed by a critical


consideration of the experimental protocols – often the data themselves are not useful
here. If the data were recorded serially in time and / or space, then a serial plot of
the residuals from the model fit can reveal patterns that would indicate a violation
of this assumption
• A plot of the residuals from the model fit against the predicted values from the model
fit should be examined to establish that the Linearity and Homoscedasticity
assumptions are reasonable
• Plots of the residuals from the model fit against each predictor should also be exam-
ined, especially if the plot of residuals against predicted values reveals that there is
a problem
• If the Linearity and Homoscedasticity assumptions are reasonable, then a normal
quantile-quantile (Q-Q) plot of the residuals from the model fit should be examined
to establish that the Conditional Normality assumption is reasonable.
8.3. MULTIPLE PREDICTORS 133

The Statistical Procedure The standard procedure used for this problem is called
least-squares

8.3 Multiple Predictors

8.3.1 Case Study – The Boys’ Lung Capacity data


❡ Illustrative Example – The Boys’ Lung Capacity data (Data Set A.7 in Appendix A
of the Subject Reader)

• In Chapter 6, linear regression was introduced as a technique for studying the rela-
tionship between a predictor variable and a response variable

• In the case study, lung capacity was the response and height was the predictor

• In multiple regression there is a single response variable but now there are two or
more predictor variables to be accounted for simultaneously

• In the case study, a second predictor variable, namely weight, was recorded

• The problem at hand is now to:

– Quantify the joint effect of height and weight on lung capacity


– Utilise the relationship in order to make predictions of lung capacity based on
both height and weight

• An appropriate exploratory investigation is to examine scatterplots of

1. lung capacity against height (Figure 8.1)


2. lung capacity against weight (Figure 8.2)
3. weight against height (Figure 8.3)

• Examination of these graphs indicates:

– There is a positive dependence between each pair of variables


– In each case this dependence is roughly linear
– There are no unusual observations apparent in the graphs
– The scatterplot of lung capacity against weight also suggests that the variability
in lung capacity increases with weight (this is not a surprising if you think about
the context)
134 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.1: Boys’ lung capacities (measured as forced vital capacity in litres) against heights
(centimetres)

4.5
4.0
3.5
3.0
fvc

2.5
2.0
1.5

140 150 160 170

height

Figure 8.2: Boys’ lung capacities (measured as forced vital capacity in litres) against
weights (kilograms)
4.5
4.0
3.5
3.0
fvc

2.5
2.0
1.5

30 40 50 60 70

weight
8.4. THE MULTIPLE REGRESSION MODEL 135

• Note that, as well as scatterplots of response against each predictor, scatterplots of


predictor against predictor are also examined

– In general, there will be inter-relationships between the predictors, and this can
have implications for the analysis and the interpretation of results

• It is important to note that Figures 8.1 – 8.3 are only two-dimensional views of a
data set that is intrinsically three-dimensional

– They can inform, but they cannot tell the whole story
– This observation has implications for model checking

8.4 The Multiple Regression Model

8.4.1 Motivation for the Model


• The framework for the multiple regression model is essentially the same as for simple
linear regression (§ 6.5)
• Recall that for lung capacity and height, we considered a separate sub-population of
12-year-old boys for each possible height

– Eg, for boys 130cm, 131cm and so on

• In the multiple regression context, where both height and weight are considered
simultaneously, we now consider a separate sub-population of 12-year-old boys for
each possible combination of height and weight

– Eg, boys with


(height=130cm, weight=30kg), (height=131cm, weight=30kg),
(height=130cm, weight=31kg), (height=131cm, height=31kg)
and so on

• Analogously to the case of regression on one variable, the multiple regression model
is a set of assumptions about the distribution of lung capacity within these sub-
populations

1. The sub-population average lung capacity for some given height and weight is
linearly related to the height and the weight:
average(lung capacity for a given height and weight)
= a + b1 × height + b2 × weight
136 CHAPTER 8. MULTIPLE REGRESSION

2. The sub-population standard deviation of the lung capacity for boys of a given
height and weight is the same for each height and weight (homoscedasticity).
Call this common value the conditional standard deviation and denote it SD
3. The sub-population distribution of the lung capacity for boys of a given height
and weight is normal (conditional normality)

• The Fundamental Sampling Assumption and methods for assessing the other three
modelling assumptions will be examined in full in § 8.5

8.4.2 Fitting the Multiple Regression Model


• The analysis of multiple regression data is similar to the simple linear regression
analysis with a single predictor variable discussed in § 6.5.2

✾ How to . . . perform multiple regression analysis

1. Use the menu item


Statistics / Fit models / Linear regression. . .

2. Name the model (fvc.on.height.weight)

3. Select the response variable (fvc) and the explanatory (ie, predictor) variables (height
and weight)

4. There is no need to alter any of the other settings

5. Click on the OK button

• The summary of the multiple regression model is shown in Table 8.1

• The (Intercept) entry is the estimate of the intercept parameter a. In this case the
value is −3.799689

• The height entry is the estimate of the height coefficient parameter b1 . In this case
the value is 0.039651

• The weight entry is the estimate of the weight coefficient parameter b2 . In this case
the value is 0.014871

• The quantity labelled Residual standard error is the estimate of the conditional
standard deviation, SD. In this case the value is 0.3027
8.4. THE MULTIPLE REGRESSION MODEL 137

Figure 8.3: Boys’ weights (kilograms) against heights (centimetres)

70
60
weight

50
40
30

140 150 160 170

height

Table 8.1: Boys’ lung capacities (measured as forced vital capacity in litres) on heights in
centimetres and weights in kilograms

Call:
lm(formula = fvc ~ height + weight, data = Boys.Lung.Capacity)

Residuals:
Min 1Q Median 3Q Max
-0.659600 -0.216124 -0.002729 0.177483 0.882400

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.799689 0.664114 -5.721 7.48e-08 ***
height 0.039651 0.005252 7.549 8.23e-12 ***
weight 0.014871 0.004652 3.197 0.00176 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.3027 on 124 degrees of freedom


Multiple R-squared: 0.6533,Adjusted R-squared: 0.6477
F-statistic: 116.8 on 2 and 124 DF, p-value: < 2.2e-16
138 CHAPTER 8. MULTIPLE REGRESSION

8.5 Assessing the Multiple Regression Assumptions


• Recall the assumptions of the multiple regression model (page 132)

• The Fundamental Sampling Assumption can only really be assessed by a critical


consideration of the experimental protocols

• The remaining three modelling assumptions can be studied on the basis of the ob-
served data by calculating and plotting, in appropriate ways, the residuals from the
multiple regression

– The residuals in a multiple regression model are an obvious generalisation to


more than one predictor of the residuals in a simple linear regression model
(§ 6.6.1)
– The residuals in a multiple regression model are computed analogously to those
in a simple linear regression model

• Recall from § 6.6 that, in the case of simple linear regression, although linearity
and homoscedasticity can be assessed from a scatterplot of the data, psychological
experiments have shown that a residuals analysis provides a better judgement

• Now that there is more than one predictor, linearity and homoscedasticity can only
be assessed from a scatterplot of residuals against predicted values

– As was noted in § 8.3.1, any scatterplots of the data are two-dimensional views of
something that is intrinsically higher dimensional, and so cannot be conclusive

• Residual plots and their interpretation are as described in § 6.6.2

• The residuals against predicted values plot for the Boys’ Lung Capacity data is shown
in Figure 8.4

– From Figure 8.4, it is concluded that the assumptions of linearity and ho-
moscedasticity are reasonable

• Recall that in Figure 8.2 there was evidence that the variability in lung capacity
increases with weight

– Although this two-dimensional observation suggests that there may be some


heteroscedasticity, the residuals against predicted values, Figure 8.4, reveals
that the multiple regression model for the full three-dimensional data set is
satisfactorily homoscedastic
8.5. ASSESSING THE MULTIPLE REGRESSION ASSUMPTIONS 139

Figure 8.4: Residuals against predicted values plot for the multiple regression of boys’
lung capacities (measured as forced vital capacity in litres) on heights in centimetres and
weights in kilograms
0.5
residuals

0.0
−0.5

2.0 2.5 3.0 3.5 4.0

predicted
140 CHAPTER 8. MULTIPLE REGRESSION

– This reinforces the caution given in § 8.3.1 that the two-dimensional scatterplots
are informative in an exploratory sense, but cannot be relied upon to tell the
whole story

• The normal Q-Q plot of the residuals for the Boys’ Lung Capacity data is shown in
Figure 8.5

– From Figure 8.5, it is concluded that the assumption of conditional normality


is reasonable

• What if the assumptions of the multiple regression model are not satisfied?

• The ideas developed in § 7 apply here also

• Frequently, transformations of one or more of the variables – usually just the response
variable – are the best approach

8.5.1 The Residuals against Predictors Plots


• In the simple linear regression model and its generalisation to the multiple regression
model, the principle means of investigating the acceptability of the linearity and
homoscedasticity assumptions is by critically examining the plot of residuals against
predicted values

– Indeed the idea of assessing the suitability of a model by examining a plot


of residuals against predicted values extends to models other than regression
models – it is a very general approach to model checking

• In regression models, for each subject, the predicted value is the value obtained from
the estimated regression equation

• Consider the special case of the simple linear regression model:

– The estimated regression equation is simply a linear function of the predictor


variable – a rescaling of the predictor variable
– Thus the plot of the residuals against the predicted values will be qualitatively
the same as the plot of the residuals against the predictor variable
– The only difference will be the numerical scale on the horizontal axis, and, if
the estimated slope is negative, the horizontal axis will be reversed
– The two plots will provide identical information regarding the acceptability of
the linearity and homoscedasticity assumptions

• Now consider the more general case of multiple regression:


8.5. ASSESSING THE MULTIPLE REGRESSION ASSUMPTIONS 141

Figure 8.5: Normal Q-Q plot of the residuals for the multiple regression of boys’ lung
capacities (measured as forced vital capacity in litres) on heights in centimetres and weights
in kilograms
Boys.Lung.Capacity$residuals

0.5
0.0
−0.5

−2 −1 0 1 2

norm quantiles
142 CHAPTER 8. MULTIPLE REGRESSION

– The estimated regression equation is no longer simply a linear function of any


one of the predictor variables
– Thus the plot of the residuals against the predicted values will be qualitatively
different from a plot of the residuals against any one of the predictor variables
– If the plot of the residuals against the predicted values indicates that the linear-
ity and homoscedasticity assumptions are acceptable, then it would be highly
unusual for a plot of the residuals against any of the predictor variables to reveal
any problems with the suitability of the model
– However, it was noted in § 7.2 that, in some applications, particularly if the
problem is with the linearity and not the homoscedasticity, it is appropriate to
extend the model to allow for the average response to depend on the predictor
in a more complex way
– In the multiple regression context, if the above is the case, then plots of the
residuals against each of the predictor variables can help identify which predic-
tors to focus attention on
– An example of this situation is presented in § 8.9.2

• Thus, in multiple regression applications, it is recommended as good practice to


critically examine not only the plot of the residuals against the predicted values, but
also the plots of the residuals against each of the predictor variables
• The residuals against heights plot for the Boys’ Lung Capacity data is shown in
Figure 8.6 and the residuals against weights plot for the Boys’ Lung Capacity data
is shown in Figure 8.7
• In this example, neither Figure 8.6 nor Figure 8.7 show anything untoward, as is to
be expected since Figure 8.4 appeared to be satisfactory

8.6 Interpreting the Multiple Regression Model

8.6.1 Testing the Significance of the Multiple Regression


• The appropriate null hypothesis is that lung capacity is independent of both height
and weight

– or equivalently, that there is no relationship between lung capacity and height


and weight
– or equivalently, that there is no regression of lung capacity on height and weight

• The P-value is given in the final line of Table 8.1:


F-statistic: 116.8 on 2 and 124 DF, p-value: < 2.2e-16
8.6. INTERPRETING THE MULTIPLE REGRESSION MODEL 143

Figure 8.6: Residuals against heights plot for the multiple regression of boys’ lung capaci-
ties (measured as forced vital capacity in litres) on heights in centimetres and weights in
kilograms

0.5
residuals

0.0
−0.5

140 150 160 170

height

Figure 8.7: Residuals against weights plot for the multiple regression of boys’ lung capac-
ities (measured as forced vital capacity in litres) on heights in centimetres and weights in
kilograms
0.5
residuals

0.0
−0.5

30 40 50 60 70

weight
144 CHAPTER 8. MULTIPLE REGRESSION

• The P-value is so infinitesimal here that it is reported as


p-value < 2.2e-16

– In this topic, this should be written up as something like ‘The P-value is less
than 2.2 × 10−16 ’ or ‘The P-value is 0.000 (to three decimal places)’, or the like

• It was argued in § 6.7.1 in the context of linear regression with only one predictor, that
the null hypothesis is logically equivalent to the hypothesis that the single population
slope parameter is zero

– Now that there are more than one predictor, there is no such simple connection
– Investigating whether the multiple regression model can be simplified is nonethe-
less a relevant exercise
– This will be discussed in detail later, in § 8.8

8.6.2 Interpreting the Multiple Regression Coefficients


• From Table 8.1, the estimated multiple regression equation is:

average(lung capacity for a given height and weight)


= (0.039651 × height) + (0.014871 × weight) − 3.799689

• For example:

– The average lung capacity for 12-year-old boys of height 150cm and weight 40
kg is estimated to be

(0.039651 × 150) + (0.014871 × 40) − 3.799689


= 2.74 litres (to two decimal places)

– The average lung capacity for 12-year-old boys of height 150cm and weight 45
kg is estimated to be

(0.039651 × 150) + (0.014871 × 45) − 3.799689


= 2.82 litres (to two decimal places)

– The average lung capacity for 12-year-old boys of height 160cm and weight 40
kg is estimated to be

(0.039651 × 160) + (0.014871 × 40) − 3.799689


= 3.14 litres (to two decimal places)
8.6. INTERPRETING THE MULTIPLE REGRESSION MODEL 145

– The average lung capacity for 12-year-old boys of height 160cm and weight 45
kg is estimated to be

(0.039651 × 160) + (0.014871 × 45) − 3.799689


= 3.21 litres (to two decimal places)

• Analogously to the slope parameter in simple linear regression, the coefficient pa-
rameters in multiple regression have meaningful physical interpretations

– It is estimated that an increase of 1cm in height corresponds, on average, to an


increase of 0.039651 litres in lung capacity, for a fixed weight
– It is estimated that an increase of 1kg in weight corresponds, on average, to an
increase of 0.014871 litres in lung capacity, for a fixed height
– The qualifications “for a fixed weight” and “for a fixed height”, respectively, are
crucial and sometimes have surprising implications
– We shall see an extreme example of this later in § 8.8.2

• As with simple linear regression, the intercept parameter sometimes has a meaningful
physical interpretation as the estimate of the average value of y when x1 = 0 and
x2 = 0

– This only applies when x1 = 0 and x2 = 0 are part of the physical range
– The intercept parameter in our example does not have a meaningful physical
interpretation
– The fact that it is negative is of no special importance

• Since the population coefficients and the population intercept are being estimated,
it is sometimes desirable to indicate the accuracy as well as the size of the estimates
– confidence intervals for the population coefficients and the population intercept
• These are computed in the same way as described in § 6.7.2
• The confidence intervals are shown in Table 8.2
• Thus it may be claimed with 95% confidence that an increase of 1cm in height
corresponds to an increase between 0.02925472 litres and 0.05004630 litres in average
lung capacity, for a fixed weight
• Likewise, it may be claimed with 95% confidence that an increase of 1kg in weight
corresponds to an increase between 0.00566343 litres and 0.02407863 litres in average
lung capacity, for a fixed height
• Similar statements may be made for the intercept parameter if appropriate and re-
quired
146 CHAPTER 8. MULTIPLE REGRESSION

8.6.3 The Coefficient of Determination


• The coefficient of determination R2 in a multiple regression is analogous to that in
a simple linear regression (§ 6.7.3)

• From Table 8.1, R2 = 0.6533

• 65.33% of the variation in 12-year-old boys’ lung capacities can be attributed to


variation in their heights and weights

8.7 Prediction
• The concepts and methodologies for predictions using the multiple regression model
are analogous to those for a simple linear regression model (§ 6.8)

• Table 8.3 shows the predictions and prediction intervals for the cases discussed in
§ 8.6.2

• From Table 8.3, statements such as the following may be made: “With 95% confi-
dence, it may be claimed that a new 12-year-old boy with height 150cm and weight
40kg will have a lung capacity in the range 2.14 litres to 3.34 litres”

• A new subject with height 150cm, weight 40kg and a lung capacity outside that range
would be considered to have an unusual lung capacity for that height and weight

• Note that the multiple regression model has permitted us to make judgements about
the boy’s lung capacity taking account of his height and weight

• In § 6.8.3, there was a discussion about the graphical representation of the fitted
model and the prediction bands

• Now that there are more than one predictor, this is not relevant

• For the Boys’ Lung Capacity data, there are now three dimensions

Table 8.2: Confidence intervals for the multiple regression coefficients for boys’ lung ca-
pacities (measured as forced vital capacity in litres) on heights in centimetres and weights
in kilograms

2.5 % 97.5 %
(Intercept) -5.11415659 -2.48522102
height 0.02925472 0.05004630
weight 0.00566343 0.02407863
8.8. MODEL SIMPLIFICATION 147

• It may be useful to the investigator to pick a fixed weight and then represent graph-
ically on axes of lung capacity against height the 95% prediction bands for that
weight

• Similarly, it may be useful to the investigator to pick a fixed height and then represent
graphically on axes of lung capacity against weight the 95% prediction bands for that
height

• Superimposing these on a scatterplot of the original data here though would not be
appropriate, since the original subjects varied both in height and weight

8.8 Model Simplification


• In practice, resources are required to measure variables

• If a model may be simplified without diminishing the quality of its predictions, then
costs of practical significance might be saved

8.8.1 Two Predictors


• Given that the height of each boy has been measured, do we gain any benefit by also
measuring the boy’s weight?

• Given that the weight of each boy has been measured, do we gain any benefit by also
measuring the boy’s height?

• These questions are effectively asking whether it is possible to simplify the multiple
regression model

• We focus attention on the structural form of the linearity assumption

Table 8.3: For selected heights (centimetres) and weights (kilograms), the estimated aver-
age boys’ lung capacities (measured as forced vital capacity in litres) and 95% prediction
intervals

height weight fit lwr upr


1 150 40 2.742729 2.140798 3.344660
2 150 45 2.817084 2.214405 3.419763
3 160 40 3.139234 2.528698 3.749770
4 160 45 3.213589 2.608193 3.818985
148 CHAPTER 8. MULTIPLE REGRESSION

• The model states that the sub-population average lung capacity for some given height
and weight is linearly related to the height and the weight:

M12 : average(lung capacity for a given height and weight)


= a + b1 × height + b2 × weight

• The first question above can be thought of as testing the hypothesis that the model
may be simplified to

M1 : average(lung capacity for a given height and weight)


= a + b1 × height

– We refer to testing the sub-model M1 against the model M12 and as a short-hand
we refer to the hypothesis M1 : M12
– We are asking whether, in the context of the model M12 , the weight variable is
really needed – the P-value for this test appears in Table 8.1 in the line labelled
weight: 0.00176
– Since the P-value = 0.00176 < 0.05, the hypothesis M1 : M12 is rejected
– Thus there is evidence that, even if the height of the boy is known, the weight
of the boy also adds significantly to the predictive power

• The second question above can be thought of as testing the hypothesis that the model
may be simplified to

M2 : average(lung capacity for a given height and weight)


= a + b2 × weight

– We refer to testing the sub-model M2 against the model M12 and as a short-hand
we refer to the hypothesis M2 : M12
– We are asking whether, in the context of the model M12 , the height variable is
really needed – the P-value for this test appears in Table 8.1 in the line labelled
height: 8.23 × 10−12
– Since the P-value = 8.23 × 10−12 < 0.05, the hypothesis M2 : M12 is rejected
– Thus there is evidence that, even if the weight of the boy is known, the height
of the boy also adds significantly to the predictive power

• In conclusion, the multiple regression model M12 cannot be simplified


8.8. MODEL SIMPLIFICATION 149

• Recall that R2 was found to be 0.6533 (Table 8.1)

– However, when the simple linear regression of lung capacity on height was con-
sidered in § 6.3.1, R2 for that model was found to be 0.6248 (Table 6.1)
– Thus, although in this case study weight is statistically significant, it only ex-
plains a very small proportion of the variability in lung capacity after height has
been accounted for
– This fact is connected to the inter-relationship that exists between height and
weight in the first place

8.8.2 Case Study – Heart Catheter Length


❡ Illustrative Example – The Heart Catheter Length data (Data Set A.26 in Ap-
pendix A of the Subject Reader)

• We shall assume that the subjects can be regarded as a random sample from the
population of children in need of this heart surgery

– That is, we shall assume that the experimenters followed protocols that enable
us to assume the Fundamental Sampling Assumption

• Consider the exploratory scatterplots of

1. catheter length against height (Figure 8.8)


2. catheter length against weight (Figure 8.9)
3. weight against height (Figure 8.10)

• The most notable feature in all three plots is a positive linear relationship

– The relationship between the two predictors, height and weight, will give rise to
an interesting phenomenon when the issue of model simplification is considered

• A plot of residuals against predicted values for the multiple regression model of
catheter length on height and weight is shown in Figure 8.11

• A plot of residuals against each predictor, height and weight, are shown in Fig-
ures 8.12 and 8.13, respectively

• From these plots, the assumptions of linearity and homoscedasticity may be regarded
as acceptable

• The normal Q-Q plot of the residuals in Figure 8.14 shows that the assumption of
conditional normality is acceptable
150 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.8: Heart catheter lengths (centimetres) against heights (centimetres)

50
45
40
catheter.length

35
30
25
20

60 80 100 120 140 160

height

Figure 8.9: Heart catheter lengths (centimetres) against weights (kilograms)


50
45
40
catheter.length

35
30
25
20

10 20 30 40

weight
8.8. MODEL SIMPLIFICATION 151

Figure 8.10: weights (kilograms) against heights (centimetres)

40
30
weight

20
10

60 80 100 120 140 160

height

Figure 8.11: Residuals against predicted values plot for the multiple regression of heart
catheter lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
4
2
residuals

0
−2
−4
−6

30 35 40 45 50

predicted
152 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.12: Residuals against heights plot for the multiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)

6
4
2
residuals

0
−2
−4
−6

60 80 100 120 140 160

height

Figure 8.13: Residuals against weights plot for the multiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
4
2
residuals

0
−2
−4
−6

10 20 30 40

weight
8.8. MODEL SIMPLIFICATION 153

• Thus we may assume the model of multiple regression of heart catheter lengths
(centimetres) on heights (centimetres) and weights (kilograms)
• The summary of the multiple regression model is shown in Table 8.4

Table 8.4: Heart catheter lengths (centimetres) on heights (centimetres) and weights (kilo-
grams)

Call:
lm(formula = catheter.length ~ height + weight, data = Heart.Catheter.Length)

Residuals:
Min 1Q Median 3Q Max
-6.961 -1.247 -0.262 1.898 7.001

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.58787 8.72398 2.360 0.0426 *
height 0.08413 0.14119 0.596 0.5659
weight 0.40467 0.36118 1.120 0.2916
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.939 on 9 degrees of freedom


Multiple R-squared: 0.8057,Adjusted R-squared: 0.7626
F-statistic: 18.67 on 2 and 9 DF, p-value: 0.0006276

• The appropriate null hypothesis to test the significance of the multiple regression is
that heart catheter length is independent of both height and weight
• The P-value is given in the final line of Table 8.4:
F-statistic: 18.67 on 2 and 9 DF, p-value: 0.0006276
• Since the P-value = 0.0006276 < 0.05, the null hypothesis is rejected, and it is
concluded that heart catheter length does depend on height and/or weight
• From Table 8.4, the estimated multiple regression equation is:

average(heart catheter length required for a given height and weight)


= (0.08413 × height) + (0.40467 × weight) + 20.58787

• From Table 8.4, R2 = 0.8057


154 CHAPTER 8. MULTIPLE REGRESSION

• 80.57% of the variation in heart catheter lengths required can be attributed to vari-
ation in patients’ heights and weights

• This model may now be used to determine predictions and prediction intervals for
the heart catheter lengths required for children who are to undergo this surgery based
on their heights and weights

• But can the model be simplified?

• First, given the model

M12 : average(heart catheter length required for a given height and weight)
= a + b1 × height + b2 × weight ,

we test the hypothesis that the model may be simplified to

M1 : average(heart catheter length required for a given height and weight)


= a + b1 × height

– That is, we test the hypothesis M1 : M12


– We are asking whether, in the context of the model M12 , the weight variable is
really needed – the P-value for this test appears in Table 8.4 in the line labelled
weight: 0.2916
– Since the P-value = 0.2916 > 0.05, the hypothesis M1 : M12 is accepted

• As a separate exercise, given the model

M12 : average(heart catheter length required for a given height and weight)
= a + b1 × height + b2 × weight ,

we test the hypothesis that the model may be simplified to

M2 : average(heart catheter length required for a given height and weight)


= a + b2 × weight

– That is, we test the hypothesis M2 : M12


– We are asking whether, in the context of the model M12 , the weight variable is
really needed – the P-value for this test appears in Table 8.4 in the line labelled
height: 0.5659
8.8. MODEL SIMPLIFICATION 155

– Since the P-value = 0.5659 > 0.05, the hypothesis M2 : M12 is accepted

• It is extremely important to note that the investigation above does not conclude that
neither predictor needs to be measured!

– Indeed, we have already tested and rejected the null hypothesis that heart
catheter length is independent of both height and weight

• Rather, the investigation concludes that either height or weight may be omitted from
the model provided the other is included

– We saw in Figure 8.10 that height and weight are strongly correlated
– We now see that this correlation is so strong that it is redundant to measure
both variables; one may serve as a surrogate for the other

• The summary of the model M1 is shown in Table 8.5 and the summary of the model
M2 is shown in Table 8.6

Table 8.5: Heart catheter lengths (centimetres) on heights (centimetres)

Call:
lm(formula = catheter.length ~ height, data = Heart.Catheter.Length)

Residuals:
Min 1Q Median 3Q Max
-7.1461 -0.7427 -0.2279 1.1876 6.6588

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.0122 4.2389 2.834 0.017737 *
height 0.2361 0.0398 5.931 0.000145 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.989 on 10 degrees of freedom


Multiple R-squared: 0.7787,Adjusted R-squared: 0.7565
F-statistic: 35.18 on 1 and 10 DF, p-value: 0.0001449

• Having now determined that M1 and M2 are each acceptable models, it is now ap-
propriate to ask whether either of these models can be further simplified
156 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.14: Normal Q-Q plot of the residuals for themultiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)

6
Heart.Catheter.Length$residuals

4
2
0
−2
−4
−6

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

norm quantiles

Table 8.6: Heart catheter lengths (centimetres) on weights (kilograms)

Call:
lm(formula = catheter.length ~ weight, data = Heart.Catheter.Length)

Residuals:
Min 1Q Median 3Q Max
-8.0210 -1.4952 -0.1132 2.0951 7.0553

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.63665 2.00922 12.759 1.64e-07 ***
weight 0.61137 0.09725 6.287 9.06e-05 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 3.81 on 10 degrees of freedom


Multiple R-squared: 0.7981,Adjusted R-squared: 0.7779
F-statistic: 39.52 on 1 and 10 DF, p-value: 9.061e-05
8.8. MODEL SIMPLIFICATION 157

• Given the model

M1 : average(heart catheter length required for a given height and weight)


= a + b1 × height

we test the hypothesis that the model may be simplified to

M0 : average(heart catheter length required for a given height and weight)


=a

– That is, we test the hypothesis M0 : M1


– We are asking whether, in the context of the model M1 , the height variable is
really needed – the P-value for this test appears in Table 8.5 in the line labelled
height: 0.000145
– Since the P-value = 0.000145 < 0.05, the hypothesis M0 : M1 is rejected
– This is of course to be expected, as we are now effectively considering H0 , which
has already been rejected by the argument on page 153

• Given the model

M2 : average(heart catheter length required for a given height and weight)


= a + b2 × weight

we test the hypothesis that the model may be simplified to

M0 : average(heart catheter length required for a given height and weight)


=a

– That is, we test the hypothesis M0 : M2


– We are asking whether, in the context of the model M2 , the weight variable is
really needed – the P-value for this test appears in Table 8.6 in the line labelled
weight: 9.06 × 10−5
– Since the P-value = 9.06 × 10−5 < 0.05, the hypothesis M0 : M2 is rejected
– This is of course to be expected, as again we are now effectively considering H0 ,
which has already been rejected by the argument on page 153

• Thus, although either M1 or M2 may be considered to be acceptable models, neither


of these can be further simplified
158 CHAPTER 8. MULTIPLE REGRESSION

– To repeat again for emphasis, it is not necessary to measure both height and
weight, provided one of them is measured

• Note that this investigation is not saying that the original model M12 is somehow
invalid, but rather that it is more complex than it need be

• We have three acceptable models for this problem, M12 , M1 and M2

– Indeed, this is reflected in the fact that all three models have approximately the
same R2 : M12 has R2 = 80.57%; M1 has R2 = 77.87%; M2 has R2 = 79.81%

• Which of these models we use becomes more of a practical than a statistical question

– As mentioned above, measuring variables costs resources, and so usually a model


with fewer variables is preferred
– In the clinical setting of this problem, is it easier to measure the child’s height
or weight?
– If both are quick and cheap to measure, then why not use the full model M12 ?

8.8.3 More Than Two Predictors


• In practice we often have several predictor variables of which not all will be necessary

• The case study in § 8.8.2 shows that even in the case of two predictors the task of
determining which variables are necessary is not always straightforward

• The number of potential intermediate models grows quickly with the number of
predictors

• As well as the full model with all predictors included and the null model with no
predictors included, there are

– two intermediate models for two predictors,


M1 and M2
– six intermediate models for three predictors,
M12 , M13 , M23 , M1 , M2 and M3
– 14 intermediate models for four predictors,
M123 , M124 , M134 , M234 , M12 , M13 , M14 , M23 , M24 , M34 , M1 , M2 , M3 and M4
– ...

• There are many techniques for assessing the acceptability of these intermediate mod-
els, and for giving guidance as to which of the acceptable models to settle on
8.9. POLYNOMIAL REGRESSION 159

• If the number of predictor variables is small, say up to about five, then the scheme
called backward elimination is feasible

– The analysis that was conducted in § 8.8.2 is in fact backward elimination


applied to that problem

✾ How to . . . perform backward elimination

1. Begin with the full model in which all predictors are included

2. Conduct a residuals analysis to confirm that the model of multiple regression is


suitable

• Obviously, the process of model simplification cannot be executed unless and


until there is an acceptable multiple regression model to start with

3. Examine the Coefficients table of the multiple regression model summary and
identify which (if any) sub-models with one fewer predictor are acceptable

4. For each of the acceptable sub-models with one fewer predictor (if any) identified in
step 3, repeat step 3 with that sub-model as the model

5. This process will eventually terminate, once every branch of the tree of sub-models
corresponding to iterative omissions of a single predictor has been investigated

6. There will now be a set of acceptable models; examine practical (as opposed to
statistical) considerations to select one of these

7. As a precaution, conduct a residuals analysis for the selected model to check that
this process has not introduced any strange artefacts

• The “Evaporation from the Soil” problem in Practical 6.1 is an example of backward
elimination applied to a four-predictor data set

8.9 Polynomial Regression

8.9.1 Quadratic Relationships


• Two variables are said to be quadratically related if they satisfy the equation

y = a + b × x + c × x2

for some numbers a, b and c


160 CHAPTER 8. MULTIPLE REGRESSION

• The curve produced by this relationship is called a parabola – see Figure 8.15
• Sometimes in regression problems it is found that the model of linear regression does
not hold, but simply because the average response is not linearly related to one of
the predictors, but rather related quadratically
• This will manifest itself in the plot of residuals against predicted values in the form
of a quadratic (parabolic) pattern for the trend, but nonetheless there will still be
reasonabe homoscedasticity about that trend
• Rather than attempting to find a model using transformations, in this case it is more
appropriate to explicitly incorporate the squared term in the model
• That is, include in the set of predictor variables both the predictor concerned and
also its square as another predictor

– Note that the “linearity” in the “linearity assumption” of the multiple regression
model refers to the fact that the regression equation is linear in the unknown
parameters (a, b1 , b2 , etc)

• This exploitation of the multiple regression model to handle this situation does not
contradict the Fundamental Sampling Assumption

– The “weaker form” of the Fundamental Sampling Assumption


“the experimenter may pre-determine which values of the predictor
variables to run the experiment for and how many times to do so for
each choice, in which case it is assumed that for each of the chosen
sub-populations the subjects in that sub-sample have been selected at
random from the sub-population”
allows us to “get away with this device”

• If necessary, even higher order powers of the predictor (x3 , x4 , . . .) could be added
as extra predictors

– In general these relationships are called polynomial relationships, hence the gen-
eral term polynomial regression

8.9.2 Case Study – Predicting Appliance Sales


❡ Illustrative Example – Predicting Appliance Sales (Data Set A.23 in Appendix A
of the Subject Reader)

• Consider the model of multiple regression of dishwasher shipments on durable goods


expenditures and private residential investment
8.9. POLYNOMIAL REGRESSION 161

• The Fundamental Sampling Assumption in this type of data set relates to the fact
that the data form a time series of measurements

– We are not interested in examining this as a time series per se


– Rather, the Fundamental Sampling Assumption is that the fundamental rela-
tionships between these variables in the United States’ economy do not change
over time
– This is one of the special cases in which the data can be informative as to
whether the Fundamental Sampling Assumption is reasonable
– Once we have settled on a model, we shall examine the residuals from that
model as a time series: we require this time series to be able to be regarded as
simply noise (recall the discussion in § 2.4.4)

• Consider the exploratory scatterplots of

1. dishwasher shipments against durable goods expenditures (Figure 8.16)


2. dishwasher shipments against private residential investment (Figure 8.17)
3. private residential investment against durable goods expenditures (Figure 8.18)

• The following remarks are pertinent:

1. Recall that each of these scatterplots is only a two-dimensional view of a data


set that is intrinsically three-dimensional
2. The factory shipments (domestic) of dishwashers appears to be positively cor-
related with both economic indicators (Figures 8.16 and 8.17)
3. The relationship between the factory shipments (domestic) of dishwashers and
durable goods expenditures, ignoring private residential investment, (Figure 8.16)
appears not to be linear or homoscedastic
4. The two economic indicators are correlated (Figure 8.18)

• To properly assess the suitability of the model of multiple regression, a residuals


analysis of the model is required

• Consider the scatterplots of

1. residuals against predicted values (Figure 8.19)


2. residuals against durable goods expenditures (Figure 8.20)
3. residuals against private residential investment (Figure 8.21)

• Observe from Figure 8.19 that there is a problem with the linearity assumption, but
not with the homoscedasticity assumption
162 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.15: An example of a quadratic relationship, a parabola

10
9
8
y

7
6
5

0 5 10 15 20

Figure 8.16: Factory shipments (domestic) of dishwashers (thousands) against durable


goods expenditures (billions of 1972 dollars)
3500
3000
2500
dish

2000
1500
1000
500

60 80 100 120 140 160 180

dge
8.9. POLYNOMIAL REGRESSION 163

Figure 8.17: Factory shipments (domestic) of dishwashers (thousands) against private


residential investment (billions of 1972 dollars)

3500
3000
2500
dish

2000
1500
1000
500

35 40 45 50 55 60

pri

Figure 8.18: Private residential investment (billions of 1972 dollars) against durable goods
expenditures (billions of 1972 dollars)
60
55
50
pri

45
40
35

60 80 100 120 140 160 180

dge
164 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.19: Residuals against predicted values plot, for the multiple regression of factory
shipments (domestic) of dishwashers (thousands) on durable goods expenditures (billions
of 1972 dollars) and private residential investment (billions of 1972 dollars)

500
residuals

0
−500

1000 1500 2000 2500 3000 3500 4000

predicted

Figure 8.20: Residuals against durable goods expenditures plot, for the multiple regression
of factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
500
residuals

0
−500

60 80 100 120 140 160 180

dge
8.9. POLYNOMIAL REGRESSION 165

• Inspection of Figures 8.20 and 8.21 suggests that the problem lies with the way in
which we are modelling the dependence of factory shipments (domestic) of dishwash-
ers on durable goods expenditures

– Refer to the discussion on page 140

• The discussion in § 8.9.1 suggests modifying the model

average(dish) = a + b1 × dge + b2 × pri

to the model

average(dish) = a + b11 × dge + b12 × dge2 + b2 × pri

with the other assumptions as they were

• This is still a multiple regression, but now with three predictors, dge, dge2 , and pri

• Now for this augmented model, consider the scatterplots of

1. residuals against predicted values (Figure 8.22)


2. residuals against durable goods expenditures (Figure 8.23)
3. residuals against durable goods expenditures squared (Figure 8.24)
4. residuals against private residential investment (Figure 8.25)

• The assumptions of linearity and homoscedasticity are reasonably satisfied for the
augmented model

• The normal Q-Q plot of the residuals in Figure 8.26 shows that the assumption of
conditional normality is acceptable

• In the discussion on page 8.9.2, it was noted that once an acceptable model had
been found, it was appropriate to examine a time series plot of the residuals – see
Figure 8.27

• There are no time trends apparent in Figure 8.27

• The summary of the augmented model is shown in Table 8.7

• Observe, from the Coefficients section of Table 8.7 that all three terms in the
model are significant

• In conclusion, we have adopted the model

average(dish) = 54.31 × dge − 0.1744 × dge2 + 45.62 × pri − 3, 412


166 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.21: Residuals against private residential investment plot, for the multiple re-
gression of factory shipments (domestic) of dishwashers (thousands) on durable goods
expenditures (billions of 1972 dollars) and private residential investment (billions of 1972
dollars)

500
residuals

0
−500

35 40 45 50 55 60

pri

Figure 8.22: Residuals against predicted values plot, for the augmented multiple regression
of factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
600
400
residuals

200
0
−200

500 1000 1500 2000 2500 3000 3500

predicted
8.9. POLYNOMIAL REGRESSION 167

Figure 8.23: Residuals against durable goods expenditures plot, for the augmented multiple
regression of factory shipments (domestic) of dishwashers (thousands) on durable goods
expenditures (billions of 1972 dollars) and private residential investment (billions of 1972
dollars)

600
400
residuals

200
0
−200

60 80 100 120 140 160 180

dge

Figure 8.24: Residuals against durable goods expenditures squared plot, for the augmented
multiple regression of factory shipments (domestic) of dishwashers (thousands) on durable
goods expenditures (billions of 1972 dollars) and private residential investment (billions of
1972 dollars)
600
400
residuals

200
0
−200

5000 10000 15000 20000 25000 30000 35000

dge2
168 CHAPTER 8. MULTIPLE REGRESSION

Figure 8.25: Residuals against private residential investment plot, for the augmented mul-
tiple regression of factory shipments (domestic) of dishwashers (thousands) on durable
goods expenditures (billions of 1972 dollars) and private residential investment (billions of
1972 dollars)

600
400
residuals

200
0
−200

35 40 45 50 55 60

pri

Figure 8.26: Normal Q-Q plot of the residuals for the augmented multiple regression of
factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
600
400
Appliance.Sales$residuals

200
0
−200

−2 −1 0 1 2

norm quantiles
8.9. POLYNOMIAL REGRESSION 169

Figure 8.27: Time series plot of the residuals for the augmented multiple regression of
factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)

600
400
residuals

200
0
−200

1960 1965 1970 1975 1980 1985

year

Table 8.7: Augmented model of factory shipments (domestic) of dishwashers (thousands)


on durable goods expenditures (billions of 1972 dollars) and private residential investment
(billions of 1972 dollars)

Call:
lm(formula = dish ~ dge + dge2 + pri, data = Appliance.Sales)

Residuals:
Min 1Q Median 3Q Max
-319.243 -157.061 -2.999 141.060 650.402

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.412e+03 3.988e+02 -8.555 1.90e-08 ***
dge 5.431e+01 7.175e+00 7.569 1.46e-07 ***
dge2 -1.744e-01 2.989e-02 -5.832 7.21e-06 ***
pri 4.526e+01 7.353e+00 6.156 3.39e-06 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 242.4 on 22 degrees of freedom


Multiple R-squared: 0.9518,Adjusted R-squared: 0.9453
F-statistic: 144.9 on 3 and 22 DF, p-value: 1.225e-14
170 CHAPTER 8. MULTIPLE REGRESSION

8.10 Dummy Variables


• Regression models are concerned with the situation of a single response variable and
one or more predictor variables, all of which are continuous numerical variables

• In some experiments, some of the predictors may be categorical variables

• The general way of dealing with such experiments is introduced later in Chapter 12

• In the special case in which all the categorical predictors are simply dichotomous (or
binary), we may exploit the multiple regression model

– Dichotomous (or binary) variables are categorical variables that only have two
values

• Artificially constructed predictor variables are entered into the multiple regression
model

– Each of these dummy variables (or indicator variables) is constructed as a nu-


merical variable with values either 0 or 1 to indicate the allocation of the subject
to whichever of the two possible classifications applies to that subject

• Like polynomial regression (§ 8.9), this exploitation of the multiple regression model
to handle this situation does not contradict the Fundamental Sampling Assumption

– The “weaker form” of the Fundamental Sampling Assumption


“the experimenter may pre-determine which values of the predictor
variables to run the experiment for and how many times to do so for
each choice, in which case it is assumed that for each of the chosen
sub-populations the subjects in that sub-sample have been selected at
random from the sub-population”
allows us to “get away with this device”

• Consider the case of one continuous numerical predictor (predictor1 ) and one di-
chotomous variable (predictor2 )

– Pick one of the values predictor2 can take as the base-line; create a numerical
(coded) variable (predictor2 .coded) that has the value 0 for the base-line and
1 for the other
– Use the model of multiple regression with predictor1 and predictor2 .coded
as the two predictors
8.10. DUMMY VARIABLES 171

– This will therefore fit the model

average response for given predictors

= a + b1 × predictor1 + b2 × predictor2 .coded


(
a + b1 × predictor1 , if predictor2 .coded = 0
=
a + b1 × predictor1 + b2 , if predictor2 .coded = 1
(
a + b1 × predictor1 , if predictor2 is at base-line
=
a + b1 × predictor1 + b2 , if predictor2 is not

• This model is often sufficient for the problem at hand

❡ Illustrative Example – Fuel consumption data – consider the exercise in Practical 6.2

• It is important to note that this “trick” only works with a 0-1 coding
• It is also important to note that this “trick” will not work with a categorical variable
with three or more possible values

– Were the dummy variable approach to be emulated for a categorical variable


with three possible values then the model would insist that the difference in the
average response between the second and the first value was the same as the
difference in the average response between the third and the second value, and
furthermore that the difference in the average response between the third and
the first value was twice either of those!
– This is because the following model would be fitted:

average response for given predictors

= a + b1 × predictor1 + b2 × predictor2 .coded




a + b1 × predictor1 , if predictor2 .coded = 0
= a + b1 × predictor1 + b2 , if predictor2 .coded = 1
a + b1 × predictor1 + 2 × b2 , if predictor2 .coded = 2




a + b1 × predictor1 , if predictor2 is at base-line
= a + b1 × predictor1 + b2 , if predictor2 is at level 1
a + b1 × predictor1 + 2 × b2 , if predictor2 is at level 2

– This would place unrealistic constraints on the model


172 CHAPTER 8. MULTIPLE REGRESSION

You might also like