Professional Documents
Culture Documents
Multiple Regression
• you will learn how to check the assumptions required for the multiple regression
model
• you will learn how to make predictions on the basis of the multiple regression model
• techniques for investigating whether the model can be simplified will be examined
• you will learn some special uses of the multiple regression technique
The Variables Several continuous numerical variables of interest, one of which is con-
sidered as a response, and the others as predictors
131
132 CHAPTER 8. MULTIPLE REGRESSION
– the subjects in the sample have been selected at random from the population;
or,
– the experimenter may pre-determine which values of the predictor variables to
run the experiment for and how many times to do so for each choice, in which
case it is assumed that for each of the chosen sub-populations the subjects in
that sub-sample have been selected at random from the sub-population
The Statistical Procedure The standard procedure used for this problem is called
least-squares
• In Chapter 6, linear regression was introduced as a technique for studying the rela-
tionship between a predictor variable and a response variable
• In the case study, lung capacity was the response and height was the predictor
• In multiple regression there is a single response variable but now there are two or
more predictor variables to be accounted for simultaneously
• In the case study, a second predictor variable, namely weight, was recorded
Figure 8.1: Boys’ lung capacities (measured as forced vital capacity in litres) against heights
(centimetres)
4.5
4.0
3.5
3.0
fvc
2.5
2.0
1.5
height
Figure 8.2: Boys’ lung capacities (measured as forced vital capacity in litres) against
weights (kilograms)
4.5
4.0
3.5
3.0
fvc
2.5
2.0
1.5
30 40 50 60 70
weight
8.4. THE MULTIPLE REGRESSION MODEL 135
– In general, there will be inter-relationships between the predictors, and this can
have implications for the analysis and the interpretation of results
• It is important to note that Figures 8.1 – 8.3 are only two-dimensional views of a
data set that is intrinsically three-dimensional
– They can inform, but they cannot tell the whole story
– This observation has implications for model checking
• In the multiple regression context, where both height and weight are considered
simultaneously, we now consider a separate sub-population of 12-year-old boys for
each possible combination of height and weight
• Analogously to the case of regression on one variable, the multiple regression model
is a set of assumptions about the distribution of lung capacity within these sub-
populations
1. The sub-population average lung capacity for some given height and weight is
linearly related to the height and the weight:
average(lung capacity for a given height and weight)
= a + b1 × height + b2 × weight
136 CHAPTER 8. MULTIPLE REGRESSION
2. The sub-population standard deviation of the lung capacity for boys of a given
height and weight is the same for each height and weight (homoscedasticity).
Call this common value the conditional standard deviation and denote it SD
3. The sub-population distribution of the lung capacity for boys of a given height
and weight is normal (conditional normality)
• The Fundamental Sampling Assumption and methods for assessing the other three
modelling assumptions will be examined in full in § 8.5
3. Select the response variable (fvc) and the explanatory (ie, predictor) variables (height
and weight)
• The (Intercept) entry is the estimate of the intercept parameter a. In this case the
value is −3.799689
• The height entry is the estimate of the height coefficient parameter b1 . In this case
the value is 0.039651
• The weight entry is the estimate of the weight coefficient parameter b2 . In this case
the value is 0.014871
• The quantity labelled Residual standard error is the estimate of the conditional
standard deviation, SD. In this case the value is 0.3027
8.4. THE MULTIPLE REGRESSION MODEL 137
70
60
weight
50
40
30
height
Table 8.1: Boys’ lung capacities (measured as forced vital capacity in litres) on heights in
centimetres and weights in kilograms
Call:
lm(formula = fvc ~ height + weight, data = Boys.Lung.Capacity)
Residuals:
Min 1Q Median 3Q Max
-0.659600 -0.216124 -0.002729 0.177483 0.882400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.799689 0.664114 -5.721 7.48e-08 ***
height 0.039651 0.005252 7.549 8.23e-12 ***
weight 0.014871 0.004652 3.197 0.00176 **
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
• The remaining three modelling assumptions can be studied on the basis of the ob-
served data by calculating and plotting, in appropriate ways, the residuals from the
multiple regression
• Recall from § 6.6 that, in the case of simple linear regression, although linearity
and homoscedasticity can be assessed from a scatterplot of the data, psychological
experiments have shown that a residuals analysis provides a better judgement
• Now that there is more than one predictor, linearity and homoscedasticity can only
be assessed from a scatterplot of residuals against predicted values
– As was noted in § 8.3.1, any scatterplots of the data are two-dimensional views of
something that is intrinsically higher dimensional, and so cannot be conclusive
• The residuals against predicted values plot for the Boys’ Lung Capacity data is shown
in Figure 8.4
– From Figure 8.4, it is concluded that the assumptions of linearity and ho-
moscedasticity are reasonable
• Recall that in Figure 8.2 there was evidence that the variability in lung capacity
increases with weight
Figure 8.4: Residuals against predicted values plot for the multiple regression of boys’
lung capacities (measured as forced vital capacity in litres) on heights in centimetres and
weights in kilograms
0.5
residuals
0.0
−0.5
predicted
140 CHAPTER 8. MULTIPLE REGRESSION
– This reinforces the caution given in § 8.3.1 that the two-dimensional scatterplots
are informative in an exploratory sense, but cannot be relied upon to tell the
whole story
• The normal Q-Q plot of the residuals for the Boys’ Lung Capacity data is shown in
Figure 8.5
• What if the assumptions of the multiple regression model are not satisfied?
• Frequently, transformations of one or more of the variables – usually just the response
variable – are the best approach
• In regression models, for each subject, the predicted value is the value obtained from
the estimated regression equation
Figure 8.5: Normal Q-Q plot of the residuals for the multiple regression of boys’ lung
capacities (measured as forced vital capacity in litres) on heights in centimetres and weights
in kilograms
Boys.Lung.Capacity$residuals
0.5
0.0
−0.5
−2 −1 0 1 2
norm quantiles
142 CHAPTER 8. MULTIPLE REGRESSION
Figure 8.6: Residuals against heights plot for the multiple regression of boys’ lung capaci-
ties (measured as forced vital capacity in litres) on heights in centimetres and weights in
kilograms
0.5
residuals
0.0
−0.5
height
Figure 8.7: Residuals against weights plot for the multiple regression of boys’ lung capac-
ities (measured as forced vital capacity in litres) on heights in centimetres and weights in
kilograms
0.5
residuals
0.0
−0.5
30 40 50 60 70
weight
144 CHAPTER 8. MULTIPLE REGRESSION
– In this topic, this should be written up as something like ‘The P-value is less
than 2.2 × 10−16 ’ or ‘The P-value is 0.000 (to three decimal places)’, or the like
• It was argued in § 6.7.1 in the context of linear regression with only one predictor, that
the null hypothesis is logically equivalent to the hypothesis that the single population
slope parameter is zero
– Now that there are more than one predictor, there is no such simple connection
– Investigating whether the multiple regression model can be simplified is nonethe-
less a relevant exercise
– This will be discussed in detail later, in § 8.8
• For example:
– The average lung capacity for 12-year-old boys of height 150cm and weight 40
kg is estimated to be
– The average lung capacity for 12-year-old boys of height 150cm and weight 45
kg is estimated to be
– The average lung capacity for 12-year-old boys of height 160cm and weight 40
kg is estimated to be
– The average lung capacity for 12-year-old boys of height 160cm and weight 45
kg is estimated to be
• Analogously to the slope parameter in simple linear regression, the coefficient pa-
rameters in multiple regression have meaningful physical interpretations
• As with simple linear regression, the intercept parameter sometimes has a meaningful
physical interpretation as the estimate of the average value of y when x1 = 0 and
x2 = 0
– This only applies when x1 = 0 and x2 = 0 are part of the physical range
– The intercept parameter in our example does not have a meaningful physical
interpretation
– The fact that it is negative is of no special importance
• Since the population coefficients and the population intercept are being estimated,
it is sometimes desirable to indicate the accuracy as well as the size of the estimates
– confidence intervals for the population coefficients and the population intercept
• These are computed in the same way as described in § 6.7.2
• The confidence intervals are shown in Table 8.2
• Thus it may be claimed with 95% confidence that an increase of 1cm in height
corresponds to an increase between 0.02925472 litres and 0.05004630 litres in average
lung capacity, for a fixed weight
• Likewise, it may be claimed with 95% confidence that an increase of 1kg in weight
corresponds to an increase between 0.00566343 litres and 0.02407863 litres in average
lung capacity, for a fixed height
• Similar statements may be made for the intercept parameter if appropriate and re-
quired
146 CHAPTER 8. MULTIPLE REGRESSION
8.7 Prediction
• The concepts and methodologies for predictions using the multiple regression model
are analogous to those for a simple linear regression model (§ 6.8)
• Table 8.3 shows the predictions and prediction intervals for the cases discussed in
§ 8.6.2
• From Table 8.3, statements such as the following may be made: “With 95% confi-
dence, it may be claimed that a new 12-year-old boy with height 150cm and weight
40kg will have a lung capacity in the range 2.14 litres to 3.34 litres”
• A new subject with height 150cm, weight 40kg and a lung capacity outside that range
would be considered to have an unusual lung capacity for that height and weight
• Note that the multiple regression model has permitted us to make judgements about
the boy’s lung capacity taking account of his height and weight
• In § 6.8.3, there was a discussion about the graphical representation of the fitted
model and the prediction bands
• Now that there are more than one predictor, this is not relevant
• For the Boys’ Lung Capacity data, there are now three dimensions
Table 8.2: Confidence intervals for the multiple regression coefficients for boys’ lung ca-
pacities (measured as forced vital capacity in litres) on heights in centimetres and weights
in kilograms
2.5 % 97.5 %
(Intercept) -5.11415659 -2.48522102
height 0.02925472 0.05004630
weight 0.00566343 0.02407863
8.8. MODEL SIMPLIFICATION 147
• It may be useful to the investigator to pick a fixed weight and then represent graph-
ically on axes of lung capacity against height the 95% prediction bands for that
weight
• Similarly, it may be useful to the investigator to pick a fixed height and then represent
graphically on axes of lung capacity against weight the 95% prediction bands for that
height
• Superimposing these on a scatterplot of the original data here though would not be
appropriate, since the original subjects varied both in height and weight
• If a model may be simplified without diminishing the quality of its predictions, then
costs of practical significance might be saved
• Given that the weight of each boy has been measured, do we gain any benefit by also
measuring the boy’s height?
• These questions are effectively asking whether it is possible to simplify the multiple
regression model
Table 8.3: For selected heights (centimetres) and weights (kilograms), the estimated aver-
age boys’ lung capacities (measured as forced vital capacity in litres) and 95% prediction
intervals
• The model states that the sub-population average lung capacity for some given height
and weight is linearly related to the height and the weight:
• The first question above can be thought of as testing the hypothesis that the model
may be simplified to
– We refer to testing the sub-model M1 against the model M12 and as a short-hand
we refer to the hypothesis M1 : M12
– We are asking whether, in the context of the model M12 , the weight variable is
really needed – the P-value for this test appears in Table 8.1 in the line labelled
weight: 0.00176
– Since the P-value = 0.00176 < 0.05, the hypothesis M1 : M12 is rejected
– Thus there is evidence that, even if the height of the boy is known, the weight
of the boy also adds significantly to the predictive power
• The second question above can be thought of as testing the hypothesis that the model
may be simplified to
– We refer to testing the sub-model M2 against the model M12 and as a short-hand
we refer to the hypothesis M2 : M12
– We are asking whether, in the context of the model M12 , the height variable is
really needed – the P-value for this test appears in Table 8.1 in the line labelled
height: 8.23 × 10−12
– Since the P-value = 8.23 × 10−12 < 0.05, the hypothesis M2 : M12 is rejected
– Thus there is evidence that, even if the weight of the boy is known, the height
of the boy also adds significantly to the predictive power
– However, when the simple linear regression of lung capacity on height was con-
sidered in § 6.3.1, R2 for that model was found to be 0.6248 (Table 6.1)
– Thus, although in this case study weight is statistically significant, it only ex-
plains a very small proportion of the variability in lung capacity after height has
been accounted for
– This fact is connected to the inter-relationship that exists between height and
weight in the first place
• We shall assume that the subjects can be regarded as a random sample from the
population of children in need of this heart surgery
– That is, we shall assume that the experimenters followed protocols that enable
us to assume the Fundamental Sampling Assumption
• The most notable feature in all three plots is a positive linear relationship
– The relationship between the two predictors, height and weight, will give rise to
an interesting phenomenon when the issue of model simplification is considered
• A plot of residuals against predicted values for the multiple regression model of
catheter length on height and weight is shown in Figure 8.11
• A plot of residuals against each predictor, height and weight, are shown in Fig-
ures 8.12 and 8.13, respectively
• From these plots, the assumptions of linearity and homoscedasticity may be regarded
as acceptable
• The normal Q-Q plot of the residuals in Figure 8.14 shows that the assumption of
conditional normality is acceptable
150 CHAPTER 8. MULTIPLE REGRESSION
50
45
40
catheter.length
35
30
25
20
height
35
30
25
20
10 20 30 40
weight
8.8. MODEL SIMPLIFICATION 151
40
30
weight
20
10
height
Figure 8.11: Residuals against predicted values plot for the multiple regression of heart
catheter lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
4
2
residuals
0
−2
−4
−6
30 35 40 45 50
predicted
152 CHAPTER 8. MULTIPLE REGRESSION
Figure 8.12: Residuals against heights plot for the multiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
4
2
residuals
0
−2
−4
−6
height
Figure 8.13: Residuals against weights plot for the multiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
4
2
residuals
0
−2
−4
−6
10 20 30 40
weight
8.8. MODEL SIMPLIFICATION 153
• Thus we may assume the model of multiple regression of heart catheter lengths
(centimetres) on heights (centimetres) and weights (kilograms)
• The summary of the multiple regression model is shown in Table 8.4
Table 8.4: Heart catheter lengths (centimetres) on heights (centimetres) and weights (kilo-
grams)
Call:
lm(formula = catheter.length ~ height + weight, data = Heart.Catheter.Length)
Residuals:
Min 1Q Median 3Q Max
-6.961 -1.247 -0.262 1.898 7.001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.58787 8.72398 2.360 0.0426 *
height 0.08413 0.14119 0.596 0.5659
weight 0.40467 0.36118 1.120 0.2916
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
• The appropriate null hypothesis to test the significance of the multiple regression is
that heart catheter length is independent of both height and weight
• The P-value is given in the final line of Table 8.4:
F-statistic: 18.67 on 2 and 9 DF, p-value: 0.0006276
• Since the P-value = 0.0006276 < 0.05, the null hypothesis is rejected, and it is
concluded that heart catheter length does depend on height and/or weight
• From Table 8.4, the estimated multiple regression equation is:
• 80.57% of the variation in heart catheter lengths required can be attributed to vari-
ation in patients’ heights and weights
• This model may now be used to determine predictions and prediction intervals for
the heart catheter lengths required for children who are to undergo this surgery based
on their heights and weights
M12 : average(heart catheter length required for a given height and weight)
= a + b1 × height + b2 × weight ,
M12 : average(heart catheter length required for a given height and weight)
= a + b1 × height + b2 × weight ,
– Since the P-value = 0.5659 > 0.05, the hypothesis M2 : M12 is accepted
• It is extremely important to note that the investigation above does not conclude that
neither predictor needs to be measured!
– Indeed, we have already tested and rejected the null hypothesis that heart
catheter length is independent of both height and weight
• Rather, the investigation concludes that either height or weight may be omitted from
the model provided the other is included
– We saw in Figure 8.10 that height and weight are strongly correlated
– We now see that this correlation is so strong that it is redundant to measure
both variables; one may serve as a surrogate for the other
• The summary of the model M1 is shown in Table 8.5 and the summary of the model
M2 is shown in Table 8.6
Call:
lm(formula = catheter.length ~ height, data = Heart.Catheter.Length)
Residuals:
Min 1Q Median 3Q Max
-7.1461 -0.7427 -0.2279 1.1876 6.6588
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.0122 4.2389 2.834 0.017737 *
height 0.2361 0.0398 5.931 0.000145 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
• Having now determined that M1 and M2 are each acceptable models, it is now ap-
propriate to ask whether either of these models can be further simplified
156 CHAPTER 8. MULTIPLE REGRESSION
Figure 8.14: Normal Q-Q plot of the residuals for themultiple regression of heart catheter
lengths (centimetres) on heights (centimetres) and weights (kilograms)
6
Heart.Catheter.Length$residuals
4
2
0
−2
−4
−6
norm quantiles
Call:
lm(formula = catheter.length ~ weight, data = Heart.Catheter.Length)
Residuals:
Min 1Q Median 3Q Max
-8.0210 -1.4952 -0.1132 2.0951 7.0553
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.63665 2.00922 12.759 1.64e-07 ***
weight 0.61137 0.09725 6.287 9.06e-05 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
– To repeat again for emphasis, it is not necessary to measure both height and
weight, provided one of them is measured
• Note that this investigation is not saying that the original model M12 is somehow
invalid, but rather that it is more complex than it need be
– Indeed, this is reflected in the fact that all three models have approximately the
same R2 : M12 has R2 = 80.57%; M1 has R2 = 77.87%; M2 has R2 = 79.81%
• Which of these models we use becomes more of a practical than a statistical question
• The case study in § 8.8.2 shows that even in the case of two predictors the task of
determining which variables are necessary is not always straightforward
• The number of potential intermediate models grows quickly with the number of
predictors
• As well as the full model with all predictors included and the null model with no
predictors included, there are
• There are many techniques for assessing the acceptability of these intermediate mod-
els, and for giving guidance as to which of the acceptable models to settle on
8.9. POLYNOMIAL REGRESSION 159
• If the number of predictor variables is small, say up to about five, then the scheme
called backward elimination is feasible
1. Begin with the full model in which all predictors are included
3. Examine the Coefficients table of the multiple regression model summary and
identify which (if any) sub-models with one fewer predictor are acceptable
4. For each of the acceptable sub-models with one fewer predictor (if any) identified in
step 3, repeat step 3 with that sub-model as the model
5. This process will eventually terminate, once every branch of the tree of sub-models
corresponding to iterative omissions of a single predictor has been investigated
6. There will now be a set of acceptable models; examine practical (as opposed to
statistical) considerations to select one of these
7. As a precaution, conduct a residuals analysis for the selected model to check that
this process has not introduced any strange artefacts
• The “Evaporation from the Soil” problem in Practical 6.1 is an example of backward
elimination applied to a four-predictor data set
y = a + b × x + c × x2
• The curve produced by this relationship is called a parabola – see Figure 8.15
• Sometimes in regression problems it is found that the model of linear regression does
not hold, but simply because the average response is not linearly related to one of
the predictors, but rather related quadratically
• This will manifest itself in the plot of residuals against predicted values in the form
of a quadratic (parabolic) pattern for the trend, but nonetheless there will still be
reasonabe homoscedasticity about that trend
• Rather than attempting to find a model using transformations, in this case it is more
appropriate to explicitly incorporate the squared term in the model
• That is, include in the set of predictor variables both the predictor concerned and
also its square as another predictor
– Note that the “linearity” in the “linearity assumption” of the multiple regression
model refers to the fact that the regression equation is linear in the unknown
parameters (a, b1 , b2 , etc)
• This exploitation of the multiple regression model to handle this situation does not
contradict the Fundamental Sampling Assumption
• If necessary, even higher order powers of the predictor (x3 , x4 , . . .) could be added
as extra predictors
– In general these relationships are called polynomial relationships, hence the gen-
eral term polynomial regression
• The Fundamental Sampling Assumption in this type of data set relates to the fact
that the data form a time series of measurements
• Observe from Figure 8.19 that there is a problem with the linearity assumption, but
not with the homoscedasticity assumption
162 CHAPTER 8. MULTIPLE REGRESSION
10
9
8
y
7
6
5
0 5 10 15 20
2000
1500
1000
500
dge
8.9. POLYNOMIAL REGRESSION 163
3500
3000
2500
dish
2000
1500
1000
500
35 40 45 50 55 60
pri
Figure 8.18: Private residential investment (billions of 1972 dollars) against durable goods
expenditures (billions of 1972 dollars)
60
55
50
pri
45
40
35
dge
164 CHAPTER 8. MULTIPLE REGRESSION
Figure 8.19: Residuals against predicted values plot, for the multiple regression of factory
shipments (domestic) of dishwashers (thousands) on durable goods expenditures (billions
of 1972 dollars) and private residential investment (billions of 1972 dollars)
500
residuals
0
−500
predicted
Figure 8.20: Residuals against durable goods expenditures plot, for the multiple regression
of factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
500
residuals
0
−500
dge
8.9. POLYNOMIAL REGRESSION 165
• Inspection of Figures 8.20 and 8.21 suggests that the problem lies with the way in
which we are modelling the dependence of factory shipments (domestic) of dishwash-
ers on durable goods expenditures
to the model
• This is still a multiple regression, but now with three predictors, dge, dge2 , and pri
• The assumptions of linearity and homoscedasticity are reasonably satisfied for the
augmented model
• The normal Q-Q plot of the residuals in Figure 8.26 shows that the assumption of
conditional normality is acceptable
• In the discussion on page 8.9.2, it was noted that once an acceptable model had
been found, it was appropriate to examine a time series plot of the residuals – see
Figure 8.27
• Observe, from the Coefficients section of Table 8.7 that all three terms in the
model are significant
Figure 8.21: Residuals against private residential investment plot, for the multiple re-
gression of factory shipments (domestic) of dishwashers (thousands) on durable goods
expenditures (billions of 1972 dollars) and private residential investment (billions of 1972
dollars)
500
residuals
0
−500
35 40 45 50 55 60
pri
Figure 8.22: Residuals against predicted values plot, for the augmented multiple regression
of factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
600
400
residuals
200
0
−200
predicted
8.9. POLYNOMIAL REGRESSION 167
Figure 8.23: Residuals against durable goods expenditures plot, for the augmented multiple
regression of factory shipments (domestic) of dishwashers (thousands) on durable goods
expenditures (billions of 1972 dollars) and private residential investment (billions of 1972
dollars)
600
400
residuals
200
0
−200
dge
Figure 8.24: Residuals against durable goods expenditures squared plot, for the augmented
multiple regression of factory shipments (domestic) of dishwashers (thousands) on durable
goods expenditures (billions of 1972 dollars) and private residential investment (billions of
1972 dollars)
600
400
residuals
200
0
−200
dge2
168 CHAPTER 8. MULTIPLE REGRESSION
Figure 8.25: Residuals against private residential investment plot, for the augmented mul-
tiple regression of factory shipments (domestic) of dishwashers (thousands) on durable
goods expenditures (billions of 1972 dollars) and private residential investment (billions of
1972 dollars)
600
400
residuals
200
0
−200
35 40 45 50 55 60
pri
Figure 8.26: Normal Q-Q plot of the residuals for the augmented multiple regression of
factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
600
400
Appliance.Sales$residuals
200
0
−200
−2 −1 0 1 2
norm quantiles
8.9. POLYNOMIAL REGRESSION 169
Figure 8.27: Time series plot of the residuals for the augmented multiple regression of
factory shipments (domestic) of dishwashers (thousands) on durable goods expenditures
(billions of 1972 dollars) and private residential investment (billions of 1972 dollars)
600
400
residuals
200
0
−200
year
Call:
lm(formula = dish ~ dge + dge2 + pri, data = Appliance.Sales)
Residuals:
Min 1Q Median 3Q Max
-319.243 -157.061 -2.999 141.060 650.402
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.412e+03 3.988e+02 -8.555 1.90e-08 ***
dge 5.431e+01 7.175e+00 7.569 1.46e-07 ***
dge2 -1.744e-01 2.989e-02 -5.832 7.21e-06 ***
pri 4.526e+01 7.353e+00 6.156 3.39e-06 ***
---
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
• The general way of dealing with such experiments is introduced later in Chapter 12
• In the special case in which all the categorical predictors are simply dichotomous (or
binary), we may exploit the multiple regression model
– Dichotomous (or binary) variables are categorical variables that only have two
values
• Artificially constructed predictor variables are entered into the multiple regression
model
• Like polynomial regression (§ 8.9), this exploitation of the multiple regression model
to handle this situation does not contradict the Fundamental Sampling Assumption
• Consider the case of one continuous numerical predictor (predictor1 ) and one di-
chotomous variable (predictor2 )
– Pick one of the values predictor2 can take as the base-line; create a numerical
(coded) variable (predictor2 .coded) that has the value 0 for the base-line and
1 for the other
– Use the model of multiple regression with predictor1 and predictor2 .coded
as the two predictors
8.10. DUMMY VARIABLES 171
❡ Illustrative Example – Fuel consumption data – consider the exercise in Practical 6.2
• It is important to note that this “trick” only works with a 0-1 coding
• It is also important to note that this “trick” will not work with a categorical variable
with three or more possible values