Regression Analysis Guide for Statistics Students

Statistics
Regression Analysis
Master of Statistics
Prof. dr. Mia Hubert

Prof. dr. Stefan Van Aelst
Prof. dr. Thomas Neyens
KU Leuven, Department of Mathematics
2020-2021
Contents
Introduction 1
1 The simple regression model 4

1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The simple linear model . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Estimation of the regression parameters . . . . . . . . . . . . . . 9
1.3.1 The least squares estimator . . . . . . . . . . . . . . . . . 9
1.3.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 The general linear model 15

2.1 The linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 The least squares estimator . . . . . . . . . . . . . . . . . 17
2.2.2 Properties and geometrical interpretation . . . . . . . . . 21
2.2.3 Statistical properties of the LS estimator . . . . . . . . . 24
2.2.4 An example in R . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Analysis of variance . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 The decomposition of the total sum of squares . . . . . . 28
2.3.2 The coefficient of multiple determination . . . . . . . . . 30
2.3.3 The extra sum of squares . . . . . . . . . . . . . . . . . . 32
2.4 Equivariance properties . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 The standardized regression model . . . . . . . . . . . . . . . . . 37
3 Statistical inference 41
3.1 Inference for individual parameters . . . . . . . . . . . . . . . . . 41
3.2 Inference for several parameters . . . . . . . . . . . . . . . . . . . 44
3.3 The overall F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Test for all parameters . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 A general linear hypothesis . . . . . . . . . . . . . . . . . . . . . 50
3.6 Mean response and prediction . . . . . . . . . . . . . . . . . . . . 51
3.6.1 Inference about the mean response . . . . . . . . . . . . . 51
3.6.2 Inference about the unknown response . . . . . . . . . . . 51
3.7 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Polynomial regression 58
4.1 One predictor variable . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Several regressors and interaction terms . . . . . . . . . . . . . . 60
4.3 Estimation and inference . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Detecting curvature . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.2 Partial residual plots . . . . . . . . . . . . . . . . . . . . . 66
5 Categorical predictors 69
5.1 One dichotomous predictor variable . . . . . . . . . . . . . . . . . 69
5.1.1 Constructing the model . . . . . . . . . . . . . . . . . . . 69
5.1.2 Estimation and inference . . . . . . . . . . . . . . . . . . 73
5.1.3 Adding interaction terms . . . . . . . . . . . . . . . . . . 76
5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.1 One polytomous predictor variable . . . . . . . . . . . . . 79
5.2.2 More than one categorical variable . . . . . . . . . . . . . 81
5.3 Piecewise linear regression . . . . . . . . . . . . . . . . . . . . . . 82
6 Transformations 85
6.1 The family of power and root transformations . . . . . . . . . . . 85
6.2 Transforming proportions . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Transformations in regression . . . . . . . . . . . . . . . . . . . . 90
6.3.1 Power transformation . . . . . . . . . . . . . . . . . . . . 90
6.3.2 Box-Cox transformation . . . . . . . . . . . . . . . . . . . 91
6.4 Nonconstant variance . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Detecting heteroscedasticity . . . . . . . . . . . . . . . . . 92
6.4.2 Variance-stabilizing transformations . . . . . . . . . . . . 94
6.4.3 Weighted least squares regression . . . . . . . . . . . . . . 95
CONTENTS | i
7 Variable selection methods 103
7.1 Reduction of explanatory variables . . . . . . . . . . . . . . . . . 103
7.1.1 Surgical unit example . . . . . . . . . . . . . . . . . . . . 104
7.2 All-possible-regressions procedure for variable reduction . . . . . 107
7.2.1 Rp2 criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 MSEp criterion . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Mallows’ Cp . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.4 Akaike’s Information Criterion . . . . . . . . . . . . . . . 114
7.2.5 PRESSp Criterion . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.1 Backward elimination . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Forward selection . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4.1 Collection of new data . . . . . . . . . . . . . . . . . . . . 125
7.4.2 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . 126
8 Multicollinearity 127
8.1 The effects of multicollinearity . . . . . . . . . . . . . . . . . . . 127
8.1.1 Uncorrelated predictor variables . . . . . . . . . . . . . . 127
8.1.2 Perfectly or highly correlated predictors . . . . . . . . . . 132
8.2 Multicollinearity diagnostics . . . . . . . . . . . . . . . . . . . . . 137
8.2.1 Informal methods . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Variance inflation factors . . . . . . . . . . . . . . . . . . 137
8.2.3 The eigenvalues of the correlation matrix . . . . . . . . . 138
8.3 Multicollinearity remedies . . . . . . . . . . . . . . . . . . . . . . 140
8.3.1 Specific solutions . . . . . . . . . . . . . . . . . . . . . . . 140
8.3.2 Principal component regression . . . . . . . . . . . . . . . 141
8.3.3 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . 148
9 Influential observations and outliers 152

9.1 Vertical outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Leverage points . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.2.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9.2.2 Diagnostic plot . . . . . . . . . . . . . . . . . . . . . . . . 161
9.2.3 The Hat matrix . . . . . . . . . . . . . . . . . . . . . . . . 166
ii | CONTENTS
9.3 Single-case diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.2 Cook’s distance . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.3 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.4 The LTS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.1 Parameter estimates . . . . . . . . . . . . . . . . . . . . . 178
9.4.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.4.3 Reweighted LTS . . . . . . . . . . . . . . . . . . . . . . . 179
9.5 The MCD estimator . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.5.1 Parameter estimates . . . . . . . . . . . . . . . . . . . . . 181
9.5.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.5.3 Reweighted MCD-estimator . . . . . . . . . . . . . . . . . 182
9.6 A robust R-squared . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
10 Nonlinear regression 185

10.1 The nonlinear regression model . . . . . . . . . . . . . . . . . . . 185
10.3 Numerical algorithms . . . . . . . . . . . . . . . . . . . . . . . . 189
10.3.1 Deepest descent . . . . . . . . . . . . . . . . . . . . . . . . 189
10.3.2 The Gauss-Newton procedure . . . . . . . . . . . . . . . . 190
10.3.3 The Levenberg-Marquardt procedure . . . . . . . . . . . . 191
10.3.4 Starting values . . . . . . . . . . . . . . . . . . . . . . . . 191
10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.5 Inference about regression parameters . . . . . . . . . . . . . . . 194
11 Nonparametric regression 197

11.1 The nonparametric regression model . . . . . . . . . . . . . . . . 197
11.2 The lowess method . . . . . . . . . . . . . . . . . . . . . . . . . . 197
11.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
11.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Bibliography 211
CONTENTS | iii
Introduction
In its simplest form regression aims to model the relation between an input vari-
able X and an output or response variable Y . Contrary to a correlation analysis,
the regression model is asymmetric. It models the influence or effect of the
input or predictor variable X on the response variable Y . The regression model
allows us to evaluate to what extent the outcome Y changes due to a change
in the value of X. The regression model can then be used to predict Y from
X. Therefore, the input variable X is also called the independent variable, or
regressor, whereas the response variable Y is also called the dependent variable.
More generally, regression analysis models the relationship between a set of

predictor variables X1 , X2 , . . . , Xl−1 and a response variable Y that are mea-
sured on n observations. The goal is now to find a relation between the
Xj (j = 1, . . . , l − 1) and Y , which reveals the joint influence of the X-variables
on Y . This model can then also be used to predict the dependent variable Y
from the independent variables X1 , . . . , Xl−1 . In a very general form, we seek
real functions g, f and a parameter vector β = (β0 , . . . , βp−1 )t such that g(Y )
can be well described by f (X1 , . . . , Xl−1 , β). Unless otherwise stated, we will
assume that the response variable is continuous.
Since the observations will in general not satisfy this functional relation exactly,
the regression model will also include a stochastic component ε which expresses
the variation of the data points around the regression curve. A regression model
thus postulates that:
1. There is a probability distribution of Y for each level of

X = (X1 , . . . , Xl−1 ).
2. The means of these probability distributions vary in some systematic fash-
1
ion with X.
This is illustrated in Figure 1.4.
We will especially study the general linear regression model which is defined
as
g(yi ) = β0 + β1 f1 (xi1 , . . . , xi,l−1 ) + . . . + βp−1 fp−1 (xi1 , . . . , xi,l−1 ) + εi
for i = 1, . . . , n and for certain choices of g, f1 , . . . , fp−1 . The error terms εi

represent the random variation of the data points around the regression curve.
We assume that the standard Gauss-Markov conditions are satisfied:
E[εi ] = 0
Var[εi ] = σ 2
E[εi εj ] = 0 for all i 6= j.
The first condition expresses that at each level of (X1 , . . . , Xl−1 ) the regression
curve represents the mean of the corresponding probability distribution of Y .
The second condition states that the probability distribution of Y at each level
of (X1 , . . . , Xl−1 ) has the same variance, namely σ 2 . The last condition implies
that the error terms are uncorrelated.
This general linear model includes:
1. the first-order regression model: l = p, g(yi ) = yi , fj (xi1 , . . . , xi,p−1 ) =

xij (j = 1, . . . , p − 1)
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi .
2 | CONTENTS
2. simple regression: the first-order regression model with p = 2.
3. polynomial regression: g, f1 , . . . , fl−1 as in the first-order regression model,

and e.g. additionally fl = X12 , fl+1 = X32 , fl+2 = X1 X2 .
4. variable selection: g, f1 , . . . , fl−3 as in the first-order regression model, all

other fj = 0.
y λ −1
5. transformations in X or Y : g(Y ) = log(Y ), g(Y ) = λ , fj = log(Xj ).
Note that this model is linear in β and not necessarily in the independent
variables Xj . An example of a nonlinear model is
yi = β0 + β1 eβ2 xi + εi .
CONTENTS | 3
Chapter 1
The simple regression

model
1.1 Examples
The ‘old faithful geyser’ is the most famous geyser in the Yellowstone National
Park (Wyoming, USA). Eruptions occur in intervals with length between 45
minutes and 125 minutes. An eruption lasts 1.5 to 5 minutes during which
14,000 to 32,000 liters of boiling water is shot in the air to a height of 32 to 56
meters. It has been observed that there is a relation between the waiting time
until an eruption and the duration of that eruption. To examine this relation
both times (in minutes) have been recorded for 272 eruptions.
Questions that can be examined based on these data are: Is there indeed a strong
influence of waiting time on the following eruption time? Can the waiting time
be used to predict the length of the subsequent eruption? To answer these
questions we model the relation between the waiting and eruption times. First,
we graphically explore this relationship by making a scatterplot of the data.
4
4.5
eruptions
3.5
2.5
1.5
50 60 70 80 90
waiting
Note that the predictor variable ’waiting time’ is plotted horizontally, while the
response variable ’eruption time’ is plotted vertically. The scatterplot clearly
reveals that longer waiting times result in longer eruption times. The main
pattern can at least approximately be represented by a line. However, it is also
clear that the relation between both times is far from perfect. We will see how
we can model the effect of the waiting time on the eruption time.
The second example is the result of an industrial laboratory experiment. Under

uniform conditions, batches of electrical insulating fluid have been subjected to
a constant voltage dose (in kV) until the insulating property of the fluid broke
down. The process was repeatedly executed for seven different voltage levels
and in each experiment the time until breakdown (in minutes) of the insulating
property of the fluid is recorded. In total 76 experiments were executed. The
main question of interest is to understand how the breakdown time depends on
the administered voltage.
1.1. EXAMPLES | 5
Breakdown time
1500
500
0
26 28 30 32 34 36 38
Voltage
A scatterplot of the data is again used to explore the relationship. From the
scatterplot we clearly see that the breakdown time decreases rapidly as the volt-
age increases. However, the type of relation between the two variables is difficult
to see from this plot. This is caused by the skewness in the response variable.
To improve the graphical representation of the data, we apply a logarithmic
transformation on the response variable.
8
Logarithm of Breakdown time
6
4
2
0
−2
26 28 30 32 34 36 38
Voltage
This scatterplot provides more insight in the data. We see a decrease of the
breakdown time (logarithmic scale) with increasing dosage of voltage, a pattern
6 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

that can be represented by a decreasing line. For every dose of voltage we can
also see that there is considerable variation in the logarithmic breakdown time.
We are now ready to model this relationship.
Note that there is an important difference between both examples. In the first
example, the recorded values for both the waiting and eruption times are ob-
served values. In this example both the X and Y variable are thus random
variables. The observed measurement pairs can even be considered to be a
random sample from the joint distribution of the two variables. In the second
example, the voltage dose is chosen by the experimenters and the breakdown
time is recorded for these fixed doses. In this example, the Y variable is still
random, but the X variable is not. Consequently, in this case the paired dose-
time measurements can also not be a random sample. Regression modeling as
discussed next can be used for both types of data under suitable conditions.
1.2 The simple linear model

The simple linear model is given by
yi = β0 + β1 xi + εi (1.1)
for i = 1, . . . , n. The parameter β0 is called the intercept, whereas β1 is called

the regression slope.
In this model the values xi are not necessarily values of the observed predic-
tor variable X, but can be values for any suitable function f (X). We assume
that X does not contain any random effect or measurement error. Note that
this assumption is naturally satisfied in an experimental setting as in the sec-
ond example where the values of the predictor are chosen and fixed by the
experimenter. In the case of an observational study where the values of X are
observed, as is the case for the response Y , it is far more difficult to satisfy this
assumption. In this case, it is up to the statistician and/or data collectors to
judge whether the X variable is observed with sufficient accuracy so that the
assumption is (approximately) satisfied. If there is considerable randomness in
the observation of X, then more complex models such as measurement error
models are needed.
1.2. THE SIMPLE LINEAR MODEL | 7

Similarly as for X, the yi values are not necessarily values of the observed
response variable Y , but can be values for any suitable function g(Y ). For the
error term, we assume that the Gauss-Markov conditions are satisfied which
means that
E[εi ] = 0 (1.2)
Var[εi ] = σ 2 (1.3)
E[εi εj ] = 0 for all i 6= j (1.4)
for i = 1, . . . , n.
In the case that X is random, it is also assumed that the errors εi are indepen-
dent of X.
As the εi are random variables with zero mean, also Y is a random variable
that satisfies:
E[Y |X] = β0 + β1 X.
Here, E[Y |X] is a function of X that for each value X = x yields the mean of the
corresponding distribution of the response variable Y at X = x. Conditionally
on the observed values for X, this can also be written as:
E[Y |X = xi ] = β0 + β1 xi (1.5)
At the first-order regression model where the X and Y variables in (1.1) corre-
spond to the observed predictor variable and response variable respectively, this
linear relation geometrically implies that we try to estimate a regression line
E[Y |X] = β0 + β1 X
in the (X, Y )-space.

For X = 0 we immediately obtain from (1.5) that E[Y |X = 0] = β0 , hence
the intercept of the model can be interpreted as the expected response when
X equals zero. That is, β0 is the mean of the distribution of Y at X = 0.
If we increase the value of X from an arbitrary value x to x + 1, then the
expected response increases from E[Y |X = x] = β0 + β1 x to E[Y |X = x + 1] =
β0 + β1 (x + 1). Therefore, we find that
β1 = E[Y |X = x + 1] − E[Y |X = x],
so the slope β1 can be interpreted as the change in the expected response Y if

X increases by one unit. That is, if X increases one unit, then Y is expected to
change by β1 units on average.

1.3 Estimation of the regression parameters
1.3.1 The least squares estimator
The simple linear model in (1.1) contains three parameters β0 , β1 and σ. Note
that the two regression parameters β0 and β1 are inherent in the model (1.1)
while the scale parameter σ is a consequence of the second Gauss-Markov condi-
tion (1.3). These model parameters are unknown and need to be estimated from
the available data. A natural strategy is to estimate the regression parameters
such that the corresponding linear function fits the available data points as well
as possible. Otherwise stated, the estimation method should aim to keep the
errors as small as possible. Here, the errors corresponding to any parameter
estimates β̂0 and β̂1 are given by
ei (β̂0 , β̂1 ) = yi − β̂0 − β̂1 xi ; i = 1, . . . , n.
The first Gauss-Markov condition (1.2) implies that positive and negative errors
occur. To avoid that large positive and large negative errors can cancel each
other out in the estimation strategy, a function needs to be used that adds up
all errors regardless of their sign. The two most common functions to achieve
this goal are the absolute value and the square. Hence, the parameters β0 and
β1 can be estimated by minimizing the sum of the absolute errors:
n
X n
X
(β̂0,LAD , β̂1,LAD ) = argmin |ei (β0 , β1 )| = argmin |yi − (β0 + β1 xi )|. (1.6)
β0 ,β1 i=1 β0 ,β1 i=1
This estimator is called the least absolute deviations estimator. The other option
is to estimate the parameters β0 and β1 by minimizing the sum of the squared
errors:
n
X n
X
(β̂0,LS , β̂1,LS ) = argmin e2i (β0 , β1 ) = argmin (yi − (β0 + β1 xi ))2 . (1.7)
β0 ,β1 i=1 β0 ,β1 i=1
This estimator is called the least squares estimator. Both estimators have
their merits, but the least squares estimator is the standard estimator for the
regression parameters in linear models because it can be solved analytically and
it has some good (optimal) statistical properties that will be discussed later.
Pn 2
The residual sum of squares i=1 ei (β0 , β1 ) is called the objective function
or loss function of the least squares estimator. Differentiating this objective
1.3. ESTIMATION OF THE REGRESSION PARAMETERS | 9

Pn
function L(β0 , β1 ) = i=1 [yi − (β0 + β1 xi )]2 with respect to β0 and β1 and
setting these derivatives equal to zero, yields the normal equations
n
∂L(β0 , β1 ) X
= −2 [yi − (β0 + β1 xi )] = 0 (1.8)
∂β0 i=1
n
∂L(β0 , β1 ) X
= −2 [yi − (β0 + β1 xi )]xi = 0. (1.9)
∂β1 i=1
The least squares estimators β̂0,LS and β̂1,LS for the simple regression model are
the solution of this system of equations. From (1.8) we find that
β̂0,LS = ȳn − β̂1,LS x̄n . (1.10)
The second equation can be replaced by
∂L(β0 , β1 ) ∂L(β0 , β1 )
− x̄n =0
∂β1 ∂β0
which leads to the equation

n
X
[yi − (β0 + β1 xi )](xi − x̄n ) = 0.
i=1
By substituting result (1.10) into this equation, we obtain that

n
X
[(yi − ȳn ) − β̂1,LS (xi − x̄n )](xi − x̄n ) = 0.
i=1
Solving this equation yields

Pn
(x − x̄n )(yi − ȳn ) cov(X, Y ) sY
β̂1,LS = Pn i
i=1
= = cor(X, Y ) , (1.11)
i=1 (xi − x̄n )
2 s2X sX
where sX en sY are the sample standard deviations of the variables X and Y ,

and cov(X, Y ) and cor(X, Y ) are respectively the sample covariance and sample
correlation between X and Y .
Note that for the existence of the least squares estimator it is required that
s2X > 0. This means that the values of the variable X cannot be all the same,
which is a natural condition. If all values of X are equal to each other, then
the data do not provide any information on how the value of the response Y
changes with changes in X.

1.3.2 Examples
In the old faithful geyser example, the simple regression model estimated based
on the available data of 272 eruptions becomes
Average eruption time = −1.87 + 0.076 ∗ Waiting time.
The graphical representation of this regression fit shows that the estimated
regression line represents well the main trend in the data.
4.5
eruptions
3.5
2.5
1.5
50 60 70 80 90
waiting
Note that the intercept has a negative sign which is physically not a meaningful
value because an eruption time cannot be negative. This is an illustration of the
danger of extrapolation from a regression model. The data do not contain any
information about the length of eruption for very short waiting times (because
short waiting times do not occur in reality). Hence, the model cannot be used
to reliably predict what would happen with the eruption time after such small
waiting times. There is no reason why the model would still be valid in this
case, and in fact the unrealistic values predicted by the model indicate that it
is not valid for such events beyond the range of information.
For the insulating fluid experiments, the simple regression model estimated from
the available data is
Average log(Breakdown time) = 18.96 − 0.51 ∗ Voltage.

The graphical representation of this regression fit shows that the estimated
regression line again represents well the main trend in the data.
8
Logarithm of Breakdown time
6
4
2
0
−2
26 28 30 32 34 36 38
Voltage
The interpretation of the regression coefficients in terms of the transformed re-

sponse variable remains as before. For instance, the slope yields the change in
average logarithmic breakdown time when the voltage dose is increased by one
unit. However, this cannot easily be transformed into an interpretation in terms
of the originally measured response variable. The reason is that the expected
logarithmic response cannot be related to some expectation of the response vari-
able on its original scale (why not?).
In case of a logarithmic transformation the regression coefficients can be inter-

preted in terms of the original response if we assume that the error distribution
is symmetric around its mean zero. This is an acceptable assumption because it
implies that at each value of X the corresponding distribution of Y is centered
around the value of the regression line with equal positive and negative devia-
tions. The symmetry assumption implies that E[Y |X = x] = med[Y |X = x],
that is, the mean coincides with the median of Y at each X = x. Hence the
regression line also models
med[Y |X] = β0 + β1 X
in this case. Now, if the response Y in the simple linear model is the logarithm

of an actual observed response Ỹ , i.e. Y = log(Ỹ ), then we obtain that
med[Y |X] = med[log(Ỹ )|X] = log(med[Ỹ |X]) = β0 + β1 X,
or equivalently,
med[Ỹ |X]) = exp(β0 ) exp(β1 X).
Moreover, we easily find that
med[Ỹ |X = x + 1])
= exp(β1 ),
med[Ỹ |X = x])
or
med[Ỹ |X = x + 1]) = exp(β1 ) med[Ỹ |X = x])
Hence, exp(β1 ) is the multiplicative change of the median of the measured re-
sponse Ỹ if the predictor X increases by one unit. A similar interpretation holds
for the intercept. In the insulating fluid example, we find that exp(−0.51) = 0.60
so with every unit increase in voltage the median breakdown point is only 60%
of what it was before. Otherwise stated, the median breakdown point decreases
by 40% for every unit increase in X. For example, the median breakdown point
at X = 32 kV is estimated at 15.2 minutes. The median breakdown point at
X = 33 kV then becomes 15.2 ∗ 0.6 = 9.1 minutes.
Similarly, if the predictor is included in the linear model in a transformed man-

ner, then one should also think carefully about the interpretation of the regres-
sion coefficients in terms of a change in the value of the originally measured X.
For example, consider a linear model of the form
E[Y |X] = β0 + β1 log(X).
Hence, the predictor X has been logarithmically transformed before inclusion

in the simple regression model. The slope β1 can of course be interpreted as the
change in the expected response Y if log(X) increases by one unit. However,
what can we say about Y if the originally measured X is changed? By using
properties of the logarithm, we find for any c > 0 that
E[Y |X = cx] = β0 + β1 log(cx) = β0 + β1 log(c) + β1 log(x),
such that
E[Y |X = cx] − E[Y |X = x] = β1 log(c).

For c = 2, this result implies that with every doubling of the value of X, the ex-
pected response Y changes by the value β1 log(2). Similarly, a ten-fold increase
of X induces a β1 log(10) change in the expected value of Y .
Being able to estimate the parameters of a linear model is not sufficient. We

should be able to check model adequacy and measure precision of parameter
estimates. In particular, we are often interested in testing whether X does have
an effect on the response Y . Also of interest may be the prediction of the ex-
pected or individual response value for given values of X. Such questions can
be answered by confidence intervals and hypothesis tests. However, the con-
struction of such confidence intervals and hypothesis tests requires assumptions
about the distribution of the errors εi . Usually, it is assumed that these errors
follow a normal distribution. It then becomes important to investigate whether
this assumption is sufficiently reasonable to validate the resulting inference.
These aspects will be discussed in the next chapters in the more general con-
text of linear models with multiple regressors together with solutions if certain
assumptions are not satisfied.

Chapter 2
The general linear model
2.1 The linear model

To simplify the notations we will write the general linear model as
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi (2.1)
for i = 1, . . . , n. The parameter β0 is called the intercept, whereas the βj (j =

1, . . . , p − 1) are the regression slopes. In this model the Xj do not necessarily
stand for the observed predictor variables, but for any function fj of them. We
also assume that the Xj do not contain any random effect or measurement
error. Moreover we assume that the Gauss-Markov conditions are satisfied. For
all i = 1, . . . , n they state:
E[εi ] = 0 (2.2)
Var[εi ] = σ 2 (2.3)
E[εi εj ] = 0 for all i 6= j. (2.4)
As the εi are random with zero mean, also Y is a random variable that satisfies:
E[Y |X1 , . . . , Xp−1 ] = β0 + β1 X1 + β2 X2 + . . . + βp−1 Xp−1 .
Conditionally on the observed values for X1 , . . . , Xp−1 , this can also be written
as:
E[Y |xi ] = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 (2.5)
with xi = (1, xi1 , . . . , xi,p−1 )t . Note that the first element of the x-vector is
1, which is the x-value for the intercept. At the first-order regression model
15
(where the Xj in (2.1) correspond with the observed predictor variables), this
linear relation geometrically implies that we try to estimate a hyperplane in the
(X, Y )-space. With p = 2 we recover simple regression as a special case and
thus fit the regression line
E[Y |X] = β0 + β1 X.
16 | CHAPTER 2. THE GENERAL LINEAR MODEL

The intercept β0 is the expected response value at xi = (1, 0, . . . , 0)t , that is
when all predictors take the value 0. The slope parameter βj now indicates the
change in the expected value of the response Y due to a unit increase in the
variable Xj when all other predictor variables are held constant. Let xi(j) =
(1, xi1 , . . . , xij , . . . , xi,p−1 )t and xi(j+1) = (1, xi1 , . . . , xij + 1, . . . , xi,p−1 )t , then
from (2.5) it follows that
E(Y |xi(j) ) = β0 + β1 xi1 + β2 xi2 + . . . + βj xij + . . . + βp−1 xi,p−1

E(Y |xi(j+1) ) = β0 + β1 xi1 + β2 xi2 + . . . + βj (xij + 1) + . . . + βp−1 xi,p−1
hence indeed
βj = E(Y |xi(j+1) ) − E(Y |xi(j) ).
Often it is very convenient to write the general linear model (2.1) in matrix
form. Let the vectors y = (y1 , . . . , yn )t , ε = (ε1 , . . . , εn )t and the matrix X =
(x1 , x2 , . . . , xn )t , then (2.1) is equivalent to
y = Xβ + ε (2.6)
whereas (2.2), (2.3) and (2.4) correspond with
E[ε] = 0 (2.7)
Σ(ε) = σ 2 In . (2.8)
Here, Σ(ε) stands for the variance-covariance matrix of the errors, and In for
the n × n identity matrix.
2.2.1 The least squares estimator
Any parameter estimate β̂ = (β̂0 , . . . , β̂p−1 )t yields fitted values ŷi and residuals
ei :
ei (β̂) = yi − ŷi
= yi − xti β̂.

The least squares estimator β̂ LS is defined as the β̂ for which the sum of
the squared residuals is minimal, or
n
X
β̂ LS = argmin e2i (β). (2.9)
β i=1
Pn 2
The residual sum of squares i=1 ei (β) is called the objective function. Differentiating
this objective function with respect to each βj (j = 0, . . . , p − 1) and setting
the derivatives equal to zero, yields the normal equations
X t Xβ = X t y.
If rank(X) = p 6 n, the solution of this linear system is given by:
β̂ LS = (X t X)−1 X t y. (2.10)
Note that X t X is the matrix of cross-products:

n
X
(X t X)jk = xij xik (2.11)
i=1
Xn
(X t X)jj = x2ij (2.12)
i=1
The condition rank(X) = p 6 n is necessary to ensure that X t X is non-

singular. Indeed, assume that X t X is singular. Then ∃ a ∈ Rp , a 6= 0 such
that X t Xa = 0 (∈ Rp ). Consequently, 0 = at X t Xa = kXak2 , and thus
Xa = 0 (∈ Rn ). This implies that there exists a linear relation between the
columns of X, or rank(X) < p.
Pn
Figure 13.3 shows the LS objective function 1 e2i (β) for varying values of β
(here, for two regressors). If the rank of X is exactly p, as in Figure 13.3(a), we
see that this objective function is convex and hence yields a unique minimum
which can be derived analytically. If rank(X) < p as in Figure 13.3(b), there
are an infinite number of LS solutions. In practice, such a perfect linear rela-
tionship between the X-variables is not often encountered, but the X-variables
might be strongly correlated. This situation is known as multicollinearity. In
such a case, the LS fit is uniquely defined, but many other parameter estimates
β̂ attain a residual sum of squares which is close to the minimal value of β̂ LS
(see Figure 13.3(c)). Consequently, small changes in the data set may cause a
large change in the parameter estimates. Figure 13.2 illustrates these effects in
the data space.

2.2.2 Properties and geometrical interpretation
If no confusion is possible, we simply denote β̂ LS as β̂. Let
H = X(X t X)−1 X t (2.13)
and
M = In − H
then the following relations hold for ŷ = (ŷ1 , . . . , ŷn )t and e = (e1 , . . . , en )t :
ŷ = Hy (2.14)
e = My (2.15)
e = Mε (2.16)
Σ(e) = σ 2 (In − H) = σ 2 M. (2.17)
Equations (2.14) and (2.15) are trivial and explain why the matrix H is called
the hat matrix. This hat matrix H is symmetric H t = H and idempotent:
H 2 = HH = H. From (2.14) we also derive that Σ(ŷ) = HΣ(y)H t = σ 2 H,
hence the diagonal elements of the hat matrix hii are always positive. Other
properties of the hat matrix will be derived in Section 9.2.3.
Equation (2.16) follows from e = M y = M (Xβ + ε) = M ε because M X =

X − HX = X − X(X t X)−1 X t X = 0n,p . Finally, Σ(e) = M Σ(ε)M t =
σ 2 M M t = σ 2 M because also M is symmetric and idempotent.

Moreover, the least squares residuals satisfy:
n
X
ei = 0 (2.18)
i=1
n
X
xij ei = 0 for all j = 1, . . . , p − 1 (2.19)
i=1
Xn
ei ŷi = 0 (2.20)
i=1
The first two equations (2.18) and (2.19) follow from X t e = X t M ε = 0p,n ε =
t
0p . Moreover, ŷ t e = β̂ X t e = 0. These equations thus imply that the mean of
the least squares residuals is zero, and that the residuals are orthogonal to the
design matrix X as well as to the predicted values.
From (2.18) we can also deduce that the LS hyperplane passes through the mean
of the data points. Indeed, as n1 i (yi − ŷi ) = 0 we have that
P
1X
ȳ = ŷi
n i
1X
= (β̂0 + β̂1 xi1 + . . . + β̂p−1 xi,p−1 )
n i
= β̂0 + β̂1 x̄1 + . . . + β̂p−1 x̄p−1 . (2.21)
As a result, the intercept of the LS fit will be zero if we first mean-center the
data, by setting yic = yi − ȳ and xcij = xij − x̄j for each i = 1, . . . , n and
j = 1 . . . , p − 1. From (2.21) we see indeed that the intercept β̂0c of the LS fit
through the transformed data equals
β̂0c = ȳ c − β̂1c x̄c1 − . . . − β̂p−1

c
x̄cp−1 = 0.
Relations (2.19) and (2.20) can also be derived from the geometrical inter-
pretation of the least squares estimator. If we consider the observed y-vector
and the column vectors of X as points in Rn , the least squares estimate β̂ LS is
defined as the vector β which minimizes the Euclidean norm ky − Xβk. This
is because for any vector z ∈ Rn , it holds that
n
X 1/2 √
kzk = zi2 = z t z.
i=1
Therefore, ŷ is the orthogonal projection of y on the linear subspace of Rn

spanned by the columns of X. Consequently, e = y − ŷ is orthogonal to both
ŷ and each column of X (see Figure 10.6).

The variance of the errors σ 2 can be estimated by the mean squared error (MSE):
n
1 X 2
σ̂ 2 = s2 = e .
n − p i=1 i
Following (2.17) the variance-covariance matrix of the residuals is then esti-

mated by
Σ̂(e) = s2 (In − H) = MSE(In − H). (2.22)

2.2.3 Statistical properties of the LS estimator
Under the Gauss-Markov conditions (2.2), (2.3) and (2.4), the following prop-
erties hold:
Property 1 The least squares estimator β̂ LS is an unbiased and consistent

estimator of β (we only need (2.2) to obtain the unbiasedness).
The unbiasedness follows directly from
E[β̂ LS ] = (X t X)−1 X t E[y] = (X t X)−1 X t Xβ = β.
Property 2 The variance-covariance matrix of β̂ LS is given by:
Σ(β̂ LS ) = σ 2 (X t X)−1 . (2.23)
This follows from
Σ(β̂ LS ) = (X t X)−1 X t Σ(y)X(X t X)−1 = σ 2 (X t X)−1 .
Property 3 (Gauss-Markov theorem) β̂ LS is the best linear unbiased esti-

mator (BLUE) of β, i.e. any other linear and unbiased estimator of the form
Ay has a larger variance than β̂ LS .
Property 4 The MSE s2 is an unbiased and consistent estimator of σ 2 .
To show the unbiasedness of s2 we compute

X
E[ e2i ] = E[et e] = E[εt M t M ε] = E[εt M ε]
= E[trace(εt M ε)] = E[trace(M εεt )]

= trace(E[M εεt ]) = trace(M E[εεt ]) = σ 2 trace(M )
Further it holds that
trace(M ) = trace(In − H) = n − trace(H)

= n − trace(X(X t X)−1 X t ) = n − trace((X t X)−1 X t X)
= n − trace(Ip ) = n − p
Property 5 s2 (X t X)−1 is an unbiased and consistent estimator of σ 2 (X t X)−1 .
We thus estimate the covariance matrix of β̂ LS as:
Σ̂(β̂ LS ) = s2 (X t X)−1 . (2.24)

Property 6 If the errors ε are normally distributed, β̂ LS is the maximum like-
lihood estimator of β. The maximum likelihood estimator of σ 2 is given by
n
2 1X 2
σ̂ML = e . (2.25)
n i=1 i
If ε ∼ N (0, σ 2 ) the likelihood function can be expressed as

n y − xt β
i
Y
i
L(β, σ 2 |y) = φ
i=1
σ
with φ the density of the standard normal distribution. Hence

n
Y 1 1
L(β, σ 2 |y) = √ exp − 2 (yi − xti β)2 .
i=1 2πσ 2 2σ
The log-likelihood function l = log(L) is then equal to

n
n 1 X
l(β, σ 2 |y) = const − log σ 2 − 2 (yi − xti β)2 . (2.26)
2 2σ i=1
Consequently, the log-likelihood function is maximized over β by minimizing

the last term, of equivalently by taking β̂ = β̂ LS . Finally it is easy to show that
l(β̂ LS , σ 2 |y) is maximized (over all σ 2 ) by the expression (2.25).

2.2.4 An example in R
The data frame fuel.frame (from the ‘SemiPar’ library) contains information
of 60 cars. This data set contains 5 variables: Weight (the weight of the car
in pounds), Disp. (the engine displacement in liters), Mileage (gas mileage in
miles/gallon), Fuel (fuel consumption in gallons per 100 miles, it thus is equal
to 100/Mileage), and Type (a factor giving the general type of car, with levels:
Small, Sporty, Compact, Medium, Large, Van).
We want to predict the fuel consumption of a car by its weight and engine
displacement. The postulated model is:
Fueli = β0 + β1 Weighti + β2 Dispi + εi ,
with εi ∼ N (0, σ 2 ).
data(fuel.frame)
attach(fuel.frame)
## help(fuel.frame)
names(fuel.frame)
pairs(~Fuel+Weight+Disp.)
2000 2500 3000 3500

5.5
5.0
4.5
Fuel
4.0
3.5
3.0
3500
3000
Weight
2500
2000
300
250
200
Disp.
150
100
3.0 3.5 4.0 4.5 5.0 5.5 100 150 200 250 300
The pairwise plots suggest that both Weight and Disp. are linearly related to
Fuel. The analysis yields:

Fuelfit <- lm(Fuel~Weight+Disp.)
Fuelsum <- summary(Fuelfit)
Fuelsum
Call:
lm(formula = Fuel ~ Weight + Disp.)
Residuals:
Min 1Q Median 3Q Max
-0.81089 -0.25586 0.01971 0.26734 0.98124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4789731 0.3417877 1.401 0.167
Weight 0.0012414 0.0001720 7.220 1.37e-09 ***
Disp. 0.0008544 0.0015743 0.543 0.589
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3901 on 57 degrees of freedom

Multiple R-squared: 0.7438,Adjusted R-squared: 0.7348
F-statistic: 82.75 on 2 and 57 DF, p-value: < 2.2e-16
The fitted model is thus:
ˆ i = 0.48 + 0.0012 Weight + 0.00085 Disp

Fuel i i
with σ̂ = 0.39.

2.3 Analysis of variance
2.3.1 The decomposition of the total sum of squares
For an individual observation we have the identity
yi − ȳ = (yi − ŷi ) + (ŷi − ȳ)
which is illustrated for simple regression in Figure 5.4.
Squaring both sides of the equation and summing over all observations gives
n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 . (2.27)
i=1 i=1 i=1
The cross-product term vanishes because of the Pythagorean theorem (see Fig-
ure 10.6). It can also be deduced from (2.18) and (2.20) by noting that
X X X X
(ŷi − ȳ)(yi − ŷi ) = (ŷi − ȳ)ei = ŷi ei − ȳ ei = 0.
i i i i
Relation (2.27) is the ANOVA decomposition which says that the total variation
(SST) in the response y can be decomposed into an ‘explained’ component due
to the regression (SSR) and an ‘unexplained’ component due to the errors (SSE).
We thus have:
SST = SSR + SSE

with degrees of freedom n − 1, p − 1 and n − p. The mean squares are defined
as the sum of squares divided by their degrees of freedom:
SSR
MSR =
p−1
SSE
MSE =
n−p
They are typically written in an ANOVA table as in Table 6.1.
2.3. ANALYSIS OF VARIANCE | 29

2.3.2 The coef f icient of multiple determination
The coefficient of multiple determination is defined as
SSR SSE
R2 = =1− (2.28)
SST SST
Pn
(yi − ŷi )2
= 1 − Pi=1
n 2
. (2.29)
i=1 (yi − ȳ)
It measures the proportion of the total variation in the response y that is ex-
plained by the linear model (2.1) which includes the variables X1 , . . . , Xp−1 . By
construction 0 6 R2 6 1. The minimum value 0 is attained when all ŷi = ȳ, i.e.
when all β̂j = 0 for j = 1, . . . , p − 1. The maximum value 1 is attained when all
the observations fall exactly on the fitted regression surface, i.e. when yi = ŷi
for all cases i.
Remarks:
1. In simple regression R2 coincides with the squared correlation coefficient

r2 between X = X1 and Y .
2. A high value of R2 does not necessarily imply that the fitted model is
useful to make predictions.
3. One can always increase R2 by adding variables to the model. Therefore

the adjusted coefficient of determination Ra2 corrects for the number of
variables:
SSE/(n − p)
Ra2 = 1 − . (2.30)
SST/(n − 1)
However, Ra2 can take negative values.
4. If the general model (2.1) does not contain an intercept term, that is,
when β0 = 0, the ANOVA decomposition becomes:
n
X n
X n
X
yi2 = ŷi2 + (yi − ŷi )2 .
i=1 i=1 i=1
The R2 coefficient is then defined by:

Pn
(y − ŷ )2
2
R = 1 − i=1 Pn i 2 i . (2.31)
i=1 yi
As the denominators of (2.29) and (2.31) are different, it is dangerous to

compare the R2 -value of a model with intercept to the R2 -value of the
same model without intercept.

√
5. The positive square root of R2 , i.e. R = R2 is called the coefficient of
multiple correlation. From Figure 10.8 is it clear that
R = cos(y, ŷ).

2.3.3 The extra sum of squares
The extra sum of squares measures the marginal reduction in the error sum of
squares (SSE) when one or several predictor variables are added to the regres-
sion model, given that other variables are already in the model. Equivalently,
it measures the marginal increase in the regression sum of squares (SSR) when
one or several regressors are added to the model.
Consider as an example a regression model with three predictors: X1 , X2 and

X3 . If we only include X1 in the model, we know that
SST = SSR(X1 ) + SSE(X1 ). (2.32)
Now, add variable X2 , then
SST = SSR(X1 , X2 ) + SSE(X1 , X2 ) (2.33)
because the total sum of squares did not change. Note that
SSE(X1 , X2 ) 6 SSE(X1 ).
We now define the extra sum of squares as:
SSR(X2 |X1 ) = SSE(X1 ) − SSE(X1 , X2 ) (2.34)
or equivalently
SSR(X2 |X1 ) = SSR(X1 , X2 ) − SSR(X1 ). (2.35)
Analogously, if we include X3 as well:
SSR(X3 |X1 , X2 ) = SSE(X1 , X2 ) − SSE(X1 , X2 , X3 )

= SSR(X1 , X2 , X3 ) − SSR(X1 , X2 )
These definitions and formulas also hold when we include several regressors at
once. For example,
SSR(X2 , X3 |X1 ) = SSE(X1 ) − SSE(X1 , X2 , X3 )

= SSR(X1 , X2 , X3 ) − SSR(X1 ).
When we combine (2.32) and (2.34) we see that the total sum of squares (SST)
can be written as
SST = SSR(X1 ) + SSR(X2 |X1 ) + SSE(X1 , X2 )

or as
SST = SSR(X2 ) + SSR(X1 |X2 ) + SSE(X1 , X2 )
if we include X2 first in the regression model.

Equation (2.35) is equivalent to:
SSR(X1 , X2 ) = SSR(X1 ) + SSR(X2 |X1 ). (2.36)
We can thus decompose the SSR of the full model (here, with all 3 predictors)
into several extra sum of squares, as in Table 7.3. Note that the degrees of
freedom associated with each sum of squares is equal to the number of variables
that are added to the model.
The ANOVA analysis of the fuel.frame data set yields:
summary(aov(Fuelfit))
Df Sum Sq Mean Sq F value Pr(>F)

Weight 1 25.139 25.139 165.209 <2e-16 ***
Disp. 1 0.045 0.045 0.295 0.589
Residuals 57 8.673 0.152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Note the difference

Fuelfit2<- lm(Fuel~Disp.+Weight)
summary(aov(Fuelfit2))

Disp. 1 17.253 17.253 113.38 3.58e-15 ***
Weight 1 7.931 7.931 52.12 1.37e-09 ***
Residuals 57 8.673 0.152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4 Equivariance properties
The least squares estimator satisfies several equivariance properties. If the
model
yi = xti β + εi (2.37)
holds, then also
yi + xti v = xti β + xti v + εi

= xti (β + v) + εi
for any vector v. Hence, if a regression estimator applied to the (xi , yi ) yields
β̂, then it is desirable that the estimator applied to the (xi , yi + xti v) yields
β̂ + v.
Property 7 β̂ LS is regression equivariant:
β̂ LS (xi , yi + xti v) = β̂ LS (xi , y i ) + v
for any vector v.
This follows from

n
X
β̂ LS (xi , yi + xti v) = argmin (yi + xti v − xti β)2
β i=1
n
X
= argmin (yi − xti (β − v))2
β i=1
Thus β̂ LS (xi , yi + xti v) − v = β̂ LS (xi , yi ).
Model (2.37) also yields the equalities
cyi = xti cβ + cεi
for any constant c and

yi = (Axi )t (At )−1 β + εi
for any non-singular p × p matrix.
Property 8 β̂ LS is scale equivariant:
β̂ LS (xi , cyi ) = cβ̂ LS (xi , yi )
for any constant c and
2
σ̂LS (xi , cyi ) = c2 σ̂LS
2
(xi , yi ).
2.4. EQUIVARIANCE PROPERTIES | 35

Property 9 β̂ LS is affine equivariant:
β̂ LS (Axi , yi ) = (At )−1 β̂ LS (xi , yi )
for any non-singular p × p matrix.
This implies that the fit is essentially independent of the choice of measurement
unit for the response variable y. Also, if we apply e.g. a logarithmic transfor-
mation to the y, it does not really make a difference whether we use the natural
logarithm log(y) = ln(y) or log10 (y) as they only differ up to a constant factor.
The affine equivariance allows linear transformations of the regressors, including

changes in the measurement units.

2.5 The standardized regression model
The computation of the least squares parameters involves the inverse of X t X.

This matrix operation will be sensitive to roundoff errors, when
1. the determinant of X t X is close to zero (multicollinearity!)
2. the elements of X t X differ significantly in order of magnitude, which oc-

curs when the predictor variables have substantially different magnitudes.
The standardized regression model is obtained by transforming the X (and Y )

variables such that the new X t X matrix corresponds with the correlation ma-
trix of the original X-variables. Consequently its entries (in absolute value) are
bounded by 1 and thus are less sensitive to roundoff errors.
This transformation is also used when we want to compare the regression coefficients
in common units. Consider e.g. the estimated regression plane:
Ŷ = 200 + 20000X1 + 0.2X2
with Y measured in dollars, X1 in thousand dollars and X2 in cents. Then a

1-unit increase of X1 (i.e. a $1000 dollar increase) when X2 is constant corre-
sponds with β̂1 = $20000. The $1000 increase is equal to a 100000-unit increase
of X2 with X1 constant, which equals 100000β̂2 = $20000. Both regressors thus
have the same effect on Y although the regression coefficients suggest the in-
verse.
The correlation transformation is defined for each observation i = 1, . . . , n and

for each variable j = 1, . . . , p − 1 as:

1 xij − x̄j
x0ij = √ (2.38)
n−1 sj

1 yi − ȳ
yi0 = √ (2.39)
n−1 sY
with sj resp. sY the standard deviation of Xj resp. Y . Using (2.11) and (2.12)
2.5. THE STANDARDIZED REGRESSION MODEL | 37

we then obtain for the transformed variables:
n
X
((X 0 )t X 0 )jk = x0ij x0ik
i=1
P
1 (xij − x̄j )(xik − x̄k )
=
n−1 sj sk
cov(Xj , Xk )
= = rjk
sj sk
s2j
((X 0 )t X 0 )jj = =1
sj sj
cov(Xj , Y )
((X 0 )t y 0 )j = = rjy
sj sY
with rjk the simple correlation between Xj and Xk , and rjy the correlation
between Xj and Y .
In terms of the transformed variables, the general linear model (2.1)
yi = β0 + β1 xi1 + . . . + βp−1 xi,p−1 + εi
now becomes
yi − ȳ = β1 (xi1 − x̄1 ) + . . . + βp−1 (xi,p−1 − x̄p−1 ) + εi
and thus

yi − ȳ s1 xi1 − x̄1 sp−1 xi,p−1 − x̄p−1 εi
= β1 + . . . + βp−1 + .
sY sY s1 sY sp−1 sY
We could drop the intercept term from the model because the observations
√
are mean-centered! If finally we divide each term by n − 1, we obtain the
standardized regression model
yi0 = β10 x0i1 + β20 x0i2 + . . . + βp−1

0
x0i,p−1 + ε0i (2.40)
for i = 1, . . . , n with
εi
ε0i = √ (2.41)
n − 1sY
0 sj
βj = ( )βj . (2.42)
sY
The regression coefficients βj0 are often called the standardized regression coef-
ficients. Because of the correlation transformation their least squares estimates
satisfy:
0 −1
β̂ = RXX rXY . (2.43)

Here, RXX is the correlation matrix of X, and rXY = (r1y , . . . , rp−1,y )t con-
tains the correlations between each predictor variable and the response variable.
To return to the estimates with respect to the original variables we use (2.42),
(2.21) and the equivariance properties of the least squares estimator:
sY 0
β̂j = ( )β̂ (2.44)
sj j
β̂0 = ȳ − β̂1 x̄1 − . . . − β̂p−1 x̄p−1 .
(2.45)
Example.
Dwaine Studios Inc. operates portrait studios in 21 cities of medium size. They
are specialized in portraits of children. The company wants to investigate
whether sales in a community (Y , expressed in $1000) can be predicted from
the number of persons aged 16 or younger in that community (X1 in thousands
of persons) and the per capita personal income (X2 in $1000). Some data and
results are shown in Table 7.5. The standardized regression model yields:
ŷi0 = 0.75x0i1 + 0.25x0i2
whereas the model in the original variables yields estimated values:
ŷi = −68.86 + 1.45xi1 + 9.36xi2 .
In the latter model, β̂1 and β̂2 can not be compared directly because both vari-
ables X1 and X2 are measured in other units. The standardized coefficients tell
us that an increase of one standard deviation of X1 when X2 is fixed leads to
a much larger increase in expected sales than if we fix X1 and increase X2 by
one standard deviation. We should however be cautious about this interpreta-
tion as also the correlation between X1 and X2 has an effect on the regression
coefficients.
2.5. THE STANDARDIZED REGRESSION MODEL | 39

Chapter 3
Statistical inference
When we want to make inferences about β we assume that the errors are inde-
pendent and normally distributed, i.e.
ε ∼ Nn (0, σ 2 In ). (3.1)
Under this condition, the general linear model satisfies:
y ∼ Nn (Xβ, σ 2 In ) (3.2)
β̂ LS ∼ Np (β, σ 2 (X t X)−1 ). (3.3)
Note that (3.2) does not tell that the {yi , i = 1, . . . , n}follow a common uni-
variate normal distribution. It says that at a certain x, the corresponding
response variable y is normally distributed. In particular, yi ∼ N (xti β, σ 2 ) for
i = 1, . . . , n. This is in general difficult to check as we often only have one
measurement for each xi . Normality of the residuals on the other hand can be
verified using residual plots (see Section 3.7).
3.1 Inference for individual parameters

From (3.3), we obtain
β̂j ∼ N (βj , σ 2 (X t X)−1
jj )
and its estimated standard error

q
s(β̂j ) = s (X t X)−1
jj .
41
Moreover it can be shown that
s2
(n − p) ∼ χ2n−p
σ2
and that β̂j and s2 are independent. Consequently,
β̂j − βj
∼ tn−p
s(β̂j )
Under the null hypothesis
H0 : β j = 0
H1 : βj 6= 0
it then holds that

β̂j
t= ∼H0 tn−p (3.4)
s(β̂j )
These t-values and their corresponding p-values are usually reported in the out-
put of an analysis with a statistical software package. If the p-value is smaller
than α, we reject the H0 hypothesis in favor of the alternative.
Equivalently, we can construct a (1 − α)100% confidence interval for βj :
CI(βj , α) = [β̂j − tn−p, α2 s(β̂j ), β̂j + tn−p, α2 s(β̂j )]
and reject H0 if 0 does not belong to CI(βj , α). Note that the quantile tn−p,α/2
satisfies
α
P (T > tn−p, α2 ) = with T ∼ tn−p .
2
Remember that α is the probability of a type I error, i.e.
α = P (H0 is rejected |H0 is correct).
42 | CHAPTER 3. STATISTICAL INFERENCE

Example: Fuel data
Let us look again at the fuel consumption data. Based on the output of the
linear model fit in Section 2.2.4 we can now examine the relevance of both
regressors in the model
The hypothesis H0 : β1 (Weight) = 0 is rejected because the corresponding
p-value is essentially 0 (1.37e-09). On the other hand, the hypothesis H0 :
β2 (Disp.) = 0 cannot be rejected at the 5% significance level because its p-
value 0.59 > 0.05.
3.1. INFERENCE FOR INDIVIDUAL PARAMETERS | 43

3.2 Inference for several parameters
When we want to test whether a group of parameters is significant we state the
null and alternative hypothesis as:
H0 : βp−q = βp−q+1 = . . . = βp−1 = 0

H1 : not all βj equal zero (j = p − q, . . . , p − 1)
(we assume that we want to make a test on the last q parameters).
Example:
Suppose we fit a regression model with three slope parameters:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi
and we want to test whether
H0 : β2 = 0 and β3 = 0.
We could then accept H0 at the α100% significance level if 0 ∈ CI(β2 , α) and

0 ∈ CI(β3 , α). However, the probability of a type I error then increases:
P (H0 is accepted |H0 is correct)

= P (0 ∈ CI(β2 , α) and 0 ∈ CI(β3 , α))
= 1 − P (0 6∈ CI(β2 , α) or 0 6∈ CI(β3 , α))
= 1 − P (0 6∈ CI(β2 , α)) − P (0 6∈ CI(β3 , α)) + P (0 6∈ CI(β2 , α) and 0 6∈ CI(β3 , α))
> 1 − P (0 6∈ CI(β2 , α)) − P (0 6∈ CI(β3 , α))
= 1 − α − α = 1 − 2α.
Hence,
P (H0 is rejected |H0 is correct) 6 1 − (1 − 2α) = 2α.
If we want to be sure that
P (H0 is rejected |H0 is correct) 6 α
we can apply the Bonferroni correction. For this, we construct simultaneous

confidence intervals which are wider than the individual confidence intervals:
α
SCI(βj , α) = CI(βj , ).
2

In general simultaneous confidence intervals for testing g 6 p parameters with
a confidence of at least 1 − α are given by
[β̂j − tn−p, 2g
α s(β̂ ), β̂ + t
j j α s(β̂ )].
n−p, 2g j
Because the simultaneous confidence intervals are wider than the individual
confidence intervals, they define a much larger region in Rg . Therefore they
yield a larger type II error, i.e. the probability that the H0 hypothesis will not
be rejected although the alternative is true, will be larger. Equivalently, the
probability to detect that H1 is correct, will be small.
β = P (type II error) = P (H0 is accepted|H1 is correct).
Therefore, the Bonferroni method is to be preferred when the number of hy-

potheses that we want to test is limited. Otherwise, other simultaneous confidence
intervals (e.g. Scheffé) will have a coverage which is closer to (1−α)100%. They
will give a better balance between the probabilities of type I and type II error.
Another disadvantage of this procedure is that the correlation between the pa-
rameter estimates is not taken into account. A partial F-test, which is based
on Σ̂(β̂) attains the correct significance level. Under H0 we obtain the reduced
model:
yi = β0 + β1 xi1 + . . . + βp−q−1 xi,p−q−1 + εi .
Let SSEp−q denote the error sum of squares under this reduced model, i.e.
SSEp−q = SSE(X1 , . . . , Xp−q−1 ) and let SSEp be the error sum of squares under
the full model. Thus SSEp = SSE(X1 , . . . , Xp−1 ). Under condition (3.1) it can
be shown that
(SSEp−q − SSEp )/q
F = ∼H0 Fq,n−p (3.5)
SSEp /(n − p)
This test statistic can as well be described using the extra sum of squares.
By (2.34) we have that
SSEp−q − SSEp = SSR(Xp−q , . . . , Xp−1 |X1 , . . . , Xp−q−1 ).
Thus, (3.5) becomes:
MSR(Xp−q , . . . , Xp−1 |X1 , . . . , Xp−q−1 )

F = (3.6)
MSE(X1 , . . . , Xp−1 )
3.2. INFERENCE FOR SEVERAL PARAMETERS | 45

Moreover,
1
MSR(Xp−q , . . . , Xp−1 |X1 , . . . , Xp−q−1 ) = SSR(Xp−q |X1 , . . . , Xp−q−1 )+
q

SSR(Xp−q+1 |X1 , . . . , Xp−q ) + . . . + SSR(Xp−1 |X1 , . . . , Xp−2 ) .
Therefore, this F-statistic can easily be computed from a (detailed) ANOVA

table as in Table 7.3.
Example: Body Fat Data.
The Body Fat data study the relation of amount of body fat (Y ) to three
possible predictor variables: triceps skinfold thickness (X1 ), thigh circumference
(X2 ) and midarm circumference (X3 ). Measurements are taken on 20 healthy
women between 25 and 34 years old. Assume we want to test:
H0 : β2 = β3 = 0
H1 : not both β2 and β3 equal zero
Using the ANOVA Table 7.4, we derive the partial F-statistic

(33.17 + 11.54)/2
F = = 3.63.
98.41/16
As F2,16,0.05 = 3.63, the p-value of our test is 5% and we are at the boundary of
the decision rule. At the 1% significance level e.g. we would not reject the H0
hypothesis.

How is this F-statistic related to Σ̂(β̂) = s2 (X t X)−1 ? Let bq = (β̂p−q , . . . , β̂p−1 )t
be the vector with the last q components of the LS fit β̂ on the full model, and
let Vqq represent the square submatrix consisting of the last q rows and columns
of (X t X)−1 . Then it can be shown that
−1
SSEp−q − SSEp = btq Vqq bq .
3.3 The overall F-test

The overall F-test is used to test whether there is a regression relation between
the response variable Y and the set of X-variables X1 , . . . , Xp−1 :
H0 : β1 = β2 = . . . = βp−1 = 0
H1 : not all βj equal zero
The test statistic is derived from the partial F-test (3.6) with q = p − 1:
MSR
F = ∼H0 Fp−1,n−p (3.7)
MSE
From (2.28) and (3.7) it can easily be derived that this F-statistic is equivalent
to:
R2 /(p − 1)
F = .
(1 − R2 )/(n − p)
Its value is usually reported in a statistical package:
F-statistic: 82.75 on 2 and 57 DF, p-value: < 2.2e-16
3.4 Test for all parameters

A (1 − α)100% joint confidence region for the unknown β ∈ Rp is given by an
ellipsoid with center β̂ LS :
(x − β̂ LS )t (X t X)(x − β̂ LS )
Eα = {x ∈ Rp | 6 Fp,n−p,α }.
ps2
This ellipsoid can be used to test hypotheses of the form:
H0 : β = β 0
H1 : β 6= β 0
3.3. THE OVERALL F-TEST | 47

for some fixed vector β 0 ∈ Rp . If β 0 does not belong to Eα , we reject the H0
hypothesis at the α significance level.
Again, this procedure attains the correct significance level by employing the
covariance matrix of β̂. A drawback is that if the H0 hypothesis is rejected,
we can not directly deduce statements about the individual parameters. This is
why individual or simultaneous confidence intervals for β0j are still useful. The
geometric differences between the two types of tests are illustrated in Figure 5.1.
If the correlation between the parameter estimates is large, the ellipsoid will be
more elongated, and tests based on confidence intervals will be too conservative.
So, it is very important to look at the correlation of the regression parameters.

Example: Fuel data
summary(Fuelfit,correlation=TRUE)
The last part of the output now yields the correlation between the three param-
eter estimates for the fuel consumption data:
Correlation of Coefficients:
(Intercept) Weight
Weight -0.90
Disp. 0.47 -0.80
Clearly, the correlation between Weight and Disp. is very high, as in Figure
5.1. Hence, the large p-value for Disp. is not very informative.
This high correlation is due to a high correlation between Weight and Disp.:
cor(Weight,Disp.)
[1] 0.8032804
Here, cor(Weight, Disp.) = −cor(β^1 , β^2 ) but this equality is not satisfied in
general.
3.4. TEST FOR ALL PARAMETERS | 49

3.5 A general linear hypothesis
All the above hypotheses belong to the class of linear hypotheses of the form:
H0 : Cβ = 0 (3.8)
H1 : Cβ 6= 0
with C a (q × p) matrix with rank(C) = q 6 p.
Example 1:
H0 : β1 = β2 , β3 = 0
is equivalent to (3.8) with

 
0 1 −1 0 0 ... 0
C= 
0 0 0 1 0 ... 0
Example 2:
H0 : β1 = β2 = . . . = βp−1 = 0
is equivalent to (3.8) with C = (0 Ip−1 ).
Example 3:
H0 : βp−q = βp−q+1 = . . . = βp−1 = 0
is equivalent to (3.8) with C = (0q,p−q Iq ).
The linear hypothesis Cβ = 0 provides q independent equations, so q of the

βj ’s can be expressed in terms of the other p − q. Under the null hypothesis
we thus obtain a reduced model with only p − q parameters. Let SSEp−q again
denote the error sum of squares under this reduced model. Then we obtain the
same partial F-statistic (3.5).

3.6 Mean response and prediction
3.6.1 Inference about the mean response

At a fixed point x0 = (1, x01 , . . . , x0,p−1 )t , the (unknown) mean response is
denoted as E[Y0 |x0 ] = xt0 β. An unbiased estimator of the mean response is
given by
ŷ0 = xt0 β̂
with
Var(ŷ0 ) = xt0 Σ(β̂)x0
which is estimated by s2 xt0 (X t X)−1 x0 .

A (1 − α)100% confidence interval for the mean response E[Y0 |x0 ] is then given
by q
ŷ0 ± tn−p, α2 s xt0 (X t X)−1 x0 .
3.6.2 Inference about the unknown response

Consider now a new point x0 = (1, x01 , . . . , x0,p−1 )t which does not belong to
the data set. A confidence interval for the unknown response
y0 = xt0 β + ε0
is constructed as follows. Consider the random variable ŷ0 −y0 = xt0 β̂−xt0 β−ε0 .
It holds that
E[ŷ0 − y0 ] = xt0 E[β̂] − xt0 β = 0

Var[ŷ0 − y0 ] = σ 2 xt0 (X t X)−1 x0 + σ 2
because β̂ and ε0 are independent and ε0 ∼ N (0, σ 2 ).
A (1 − α)100% prediction interval for the unknown response y0 is then given by

q
ŷ0 ± tn−p, α2 s xt0 (X t X)−1 x0 + 1.
This interval is larger than the confidence interval for the mean response be-
cause it also includes the uncertainty given by ε0 .
3.6. MEAN RESPONSE AND PREDICTION | 51

Example.
Figure 3.6.2 contains the 95% confidence intervals for the average weight of 10
people based on their length (dotted curves). This is however not so interesting
as we are merely interested in the prediction of someone’s weight, based on its
length. For instance, given an individual with length x = 170cm, what can we
say about his/her weight? The prediction interval (dashed curve) is larger than
the confidence interval because it takes into account two sources of variability:
the variability of the fitted line, and the variability of the observation around
the regression line.
We notice that the confidence and the prediction intervals become larger as we
move away from the mean of the data. This illustrates how dangerous it is to
draw conclusions about an observation with x-values outside the range of the
observed xi values. This is called extrapolation.

3.7 Residual plots
Residual plots are very helpful to check the validity of our model assumptions.
If our model is correct, we can consider the residuals ei as the observed errors.
These residuals should exhibit tendencies which confirm the assumptions we
have made, or at least do not exhibit a denial of the assumptions. Residual
plots will be helpful to check for
1. non-normality
2. time effects (correlation)
3. nonconstant variance (and transformations of Y )
4. curvature (and transformations of the regressors)
5. outlier detection ...
Normal quantile plot

We usually assume condition (3.1) which a.o. says that the errors are normally
distributed with zero mean. Since the least squares residuals always have zero
mean, remember (2.18), we do not have to check for it! Normality can be
verified using a normal quantile plot. One can make such a qq-plot of the
residuals themselves, but they have different standard errors, hence different
distributions. Therefore, we prefer to make a normal quantile plot of the stan-
dardized residuals. They are defined by dividing each residual by its standard
error (2.22):
(s) ei
ei = √ (3.9)
s 1 − hii
with hii the ith diagonal element of the hat matrix H. Note that the normality
assumption can also be accessed formally through the Shapiro-Wilk statistic or
the Kolmogorov-Smirnov test.
Plot of the residuals versus their index

This plot is useful if the index i has a physical interpretation, such as ‘time’.
When we make a plot of the ei versus i, we do not expect to see any pattern
as the errors should be uncorrelated. An example of positively autocorrelated
error terms is presented in Figure 12.1.
3.7. RESIDUAL PLOTS | 53

Moreover we can check whether the variance is constant over time (known as
homoscedasticity). A pattern as in Figure 2.6 (1) gives evidence that the vari-
ance is increasing (heteroscedasticity). Patterns (2) and (3) show that the linear
model should be refined by adding first-order or second order terms.
Plot of the residuals versus f itted values

If the linear model is correct, it follows from equations (2.18) and (2.20) that
the ei are uncorrelated with the ŷi . A non-horizontal or curved band in a (ŷi , ei )
plot thus shows that the linear model is not appropriate. Sometimes the vari-
ance of the errors increases with the level of the dependent variable. This is
again visible when we observe a funnel.
Plot of the residuals versus independent variables

From equation (2.19) we can deduce that each independent variable Xj is un-
correlated with the residuals ei . Therefore, (xij , ei ) plots (for j = 1, . . . , p − 1)
again can indicate that the regression fit is defective.

Plot of the standardized residuals
Diagnostics to detect influential observations and outliers will be discussed in
Chapter 9. Now, we can already make a plot of the standardized residuals (3.9).
If they are a good approximation of the true errors, they should be approxi-
mately gaussian distributed. Hence observations whose absolute standardized
residual is larger than, say 2.5, can be pinpointed as outliers.
In R the standardized residuals can be retrieved from the lm.influence func-

tion, or using the library MASS:
Fuelinf <- lm.influence(Fuelfit)

e <- residuals(Fuelfit)
h <- Fuelinf$hat
s <- Fuelsum$sigma
es <- e/(s*(1-h)^.5)
library(MASS)
es2 <- stdres(Fuelfit)
qqnorm(es,ylab="Standardized residuals")
qqline(es)
plot(e,xlab="Index",ylab="Residuals")
plot(Fuelfit) # first graph yields residuals versus fitted values
plot(es,xlab="Index",ylab="Standardized Residuals")
abline(h=-2.5,lty=2)
abline(h=2.5,lty=2)
plot(Weight,e,ylab="Residuals")

abline(h=0,lty=3)
plot(Disp.,e,ylab="Residuals")
abline(h=0,lty=3)
This yields the following plots.
Normal Q−Q Plot
1.0
2
0.5
1
Standardized residuals
Residuals
0.0
0
−1
−0.5
−2
−2 −1 0 1 2 0 10 20 30 40 50 60
Theoretical Quantiles Index
Residuals vs Fitted
1.0
54
2
48
0.5
Standardized Residuals
1
Residuals
0.0
0
−1
−0.5
44
−2
−1.0
3.0 3.5 4.0 4.5 5.0 5.5 0 10 20 30 40 50 60
Fitted values Index

lm(Fuel ~ Weight + Disp.)
1.0
1.0
0.5
0.5
Residuals
Residuals
0.0
0.0
−0.5
−0.5
2000 2500 3000 3500 100 150 200 250 300
Weight Disp.

Chapter 4
Polynomial regression
A polynomial regression model is used when the response function is a polyno-

mial function, or when a polynomial function is a good approximation to the
true unknown function.
4.1 One predictor variable

Example: one predictor - quadratic model
yi = β0 + β1 xi1 + β2 x2i1 + εi (4.1)

= β0 + β1 xi1 + β11 x2i1 + εi . (4.2)
Note that the β2 coefficient in (4.1) is often denoted as β11 to indicate that it
is the parameter related to X12 . The response function of this quadratic model
is a parabola as in Figure 7.4:
E[Y |X1 ] = β0 + β1 X1 + β11 X12 .
Polynomial models may provide good fits, but one should be careful with ex-
trapolation! Figure 7.4 shows why extrapolation is dangerous: beyond x = 2
the curve is descending (a) or increasing (b), and this might not be appropriate.
Often, xi and x2i will be highly correlated. Example: generate xi ∼ N (5, 4) for
i = 1, . . . , 40. Then cor(xi , x2i ) = 0.97. But cor((xi − x̄), (xi − x̄)2 ) = −0.09.
58
To avoid multicollinearity, it is thus advisable to center the regressors. We then
obtain the model
yi = β00 + β10 (xi1 − x̄1 ) + β11

0
(xi1 − x̄1 )2 + εi (4.3)
= β00 + β10 x0i1 + β11
0
(x0i1 )2 + εi (4.4)
if we let x0i1 = xi1 − x̄1 . To avoid the different models (with β and β 0 ), we will
assume in this chapter that the regressors xi are centered, and we will denote
the corresponding regression parameters with β.
The centering of the regressors also protects against round-off errors when we
solve the normal equations. In general, they involve a.o. the cross-products
P P 2
i xij xik and the sum of squares i xij . Here, this implies terms of the form
P 2 P
xij , i xij x2ij = i x3ij and i x2ij x2ij = i x4ij .
P P P P
i xij xij =
We can extent the quadratic model (4.1) by including a term in X13 :
yi = β0 + β1 xi1 + β11 x2i1 + β111 x3i1 + εi .
This is called a cubic regression model, with a response function shown in Figure
7.5.
Higher order of xi are rarely used because the interpretation of the coefficients
becomes difficult and the model may be erratic for interpolations and small
4.1. ONE PREDICTOR VARIABLE | 59

extrapolations. If we include too many terms in the regression model, the fit
is probably good for the actual sample, but not necessarily for another sample
from the same population. We could e.g. postulate a polynomial model of order
n − 1. Because this model contains n parameters, it would yield an exact fit (if
all the xi are distinct), and thus SSE = 0, or R2 = 1.
4.2 Several regressors and interaction terms

Polynomial models may involve several regressors. In that case, also an interac-
tion term may be appropriate. The following is a second-order regression model
with two predictor variables:
yi = β0 + β1 xi1 + β2 xi2 + β11 x2i1 + β22 x2i2 + β12 xi1 xi2 + εi . (4.5)
Figure 7.6 shows an example of such a quadratic response surface. This model
can easily be extended to a second-order model with three predictor variables.
Then (at most) three interaction terms can be included: β12 xi1 xi2 , β13 xi1 xi3
and β23 xi2 xi3 .
When an interaction term is present as in model (4.5), the interpretation of the

regression coefficients that belong to the first-order terms is different from their
interpretation at the first-order regression model. Assume e.g. that we have
a model with two regressors with (only) linear effects on the response and an
interaction term:
E[Y |(xi1 , xi2 )] = β0 + β1 xi1 + β2 xi2 + β12 xi1 xi2 .
60 | CHAPTER 4. POLYNOMIAL REGRESSION

If we hold X2 constant and increase X1 with one unit,
E[Y |(xi1 + 1, xi2 )] = β0 + β1 (xi1 + 1) + β2 xi2 + β12 (xi1 + 1)xi2 .
Consequently,
E[Y |(xi1 + 1, xi2 )] − E[Y |(xi1 , xi2 )] = β1 + β12 xi2 .
Hence, the effect of X1 for a given level of X2 depends on the level of X2 , and
vice versa. Figure 7.10 illustrates the effect on the response function when an
interaction term is present in the model. Picture (a) shows the response function
without an interaction term. Here,
E[Y |X1 , X2 ] = 10 + 2X1 + 5X2 ,
hence the mean response increases with β2 × (3 − 1) = 5 × 2 = 10 when we

increase X2 from 1 to 3. On the next graph, the response function is:
E[Y |X1 , X2 ] = 10 + 2X1 + 5X2 + 0.5X1 X2 .
We now see that the slope of the response function at X2 = 3 is larger than the
slope at X2 = 1. When both β1 and β2 are positive, we say that the interaction
is of reinforcement or synergistic type when β12 is positive. A negative β12 ,
as in Figure 7.10(c), yields an interaction effect of interference or antagonistic
type. A three-dimensional representation of these response functions is given in
Figure 7.11.
4.2. SEVERAL REGRESSORS AND INTERACTION TERMS | 61

4.3 Estimation and inference
Because the polynomial regression models are special cases of the general linear
model (2.1), the parameters can be estimated with the least squares estimator.
To check whether the higher-order terms are significant, we should perform a
partial F-test as described in Chapter 3, formula (3.5).
When a polynomial term of a given order is retained, then all related terms of
lower order are usually also retained in the model. This can be seen as follows.
Consider the quadratic model (4.1). Geometrically, the fitted parabola attains
its minimum at x = − 2β̂β̂1 if β̂11 is positive. If we drop β1 from the model, we
11
require the fitted parabola to attain its minimum at x = 0. Even if β1 is not

statistically significant, omitting it from the model will often produce distortion
in the refit. If β11 on the other hand is not significant, it means that curvature,
if any, is very small. Then, a linear fit might still produce a good fit. This hi-
erarchical approach also implies that, when we retain an interaction term, also
the first-order terms of the predictors should be retained.
Often we want to express the final model in terms of the original (non-centered)
observations. This can be easily done. For the quadratic model e.g. we have for
the original variables, see (4.1):
ŷi = β̂0 + β̂1 xi1 + β̂11 x2i1
and for the centered regressors, see (4.3):
ŷi = β̂00 + β̂10 (xi1 − x̄1 ) + β̂11

0
(xi1 − x̄1 )2 .
Combining these two equations gives:
β̂0 = β̂00 − β̂10 x̄1 + β̂11

0
x̄21
β̂1 = β̂10 − 2β̂11 x̄1
0
β̂11 = β̂11 .
4.4 Example
We consider the Power Cells data set, presented in Table 7.9. The response
variable (Y ) is the life of a power cell, measured in terms of the number of
4.3. ESTIMATION AND INFERENCE | 63

discharge-charge cycles that a power cell underwent before it failed. The regres-
sors are the charge rate (X1 ), measured in amperes, and the ambient tempera-
ture (X2 ) measured in Celcius degrees.
The regressors were first centered and standardized in order to obtain the coded
values -1, 0 and 1 for each regressor. Note that this was possible because in this
experiment both regressors were controlled at three levels. Doing so, the corre-
lation between X1 and X12 was reduced from 0.991 to zero, and the correlation
between X2 and X22 from 0.986 to zero.
The researcher decided to fit the second-order polynomial model (4.5). The
output from R:
Powerfit <- lm(power ~ charge * temp + I(charge^2) + I(temp^2))

summary(Powerfit)
Call:
lm(formula = power ~ charge * temp + I(charge^2) + I(temp^2))
Residuals:

1 2 3 4 5 6 7 8
-21.465 9.263 12.202 41.930 -5.842 -31.842 21.158 -25.404
9 10 11
-20.465 7.263 13.202
Coefficients:
(Intercept) 162.84 16.61 9.805 0.000188 ***
charge -55.83 13.22 -4.224 0.008292 **
temp 75.50 13.22 5.712 0.002297 **
I(charge^2) 27.39 20.34 1.347 0.235856
I(temp^2) -10.61 20.34 -0.521 0.624352
charge:temp 11.50 16.19 0.710 0.509184
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic: 10.57 on 5 and 5 DF, p-value: 0.01086
summary(aov(Powerfit))

charge 1 18704 18704 17.846 0.00829 **
temp 1 34202 34202 32.632 0.00230 **
I(charge^2) 1 1646 1646 1.570 0.26555
I(temp^2) 1 285 285 0.272 0.62435
charge:temp 1 529 529 0.505 0.50918
Residuals 5 5240 1048
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now, let us test whether the quadratic terms and the interaction term are
significant:
H0 : β11 = β22 = β12 = 0.
(1645.97+284.93+529.00)/3
The partial F-statistic F = 1048.09 = 0.78. As F3,5,0.05 = 5.41,
we do not reject the H0 hypothesis and we can thus simplify the model.
4.4. EXAMPLE | 65
4.5 Detecting curvature
4.5.1 Residual plots

To find out that a quadratic or cubic term in Xj is appropriate, it is useful to
make plots of the residuals versus the fitted values, and versus each regressor,
as we have seen in Chapter 3, Section 3.7.
Assume for instance that the true model is quadratic in X1 :
yi = β0 + β1 xi1 + β11 x2i1 + εi
but we fit only a linear model:
ŷi = β̂0 + β̂1 xi1 .
Then
ei = yi − ŷi
= (β0 − β̂0 ) + (β1 − β̂1 )xi1 + β11 x2i1 + εi
= γ0 + γ1 (β̂0 + β̂1 xi1 ) + γ2 (β̂0 + β̂1 xi1 )2 + εi
for certain values of γ0 , γ1 and γ2 . This shows that both a (xi1 , ei ) and a (ŷi , ei )
plot will show a quadratic curve.
4.5.2 Partial residual plots

Another useful graphical tool is the partial residual plot. For each regressor
Xj it is constructed as follows:
1. First, consider the linear model:
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi
and compute the corresponding least squares estimate β̂.
2. Next, compute the partial residuals
e0i = yi − (ŷi − β̂j xij )

= ei + β̂j xij
= yi − β̂0 − β̂1 xi1 − . . . − β̂j−1 xi,j−1 − β̂j+1 xi,j+1 − . . . − β̂p−1 xi,p−1 .
3. Finally, plot the partial residuals e0i versus xij .

Such a partial residual plot in Xj will often show the relationship between Y
and Xj after the effects of the other predictors have been removed. This can be
seen as follows: ŷi − β̂j xij is an estimate of the ith response using all regressors
except Xj . Such an estimate can also be found by eliminating Xj from the lin-
ear model, but then the resulting estimates β̃0 , β̃1 , . . . , β̃p−1 might differ more
from β̂ than using this approach. Consequently, yi − (ŷi − β̂j xij ) expresses the
information in Y which is left after the effects of all predictors but Xj have been
eliminated.
Example: assume e.g. that the true model includes a quadratic term in Xj , thus
yi = β0 + β1 xi1 + . . . + βj xij + βjj x2ij + . . . + εi .
We first approximate the true response curve with only a linear term in Xj . If
then the other estimated coefficients β̂k for k 6= j are close to the true βk ,
e0i = yi − ŷi + β̂j xij

≈ βj xij + βjj x2ij + εi
and a (xij , e0i ) plot will show curvature.
This reasoning also holds for any other transformation of Xj , e.g.
yi = β0 + β1 xi1 + . . . + βj g(xij ) + . . . + εi
as long as β̂k is close to βk for all k 6= j if we fit the full linear model.
One useful property of partial residuals is that the LS regression of e0i versus xij
yields a slope β̂j (which is the parameter estimate for Xj of the full model) and
a zero intercept. Thus partial residuals plots also provide information on the
direction and magnitude of the linearity as well as the nonlinearity of Xj . Plots
of the residuals ei versus xij on the other hand have a zero slope and intercept,
due to (2.18) and (2.19), so they only show deviations from linearity.
In Figure 12.6 we see the three partial residual plots of a regression model with
three predictors, including the LS slope estimate. Figure 12.6(a) slightly sug-
gests a cubic regression in the variable Education, Figure 12.6(c) a quadratic
term in Percentage of Women, whereas the curve in Figure 12.6(b) shows that
a logarithmic transformation of Income could probably improve the fit.
4.5. DETECTING CURVATURE | 67

Chapter 5
Categorical predictors
Contrary to continuous or quantitative regressors, categorical or qualitative pre-

dictor variables only take on a finite number of values. Typical examples are
gender (male/female), age (child, adult, senior), profession, type of production
machine, time period (1941-1960, 1961-1980, 1981-2000), ...
5.1 One dichotomous predictor variable
5.1.1 Constructing the model
In the simplest case, we have one continuous regressor and one categorical pre-
dictor that takes on only two different values.
Example: Insurance Innovation data set.

An economist wished to relate the speed with which a particular insurance
innovation is adopted to the size of the insurance firm and the type of firm.
The variables in this data set are:
Y : the number of months elapsed between the time the first firm adopted
the innovation and the time the given firm adopted the innovation
X1 : the size of the firm, measured by the amount of total assets, in million $
T2 : the type of the firm: stock company or mutual company
69
firm
Months Size Type

1 17 151 Mutual
2 26 92 Mutual
3 21 175 Mutual
4 30 31 Mutual
5 22 104 Mutual
6 0 277 Mutual
7 12 210 Mutual
8 19 120 Mutual
9 4 290 Mutual
10 16 238 Mutual
11 28 164 Stock
12 15 272 Stock
13 11 295 Stock
14 38 68 Stock
15 31 85 Stock
16 21 224 Stock
17 20 166 Stock
18 13 305 Stock
19 30 124 Stock
20 14 246 Stock
To include the type of firm in a regression model, this dichotomous variable is

usually converted into a binary variable:

1 if firm i is a stock company

(1)
X2 = (5.1)
0 if firm i is a mutual company

The regression model then becomes

(1)
yi = β0 + β1 xi1 + β2 xi2 + εi . (5.2)
The interpretation of the regression coefficients is as follows:
• for a stock company, we have
E[Y |X1 ] = β0 + β1 X1 + β2
= (β0 + β2 ) + β1 X1
70 | CHAPTER 5. CATEGORICAL PREDICTORS

which is a straight line with slope β1 and intercept β0 + β2 .
• for a mutual company, we obtain
E[Y |X1 ] = β0 + β1 X1 + 0
= β0 + β1 X1
which is a straight line with the same slope β1 and intercept β0 .
In the (X1 , Y ) space, both lines are thus parallel. The parameter β0 is the ex-
pected value for mutual companies at x1 = 0, and β2 indicates how much higher
the elapsed time is for stock firms that for mutual firms, for any given size of
firm, see Figure 11.1.
In general, β2 shows how much higher or lower the mean response line is for the
class coded 1 than the line for the class coded 0, for any given level of X1 . The
5.1. ONE DICHOTOMOUS PREDICTOR VARIABLE | 71

class coded 0 thus serves as the reference group.
Other coding schemes could be used as well, e.g.


0 if firm i is a stock company

(2)
X2 =
1 if firm i is a mutual company

or 
1

if firm i is a stock company
(3)
X2 =
−1

if firm i is a mutual company
Then of course the interpretation of the regression coefficients changes! Con-

sider e.g. the model
(3)
yi = β0 + β1 xi1 + β2 xi2 + εi . (5.3)
Then stock firms have
E[Y |X1 ] = (β0 + β2 ) + β1 X1
whereas mutual firms have
E[Y |X1 ] = (β0 − β2 ) + β1 X1 .
Now the difference between the expected time of a stock firm and the expected
time of a mutual firm at a given x1 is expressed by 2β2 ! The reference line
y = β0 + β1 x1 then lies in between the other two response functions.
Note that we only need one binary variable X2 although T2 has two levels. If
we would e.g. define

1

X2 =
0

and 
0

X3 =
1

then our design matrix would not have full rank, because xi2 + xi3 = 1 which is
the intercept term. Consequently the LS estimator would not be unique.

5.1.2 Estimation and inference
The parameters in models (5.2) or (5.3) can be estimated as before, because
these models belong to the class of general linear models (2.1). Which binary
variable is used depends mainly on the tests we want to apply afterwards, or
the confidence intervals we want to construct.
Example.
Assume that the economist is most interested in the effect of type of firm (T2 )
on the elapsed time and wished to obtain a 95% confidence interval for the
mean increase of the time of stock firms compared to mutual firms. Then it is
(1)
recommend to work with the binary variable X2 . In R we obtain
attach(firm)
mst <- lm(Months ~ Size + Type)
coefficients(summary(mst))

(Intercept) 33.8740690 1.813858297 18.675146 9.145269e-13
Size -0.1017421 0.008891218 -11.442990 2.074687e-09
TypeStock 8.0554692 1.459105700 5.520826 3.741874e-05
We find that the fitted model is
(1)
ŷi = 33.874 − 0.102xi1 + 8.055xi2
and that a 95% confidence interval for β2 is given by
8.055 ± t17,0.025 1.459 = [4.98; 11.13].
So, with 95% confidence, we conclude that on the average, stock companies tend
to adopt the innovation between 5 and 11 months later than mutual companies,
for any given size of firm. A scatter plot of the data and the two regression
lines are shown in Figure 5.1. This is the default coding scheme in R.

Figure 5.1: The Insurance Innovation data set, with the two regression lines
superimposed.
40
30
Time elapsed
20
10
0
0 50 100 150 200 250 300
Size
The coding scheme can be checked by examining the design matrix
model.matrix(mst)
(Intercept) Size TypeStock

1 1 151 0
2 1 92 0
3 1 175 0
...
18 1 305 1
19 1 124 1
20 1 246 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Type
[1] "contr.treatment"

(3)
To work with X2 we set
options(contrasts = c("contr.helmert","contr.poly"))
mst2 <- lm(Months ~ Size + Type)
summary(mst2)
which yields
Call:
lm(formula = Months ~ Size + Type)
Residuals:
-5.6915 -1.7036 -0.4385 1.9210 6.3406
Coefficients:
(Intercept) 37.901804 1.770041 21.413 9.78e-14 ***
Size -0.101742 0.008891 -11.443 2.07e-09 ***
Type1 4.027735 0.729553 5.521 3.74e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic: 72.5 on 2 and 17 DF, p-value: 4.765e-09
The design matrix now is
model.matrix(mst2)
(Intercept) Size Type1

1 1 151 -1
2 1 92 -1
3 1 175 -1
...
18 1 305 1

19 1 124 1
20 1 246 1
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$Type
[1] "contr.helmert"
Note that the fitted values and consequently also the residuals are the same
(1) (3)
whether we use X2 or X2 . Hence, the MSE and the overall F-test yield the
same results.
5.1.3 Adding interaction terms

So far, we have assumed that the regression lines of both classes are parallel.
If this assumption is not plausible, we can extend model (5.2) by adding an
interaction term. This yields the model
yi = β0 + β1 xi1 + β2 xi2 + β3 xi1 xi2 + εi (5.4)
(1)
with X2 = X2 defined in (5.1). The meaning of the regression coefficients
now becomes:
• for stock firms:
E[Y |X1 ] = β0 + β1 X1 + β2 + β3 X1
= (β0 + β2 ) + (β1 + β3 )X1
which is a line with intercept β0 + β2 and slope β1 + β3 .
• for mutual firms:
E[Y |X1 ] = β0 + β1 X1
which is a line with intercept β0 and slope β1 (see Figure 11.3).
The hypothesis test

H0 : β 3 = 0 H1 : β3 6= 0
is performed to see whether the interaction term is significant, or equivalently

whether the two regression lines are parallel or not.

Example.
If we include an interaction term in the Insurance Innovation regression and use
the default dummy encoding in R, then the result of the analysis is:
mst3 <- lm(Months ~ Size * Type)

summary(mst3)
Call:
lm(formula = Months ~ Size * Type)
Residuals:
-5.7144 -1.7064 -0.4557 1.9311 6.3259
Coefficients:

(Intercept) 33.8383695 2.4406498 13.864 2.47e-10 ***
Size -0.1015306 0.0130525 -7.779 7.97e-07 ***
TypeStock 8.1312501 3.6540517 2.225 0.0408 *
Size:TypeStock -0.0004171 0.0183312 -0.023 0.9821
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The large p-value for β3 confirms that the simplified regression model (5.2)
without interaction term is appropriate for this data set.
Remark:
1. The model with the interaction term (5.4) is almost identical to the model
in which we assume a separate regression line for both groups that are
defined by the dichotomous predictor variable. The only difference is
that model (5.4) assumes that the data points of both classes show the
same variability around their regression line. Doing so, tests about the
equality of the slopes and the intercepts become very easy to apply.
2. The principle of marginality specifies that a model including a high-order

term, such as an interaction, should normally also include the lower-order
relatives of that term (the main effects that compose the interaction).
Suppose that we fit model (5.4) and conclude that β2 = 0 but β3 6= 0,

which would result in the model
yi = β0 + β1 xi1 + β3 xi1 xi2 + εi .
This model describes regression lines that have the same intercept but
different slopes which is a peculiar specification (generally) of no substan-
tive interest. Similarly, the model which retains β3 but removes β1 :
yi = β0 + β2 xi2 + β3 xi1 xi2 + εi
has a zero slope for the class coded X2 = 0 which is usually too restrictive.

5.2 Extensions
There are two straightforward extensions of models (5.2) and (5.4). The first
one includes a predictor variable with more than two classes, the second one
includes more than one categorical variable.
5.2.1 One polytomous predictor variable

If a qualitative predictor variable T2 has more than two levels, we need additional
indicator variables in the regression model. Assume e.g. that T2 indicates a tool
model which can take on four different values: M1, M2, M3 and M4. Then we
need three binary variables:

1

if T2 = M 1
X2 =
0 if T2 6= M 1


1

if T2 = M 2
X3 =
0 if T2 6= M 2


1

if T2 = M 3
X4 =
0 if T2 6= M 3

With Y = tool wear, and X1 = tool speed, a first-order regression model is:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi4 + εi . (5.5)
The response functions are again lines with the same slope β1 for all values of
T2 , see Figure 11.5. The coefficients β2 , β3 and β4 indicate how much higher
(lower) the response functions are for tool models M1, M2 and M3 than for tool
model M4, for any given level of tool speed.
An hypothesis test of the form H0 : βj = 0 for j = 2, . . . , 4 is thus concerned

with the differential effect of class j compared with the reference class M4 for
which X2 = X3 = X4 = 0. If we want to compare the intercepts of e.g. class
M3 with M2 we use the point estimator β̂4 − β̂3 with variance
Var(β̂4 − β̂3 ) = Var(β̂4 ) + Var(β̂3 ) − 2cov(β̂4 , β̂3 )
which can be estimated from Σ̂(β̂).
5.2. EXTENSIONS | 79
If interaction effects are present, the regression model (5.5) becomes:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi4 + β5 xi1 xi2 + β6 xi1 xi3 + β7 xi1 xi4 + εi .
This model again implies that each tool model has its own regression line, with
different intercepts and slopes for the different tool models.

5.2.2 More than one categorical variable
If several predictor variables are qualitative, they should all be converted into
indicator variables.
Example: one quantitative and two dichotomous qualitative regressors. The
first-order model becomes:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + εi .
The response functions can be summarized as:
X2 = 0 X2 = 1
X3 = 0 β0 + β1 X1 (β0 + β2 ) + β1 X1
X3 = 1 (β0 + β3 ) + β1 X1 (β0 + β2 + β3 ) + β1 X1
With interactions between each pair of the predictor variables:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi1 xi2 + β5 xi1 xi3 + β6 xi2 xi3 + εi
with response functions
X2 = 0 X2 = 1
X3 = 0 β0 + β1 X1 (β0 + β2 ) + (β1 + β4 )X1
X3 = 1 (β0 + β3 ) + (β1 + β5 )X1 (β0 + β2 + β3 + β6 ) + (β1 + β4 + β5 )X1
Remarks.
• If all the explanatory variables are qualitative, the models are called anal-
ysis of variance models.
• If the model contains qualitative and quantitative regressors, but the main
variables of interest are the qualitative ones, it is called an analysis of
covariance model.
5.2. EXTENSIONS | 81
5.3 Piecewise linear regression
Indicator variables can also be used when the regression of Y on X follows a cer-
tain linear relation in some range of X but follows a different relation elsewhere.
Example: lot size data set.

The unit cost of a lot depends linearly on the lot size, but the slope changes
once the lot size exceeds 500 (e.g. because the unit price of some raw materials
decrease if larger amounts are purchased). The model, as illustrated in Figure
11.9, can be expressed as:
yi = β0 + β1 xi1 + β2 (xi1 − 500)xi2 + εi
where X1 is the lot size and


1

if Xi1 > 500
X2 =
0

otherwise.
For X1 6 500, X2 = 0 we obtain the response function:
E[Y |X1 ] = β0 + β1 X1
whereas for X1 > 500, X2 = 1 we have
E[Y |X1 ] = β0 + β1 X1 + β2 (X1 − 500)

= (β0 − 500β2 ) + (β1 + β2 )X1 .

When the regression function not only changes its slope at some value Xp but
also makes a jump there, then we need an additional term. The response func-
tions in Figure 11.10 could be modelled as:
yi = β0 + β1 xi1 + β2 (xi1 − 40)xi2 + β3 xi2 + εi
with X2 = I(X1 > 40). Then for X1 6 40 the response function becomes
E[Y |X1 ] = β0 + β1 X1
and for X1 > 40 we obtain
E[Y |X1 ] = β0 + β1 X1 + β2 (X1 − 40) + β3

= (β0 − 40β2 + β3 ) + (β1 + β2 )X1
so β2 represents the difference in the slopes, and β3 the difference in the mean
responses at Xp = 40.
5.3. PIECEWISE LINEAR REGRESSION | 83

Chapter 6
Transformations
In regression both transformations of the regressors and of the response variable

are often useful to obtain a model that better fits the assumptions of the general
linear model. First we will study the effect of a transformation on some variable
X.
6.1 The family of power and root transforma-

tions
If X is strictly positive, a useful group of transformations is the family of power
and roots:
X → Xλ (6.1)
If λ is negative, this transformation is an inverse power, e.g. X −1 = 1/X, X −2 =

1/X 2 . If λ is a fraction, the transformation represents a root, e.g. X 1/3 =
√ √
3
X, X −1/2 = 1/ X. A more convenient family of power transformations is
defined as: 
 xλ −1

λ 6= 0, x > 0
λ
f (x) = x(λ) = (6.2)
log(x)

λ = 0, x > 0
Figure 4.1 shows several of those power transformations. Note that this family
of transformations is monotone increasing in X, whereas the simple form (6.1)
is decreasing if λ < 0. Moreover, for λ = 0, transformation (6.1) would be use-
less (X 0 = 1), whereas the logarithmic transformation is often very appropriate.
85
The effect of a power transformation is the following:
1. if λ < 1 large values of X are compressed, whereas small values are more
spread out.
2. if λ > 1 the inverse effect takes place: large values are more dispersed,
whereas small values are compressed.
The first property makes the power transformation interesting if the distribu-
tion of X is skewed to the right, the second if X is skewed to the left (which
occurs less in practice). Consider e.g. the distribution of income in Figure 4.2.
The non-parametric density estimate clearly shows a right-tailed distribution.
The log of income on the other hand is more symmetric, as illustrated in Figure
4.3.
86 | CHAPTER 6. TRANSFORMATIONS
Remarks:
• If X also contains negative values, we can first add a positive constant

to each observation to make them all strictly positive. Example: X =
{−3, −1, 4, 7, 9} then X + 4 = {1, 3, 7, 11, 13} > 0.
• The power transformation is not very effective when the ratio between the
largest and the smallest value is small. Example:
X : 2001 2002 2003 2004 2005
log(X) : 7.6014 7.6019 7.6024 7.6029 7.6034
Here, 2005/2001 = 1.002 ≈ 1. When we first subtract 2000 from the data,
we obtain a ratio of 5/1 = 5:
X − 2000 : 1 2 3 4 5
log(X − 2000) : 0 0.6931 1.0986 1.3863 1.6094
and then the logarithmic transformation has more effect.
• An adequate power transformation can often be found in the range −2.5 6

λ 6 3. We usually select integer values, or simple fractions such as 1/2 or
1/3.
6.1. THE FAMILY OF POWER AND ROOT TRANSFORMATIONS | 87

6.2 Transforming proportions
Power transformations are often not helpful for proportions, because these quan-
tities are bounded below by 0 and above by 1. Also many sorts of rates (e.g.
infant mortality rate per 1000 live births) are rescaled proportions. Or ”the
number of questions correct on an exam of fixed length” is essentially a propor-
tion.
Common transformations for proportions are:
1. The logit transformation:

P
logit(P ) = log .
1−P
The logit transformation (Figure 4.15) removes the boundaries of the scale,
spreads out the tails of the distribution and makes the transformed vari-
able symmetric around zero. It is the inverse of the cumulative distribution
function of the logistic distribution and is essential in logistic regression.
2. The probit transformation uses the inverse of the standard normal distri-
bution:
probit(P ) = Φ−1 (P )
as is done in probit regression.
√
3. Also the arcsine-square-root arcsin( P ) transformation has a similar shape.
6.2. TRANSFORMING PROPORTIONS | 89

6.3 Transformations in regression
6.3.1 Power transformation

When a normal quantile plot of the (standardized) residuals shows that they
are not normally distributed, a power transformation of the response variable
can sometimes correct this non-normality. A positive skew in the residuals e.g.
will often be corrected by a log or a square-root transformation.
A power transformation can also be used to make a nonlinear relationship more

nearly linear. Assume e.g. that Y ≈ 4X 2 . An (x, y) plot will show a parabola.
√
If we transform Y into Y 0 = Y we obtain approximately a line since then
Y 0 ≈ 2X. The same happens when we transform X into X 0 = X 2 , yielding
Y ≈ 4X 0 . To select a good transformation, we can use Tukey and Mosteller’s
‘bulge rule’, illustrated in Figure 4.6. Note that a transformation of one or
more of the regressors is often preferred over a transformation of the response
variable. Transforming Y not only changes the shape of the error distribution
(which is often the primary goal of the transformation), but it also has an effect
on the relationship between Y and the X’s.
6.3.2 Box-Cox transformation
A more sophisticated approach to transform the response variable is the Box-
Cox transformation. The object of this transformation is to normalize the error
distribution, to stabilize the error variance and to straighten the relation be-
tween Y and X.
The general Box-Cox model assumes that a certain power transformation of Y

yields a general linear model with normal homoscedastic errors:
(λ)
(λ)
with εi i.i.d. N (0, σ 2 ), and yi defined by (6.2). Note that all the yi must be
positive, otherwise a constant should first be added. For a particular choice of
λ, the maximized log-likelihood (profile log-likelihood) is
n
n 2
X
log L(λ) = const − log σ̂ (λ) + (λ − 1) log yi
2 i=1
where σ̂ 2 (λ) = e2i (λ)/n and e2i (λ) are the least squares residuals from the
P
regression of y (λ) on the X’s. To find the maximum-likelihood estimator λ̂ the

profile log-likelihood is evaluated over a range of λ values, say, between -2 and 2.
An approximate (1 − α)100% confidence interval for λ is given by the interval

of all λ that satisfy
2(log L(λ̂) − log L(λ)) 6 χ21,α .
This stems from the fact that the likelihood-ratio statistic G = −2(log L(λ) −
log L(λ̂)) is asymptotically distributed as a χ21 distribution. Usually a plot of
the log-likelihood log L(λ) versus λ is made, with the 95% confidence interval
for λ indicated, as in Figure 12.8. Then a value of λ̂ is selected which belongs
to this confidence interval and which coincides with rounded numbers such as
-1.5, -1, -0.5, 0, 0.5, 1, . . . .
Remark that the Box-Cox methodology assumes that there exists a λ such
that on the transformed scale the assumptions of the general linear model are
fulfilled. It is thus important to verify (using residual plots and normal quantile
plots) whether the proposed transformation indeed improved the appropriate-
ness of the model assumptions.
6.3. TRANSFORMATIONS IN REGRESSION | 91

6.4 Nonconstant variance
6.4.1 Detecting heteroscedasticity
The second Gauss-Markov condition (2.3) states that the error variance is ev-
erywhere the same around the regression surface. Nonconstant error variance
is called heteroscedasticity. In that case the least squares estimator is still un-
biased and consistent, but the variances of the parameter estimates tend to be
large and thus affect the tests of hypothesis substantially. Also s2 (X t X)−1 need
no longer be an unbiased estimate of the covariance matrix of β̂ LS .
Heteroscedasticity can sometimes be detected if we understand the underlying

situation. For example, if the responses are counts, then it is likely that the
response variable is approximately Poisson distributed, for which the variance
is equal to its expected value, i.e. Var[Y |xi ] = E[Y |xi ]. Hence it cannot be
expected that σi2 := Var[Y |xi ] will be constant. Or consider house prices or
income. Less expensive houses or low incomes usually show less variation than
more expensive houses or higher incomes.
Heteroscedasticity can often be seen through residual plots. If σi2 varies with
E[Y |xi ], then a plot of the residuals, which are estimates of εi , against the
fitted values ŷi , which are estimates of E[Y |xi ], might reveal that the residuals
are more spread out for some values of ŷi than for others. A typical example is
shown in Figure 6.2. Because the least-squares residuals have unequal variances
even in the homoscedastic case, it is preferable to use the standardized residuals.
Also plots of the absolute (standardized) residuals or the squared residuals versus
ŷi are often used. Since
σi2 = E[ε2i ] − (E[εi ])2 = E[ε2i ]
we notice that the squared residual e2i is an estimator of the variance σi2 , and
that the absolute residual |ei | is an estimate of the standard deviation σi .
Sometimes the variance of the errors varies with one or more of the regressors
X. Therefore residual plots versus each of the independent variables might also
be appropriate.
Note that heteroscedasticity can also be due to the omission of an important

variable or interaction term from the model. This can however be very difficult
to detect.
6.4. NONCONSTANT VARIANCE | 93

6.4.2 Variance-stabilizing transformations
Variance-stabilizing transformations can be used when the variance of the re-
sponse variable depends on its expected value, e.g. when Yx := Y |xx has a
Poisson or an exponential distribution.
Assume that Yx has mean µx and variance σx2 and that g is a function of Yx
such that E[g(Yx )] can be well approximated with g(E[Yx ]) = g(µx ). The Taylor
expansion of g(Yx ) around µx gives:
g(Yx ) ≈ g(µx ) + (Yx − µx )g 0 (µx ).
Consequently
Var(g(Yx )) = E[(g(Yx ) − E[g(Yx )])2 ] ≈ E[(g(Yx ) − g(µx ))2 ]

≈ E[((Yx − µx )g 0 (µx ))2 ]
= g 0 (µx )2 E[(Yx − µx )2 ]
= g 0 (µx )2 σx2 .
If we want to apply a transformation such that Var(g(Yx )) = c2 for a certain

constant c, we thus have to find a function g such that g 0 (µx )2 σx2 = c2 , or
Z
dµx
g(µx ) = c .
σx
Examples.
1. Assume that Yx is Poisson distributed with E(Yx ) = λx = Var(Yx ). Then

Z
dλ
√ x = 2 λx .
p
g(λx ) =
λx
Hence taking the square root of Y will lead to a more constant variance.
2. If the response is a proportion, E(Yx ) = px and Var(Yx ) ∼ px (1 − px ),

thus
√
Z
dpx
g(px ) = p = arcsin px
px (1 − px )
which gives us again the arcsine-square-root transformation.
6.4.3 Weighted least squares regression
Transformations in Y , such as the Box-Cox transformation, might be helpful

to reduce unequal error variances, but they have the disadvantage of changing
the relation between the response and the independent variables. If the linear
relationship seems appropriate, but one is left with unequal error variances, the
Weighted Least Squares procedure is recommended.
For this we consider the generalized linear regression model:
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi (6.3)
for i = 1, . . . , n, with
εi independent N (0, σi2 ).
Moreover we assume that the variances σi2 are known up to a constant of pro-
portionality:
σi2 = σ 2 /wi
with wi known weights, and σ 2 unknown. The ratio of two variances σj2 and σk2
is then indeed known, since σj2 /σk2 = wk /wj . Let W = diag(w1 , . . . , wn ) then
Σ(ε) = σ 2 W −1 . (6.4)
This generalized linear model (6.3) is equivalent to the general linear model:
√ √ √ √ √ √
wi yi = β0 wi + β1 wi xi1 + β2 wi xi2 + . . . + βp−1 wi xi,p−1 + wi εi
√
since wi εi are independent N (0, σ 2 ). Or
y (W ) = X (W ) β + ε(W )
with y (W ) = W 1/2 y, X (W ) = W 1/2 X, ε(W ) = W 1/2 ε, Σ(ε(W ) ) = W 1/2 Σ(ε)W 1/2 =

σ 2 In .
The least-squares estimator thus becomes
β̂ WLS = ((X (W ) )t X (W ) )−1 (X (W ) )t y (W )

= (X t W 1/2 W 1/2 X)−1 X t W 1/2 W 1/2 y
= (X t W X)−1 X t W y

Since for any β, ŷ (W ) (β) = X (W ) β = W 1/2 Xβ we see that
n
(W ) (W ) 2
X
β̂ WLS = argmin (yi − ŷi )
β i=1
n
X √ √
= argmin ( wi yi − wi xti β)2
β i=1
n
X
= argmin wi (yi − xti β)2 .
β i=1
We thus apply the ordinary least squares estimator (OLS) on the weighted vari-
ables, where observations with a large variance get a small weight and those
with a small variance a large weight (wi ∼ 1/σi2 ).
It can be shown that β̂ WLS is the BLUE estimator of β in the generalized linear
model (6.3) that satisfies (6.4). The variance-covariance matrix of β̂ WLS is given
by
Σ(β̂ WLS ) = (X t W X)−1 X t W σ 2 W −1 W t X(X t W X)−1

= σ 2 (X t W X)−1 .
The residual vector is
e(W ) = y (W ) − ŷ (W )
= W 1/2 y − W 1/2 X β̂ WLS
= W 1/2 (y − ŷ) with ŷ = X β̂ WLS . (6.5)
An unbiased estimate of σ 2 is given by

n
1 X (W ) 2
σ̂ 2 = (e )
n − p i=1 i
n
1 X
= wi (yi − ŷi )2
n − p i=1
from which
Σ̂(β̂ WLS ) = σ̂ 2 (X t W X)−1 (6.6)
follows.
Estimation of the variance function.
Because the error variances σi2 or the weights wi are in general not known, we
are forced to estimate them. As the σi2 often vary with one or several predictor
variables or with the mean response E(yi ), the following procedure might be
helpful:
1. Fit the regression model by ordinary least squares (OLS) and analyze the
residuals.
2. Regress the squared residuals or the absolute residuals on the fitted values
or one or several independent variables. This makes sense because the
squared residuals e2i estimate the variances σi2 , while the absolute residuals
|ei | are estimates of the standard deviations σi .
3. Use the fitted values from the estimated variance or standard deviation
to obtain the weights wi .
4. Fit the regression model by WLS and analyze the residuals.
If β̂ WLS differs significantly from β̂ LS , it is advisable to iterate the WLS process

by using the residuals from the WLS fit to reestimate the variance or standard
deviation function and then obtain revised weights. This iterative process is
called iteratively reweighted least squares.
Note that the estimated standard deviations of the coefficients, derived from (6.6),
are now only approximate, because the estimation of the variances σi2 has in-
troduced another source of variability. The approximation will often be quite
good when the sample size is not too small.

Example: Blood data set.
Consider the Blood data set containing the age and the diastolic blood pressure
of 54 healthy women.
lmba <- lm(Bloodpr~Age)

summary(lmba)
Call:
lm(formula = Bloodpr ~ Age)
Residuals:
-16.4786 -5.7877 -0.0784 5.6117 19.7813
Coefficients:
(Intercept) 56.15693 3.99367 14.061 < 2e-16 ***
Age 0.58003 0.09695 5.983 2.05e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The OLS analysis yields the fitted model:
Bloodpressure = 56.16 + 0.58Age
The scatter plot of the data, the plot of residuals versus age, and absolute
residuals versus age using OLS clearly demonstrate heteroscedasticity.
plot(Age,Bloodpr)
abline(lm(Bloodpr~Age))
resid <- residuals(lmba)
plot(Age,resid)
plot(Age,abs(resid))
110
100
90
Bloodpr
80
70
20 30 40 50 60
Age
20
10
resid
0
−10
20 30 40 50 60
Age

20
15
abs(resid)
10
5
0
20 30 40 50 60
Age
When we regress the absolute residuals versus Age we obtain the estimated
expected standard deviation:
σ̂i = si = −1.55 + 0.198Age
stdev <- lm(abs(resid)~Age)

abline(lm(abs(resid)~Age),lty=2)
summary(stdev)
Call:
lm(formula = abs(resid) ~ Age)
Residuals:
-9.7639 -2.7882 -0.1587 3.0757 10.0350
Coefficients:
(Intercept) -1.54948 2.18692 -0.709 0.48179
Age 0.19817 0.05309 3.733 0.00047 ***
---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally we apply WLS with weights wi = 1/s2i .
weightblood <- 1/stdev$fitted^2

wlmba <- lm(Bloodpr~Age, weights=weightblood)
summary(wlmba)
Call:
lm(formula = Bloodpr ~ Age, weights = weightblood)
Weighted Residuals:
-2.0230 -0.9939 -0.0327 0.9250 2.2008
Coefficients:
(Intercept) 55.56577 2.52092 22.042 < 2e-16 ***
Age 0.59634 0.07924 7.526 7.19e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The final estimated regression line is thus:
Bloodpressure = 55.56 + 0.596Age
which is not so different from the OLS line. Therefore, an extra reweighting
should not be considered. We see that the standard error of β̂1 has decreased
from 0.097 in the OLS analysis to 0.079 in the WLS analysis. Consequently the

(approximate) 95% confidence interval for β1 , which is 0.596 ± t53,0.025 0.079, is
also smaller than the confidence interval based on the ordinary analysis.
Finally we test whether the heteroscedasticity has gone now. Note that R
computes the residuals as ei (β̂ WLS ) = yi − ŷi (β̂ WLS ). They will still show the
(W ) √
heteroscedasticity. By (6.5) it is the residuals ei (β̂ WLS ) = wi (yi −ŷi (β̂ WLS ))
who should have constant variance.
plot(Age,resid(wlmba))
plot(Age,resid(wlmba)*sqrt(weightblood),ylab="Weighted residuals")
20
10
resid(wlmba)
0
−10
20 30 40 50 60
Age
2
1
Weighted residuals
0
−1
−2
20 30 40 50 60
Age

Chapter 7
Variable selection methods
7.1 Reduction of explanatory variables

Variable reduction is often recommended because:
• a regression model with many predictors may be difficult and expensive

to maintain
• a model with a limited number of regressors is easier to work with and to

understand
• the presence of highly intercorrelated explanatory variables may increase

the variance of the regression coefficients, and increase the problem of
roundoff errors.
• the presence of explanatory variables that are not related to the response
variable increase the variance of the predicted values and hence, decrease
the model’s predictive ability.
On the other hand, omitting important variables (or latent explanatory vari-
ables) leads to biased estimates of the regression coefficients, the error variance,
the mean responses and predictions of new observations.
This is illustrated in Figure 4.3. If too many variables are selected, too much of
the redundancy in the x-variables is used and the solution becomes overfitted.
The regression equation will be very data dependent and gives poor prediction
results. If too few variables are retained, it is called underfitting which means
103
that the model is not large enough to capture the important variability in the
data. The optimal number of variables is usually found in between the two
extremes. It is therefore often a good idea to consider several ‘good’ subsets of
explanatory variables.
7.1.1 Surgical unit example

We consider the surgical unit example to illustrate the model-building process
and some variable reduction methods.
The data set consists of n = 108 patients undergoing a liver operation. A
hospital surgical unit was interested to predict the survival time of patients.
Available explanatory variables are:
X1 : blood clotting score

X2 : prognostic index (including the age of the patient)
X3 : enzyme function test score
X4 : liver function test score
An exploratory data analysis shows that a transformation of the response vari-
able into Y 0 = log10 Y makes the distribution of the residuals in the first order
linear model more normal (symmetric), eliminates the need for interaction terms
104 | CHAPTER 7. VARIABLE SELECTION METHODS

and makes the relation between every regressor and the response variable more
linear.
1 2 3 4 5 6 20 40 60 80
2.8
2.4
Log.Survival
2.0
1.6
6
5
4
Liver.
3
2
1
80 100
Enzyme
60
40
20
80
60
Prognostic
40
20
10
8
Blood.Clotting
6
4
1.6 2.0 2.4 2.8 20 40 60 80 100 120 4 6 8 10
We take the first half of the data as training data to build a regression model.
Table 8.1 shows part of these data which are graphically shown in the scatterplot
matrix The first order linear model (full model) for the data yields:
7.1. REDUCTION OF EXPLANATORY VARIABLES | 105

surgicalunit <- surgicalunit[,-(5:8)]
surgicalunit1 <- surgicalunit[1:54,]
attach(surgicalunit1)
surg.full <- lm(Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver)
Call:
lm(formula = Log.Survival ~ Blood.Clotting + Prognostic + Enzyme +
Liver)
Residuals:
-0.43500 -0.17591 -0.02091 0.18400 0.56192
Coefficients:
(Intercept) 3.851948 0.266258 14.467 < 2e-16 ***
Blood.Clotting 0.083684 0.028833 2.902 0.00554 **
Prognostic 0.012665 0.002315 5.471 1.51e-06 ***
Enzyme 0.015632 0.002100 7.443 1.37e-09 ***
Liver 0.032161 0.051465 0.625 0.53493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


7.2 All-possible-regressions procedure for vari-
able reduction
The first procedure considers all possible subsets of the pool of potential pre-
dictors, including products of observed variables for interaction or higher-order
terms. A regression is then performed for each subset and evaluated through
some criterion. Note that the number of subsets grows exponentially with the
number of predictors. For example there are 210 = 1024 regressions to be in-
spected for a data set with 10 possible predictors. Since it would not be possible
to examine each model carefully, it is useful to compute one simple criterion for
each model and then to select a few good subsets.
7.2.1 Rp2 criterion

The Rp2 coefficient (of multiple determination) reflects the proportion of the
variance of the response variable that is explained by the linear model with p
coefficients. Resuming (2.28)
SSEp
Rp2 = 1 −
SST
Subsets with a high Rp2 coefficient (or equivalently with a low SSEp ) are consid-
ered good. We know that Rp2 always increases if we include additional variables
in the model. Therefore it makes no sense to maximize Rp2 , but we should find
the point where adding more variables is not worthwhile because it leads to a
very small increase in Rp2 .
Figure 8.4 contains a plot of the Rp2 values versus p, and its maximum for each
number of parameters p in the model. From this plot it can be seen that the
inclusion of the fourth variable Liver does not lead to a large increase in the
explained variance. This might be surprising because the correlation between
Liver and Log.Survival is the largest among all the pairwise correlations with
the response variable. This indicates that X1 , X2 and X3 contain much of the
information presented by X4 .
7.2. ALL-POSSIBLE-REGRESSIONS PROCEDURE FOR VARIABLE

REDUCTION | 107
print(cor(surgicalunit1[, -5]), digits = 2)
Blood.Clotting Prognostic Enzyme Liver

Blood.Clotting 1.00 0.090 -0.150 0.50
Prognostic 0.09 1.000 -0.024 0.37
Enzyme -0.15 -0.024 1.000 0.42
Liver 0.50 0.369 0.416 1.00
Log.Survival 0.25 0.470 0.654 0.65
Log.Survival
Blood.Clotting 0.25
Prognostic 0.47
Enzyme 0.65
Liver 0.65
Log.Survival 1.00

REDUCTION | 109
7.2.2 MSEp criterion
Because Rp2 does not take account of the number of parameters in the regression
model and because it can never decrease as p increases, the adjusted Ra2 , defined
in (2.30), can be used as an alternative criterion:
SSE/(n − p)
Ra2 = 1 −
SST/(n − 1)
MSEp
=1− .
SST/(n − 1)
Since SST remains constant over all regression models, considering the ad-
justed Ra2 is equivalent to looking at the mean squared error MSEp = σ̂ 2 =
1
Pn 2
n−p i=1 ei,p with ei,p the residuals from a model with p regressors. Although
X X
SSEp+1 = e2i,p+1 6 SSEp = e2i,p ,
the denominator n − (p + 1) of the larger model is smaller than the denominator

n − p of the smaller model. Hence, if the decrease in SSE is very small, the loss
of one degree of freedom can result in an MSEp+1 > MSEp .
When we consider the MSE criterion we can thus look for the subset(s) with
minimal MSE, or whose MSE is very close to the minimum. Figure 8.5 shows
the MSEp plot for the surgical unit data. Again, the fourth explanatory variable
Liver appears not to be needed in the model.

REDUCTION | 111
7.2.3 Mallows’ Cp
The Cp statistic, suggested by C.L. Mallows, has the form
SSEp
Cp = − (n − 2p) (7.1)
s2
with s2 the MSE from the ’largest’ model, presumed to be a reliable unbiased
estimate of the error variance σ 2 .
It is an estimator of the standardized total mean squared error of the fitted

values:
n
1 X
γp = MSE(ŷi )
σ 2 i=1
n
1 X
= E((ŷi − E(yi |xi ))2 )
σ 2 i=1
n
1 X
= (bias2 (ŷi ) + Var(ŷi )).
σ 2 i=1
It can be shown that γp can be expressed as
E(SSEp )
γp = − (n − 2p)
σ2
from which (7.1) follows. In order to minimize the total mean squared error we
prefer models with small Cp value.
Note that there are several Cp statistics for each p, except for the Cp value when
all variables are included, let’s say CP . At this model
(n − P )MSEP
CP = − n + 2P = P
s2
since s2 = MSEP .
If a model with p parameters is adequate, E(SSEp ) ≈ (n − p)σ 2 . Since we

assume that also E(s2 ) = σ 2 , approximately E(SSEp /s2 ) ≈ n − p. Hence
E(Cp ) ≈ p
for an adequate model. When the Cp values for all regression models are thus
plotted against p, the models with little bias will be close to the line Cp = p.
Models with substantial bias (due to the omission of some predictors) will fall

above this line.
The Cp plot of the surgical unit data clearly shows that only the subset with the
first three variables has little bias. Here, the Cp value falls below the line Cp = p,
because SSE1,2,3 = 0.1099 is only slightly larger than SSE1,2,3,4 = 0.1098 and
consequently MSE1,2,3 = 0.00220 < MSE1,2,3,4 = 0.00224.

REDUCTION | 113
7.2.4 Akaike’s Information Criterion
Akaike’s information criterion (AIC) is an asymptotically unbiased estimator of
the Kullback-Leibler information number (or discrepancy):
h f (x) i
P
K-L = EP log
fp (x)
with fP the likelihood of the full model, and fp the likelihood of the reduced
model.
As derived in (2.26), for a regression model with n observations, p parameters

and normally distributed errors, the log-likelihood function is given by:
n
n 1 X
log L(β, σ 2 |y) = const − log σ 2 − 2 (yi − xti β)2 .
2 2σ i=1
The least squares estimator β̂ = β̂ LS maximizes this log-likelihood function:

n 1
log L(β̂, σ 2 |y) = const − log σ 2 − 2 SSEp .
2 2σ
Akaike’s Information Criterion is now defined as
AICp = −2 max log L + 2p.
If σ 2 is known, this corresponds to

SSEp
AICp = + 2p + const.
σ2
From (7.1) we see that, if σ 2 is known,
SSEp
Cp = − (n − 2p)
σ2
and thus Cp and AICp only differ by a constant.
If σ 2 is unknown, we estimate it as σ̂ML
2
= SSEp /n (see (2.25)) and thus
2 n 2 n
max log L = log L(β̂, σ̂ML |y) = const − log σ̂ML −
2 2
from which
AICp = n log(SSEp /n) + 2p + const
follows.
The AIC criterion is used in R to perform stepwise regression (see Section 7.3).
In a forward search strategy, one starts with a model of e.g. p − 1 explanatory
variables, and then includes the variable that yields the largest reduction in the
AIC.

7.2.5 PRESSp Criterion
The goal of the PRESSp criterion (predicted sum of squares) is to estimate the
mean squared error of prediction:
MSEP = E[(y0 − ŷ0 )2 ]
with y0 a new observation, independent of the original n observations.
To obtain independent observations in a artificial way, every observed data point

(xi , yi ) in turn is withheld from the data set. This leaves a new data set with
n − 1 observations. Under the independent errors assumption, we know that yi
is independent of this new data set. The remaining n − 1 observations yield the
least squares fit β̂ (i) and for observation i the fitted value ŷi(i) = xti β̂ (i) .
The PRESSp value is then obtained by summing the prediction errors di =

yi − ŷi(i) over all i = 1, . . . , n:
n
X
PRESSp = (yi − ŷi(i) )2 (7.2)
i=1
Models with small PRESSp values (or PRESSp /n) are considered good candi-
date models. The prediction error di = yi − ŷi(i) is also called the deleted residual
for the ith observation. It can be shown to be equal to
ei
di = (7.3)
1 − hii
and thus can be computed without recomputing the regression function.

REDUCTION | 115
7.3 Stepwise regression
Stepwise regression procedures develop a sequence of regression models, such
that at each step an explanatory variable is added to or removed from the model.
They are automatic search methods that avoid fitting all subset regressions.
7.3.1 Backward elimination

The backward elimination method starts with the regression model that contains
all P explanatory variables. Then it computes for each variable j = 1, . . . , P in
the model its partial F-value. Following (3.6):
MSR(Xj |X1 , . . . , Xj−1 , Xj+1 , . . . , XP )
Fj∗ =
MSE(X1 , . . . , XP )
represents the decrease in the SSE when Xj is added to the model that does
not yet contain Xj . Equivalently
!2
β̂j
Fj∗ =
s(β̂j )
and thus equals the squared t-value for the parameter test H0 : βj = 0 versus
H1 : βj 6= 0.
The variable for which this Fj∗ is smallest is the candidate for deletion. If this
Fj∗ value falls below a predetermined limit (e.g. F1,n−P,α or the corresponding
p-value is larger than α), then this variable is deleted. Otherwise the process is
stopped. If not, the procedure starts over with the P − 1 remaining variables.
A drawback of this method is that a variable can never come back in the model
once it is deleted.
To illustrate the stepwise procedures we use a random halfsample of the surgical
unit data.
set.seed(1)
surgicalunit2 <- surgicalunit[sample(1:108,54),]
attach(surgicalunit2)
surg.full <- lm(Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver)
summary(surg.full)
Call:
lm(formula = Log.Survival ~ Blood.Clotting + Prognostic + Enzyme +
7.3. STEPWISE REGRESSION | 117

Liver)
Residuals:
-0.60295 -0.20763 -0.01256 0.20764 0.57249
Coefficients:
(Intercept) 3.870175 0.278091 13.917 < 2e-16 ***
Blood.Clotting 0.070782 0.030310 2.335 0.0237 *
Prognostic 0.014444 0.002477 5.830 4.27e-07 ***
Enzyme 0.014245 0.002226 6.399 5.66e-08 ***
Liver 0.041781 0.052601 0.794 0.4308
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because F4∗ = 0.0662 = 0.0044 < F1,54−5,0.05 = 4.04 (or equivalently because
the corresponding p-value 0.948 > 0.05) the fourth variable Liver will be re-
moved.
surg.bpe <- update(surg.full, .~. - Liver)

coefficients(summary(surg.bpe))

(Intercept) 3.76437069 0.243211225 15.477784 1.048185e-20
Blood.Clotting 0.08579894 0.023603496 3.635010 6.559042e-04
Prognostic 0.01529755 0.002223887 6.878744 9.288718e-09
Enzyme 0.01528730 0.001791835 8.531641 2.532426e-11
The procedure stops because all p-values are smaller than the limit.
In R the function stepAIC performs backward selection automatically based on
the AIC criterion. The variable whose removal results in the largest decrease
in AIC is dropped from the model. The procedure stops when AIC cannot
decrease anymore.

library(MASS)
surg.full <- lm(Log.Survival~Blood.Clotting + Prognostic + Enzyme + Liver)
surg.stepb <- stepAIC(surg.full, list(upper = ~ Blood.Clotting +
Prognostic + Enzyme + Liver, lower = ~ 1), direction = "back")
Start: AIC=-136.94
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver
Df Sum of Sq RSS AIC

- Liver 1 0.04575 3.5989 -138.25
<none> 3.5531 -136.94
- Blood.Clotting 1 0.39544 3.9485 -133.24
- Prognostic 1 2.46490 6.0180 -110.49
- Enzyme 1 2.96904 6.5221 -106.14
Step: AIC=-138.25
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme

<none> 3.5989 -138.252
- Blood.Clotting 1 0.9511 4.5499 -127.589
- Prognostic 1 3.4058 7.0046 -104.290
- Enzyme 1 5.2391 8.8380 -91.736
Note that the column ”Sum of Sq” indicates the extra sum of squares
SSR(Xj |X1 , . . . , Xj−1 , Xj+1 , . . . , Xp ) whereas RSS = SSEp−1 .

7.3.2 Forward selection
The forward selection procedure starts with the model that only contains the
intercept. Then a simple regression model is fitted for each of the P − 1 ex-
planatory variables. Also here the partial F ∗ -value is computed:
MSR(Xj )
Fj∗ =
MSE(Xj )
and the variable with the largest Fj∗ value is the candidate for the first addition.
If this Fj∗ value exceeds a predetermined level (or the p-value is lower than α),
the Xj variable is added. Otherwise none of the regressors are considered to be
helpful in the prediction of the response variable.
Assume variable X7 is entered, then the next partial Fj∗ values are
MSR(Xj |X7 )
Fj∗ =
MSE(Xj , X7 )
and again the variable with the largest Fj∗ -value is included (if it is large enough).
As with the backward selection procedure, none of the variables can be removed
once they are entered in the model.
surg.initial <- lm(Log.Survival~1)

surg.stepf <- stepAIC(surg.initial, list(upper = ~ Blood.Clotting +
Prognostic + Enzyme + Liver, lower = ~ 1), direction = "forward")
Start: AIC=-77.58
Log.Survival ~ 1

+ Liver 1 4.7380 7.6321 -101.658
+ Enzyme 1 4.0988 8.2713 -97.314
+ Prognostic 1 2.7667 9.6034 -89.251
+ Blood.Clotting 1 1.0279 11.3421 -80.265
<none> 12.3701 -77.580
Step: AIC=-101.66
Log.Survival ~ Liver

+ Enzyme 1 1.52473 6.1073 -111.693
+ Prognostic 1 1.10663 6.5254 -108.117

<none> 7.6321 -101.658
+ Blood.Clotting 1 0.02354 7.6085 -99.825
Step: AIC=-111.69
Log.Survival ~ Liver + Enzyme

+ Prognostic 1 2.15879 3.9485 -133.24
<none> 6.1073 -111.69
+ Blood.Clotting 1 0.08933 6.0180 -110.49
Step: AIC=-133.24
Log.Survival ~ Liver + Enzyme + Prognostic

+ Blood.Clotting 1 0.39544 3.5531 -136.94
<none> 3.9485 -133.24
Step: AIC=-136.94
Log.Survival ~ Liver + Enzyme + Prognostic + Blood.Clotting
7.3.3 Stepwise regression

Stepwise regression combines the forward and backward selection procedure. At
each stage of the procedure, it is checked whether a new variable should come
in or should be removed from the model. Let us consider the forward stepwise
procedure. Then, starting with the intercept term, a variable is added as in
the forward selection method. Assume it to be X3 . Next, a second variable is
selected, e.g. X4 . But before a third one is added, it is checked whether X3
should be dropped from the actual model or not. For this the partial Fj∗ -value
MSR(X3 |X4 )
Fj∗ =
MSE(X3 , X4 )
is computed. Assume that both variables remain in the model, then the partial
Fj∗ -values
MSR(Xj |X3 , X4 )
Fj∗ =
MSE(Xj , X3 , X4 )
have to be computed and so on.

The stepwise regression thus allows a variable that is brought into the model at
a certain stage, to be dropped consequently if it is no longer helpful in conjunc-
tion with other variables added at a later stage.
Stepwise regression can also be performed by looking at the increase and de-
crease of the AIC when removing and adding variables:
surg.stepfbi <- stepAIC(surg.initial, list(upper = ~ Blood.Clotting +

Prognostic + Enzyme + Liver, lower = ~ 1), direction = "both")
Start: AIC=-77.58
Log.Survival ~ 1

+ Liver 1 4.7380 7.6321 -101.658
+ Enzyme 1 4.0988 8.2713 -97.314
+ Prognostic 1 2.7667 9.6034 -89.251
+ Blood.Clotting 1 1.0279 11.3421 -80.265
<none> 12.3701 -77.580
Step: AIC=-101.66
Log.Survival ~ Liver

+ Enzyme 1 1.5247 6.1073 -111.693
+ Prognostic 1 1.1066 6.5254 -108.117
<none> 7.6321 -101.658
+ Blood.Clotting 1 0.0235 7.6085 -99.825
- Liver 1 4.7380 12.3701 -77.580
Step: AIC=-111.69
Log.Survival ~ Liver + Enzyme

+ Prognostic 1 2.15879 3.9485 -133.244
<none> 6.1073 -111.693
+ Blood.Clotting 1 0.08933 6.0180 -110.488
- Enzyme 1 1.52473 7.6321 -101.658
- Liver 1 2.16398 8.2713 -97.314

Step: AIC=-133.24
Log.Survival ~ Liver + Enzyme + Prognostic

+ Blood.Clotting 1 0.39544 3.5531 -136.94
<none> 3.9485 -133.24
- Liver 1 0.60137 4.5499 -127.59
- Prognostic 1 2.15879 6.1073 -111.69
- Enzyme 1 2.57689 6.5254 -108.12
Step: AIC=-136.94
Log.Survival ~ Liver + Enzyme + Prognostic + Blood.Clotting

- Liver 1 0.04575 3.5989 -138.25
<none> 3.5531 -136.94
- Blood.Clotting 1 0.39544 3.9485 -133.24
- Prognostic 1 2.46490 6.0180 -110.49
- Enzyme 1 2.96904 6.5221 -106.14
Step: AIC=-138.25
Log.Survival ~ Enzyme + Prognostic + Blood.Clotting

<none> 3.5989 -138.252
+ Liver 1 0.0457 3.5531 -136.943
- Blood.Clotting 1 0.9511 4.5499 -127.589
- Prognostic 1 3.4058 7.0046 -104.290
- Enzyme 1 5.2391 8.8380 -91.736
surg.stepfbf <- stepAIC(surg.full, list(upper = ~ Blood.Clotting +

Prognostic + Enzyme + Liver, lower = ~ 1), direction = "both")
Start: AIC=-136.94
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver

- Liver 1 0.04575 3.5989 -138.25
<none> 3.5531 -136.94

- Blood.Clotting 1 0.39544 3.9485 -133.24
- Prognostic 1 2.46490 6.0180 -110.49
- Enzyme 1 2.96904 6.5221 -106.14
Step: AIC=-138.25
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme

<none> 3.5989 -138.252
+ Liver 1 0.0457 3.5531 -136.943
- Blood.Clotting 1 0.9511 4.5499 -127.589
- Prognostic 1 3.4058 7.0046 -104.290
- Enzyme 1 5.2391 8.8380 -91.736

7.4 Model validation
Model validation involves checking the model against independent data. Three
basic ways of validation are
• collection of new data to check the model and its predictive ability
• comparison of results with theoretical expectations, earlier empirical re-

sults, simulation results
• use of a holdout sample to check the model and its predictive ability
7.4.1 Collection of new data
The purpose of collecting new data is to able to examine whether the regression
model developed from the earlier data is still applicable for the new data. This
is in particular of interest for exploratory observational studies, as they also
involve model building.
Validity checking can be performed
• by re-estimating the final model using the new data and comparing the
estimated regression coefficients and other characteristics of the fitted
model
• by re-estimating from the new data all the ’good’ subset models that had
been considered to see whether the selected regression model is still the
preferred one.
• by measuring the predictive ability of the regression model. Since the

selected model is chosen to fit the original data, the MSE often underes-
timates the true variance. The mean squared prediction error
Pm
i=1 (yi − ŷi )2
MSEP =
m
computes the mean of the squared prediction errors of the new data (of
size m), and should be compared with MSE. If MSEP is much larger
than MSE, one should rely on the MSEP as an indicator of how well the
selected regression model will predict in the future.
7.4. MODEL VALIDATION | 125

7.4.2 Data splitting
Since it is often very difficult to collect new data, an alternative is to split the
data into two sets: a training set used to develop the model and a validation
set which is used to evaluate the reasonableness and predictive capability of the
selected model. This validation procedure is often called cross-validation. The
validation set then plays the role of the ’new data’ in the previous section.
To obtain reliable results, the training set should be large enough (remember,
n > 5p) otherwise the variances of the regression coefficients will be too large.
If data splitting is impractical at small data sets, the PRESS criterion (7.2)
n
X
PRESSp = (yi − ŷi(i) )2
i=1
can also be employed as a form of data splitting.
If a data set for an explanatory observational study is very large, it can even be
split into three parts: one for developing the regression model, the second for
estimating the parameters and the third for validation. This approach avoids
bias resulting from estimating the regression parameters from the same data set
used for developing the model. On the other hand this approach yields larger
variances of the parameter estimates.
In any case, once the model has been validated, it is customary to use the entire
data set for estimating the final regression model.

Chapter 8
Multicollinearity
8.1 The ef fects of multicollinearity

We say that there exist multicollinearity among the predictors if there exists a
nontrivial linear combination of the regressors which is (almost) zero:
p−1
X
∃ {cj } : c0 + cj Xj ≈ 0.
j=1
As illustrated in Figures 13.2 and 13.3 of Chapter 2, there is a large effect on

the regression parameter estimates when the predictor variables are correlated
among themselves. Before exploring these difficulties in detail, we first examine
the situation when all the predictor variables are uncorrelated.
8.1.1 Uncorrelated predictor variables

Let us consider the crew productivity example, investigating the effect of work
crew size (X1 ) and the level of bonus pay (X2 ) on crew productivity (Y ).
attach(crew)
crew
crew.size bonus.pay crew.productivity

1 4 2 42
2 4 2 39
3 4 3 48
4 4 3 51
5 6 2 49
127
6 6 2 53
7 6 3 61
8 6 3 60
Both predictors are uncorrelated, r12 = 0. We compare the regression when

both X1 and X2 are included in the model, to the simple regression models
containing only X1 and only X2 .
crew.lm12 <- lm(crew.productivity ~ crew.size + bonus.pay)

summary(crew.lm12)
Call:
lm(formula = crew.productivity ~ crew.size + bonus.pay)
Residuals:
1 2 3 4 5 6 7 8
1.625 -1.375 -1.625 1.375 -2.125 1.875 0.625 -0.375
Coefficients:
(Intercept) 0.3750 4.7405 0.079 0.940016
crew.size 5.3750 0.6638 8.097 0.000466 ***
bonus.pay 9.2500 1.3276 6.968 0.000937 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(crew.lm12)
Analysis of Variance Table
Response: crew.productivity
crew.size 1 231.125 231.125 65.567 0.0004657 ***
bonus.pay 1 171.125 171.125 48.546 0.0009366 ***
Residuals 5 17.625 3.525
128 | CHAPTER 8. MULTICOLLINEARITY

---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
crew.lm1 <- lm(crew.productivity ~ crew.size)

summary(crew.lm1)
Call:
lm(formula = crew.productivity ~ crew.size)
Residuals:
-6.750 -3.750 0.125 4.500 6.000
Coefficients:
(Intercept) 23.500 10.111 2.324 0.0591 .
crew.size 5.375 1.983 2.711 0.0351 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(crew.lm1)
crew.size 1 231.12 231.125 7.347 0.03508 *
Residuals 6 188.75 31.458
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
crew.lm2 <- lm(crew.productivity ~ bonus.pay)

summary(crew.lm2)
Call:
8.1. THE EFFECTS OF MULTICOLLINEARITY | 129

lm(formula = crew.productivity ~ bonus.pay)
Residuals:
-7.000 -4.688 -0.250 5.250 7.250
Coefficients:
(Intercept) 27.250 11.608 2.348 0.0572 .
bonus.pay 9.250 4.553 2.032 0.0885 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(crew.lm2)
bonus.pay 1 171.12 171.125 4.1276 0.08846 .
Residuals 6 248.75 41.458
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The fitted response functions are:
Ŷ = 0.375 + 5.375X1 + 9.250X2

Ŷ = 23.50 + 5.375X1
Ŷ = 27.25 + 9.250X2
We see that β̂1 = 5.375, the regression coefficient for X1 , is the same whether
or not X2 is also included in the model. The same holds for β̂2 = 9.250. This
is a general result, which can be most easily deduced from the estimate of β in
the standardized regression model (2.40):
yi0 = β10 x0i1 + β20 x0i2 + . . . + βp−1

0
x0i,p−1 + ε0i (8.1)

with parameter estimates (2.43)
0 −1
β̂ = RXX rXY .
If all the X variables are uncorrelated, RXX = Ip−1 , and thus β̂j0 = rjy only
depends on Xj and Y . This remains true for the original coefficients
sY 0 sjY
β̂j = ( )β̂ = 2 .
sj j sj
A geometrical illustration is provided in Figure 10.11 for uncorrelated regressors

and in Figure 10.10 for correlated predictor variables.
Moreover we observe in our example that
SSR(X2 |X1 ) = 171.125 = SSR(X2 )
and equivalently SSR(X1 |X2 ) = SSR(X1 ) = 231.125. Since
SSR(X2 |X1 ) = SSR(X1 , X2 ) − SSR(X1 ) = SSR(X2 )
we see that the regression sum of squares due to X1 and X2 together can be
split into the SSR due to X1 alone and the SSR due to X2 alone, when X1 and
X2 are uncorrelated.

8.1.2 Perfectly or highly correlated predictors
Now, consider an example where two predictor variables are perfectly correlated,
as in Table 7.8.
Here, the response function is not unique. Both Ŷ = −87 + X1 + 18X2 and
Ŷ = −7 + 9X1 + 2X2 yield the same fitted values and (zero) residuals.
As argued in Section 2.2, the matrix of cross-products X t X is singular when

rank(X) < p, hence the least-squares solution is not unique. This could also be
seen in Figure 13.2(b) of Chapter 2.

At real data sets, predictor variables are rarely perfectly correlated and the least
squares solution will thus still be unique. But other problems occur when some
(or all) of the regressors are highly correlated:
• because (X t X) is close to a singular matrix, the variance of the estimated

coefficients
Σ(β̂) = σ 2 (X t X)−1
will be large. Consequently
– a new sample can yield very different estimates
– many of the regression coefficients may be statistically non significant

(large confidence intervals), even though a statistical relation exists
between the response variable and the set of predictor variables. The
t and p-values corresponding to the univariate tests βj = 0 are thus
not very informative (see also the discussion in Section 3.4).
• the interpretation of a regression coefficient is not fully applicable. When

regressors are strongly correlated, it might be hard to vary one predictor
while holding the others constant (e.g. X1 = amount of rainfall, X2 =
hours of sunshine).
We will illustrate some of these effects on the Body Fat data, relating the
amount of body fat (Y ) to triceps skinfold thickness (X1 ), thigh circumference
(X2 ) and midarm circumference (X3 ), measured at 20 healthy women of 25-34
years old.
attach(bodyfat)
bodyfat
triceps thigh midarm body.fat

1 19.5 43.1 29.1 11.9
2 24.7 49.8 28.2 22.8
3 30.7 51.9 37.0 18.7
4 29.8 54.3 31.1 20.1
5 19.1 42.2 30.9 12.9
6 25.6 53.9 23.7 21.7
7 31.4 58.5 27.6 27.1
8 27.9 52.1 30.6 25.4

9 22.1 49.9 23.2 21.3
10 25.5 53.5 24.8 19.3
11 31.1 56.6 30.0 25.4
12 30.4 56.7 28.3 27.2
13 18.7 46.5 23.0 11.7
14 19.7 44.2 28.6 17.8
15 14.6 42.7 21.3 12.8
16 29.5 54.4 30.1 23.9
17 27.7 55.3 25.7 22.6
18 30.2 58.6 24.6 25.4
19 22.7 48.2 27.1 14.8
20 25.2 51.0 27.5 21.1
From the correlation matrix RXX we deduce that X1 and X2 are highly corre-
lated.
print(cor(bodyfat),digits=2)
triceps thigh midarm body.fat

triceps 1.00 0.924 0.458 0.84
thigh 0.92 1.000 0.085 0.88
midarm 0.46 0.085 1.000 0.14
body.fat 0.84 0.878 0.142 1.00
The third variable X3 is not highly correlated with X1 and X2 but if we regress
X3 on X1 and X2 we obtain R2 = 0.98. We see that the regression coefficient
for a certain predictor varies a lot depending on the presence of some or all of
the other predictors, and even can change sign as for β̂2 . Also the standard
error of the parameter estimates increases considerably when more variables are
added to the model.
variables in model β̂1 β̂2 s(β̂1 ) s(β̂2 )
X1 0.8572 - 0.1288 -
X2 - 0.8565 - 0.1100
X1 , X2 0.2224 0.6594 0.3034 0.2912
X1 , X2 , X3 4.3341 -2.8568 3.0155 2.5820
This example illustrates again that a regression coefficient reflects the marginal
or partial effect of a predictor on the response variable, given the other variables
in the model!

If we compute the extra sums of squares for X1 , we see that SSR(X1 ) = 352.27
is much larger that SSR(X1 |X2 ) = 3.47. This is again due to the fact that X1
and X2 are highly correlated. When X2 is already in the model, the marginal
contribution of X1 is small, because X1 contains almost the same information
as X2 .
Finally we illustrate the effect of the multicollinearity on hypothesis tests. Con-

sider the linear model with only the first two predictors.
body.lm12 <- lm(body.fat ~ triceps + thigh)

summary(body.lm12, correlation = TRUE)
Call:
lm(formula = body.fat ~ triceps + thigh)
Residuals:
-3.9469 -1.8807 0.1678 1.3367 4.0147
Coefficients:
(Intercept) -19.1742 8.3606 -2.293 0.0348 *
triceps 0.2224 0.3034 0.733 0.4737
thigh 0.6594 0.2912 2.265 0.0369 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Coefficients:
(Intercept) triceps
triceps 0.73
thigh -0.93 -0.92
The overall F-test

H0 : β 1 = β 2 = 0

has a p-value 0, which leads us to conclude that β1 or β2 are significant. The
individual t-tests for β1 and β2 :
H0 : β 1 = 0 and H0 : β 2 = 0
are however both in favor of the H0 hypothesis. If we use the Bonferroni method
at the α = 5% significance level, we see that both p-values (0.47 and 0.037) are
larger than 0.025 = α/2. Reconsidering Figure 5.1 from Chapter 3, we notice
that the absolute value of the correlation between β̂1 and β̂2 is very high (-0.92)
and thus leads to a very tight confidence ellipse which excludes (0, 0). But the
two univariate confidence intervals contain 0.

8.2 Multicollinearity diagnostics
8.2.1 Informal methods

To summarize, multicollinearity can be informally detected by the following
diagnostics:
1. large changes in the estimated regression coefficients when a predictor

variable is added or deleted
2. nonsignificant results in individual tests on the regression coefficients for

important predictor variables
3. estimated regression coefficients with an opposite sign as we would expect

from theoretical considerations or prior experience
4. large simple correlations between pairs of predictor variables.
8.2.2 Variance inf lation factors

A formal method to detect multicollinearity is by means of the variance inf lation
factors.
Let Rj2 be the value of the R2 coefficient of determination when Xj is regressed

against all the other independent variables. Then, for each j = 1, . . . , p − 1 the
variance inflation factor is defined as
1
VIFj =
1 − Rj2
• If Xj is not related to the other X variables, Rj2 = 0 and hence, VIFj = 1.
• When there is a perfect linear association, Rj2 = 1 and VIFj is unbounded.
• In the general case 0 < Rj2 < 1, so 1 < VIFj < ∞.
It can be shown that VIFj equals the jth diagonal element of the inverse corre-
lation matrix of X.
−1
VIFj = (RXX )jj (8.2)
For the Body Fat data, we have indeed large variance inflation factors for all
three regressors: VIF1 = 708.84, VIF2 = 564.34 and VIF3 = 104.61.
8.2. MULTICOLLINEARITY DIAGNOSTICS | 137

solve(cor(bodyfat[, 1:3]))
triceps thigh midarm

triceps 708.8429 -631.9152 -270.9894
thigh -631.9152 564.3434 241.4948
midarm -270.9894 241.4948 104.6060
The variance inflation factors measure how much the variances of the estimated
regression coefficients are inflated compared to when the predictor variables are
not linearly related. This can be seen as follows: at the standardized regression
model (2.40):
0 −1
Σ(β̂ ) = (σ 0 )2 ((X 0 )t X 0 )−1 = (σ 0 )2 RXX
with (σ 0 )2 the error variance of the transformed data. Because of (8.2) we derive
that Var(β̂j0 ) = (σ 0 )2 VIFj . In terms of the original variables, this yields
sY 2
Var(β̂j ) = Var(β̂j0 )
sj
sY 2 1 σ2
= 2 σ 2 VIFj = VIFj .
sj (n − 1)sY (n − 1)s2j
using (2.42) and (2.41).
We speak about strong multicollinearity if
• the largest VIF is larger than 10, or if
• the mean of the VIF values is considerably larger than 1.
8.2.3 The eigenvalues of the correlation matrix
As another diagnostic tool we can look at the eigenvalues of RXX . Because the
correlation matrix is symmetric and semi-positive definite, it can be decomposed
as
p−1
X
RXX = P LP t = λj v j v tj (8.3)
j=1
with the columns of P = (v 1 , . . . , v p−1 ) containing the (normalized) eigenvec-

tors of RXX and L = diag(λ1 , . . . , λp−1 ) the corresponding eigenvalues. We
assume from now on that the eigenvalues are sorted in descending order, so

λ1 > λ2 > . . . > λp−1 . If there is perfect multicollinearity, some of the eigenval-
ues are zero. Near collinearities are associated with small eigenvalues.
To judge the eigenvalues with respect to their size, we use the equality
p−1
X
λj = tr(RXX ) = p − 1
j=1
so we can
• compare the eigenvalues with p − 1 by computing λj /(p − 1)
• compare them with the largest eigenvalue λmax = λ1 . The condition

number is defined as q
ηj = λmax /λj .
A condition number ηj > 30 is an indication for multicollinearity.
8.2. MULTICOLLINEARITY DIAGNOSTICS | 139

8.3 Multicollinearity remedies
8.3.1 Specif ic solutions

To reduce multicollinearity, we can
• drop one or several predictor variables that are highly correlated to the
remaining variables
• for polynomial regression: center the variables
• apply a biased regression method with a smaller variance.
In Figure 10.2 it is illustrated why a biased estimator with a small variance

might be preferable over an unbiased estimator with large variance: the biased
one will have a larger probability of being close to the true parameter value.
Two popular biased estimators are: Principal component regression and Ridge
regression.

8.3.2 Principal component regression
Principal Component Regression (PCR) combines Principal Component Analy-
sis (PCA) and linear regression. In the first step, the largest principal com-
ponents of the regressors are computed, yielding a new set of uncorrelated
regressors. Secondly, the response variable is regressed onto these principal
components.
This is illustrated in Figure 4.2.
8.3. MULTICOLLINEARITY REMEDIES | 141

Because principal components are attracted by the variables that have the
largest variance, it is common to start standardizing the variables by the cor-
relation transformation (2.38) and (2.39). We thus consider the standardized
linear model:
yi0 = β10 x0i1 + . . . + βp−1
0
x0i,p−1 + ε0i .
Denoting Zn,p−1 = Z = X 0 , the least squares estimator then satisfies:

0 −1
β̂ = (β̂10 , . . . , β̂p−1
0
)t = (Z t Z)−1 Z t y 0 = RXX Z t y0
0
with variance Σ(β̂ ) = σ 02 (Z t Z)−1 .
From (8.3) it follows that

p−1
X
t
(Z Z) −1
= λ−1 t
j vj vj = P L
−1 t
P . (8.4)
j=1
Here, the loading vectors v j define the principal components Tj = v1j Z1 +

v2j Z2 + . . . + vp−1,j Zp−1 of Z satisfying the property that Var(Tj ) is maximal
under the constraints that kv j k = 1 and cor(Tj , Tl ) = 0 for all j < l.
Relation (8.4) illustrates again that the presence of small eigenvalues yields a
0
large sampling variability. Hence, to reduce the variance of β̂ , we can decide to
eliminate the eigenvectors for which the corresponding eigenvalue is too small.
If λk+1 , . . . , λp−1 are sufficiently small, this corresponds to setting
k
X
(Z t Z)+ = λ−1 t
j vj vj .
j=1
and defining
+
β̂ = (Z t Z)+ Z t y 0 .

+
Equivalently we obtain β̂ by first applying a principal component analysis to
the z’s and retaining the first k principal components. They coincide with the
Tj for j = 1, . . . , k. These principal components span a k-dimensional subspace
of Rp−1 with basis vectors v 1 , . . . , v k . The coordinates of the observations
projected onto this subspace are given by the scores
ti = (P̃k,p−1 )t z i
with P̃p−1,k = (v 1 , . . . , v k ), or equivalently Tn,k = Zn,p−1 P̃p−1,k with Tn,k =

(t1 , . . . , tn )t . In the remainder, we drop the subscripts and write T = Z P̃ .
Next, the response variable y 0 is regressed onto the scores. We thus consider
the regression model
yi0 = tti α + εi
with least squares estimates α̂ = (T t T )−1 T t y 0 where T we dropped the sub-

scripts in T = Tn,k . This estimate is not affected by multicollinearity problems
as the scores are uncorrelated! Since T = Z P̃ (where P̃ is short for P̃p−1,k ), we
find using (8.3) that
T t T = P̃ t Z t Z P̃ = P̃ t RXX P̃ = P̃k,p−1
t t
Pp−1,p−1 Lp−1,p−1 Pp−1,p−1 P̃p−1,k = L̃k,k
with L̃k,k = L̃ the upper left k × k submatrix of L. Thus, α̂ = L̃−1 P̃ t Z t y 0 .

Finally we note that yi0 = tti α + εi = z ti P̃ α + εi , so
+
β̂ = P̃ α̂ = P̃p−1,k L̃−1 t t 0 t + t 0
k,k P̃k,p−1 Z y = (Z Z) Z y .

This approach is illustrated in Figure 5.1.

+
The PCR estimator β̂ is biased. Using the orthogonality of the eigenvectors,
it follows that
k
X k p−1
X X
(Z t Z)+ (Z t Z) = λ−1 t
λj v j v tj + λj v j v tj

j vj vj
j=1 j=1 j=k+1
k
X
= v j v tj
j=1
p−1
X
= Ip−1 − v j v tj
j=k+1
and that
(Z t Z)+ (Z t Z)(Z t Z)+ = (Z t Z)+ .
Consequently
+
E(β̂ ) = (Z t Z)+ Z t E(Y 0 )
= (Z t Z)+ Z t Zβ 0
p−1
X
= β0 − v j v tj β 0 .
j=k+1
+
On the other hand, the variance of β̂ has decreased:
+
Σ(β̂ ) = (Z t Z)+ Z t Σ(Y 0 )Z(Z t Z)+
= (Z t Z)+ σ 2
hence
k
+ X
Var(β̂ l ) =σ 2
λ−1 2
j vlj
j=1
whereas
p−1
0 X
Var(β̂ l ) = σ 2 λ−1 2
j vlj .
j=1
Remarks.
• The PCR method is in particular very useful when n < p. When there are
more variables than observations, there is always perfect multicollinearity
because rank(X) ≤ min(n, p) = n < p.
• Another advantage of PCR is its ease of computation and its transparency.

• A drawback is that the principal components are not always easy to in-
terpret. Moreover they only make sense if all the variables are measured
in the same units.
• PCR selects components which contain most of the variation in the re-
gressors. More sophisticated methods such as Partial Least Squares Re-
gression (PLS) compute components that maximize their covariance with
the response variable, with the goal of retaining components that are more
informative with respect to the regression model.
A very important issue in PCR is the choice of k, the optimal number of principal
components that are retained in the analysis. Some popular strategies are the
following:
• to make a scree plot, which is a graph of the eigenvalues in decreasing

order.
• to select k such that the first k components explain a prescribed percentage

of the total variance of the x-variables, e.g. one could take k as the smallest
integer such that
Xk p−1
X
( λj )/( λj ) > 80%
j=1 j=1
• to use variable selection techniques as discussed in Chapter 7.
• to compute the RMSEP value at a validation set (or by cross-validation):

v
u m
u1 X
RMSEPk = t (yi − ŷi,k )2
m i=1
with ŷi,k the fitted response value for the ith case based on a PCR regres-
sion with k components, and m the number of observations in the vali-
dation set. The RMSEPk curve for k = 1, . . . , kmax often has the shape
of the upper curve of Figure 4.3 (in Chapter 7, Section 7.1). Its minimal
value then determines the chosen number of components, see Figure 5.2.

Example: Police Height data.
Measurements on height are taken for 33 female police department applicants
together with 9 predictor variables: sitting height, upper arm length, forearm
length, hand length, upper leg length, lower leg length, foot length, brachial
index and Tibio-Femural index.
The smallest two eigenvalues of Z t Z are λ9 = 0.00047 and λ8 = 0.00087 whereas

λ7 = 0.23145, so 7 principal components are retained. This yields the param-
eter estimates and the variance inflation factors of Table 10.3. We see that
the least squares and the PCR estimates differ very little for all the predictor
variables that have a small VIF for least squares. These regressors are not in-
tercorrelated and are only slightly affected by the deletion of v 8 and v 9 . The
correlated variables on the other hand have estimates that are greatly altered
by the elimination of v 8 and v 9 and their VIF values are significantly reduced.
Also the closeness of σ̂ 2 and R2 show that the PCR model is appropriate here.

8.3.3 Ridge regression
Ridge regression is a biased regression method which starts by transforming the

variables by the correlation transformation, yielding the standardized regression
model (2.40), whose least squares solution satisfies
RXX β = rXY .
The ridge standardized regression estimators β ∗ are obtained by introducing a

constant c > 0 to the diagonal elements of the correlation matrix of X:
(RXX + cIp−1 )β ∗ = rXY (8.5)
With c = 0 the ridge and the least squares estimators coincide. When c > 0 the
ridge estimator is biased, but has less variability.
It can be shown that the bias of β ∗ increases with c, whereas the variance
(expressed as the trace of the variance-covariance matrix) decreases with c.
The mean squared error combines the bias and the variance of an estimator.
For an estimator of a univariate parameter β:
MSE(β̂) = E[(β̂ − β)2 ] = (E[β̂] − β)2 + E[(β̂ − E(β̂))2 ]

= bias(β̂)2 + Var(β̂).
For a p − 1-dimensional estimator of β, the total mean squared error can be

defined as
p−1
X p−1
X
TMSE(β̂) = E[(β̂ j − β j )2 ] = [bias(β̂ j )2 + Var(β̂ j )].
j=1 j=1
It has been shown that for any data set there exists always a value of c such that
the ridge estimator β ∗ has a smaller TMSE than the least squares estimator β̂ LS .
To determine the constant c we will consider the ridge trace method, and the
variance inflation factors. The ridge trace plots the evolution of the ridge stan-
dardized regression coefficients βj∗ for different values of c, usually between 0
and 1.
The VIF values for ridge regression are defined as for OLS: they measure for
each coefficient how large the variance of β̂j∗ is relative to what the variance
would be if the predictors were uncorrelated. It can be shown that VIFj for
ridge regression equals the j-diagonal element of the matrix
(RXX + cI)−1 RXX (RXX + cI)−1 .
In the Body fat example (Table 10.3) we see that the VIF’s decrease rapidly as
c changes from 0 towards 1. The constant c is then chosen as the smallest value
where the plot and the VIF’s become stable. Here, it was decided to employ

c = 0.02 since the VIF values are then close to 1 and the regression coefficients
are quite stable. The resulting fitted model for c = 0.02 is:
Yˆ0 = 0.5463 X10 + 0.3774 X20 − 0.1369 X30
or in terms of the original variables
Bodyfat = −7.4034 + 0.5554 triceps + 0.3681 thigh − 0.1916 midarm
Also notice that the R2 value only decreased slightly: from 0.8014 to 0.7818.
Since the total sum of squares for the transformed variables
n
X
SST = (yi0 − ȳ 0 )2 = 1
i=1
the coefficient of multiple determination for ridge regression equals
R2 = 1 − SSE
n
X
=1− (yi0 − ŷi0 )2
i=1

Chapter 9
Influential observations and

outliers
Real data sets often contain outlying observations. Although a precise definition
of outliers is hard to give, they are characterized as the observations that do not
follow the pattern of the majority of the data. In regression, data points can be
split into 4 types:
1. regular observations with internal xi and well-fitting yi
2. vertical outliers, with internal xi and non-fitting yi
3. good leverage points, with outlying xi and well-fitting yi
4. bad leverage points, with outlying xi and non-fitting yi
For simple regression, these different types of observations are illustrated in Fig-
ure 9.1. It is well-known that the least-squares estimator β̂ LS is very sensitive
to vertical outliers and bad leverage points.
152
vertical outlier
good leverage point
•
•
y
• •
•
regular data
•
• •••
••• ••
• • •
• • •
• • • bad leverage point
• • ••
•
Figure 9.1: Different types of outliers in regression.
9.1 Vertical outliers

We consider the Telephone data set, which contains the number of international
telephone calls (in millions) from Belgium in the years 1950-1973.
library(MASS)
phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[21] 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0
[11] 13.5 14.9 16.1 21.2 119.0 124.0 142.0 159.0 182.0 212.0
[21] 43.0 24.0 27.0 29.0
This data set contains six remarkable vertical outliers. It turned out that from
1964 to 1969 another recording system was used, giving the total number of
minutes of these calls. The LS fit has clearly been affected by the outlying y-
values, as shown in Figure 9.2. The robust LTS method, which will be defined in
9.1. VERTICAL OUTLIERS | 153

200
150
calls
100
50
0
50 55 60 65 70
year
Section 9.4.1, avoids the outliers and fits nicely the linear model of the majority
of the data.
attach(phones)
plot(year,calls)
phones.lm <- lm(calls ~ year)
abline(phones.lm)
text(70,100,"LS")
library(robustbase)
phones.wlts <- ltsReg(calls~year,alpha=0.75)
abline(phones.wlts,lty=2)
text(67,30,"LTS")
To detect vertical outliers, we consider the standardized robust residuals,

defined as
(s) yi − ŷi,R
ei,R = (9.1)
sR
with ŷi,R the fitted values obtained by applying a robust regression method,
and sR a robust measure of scale. If the majority of the data points follows
the general linear model with normal errors, these standardized robust residu-
als approximately lie in [-2,2] with a confidence of 95% and in [-2.5,2.5] with
a confidence of 99%. The robust LTS method nicely detects the outliers, as
shown on the residual plot in Figure 9.3. On the other hand, none of the obser-
154 | CHAPTER 9. INFLUENTIAL OBSERVATIONS AND OUTLIERS

200
150
calls
100
LS
50
LTS
0
50 55 60 65 70
year
Figure 9.2: Telephone data set with LS and LTS fit superimposed.
vations seems outlying if we look at the plot of the standardized LS residuals

(Figure 9.4).

Residuals vs Index
25
20
19
20
Standardized LTS residual 18
17
15
16
15
10
5
2.5
0
−2.5
5 10 15 20
Index
Figure 9.3: Telephone data set: Index plot of the standardized robust residuals.
Residuals vs Index
3
2
Standardized LS residual
1
0
−1
5 10 15 20
Index
Figure 9.4: Telephone data set: Index plot of the standardized least squares
residuals.

The LS residuals do not detect the vertical outliers because the LS fit itself is
attracted to those outliers as it tries to make the (squared) residuals of all the
cases as small as possible. A classical approach to find the vertical outliers,
thus based on β̂ LS , consists of computing the deleted residual, introduced in
Chapter 7, equation (7.3):
di = yi − ŷi(i)
ei
=
1 − hii
with ŷi(i) the fitted value of case i, excluded from the data set to estimate the
regression coefficients. It can be shown that
s(i)
s(di ) = √
1 − hii
and that
di e
e∗i = = √i ∼ tn−p−1 . (9.2)
s(di ) s(i) 1 − hii
Hence, the e∗i are called the studentized residuals. They can be computed
without refitting the model each time an observation is deleted, by using the
relation s
n−p−1
e∗i = ei .
SSE(1 − hii ) − e2i
Here, also a plot of the studentized residuals of the Telephone data (Figure 9.5),
does not pinpoint the outliers. This is due to the fact that the outliers are not
isolated here. Deleting one of the outliers does not change the fit drastically!
phones.studres <- studres(phones.lm)
or alternatively
phones.lmi <- lm.influence(phones.lm)

si <- phones.lmi$sigma
h <- phones.lmi$hat
phones.studres <- residuals(phones.lm)/(si*(1-h)^0.5)
plot(phones.studres,ylim=c(-3,3))
abline(h=c(-2.5,2.5))

3
2
1
phones.studres
0
−1
−2
−3
5 10 15 20
Index
Figure 9.5: Telephone data set: Index plot of the studentized residuals.
9.2 Leverage points
9.2.1 Residuals
Bad leverage points, which are outlying observations in the predictor space that
do not follow the linear model of the majority of the data points, also have a
large influence on the classical LS estimator. Let us illustrate this effect on
the stars data set. These data form the Hertzsprung-Russell diagram of the
star cluster CYG OB1, which contains 47 stars in the direction of Cygnus. The
regressor X is the logarithm of the effective temperature at the surface of the
star, and the response Y is the logarithm of its light intensity.

In the plot of the data in Figure 9.6 we see two groups of observations: the
majority, following a steep band, and four stars in the upper left corner (with
indices 11, 20, 30 and 34). The 43 ’regular’ observations lie on the main se-
quence, whereas the four outlying data points are giant stars. The LS fit is
again highly attracted by the giant stars and does not at all reflect the linear
trend of the majority of the data points, in contrast to the robust LTS fit.
If we plot the studentized LS residuals (Figure 9.7) we can not detect any devi-
ating observation, but the four outliers stand out in the plot of the standardized
LTS residuals (Figure 9.8).
9.2. LEVERAGE POINTS | 159

6.0
5.5
LS
intensity
5.0
4.5
LTS
4.0
3.6 3.8 4.0 4.2 4.4 4.6
temperature
Figure 9.6: Stars data set with LS and LTS fit superimposed.
3
2
1
stars.studres
0
-1
-2
-3
0 10 20 30 40
Figure 9.7: Stars data set: Index plot of studentized LS residuals.

34
30
20
11
10
Standardized LTS Residual
5
2.5
0
-2.5
0 10 20 30 40
Index
Figure 9.8: Stars data set: Index plot of standardized LTS residuals.
9.2.2 Diagnostic plot

In this example, the outliers are bad leverage points, hence they can be detected
based on their (large) robust residuals. This residual plot however can not dis-
tinguish between bad leverage points and vertical outliers. Also good leverage
points will not be highlighted on a residual plot, as they have a small residual.
Therefore we will need a metric within the X-space to compute the distance of
each observation to the center of the data cloud. For (p−1)-dimensional vectors
xi = (xi1 , . . . , xi,p−1 )t the classical Mahalanobis distance is defined as:
p
MD(xi ) = (xi − x̄)t S −1 (xi − x̄) (9.3)
with
n
1X
x̄ = xi
n i=1
the sample mean, and

n
1 X
S= (xi − x̄)(xi − x̄)t
n − 1 i=1
the empirical covariance matrix of the xi . Both the sample mean and the sample
covariance matrix are however non-robust: the mean will be shifted towards the

outliers whereas the covariance matrix will be inflated to them. Consider e.g.
Figure 9.9 with contains the logarithms of the body weight (in kilograms) and
the brain weight (in grams) of 28 animals.
data(Animals)
Animals
body brain
Mountain beaver 1.350 8.1
Cow 465.000 423.0
Grey wolf 36.330 119.5
Goat 27.660 115.0
Guinea pig 1.040 5.5
Dipliodocus 11700.000 50.0
Asian elephant 2547.000 4603.0
Donkey 187.100 419.0
Horse 521.000 655.0
Potar monkey 10.000 115.0
Cat 3.300 25.6
Giraffe 529.000 680.0
Gorilla 207.000 406.0
Human 62.000 1320.0
African elephant 6654.000 5712.0
Triceratops 9400.000 70.0
Rhesus monkey 6.800 179.0
Kangaroo 35.000 56.0
Golden hamster 0.120 1.0
Mouse 0.023 0.4
Rabbit 2.500 12.1
Sheep 55.500 175.0
Jaguar 100.000 157.0
Chimpanzee 52.160 440.0
Rat 0.280 1.9
Brachiosaurus 87000.000 154.5
Mole 0.122 3.0
Pig 192.000 180.0
Three animals are clearly outlying: these are dinosaurs, with a small brain as
compared with a heavy body. We see that the classical mean, indicated by a plus
sign, is shifted towards the outliers. The covariance matrix can be visualized

Classical and robust tolerance ellipse (97.5%)
10
Log brain weight
Brachiosaurus
5
Triceratops
Dipliodocus
0
robust
−5
classical
−5 0 5 10 15
Log body weight
Figure 9.9: Body and brain weight for 28 animals with classical and robust
tolerance ellipse superimposed.
through the classical tolerance ellipsoid, defined by

q
{x ; MD(x) = χ2p−1,0.025 }.
At a (p − 1)-variate normal distribution this ellipsoid should contain approxi-

mately 97.5% of the data points, since the squared Mahalanobis distances are
then χ2p−1 distributed. We see that this ellipsoid is highly attracted to the out-
liers, and tries to engulf them. On the other hand, the robust tolerance ellipsoid,
defined as
q
{x ; RD(x) = χ2p−1,0.025 }
is much smaller and essentially contains the majority of the data points. Here,
the robust distance is defined analogously to the Mahalanobis distance:
q
RD(xi ) = (xi − µ̂R )t Σ̂−1
R (xi − µ̂R ) (9.4)
where µ̂R and Σ̂R are robust estimates of the center µ and shape Σ of the
x-part of the data points. In Section 9.5 we will discuss the MCD estimator as
a highly robust estimator of location and shape.

If we compare the Mahalanobis distances of the Animals data set with the
robust distances as in Figure 9.10 we see that the MD distances are all but one
q
smaller than χ22,0.025 = 2.72, whereas the robust distances of the dinosaurs
are much larger than this cut-off value.
Distance−Distance Plot
Brachiosaurus
Dipliodocus
8
Triceratops
6
Robust distance
4
2
0
0.5 1.0 1.5 2.0 2.5 3.0
Mahalanobis distance
Figure 9.10: Animals data set: Robust distances versus Mahalanobis distances.
Leverage points will thus be characterized as having a large robust distance.

If we now plot for each observation its standardized robust residual versus its
robust distance, we obtain the diagnostic plot, on which the four types of
observations can be distinguished as in Figure 9.11.

cutoff
vertical outliers bad leverage points

standardized LQS or LTS residual
2.5
regular observations good leverage points
-2.5
vertical outliers bad leverage points
robust distance RD(x i)
Figure 9.11: Diagnostic plot with 4 types of observations.
For the stars data set, this yields Figure 9.12, on which we clearly see the giant
stars and star 7 as bad leverage points. Star 14 is a good leverage point, whereas
star 9 is found to be a vertical outlier.

Regression Diagnostic Plot
34
30
20
8
11
6
Standardized LTS residual
4
2.5
9
2
14
0
−2
−2.5
0 1 2 3 4 5 6
Robust distance computed by MCD
Figure 9.12: Stars data set: diagnostic plot.
9.2.3 The Hat matrix

Classical diagnostics to detect leverage points are based on the hat matrix
H = X(X t X)−1 X t
as defined in (2.13) which transforms the observed response vector y into its LS
estimate
ŷ = Hy
or equivalently
ŷi = hi1 y1 + hi2 y2 + . . . + hin yn .
(Note that the X matrix here includes a constant column of ones for the intercept
term.) The element hij of H thus measures the effect of the jth observation
on ŷi , and the diagonal element hii the effect of the ith observation on its own
prediction. A diagonal element hii = 0 indicates a point with no influence on
the fit. Since
tr(H) = tr(X(X t X)−1 X t ) = tr(X t X(X t X)−1 ) = tr(Ip ) = p

we have n
X
hii = p
i=1
and consequently
h̄ii = p/n.
Moreover, since H is symmetric H t = H and idempotent HH = H, we see that

n
X
hii = (HH)ii = hij hji
j=1
X
= hii hii + hij hji
j6=i
X
= h2ii + h2ij
j6=i
and thus 0 6 hii and hii > h2ii for all i = 1, . . . , n. Finally, this implies
0 6 hii 6 1.
These limits do not yet tell us when hii is large. Some authors suggest to use
2p
hii >
n
as cut-off value. Note that, when hii = 1, hij = 0 for all j 6= i, and consequently
ŷi = 1yi and ri = yi − ŷi = 0. The ith observation is thus so influential that
the LS fit passes through it. Moreover, the variance of the ith residual is then
zero:
s2 (ei ) = MSE(1 − hii ) = 0.
It can also be shown that there is a one-by-one correspondence between the

squared Mahalanobis distance (9.3) for object i and its hii :
1 1
hii = MD2i + . (9.5)
n−1 n
From this expression, we see that hii also measures the distance of xi to the
center of the data points in the X-space. On the other hand, equation (9.5)
shows that the hat-diagnostic is not robust!
Example 1: the Telephone data set

Table 2 lists some diagnostics for the Telephone data set. We see that none of
the observations has a leverage which is larger that 2p/n = 0.167 or a squared
Mahalanobis distance which exceeds χ21,0.05 = 3.84. This is not surprising as

the only regressor in this data set Year does not contain any outlying value.
Note that the legend of this table contains a different terminology than the one
(s)
used in this book: the standardized residuals ei , defined by (3.9), are here de-
noted as ‘studentized residuals’ ti , and the studentized residuals e∗i , introduced
in (9.2), as ‘jackknifed residuals’ t(i). The standardized residuals of Table 1 are
obtained as ei /s and can also be compared to the cut-off value 2.5.

Example 2: the Stars data set
Table 1 lists the same diagnostics for the Stars data set. We see that both the
diagonal elements of the hat matrix and the Mahalanobis distances of the giant
stars exceed their cut-off value. So in this example these diagnostics are able to
identify the most extreme outliers in x. But the bad leverage point 7, and the
good leverage point 14 are not recognized.

Example 3: the Hawkins-Bradu-Kass data set
This artifical data set contains 75 observations in four dimensions and is listed
in Table 9. The first 10 observations are bad leverage points, and the next four
points are good leverage points.

If we now take a look at the hii and the MD(xi ) values for these 14 leverage
points, we see that the classical diagnostics fail completely. The LS standardized
and studentized residuals are large for the good leverage points, but not for
the bad leverage points. The LS fit is thus tilted towards these bad leverage
points, and renders them into regular observations. Even the diagnostics in the
predictor space cannot identify the first 10 observations. The good leverage
points on the other hand are converted into bad leverage points.

9.3 Single-case diagnostics
After the outlying observations in X- or Y -space are identified, classical diagnos-
tics proceed in asserting the influence of these outlying cases on the regression
fit. There exist several single-case diagnostics that are based on the omission
of a single case to measure its influence. Since all these diagnostics are char-
acterized by some function of the LS residuals, or the diagonal elements of the
hat matrix, they will however not be able to identify the true influential data
points when the outliers in the data set are not isolated.
9.3.1 DFFITS
A measure of the influence that case i has on the fitted value ŷi is
ŷi − ŷi(i)
DFFITSi = q (9.6)
s2(i) hii
Since ŷi = xti β̂, the variance of the fitted value equals
Var(ŷi ) = xti σ 2 (X t X)−1 xi = σ 2 hii
and is usually estimated by

s2 (ŷi ) = s2 hii .
In (9.6) the denominator is the estimated standard deviation of ŷi but the un-
known error variance is now estimated by the MSE obtained by omitting the ith
case from the data set. The DFFITS value thus measures the (standardized)
effect on the prediction when an observation is deleted.
Like other single-case diagnostics, e.g. the deleted residual (7.3), the DFFITS
values can be computed by the results from fitting the entire data set:
r
hii
DFFITSi = e∗i .
1 − hii
The DFFITSi value thus depends on the size of the studentized residual e∗i and
the leverage value hii , and will be large if either e∗i is large, or hii is large or
they are both large. A case is considered to be influential if
p
|DFFITSi | > 2 p/n.

9.3.2 Cook’s distance
Cook’s distance Di measures the influence of the ith case on all n fitted values.
Let ŷj(i) = xtj β̂ (i) then it is defined as
Pn
j=1 (ŷj − ŷj(i) )2
Di = (9.7)
ps2
2
e hii
= i2 (9.8)
ps (1 − hii )2
(s) hii 1
= (ei )2 (9.9)
1 − hii p
and thus Di is essentially the same as the square of the DFFITSi .
In matrix notation, it can be written as
Di = (ŷ − ŷ(i) )t (ŷ − ŷ(i) )/(ps2 )
so Di is the squared distance between ŷ and ŷ(i) , divided by ps2 . Since (ŷ −
ŷ(i) ) = X(β̂ − β̂ (i) ) it is also equivalent to
Di = (β̂ − β̂ (i) )t (X t X)(β̂ − β̂ (i) )/(ps2 ).
Hence Di also measures the influence of the ith case on the regression coefficients.
There is no formal test to decide when Di is large. We can simply compare

the sizes of the large Di with the base level indicated by the majority of the
distances. Some authors suggest to declare a data point influential if
Di > 1.
9.3.3 DFBETAS
The DFBETAS measure computes for each case i its influence of each regression
coefficient β̂j :
β̂j − β̂j(i)
DFBETASij = q (9.10)
s2(i) (X t X)−1
jj
for j = 1, . . . , p − 1. The denominator is an estimate of the standard error of

β̂j , because from (2.23) it follows that
q
s(β̂j ) = σ 2 (X t X)−1
jj .
A guideline to identify influential cases is

√
|DFBETASij | > 2/ n.
9.3. SINGLE-CASE DIAGNOSTICS | 173

9.3.4 Examples
Let us look at the values of the single-case diagnostics DFFITSi , Di and DFBETASi
for the three data sets considered before. Table 5 lists the outliers diagnostics
for the Telephone data. None of the diagnostics identifies the vertical outliers
15-19, and only the DFFITS and DFBETAS values for observation 20 exceed
the corresponding cut-off values. Moreover, also two regular data points (23
and 24) are declared as influential!

Note that these diagnostics can be computed in R with the following commands:
e <- residuals(phones.lm)
phones.dfbetas <- dfbetas(phones.lm)
phones.dffits <- h^0.5*e/(si*(1-h))
p <- phones.lm$rank
phones.stres <- stdres(phones.lm)
phones.cd <- (1/p * phones.stres^2 * h)/(1 - h)
Next, we consider the outlier diagnostics for the Stars data set, listed in Table
4. We see that the giant stars 11, 20, 30 and 34 are not noticed by Cook’s
distance Di , but DFFITS and DFBETAS are more powerful.

Finally, we look at the diagnostics for the Hawkins-Bradu-Kass data. Again we
see that none of the classical diagnostics succeed in separating the ‘bad’ points
from the good ones. The robust diagnostic RDi on the other hand (do not
confuse the last column in the tables with the robust distance (9.4)!) which is
based on robust residuals finds all the outlying data points. Its definition will
not be given here, since the good and the bad leverage points can be detected
by means of the diagnostic plot.

9.4 The LTS estimator
9.4.1 Parameter estimates

The Least Trimmed Squares estimator is a highly robust regression estima-
tor. It is defined as
h
X
β̂ LTS = argmin (e2 (β̂))i:n (9.11)
β̂ i=1
with h an integer between [n + p + 1]/2 and n, and e2i:n the ith smallest squared
residual. For any candidate β̂ we thus rank the squared residuals from smallest
to largest (e2 (β̂))1:n 6 (e2 (β̂))2:n 6 . . . 6 (e2 (β̂))n:n and compute the sum
of the h smallest squared residuals. The LTS fit then corresponds to that β̂
which yields the smallest sum. The LTS estimator does not try to make all the
residuals as small as possible, but only the ‘majority’, where the ‘majority’ is
defined as h/n.
The robustness of an estimator can be measured by its breakdown value which

says how many of the n observations need to be replaced before the estimate
is carried away. Formally, the finite-sample breakdown value of any regression
estimator T (Z) = T (X, y) is given by
m
ε∗n = ε∗n (T, Z) = min { ; sup kT (Z 0 )k = ∞}
n Z0
where Z 0 = (X 0 , y 0 ) ranges over all data sets obtained by replacing any m ob-
servations of Z = (X, y) by arbitrary points.
The breakdown value of the LTS estimator satisfies:
ε∗n = (n − h + 1)/n
and is maximal for h = [(n + p + 1)/2]. Roughly speaking, the maximal break-
down value of 50% is obtained for h ≈ n/2. If we choose h = 0.75n, the
breakdown value is approximately 25% etc. If h gets closer to n, the LTS es-
timator approaches the LS estimator. The larger we choose h the better the
finite-sample efficiency of the LTS estimator will be, but the lower its resistance
towards outliers!

With the parameter estimates β̂ LTS we can associate an estimator of the error
scale σ: v
u h
u1 X
sLTS = dh,n t (e2 (β̂ LTS ))i:n .
h i=1
The constant dh,n is chosen to make the scale estimator consistent at the gaus-
sian model, which gives
h+n
ch,n = 1/Φ−1 ( )
s 2n
2n
dh,n = 1/ 1 − φ(1/ch,n ).
hch,n
9.4.2 Computation
Contrary to the LS estimator, the objective function of the LTS estimator
h
X
(e2 (β̂))i:n (9.12)
i=1
is not convex and has many local minima. Therefore, one has to rely on ap-
proximate algorithms to compute the LTS estimator. Several approaches exist
which differ in speed and/or accuracy.
The p-subset algorithm, such as PROGRESS, starts by drawing a random

subset of p observations out of the whole data set n. Then, the hyperplane β̂
through these p data points is computed and the objective function (9.12) is
evaluated in β̂. By drawing many random p-subsets (500-3000, depending on n
and p) we obtain many candidate fits, from which the β̂ LTS with the smallest
objective function can be selected. The FAST-LTS algorithm also starts with
random subsets but it then uses more advanced steps that decrease the objective
function.
9.4.3 Reweighted LTS

Although the LTS estimator is asymptotically normal, its asymptotic and finite-
sample efficieny is not very high. This implies that its variance at uncontami-
nated data is much larger than the variance of β̂ LS . To improve the efficiency of
LTS, we can apply a reweighted procedure. Another advantage of the reweighted
LTS is that it yields inferential information such as standard errors of the esti-
mates, t and p-values and so on, which can be used for model improvement.
9.4. THE LTS ESTIMATOR | 179

First, the standardized residuals are computed
ei (β̂ LTS )/sLTS
and a weight function is applied to them. An example of a weight function is


1 if |ei (β̂
LTS )/sLTS | < 2.5

wi =
0 otherwise.

This is called hard rejection and produces a clear distinction between accepted
and rejected points. Next, a weighted LS fit is computed, which is equivalent
√ √
to apply OLS on the transformed observations ( wi xi , wi yi ) as discussed in
Chapter 6, Section 6.4.3. If we denote the resulting parameter estimates as
β̂ RLTS , we can again compute the corresponding residuals ei (β̂ RLTS ) and the
scale estimate
sP
wi e2i (β̂ RLTS )
iP
sRLTS = .
i wi − p
In the R package robustbase the default output is actually the reweighted LTS.

9.5 The MCD estimator
9.5.1 Parameter estimates
The Minimum Covariance Determinant estimator is a robust estimator for

the center µ and the shape Σ of a multivariate data set. In the regression con-
text, it will be applied to the set of predictor variables X1 , . . . , Xp−1 to detect
good and bad leverage points. In this section we therefore assume that our data
points are (p − 1)-dimensional.
The MCD estimator is defined by:
• find the h observations out of n whose classical covariance matrix has the
lowest determinant
• then, µ̂0 is the average of those h observations, and Σ̂0 is the covariance
matrix of those h observations (multiplied by a consistency factor)
with [n + p]/2 6 h 6 n. This definition is inspired by the following relation: let

x̄h denote the mean and Sh the covariance matrix of h observations. Consider
now the tolerance ellipsoid
{x; (x − x̄h )t Sh−1 (x − x̄h ) 6 c2 }
for some constant c. Then it can be shown that the volume of this ellipsoid is
proportional to the square root of the determinant of Sh .
The breakdown value of the MCD estimator is ε∗n = (n−h+1)/n. For any scatter
matrix Σ̂, breakdown means that the largest eigenvalue becomes arbitrary large
or that the smallest eigenvalue becomes zero:
m λmax(X 0 )
ε∗n (Σ̂, X) = min{ sup = ∞}
n X 0 λmin(X 0 )
with X 0 obtained by replacing m points out of X. This implies that either the
tolerance ellipsoid explodes (i.e. becomes unbounded) or that it implodes (i.e.
is flattened to a lower dimension and deflated to a zero volume). Remember
that the determinant of a square matrix is equal to the product of its eigenvalues.
9.5. THE MCD ESTIMATOR | 181

9.5.2 Computation
The computation of the MCD estimator is very difficult. Exhaustive search
over all h-subsets is usually too time-consuming. Again a p-subset approach
can be followed.
It starts by drawing random p points out of n, and computing their mean m0

and covariance matrix C0 . Since the observations are (p − 1)-dimensional, this
covariance matrix will be non-singular unless the p observations lie on a hy-
perplane of Rp−1 . We then compute the h observations with smallest robust
distance (with respect to m0 and C0 ). Finally the mean m1 and covariance
matrix C1 of these h data points is computed and the determinant of C1 is
evaluated.
Improvements to this elemental approach are incorporated in the FAST-MCD al-

gorithm, which can be used in R with the covMcd function in package robustbase.
9.5.3 Reweighted MCD-estimator

The one-step reweighted MCD-estimator is defined analogously to the reweighted
LTS estimator. Based on µ̂0 and Σ̂0 we compute the robust distances as in (9.4):
q
RD(xi ) = (xi − µ̂0 )t Σ̂−1
0 (xi − µ̂0 )
and assign a weight to each observation:

 q
1 if RD(xi ) 6 χ2

p−1,0.025
wi =
0 otherwise

Finally the weighted mean and weighted covariance matrix are obtained:
n
! n
!
X X
µ̂1 = wi xi / wi
i=1 i=1
n
! n
!
X X
Σ̂1 = wi (xi − µ̂1 )(xi − µ̂1 )t / wi − 1
i=1 i=1
These reweighted estimates attain a higher finite-sample efficiency than the

raw MCD estimates, but retain the same breakdown value and are the default
output of the covMcd function in R package robustbase.

9.6 A robust R-squared
The classical coefficient of determination (2.28)
Pn
(yi − ŷi )2
R2 = 1 − Pi=1n 2
i=1 (yi − ȳ)
can also be written as

2
2 σ̂ML (X, y)
RML =1− 2 (1, y)
σ̂ML
2
with σ̂ML the maximum likelihood estimator for σ 2 (under normal errors), given
2
= n1
P 2
by σ̂ML ei .
Remember that the mean of the yi corresponds with the univariate least squares
estimator: n
X
ȳ = argmin (yi − µ̂)2 .
µ̂ i=1
2
Therefore a robust R can be defined analogously as
2 s2LT S (X, y)
RLT S =1− .
s2LT S (1, y)
The denominator is equal to the squared univariate LTS or MCD scale estimator.
It is defined by the variance of the h-subset with smallest variance, and can be
computed by an explicit algorithm of O(n log n). For this we first have to sort
the (univariate) observations and compute the variance of h successive points.
9.6. A ROBUST R-SQUARED | 183

9.7 Model selection
All the variable selection methods discussed in Chapter 7 are based on LS esti-
mates and hence they are very sensitive to outliers! Simple robust alternatives
are e.g. based on the robust R2 value.
One should however be very cautious when outliers are detected. They are found
not to satisfy the linear model that is followed by the majority of the data points.
It is very important to investigate the reason why they differ: it can be due to
the fact that they indeed belong to another population and hence satisfy another
relation. But they can also point us to a model-misspecification. The inclusion
of a quadratic term or a transformation of a variable can e.g. accommodate the
outliers. Knowledge about the problem at hand is thus indispensable for model
building and outlier detection!!

Chapter 10
Nonlinear regression
10.1 The nonlinear regression model

The general linear model (2.1)
can be expressed as
yi = f (β, xi ) + εi (10.1)
with f (β, xi ) = E(yi ) = xti β a function that is linear in the regression parame-
ters β = (β0 , β1 , . . . , βp−1 )t . This general model includes polynomial regression
models, models with interaction terms, binary variables, and transformed vari-
ables as discussed in Chapters 4-6.
A nonlinear regression model is of the same form as (10.1):
yi = f (β, xi ) + εi (10.2)
but with a mean response function which is not linear in the unknown parame-
ters β = (β0 , β1 , . . . , βp−1 )t . The error terms εi are usually assumed to satisfy
the Gauss-Markov conditions (2.2)-(2.4) as for linear regression. Their expec-
tation is thus zero, they have constant variance and they are uncorrelated. In
matrix notation this nonlinear model is written as:
y = f (β, X) + ε
with, for normally distributed errors,
ε ∼ Nn (0, σ 2 In ). (10.3)
185
Examples.
Exponential regression models.
yi = β0 eβ1 xi + εi (10.4)
yi = β0 + β1 eβ2 xi + εi . (10.5)
Typical examples of (10.5) are growth curves where yi represents the length of
individuals at time xi = ti . Then, if β2 < 0, β0 is the maximum length, β0 + β1
the length at time 0 (thus β1 < 0), whereas β2 exposes the proportionality at
time t of the rate of growth y 0 (t) = β1 β2 eβ2 t to the remaining amount of growth
β0 − y(t) = −β1 eβ2 t .
Logistic regression models.
β0
yi = + εi .
1 + β1 e β 2 x i
This model is popular in population studies with yi the population size at time
xi = ti . With β2 < 0, β0 represents again the maximal amount of species.
The response function in now S-shaped. This response function is also used in
logistic regression to model the probability of success for a binary (Bernouilli)
outcome variable.
Some nonlinear response functions can be linearized by a transformation, hence

they are called intrinsically linear. The logarithm of the response function (10.4)
186 | CHAPTER 10. NONLINEAR REGRESSION

e.g.
log f (β, xi ) = log β0 + β1 xi

= β00 + β1 xi (10.6)
is linear in β00 and β1 . Model (10.6) could thus be analyzed with linear regression
techniques. However, it depends on the error terms whether (10.4) or (10.6)
should be studied. If we use (10.6), we assume that
log yi = β00 + β1 xi + εi (10.7)
or equivalently that
yi = β0 eβ1 xi eεi = β0 eβ1 xi ε0i
with the ε0i begin lognormally distributed. This is not equivalent to (10.4), so
even in this situation we might prefer the nonlinear model (10.4) over (10.7).
Note that contrary to the general linear model (2.1), the number of predictor
variables is not necessarily the same as the number of regression parameters.
We denote the regression parameters β = (β0 , β1 , . . . , βp−1 )t as before, but now
the predictor variables xi = (xi1 , . . . , xiq )t have length q and do not (always)
include a constant element 1.

Just as for linear models, the regression parameters are typically estimated using
the least squares criterion, or using maximum likelihood estimation. The least
squares method minimizes the sum of the squared residuals:
n
X 2
β̂ LS = argmin yi − f (β, xi ) . (10.8)
β i=1
The maximum likelihood approach requires specifying the error distribution.

If we assume independent normal errors (10.3) the likelihood becomes
n
1 h 1 X 2 i
L(β, σ 2 ) =

2 n/2
exp − 2
yi − f (β, xi )
(2πσ ) 2σ i=1
which is maximized for β̂ ML = β̂ LS . So, under normality, both estimation pro-

cedures collapse, just as in linear regression.

A very important difference with linear regression is that the solution of the
minimization problem (10.8) in general can not be computed analytically, be-
cause the normal equations are not linear in β. These normal equations are
obtained by setting the p partial derivatives of
n
X 2
Q(β) = yi − f (β, xi )
i=1
equal to zero. For j = 0, . . . , p − 1, we have

n
∂Q X ∂f (β, xi )
= −2(yi − f (β, xi )) (10.9)
∂βj i=1
∂βj
so at β = β̂ LS , we obtain
n h ∂f (β, x ) i n h ∂f (β, x ) i
i i
X X
yi − f (β, xi ) =0
i=1
∂β j β=β̂ LS
i=1
∂β j β=β̂ LS
or in matrix notation
F (β̂ LS , X)t [y − f (β̂ LS , X)] = 0p (10.10)
with F (β̂ LS , X) the n × p matrix of partial derivatives:

h ∂f (β, x ) i
i
Fij = (10.11)
∂βj β=β̂ LS
Note that in linear regression
f (β, xi ) = xti β = β0 + β1 xi1 + . . . + βp−1 xi,p−1
such that Fij = xij (with Fi0 = 1). The normal equations then reduce to
X t [y − X β̂ LS ] = 0p
from which we find the well-known solution (2.10):
β̂ LS = (X t X)−1 X t y.
For the exponential regression model (10.4), the normal equations lead to:
n n
ˆ
X X
yi eβ1 xi − βˆ0 e2β̂1 xi = 0
i=1 i=1
n n
ˆ
X X
xi yi eβ1 xi − βˆ0 xi e2β̂1 xi = 0
i=1 i=1

which is clearly a system that is not linear in β̂0 and β̂1 . It is even very likely that
multiple solutions of (10.10) exist, which expresses the occurrence of many local
minima of Q(β). We could proceed by finding a solution of (10.10) numerically,
but it is more practical to apply direct numerical search procedures to Q(β).
10.3 Numerical algorithms

Numerical search procedures proceed iterativelly. Starting with an initial esti-
(0) (1)
mate β̂ , an improved estimate β̂ if searched for such that
(1) (0)
Q(β̂ ) < Q(β̂ ).
We discuss a few popular methods.
10.3.1 Deepest descent

The method of deepest descent moves in the opposite direction of the gradient
vector d(β) whose elements are
∂Q(β)
d(β)j = .
∂βj
Thus
(1) (0) (0)
β̂ = β̂ − M0 d(β̂ )
(1) (0)
with M0 small enough such that Q(β̂ ) < Q(β̂ ). Next,
(2) (1) (1)
β̂ = β̂ − M1 d(β̂ )
(2) (1)
with M1 small enough such that Q(β̂ ) < Q(β̂ ). This approach is illustrated
in Figure 14.8.
From (10.9) we know that
d(β) = −2F (β, X)t [y − f (β, X)].
Preferable, the partial derivatives defined in F are derived analytically, oth-

erwise numerical approximations should be used. The general format of the
deepest descent method thus becomes
(k+1) (k) (k) (k)

β̂ = β̂ + Mk F (β̂ , X)t [y − f (β̂ , X)]
(k+1) (k)
with Mk such that Q(β̂ ) < Q(β̂ ). The iteration is stopped at conver-
(k+1) (k) (k+1) (k)
gence, i.e. when kβ̂ − β̂ k or kQ(β̂ ) − Q(β̂ )k is sufficiently small.
10.3. NUMERICAL ALGORITHMS | 189

10.3.2 The Gauss-Newton procedure
The Gauss-Newton method starts with a Taylor expansion of f (β, X) around
(0)
β̂ :
p−1
(0) X ∂f (β, X) (0)
f (β, X) ≈ f (β̂ , X) + β=β̂
(0) (βj − β̂j )
j=0
∂βj
(0) (0) (0)

≈ f (β̂ , X) + F (β̂ , X)[β − β̂ ]
and then approximates the nonlinear regression model (10.2) with

(0) (0) (0)
y = f (β̂ , X) + F (β̂ , X)[β − β̂ ] + e. (10.12)
(0)
Let y − f (β̂ , X) = y (0) and
(0)
β − β̂ = δ (0) (10.13)
then equation (10.12) is equivalent to the linear model (in δ):

(0)
y (0) = F (β̂ , X)δ (0) + e = F(0) δ (0) + e.
The parameters δ (0) can now be estimated as

(0)
δ̂ t
= (F(0) F(0) )−1 F(0)
t
y (0)
which, following (10.13), yields updated estimates for β̂:

(1) (0) (0)
β̂ = β̂ + δ̂
(0) (0)
= β̂ t
+ (F(0) F(0) )−1 F(0)
t
(y − f (β̂ , X))

or in a more general form as
(k+1) (k) (k)

β̂ = β̂ t
+ Mk (F(k) F(k) )−1 F(k)
t
(y − f (β̂ , X))
(k+1) (k)
with Mk small enough such that Q(β̂ ) < Q(β̂ ).
10.3.3 The Levenberg-Marquardt procedure
This algorithm combines deepest descent and Gauss-Newton by setting
(k+1) (k) (k)

β̂ = β̂ t
+ (F(k) F(k) + Mk Ip )−1 F(k)
t
(y − f (β̂ , X)).
(k+1) (k)
Initially, M0 is very small, such as 10−8 . If Q(β̂ ) < Q(β̂ ) we proceed
with M1 = M0 /10, otherwise we set M1 = 10M0 and try again. For small
values of Mk , the Levenberg-Marquardt is similar to Gauss-Newton, otherwise
it approaches the deepest descent method.
10.3.4 Starting values

(0)
For all these procedures, the choice of the initial estimator β̂ is very impor-
tant. A poor estimate will result in slow convergence, convergence to a local
minimum or even divergence.
Initial values can be obtained from:
• a priori knowledge about the range of the parameters
• selecting p representative observations, and solving the system y = f (β, X)

exactly for those points
• a grid search in the parameter space, and retaining the solution with
minimal Q(β)
• the least squares solution of the linearized model (if the model is intrinsi-
cally linear).
It is often desirable to try several initial values to make sure that the same
solution will be found.
10.3. NUMERICAL ALGORITHMS | 191

10.4 Example
A hospital administrator wished to predict the degree of long-term recovery
after discharge from the hospital for severely injured patients. The predictor
variable (X) is the number of days of hospitalization, and the response (Y ) is
a prognostic index for long-term recovery. Large values of this index reflect a
good prognosis. Data are available for n = 15 patients.
attach(patients)
days
[1] 2 5 7 10 14 19 26 31 34 38 45 52 53 60 65
prognostic
[1] 54 50 45 37 35 25 20 16 18 13 8 11 8 4 6
The nonlinear regression model (10.4) was considered. To obtain starting values
for the numerical procedure, the model was first linearized as in (10.6). This
ˆ0
yields βˆ00 = 4.0371 and βˆ1 = −0.03797, from which βˆ0 = eβ0 = 56.6646 is ob-
tained. These initial estimates are now used as starting values for the nonlinear
estimation method.
coefficients(lm(log(prognostic)~days))
(Intercept) days
4.03715887 -0.03797418
patients.nls <- nls(prognostic~A*exp(B*days),

start=list(A=56.6646,B=-0.03797),trace=T)
56.08671 : 56.66460 -0.03797

49.46383 : 58.5578440 -0.0395328
49.4593 : 58.60548449 -0.03958469
49.4593 : 58.6065313 -0.0395864
patients.nls
Nonlinear regression model

model: prognostic ~ A * exp(B * days)

data: parent.frame()
A B
58.60653 -0.03959
residual sum-of-squares: 49.46
Number of iterations to convergence: 3

Achieved convergence tolerance: 9.007e-06
The resulting fit is illustrated in Figure 10.1
plot(days,prognostic,xlim=c(0,71),ylim=c(0,59),xaxs="i")
A <- summary(patients.nls)$parameters[1]
B <- summary(patients.nls)$parameters[2]
xx <- seq(0,70,length=100)
yy <- A*exp(B*xx)
lines(xx,yy)
60
50
40
prognostic
30
20
10
0
0 10 20 30 40 50 60 70
days
Figure 10.1: Scatter plot and fitted curve for the patients data set.
10.4. EXAMPLE | 193

10.5 Inference about regression parameters
Exact inference procedures are not available for nonlinear regression, because in
general β̂ LS and β̂ ML are not normally distributed (even if the error terms are),
they are not unbiased and they do not have minimum variance. Fortunately,
when the error terms satisfy (10.3) and the sample size is sufficiently large, the
estimates are approximately normally distributed, unbiased and the variance-
covariance matrix of β̂ LS is estimated by
Σ̂(β̂ LS ) = σ̂ 2 (F t F )−1 (10.14)
with F the matrix of partial derivatives of f evaluated at β̂ LS , as defined

in (10.11). The estimate of the error term variance σ 2 is given by the mean
squared error
Pn Pn
2 SSE − ŷi )2
i=1 (yi i=1 [yi − f (β̂ LS , xi )]2
σ̂ = MSE = = =
n−p n−p n−p
exactly as for linear regression. For nonlinear regression, σ̂ 2 is not an unbiased

estimator of σ 2 , but the bias is small when n is large.
Approximate confidence intervals for a single βj are then obtained from (10.14):
β̂j ± tn−p,α/2 s(β̂j ).
Equivalently, tests concerning a single βj are derived from the test statistic
β̂j − βj0
t= ≈H0 tn−p .
s(β̂j )
Example.
summary(patients.nls,correlation = TRUE)
Formula: prognostic ~ A * exp(B * days)
Parameters:
A 58.606531 1.472159 39.81 5.70e-15 ***

B -0.039586 0.001711 -23.13 6.01e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Parameter Estimates:

A
B -0.71
Number of iterations to convergence: 3

Achieved convergence tolerance: 9.007e-06
Also model checking e.g. using residuals plot remains necessary. Figures 10.2
and 10.3 do not suggest departures from the model assumptions. When in-
terpreting residual plots for nonlinear regression, it should be noticed that the
residuals do not necessarily sum to zero.
3
2
1
Residual
0
−1
−2
10 20 30 40 50
Fitted value
Figure 10.2: Residuals versus fitted values for patients data set.
10.5. INFERENCE ABOUT REGRESSION PARAMETERS | 195

Normal Q−Q Plot
3
2
1
Residuals
0
−1
−2
−1 0 1
Theoretical Quantiles
Figure 10.3: Normal quantile plot for the residuals of the patients data set.

Chapter 11
Nonparametric regression
11.1 The nonparametric regression model

In linear and nonlinear regression the response function f (β, X) is well specified,
either as a linear or a nonlinear function of β. Although these functions are ap-
propriate for many applications, they usually have all orders of derivatives ev-
erywhere and consequently, they do not allow to model relations which fluctuate
a lot.
The nonparametric regression model states that
E(Y |x) = f (x)
without specifying f (x). We will concentrate on regression curve modelling,

where we have n independent observations (xi , yi ) which satisfy
yi = f (xi ) + εi (11.1)
with E(εi ) = 0 and Var(εi ) = σ 2 .
11.2 The lowess method

The lowess or loess method stands for locally weighted scatterplot smoothing.
Loess is also a deposit of fine clay or silt along a river valley, and thus is a surface
of sorts. The method produces a fitted value ŷ = fˆ(x) for a set of x-values in
the range of the data. Often the x-values for which the smooth is computed are
197
simply the data points xi , but other values can be considered as well, mainly if
the observed data are rather sparse in some region of x.
The general idea of lowess is to fit a polynomial regression in the neighborhood

of x, thereby giving less weight to observations whose xi -value are far from x
and emphasizing the points that are close to x. The algorithm proceeds as
follows:
1. First the neighborhood or the span of the smoother has to be chosen. This
span 0 < s ≤ 1 represents the fraction of the data that will be included in
each fit. It corresponds to m = [sn] data values. Often s = 0.5 or s = 2/3
work well. The larger s, the smoother the results.
2. Then locally weighted regression is performed. First a window is selected

that contains the m data points whose xi -values are closest to x (using
the Euclidean distance). Let h = maxm
j=1 |xj − x| be the distance to the
farthest xj .
(a) Each of the m observations in the window are given a weight

x − x
j
wj = wT
h
with wT the tricube weight function


(1 − |z|3 )3

if |z| < 1
wT (z) =
0 if |z| ≥ 1

(b) Next, weighted polynomial least squares is applied to the observations

Pm
in the window, i.e. j=1 wj e2j is minimized with ej the residuals from
a polynomial fit to the (xj , yj ) (see also Section 6.4.3). Often a linear
fit is used (or equivalently, the degree of the polynomial is k = 1). If
the relationship between Y and X changes direction quickly, then a
local quadratic fit (k = 2) might be preferred.
(c) The fitted value ŷ from this regression step is retained. Connecting
these fitted values for all x-values under consideration produces an
initial nonparametric regression estimate.
3. To make the estimate resistant to outliers (and hence to allow long-tailed

error distributions), the weights are adapted such that observations with a
198 | CHAPTER 11. NONPARAMETRIC REGRESSION

large residual obtain a smaller weight. For each i = 1, . . . , n, the residuals
from the previous fit are computed ei = yi − ŷi and also their robust scale
estimate (the median absolute deviation)
mad = med(|ei − med(ei )|).
Robustness weight are then computed using the bisquare weight function

(1 − z 2 )2 if |zi | < 1

i
vi = wB (zi ) =
0 if |zi | ≥ 1

where, the zi are the standardized residuals

ei
zi = .
6mad
The tuning constant t = 6 is chosen because for normally distributed
errors, 6mad ≈ 4σ. Step 2 is now repeated, with compound weights vj wj
in the local weighted regression. This procedure can be iterated until the
ŷ converge, but usually one or two iterations are sufficient.
The different steps in the algorithm are depicted in Figure 14.15 for local linear
regression. Figure 11.1 shows the two different weight functions. In Figure 11.2
we see the difference between the nonrobust and the robust smoothed curve in
the presence of an outlier at xmin .
1.0
0.8
w w
B T
0.6
0.4
0.2
0.0
-1.0 -0.5 0.0 0.5 1.0
Figure 11.1: Biweight (wB ) and tricube (wT ) weight function.
11.2. THE LOWESS METHOD | 199

Figure 11.2: Robust and nonrobust smoothing
11.2. THE LOWESS METHOD | 201

Remarks.
1. The lowess method extends the locally weighted average or the kernel
approach which sets k = 0 in Step 2(b). This implies that the fitted
value in x is obtained as the weighted average of the responses in the
neighborhood of x: Pm
j=1 w j yj
ŷ = Pm .
j=1 wj
When f (x) is nearly linear in the neighborhood of x, then both the local
average and the local regression will produce nearly unbiased estimates of
f (x), but the variance of the local regression will be smaller. Moreover,
when f (x) is substantially nonlinear, the polynomial approach will yield
less bias. This is illustrated in Figure 2.5.

2. The lowess method can easily be extended to response surface fitting when
there are several predictor variables. Using Euclidean distances, a neigh-
borhood of x is then defined as the region
{xi ∈ Rp : kxi − xk ≤ h}
with h = maxm
j=1 kxj − xk.
3. The method also applies when the variance of the errors is not constant,
√
but instead satisfies Var(εi ) = σ 2 /ai or Var( ai εi ) = σ 2 with the ai
known weights as in the generalized linear model (6.3). Then, the neigh-
√
borhood weights wj and the robustness weight vj are replaced with aj wj
√
resp. aj vj .
4. An important issue in scatterplot smoothing is the choice of the span.

This could be done visually: if the lowess curve based on an initial choice
is too rough, an increased value of s can be tried. If the curve looks too
smooth, a smaller value should be selected. In general, the larger s the
greater the bias and the smaller the variance of ŷ as an estimate of the
true regression function f (x).
A general approach is to minimize an estimate of the MSE of ŷ. For this,

the cross-validated PRESS (7.2) can be used
n
X
PRESSs = (yi − ŷi(i) )2
i=1
where ŷi(i) is the fitted value evaluated at xi for a locally weighted regres-
sion that omits the ith observation.
11.3 Inference
Although the primary goal of scatterplot smoothing is to produce a visual sum-
mary of the relation between Y and X, inference is also useful:
• to obtain an idea of the variability of the estimated curve
• to determine whether it is really better than a (simpler) parametric fit.
We consider inference for the lowess smoother without robustness weights. Be-
cause the fitted values are derived from a weighted least squares regression, in
11.3. INFERENCE | 203

which the weights only depend on the x-values, they can be written as a linear
combination of the y-values:
n
X
ŷi = sij yj
j=1
or in matrix notation
ŷ = Sy
Note the observations that fall outside the span of the smoother for the ith
observation receive sij = 0.
Consequently, the variance of the fitted values is given by
Σ(ŷ) = SΣ(y)S t = σ 2 SS t
To estimate Σ(ŷ) we need an estimate of the error variance σ 2 , and hence, an

estimate of the degrees of freedom for the residual sum of squares.
One approach is to consider the analogy with linear least squares regression,
where ŷ = Hy with H = X(X t X)−1 X t the hat matrix and e = (In − H)y.
The number of parameters in this model is tr(H) = p, and the residual degrees
of freedom are tr(In − H) = n − p. By analogy, the ’equivalent’ number of
parameters for the lowess method is tr(S) and the residual degrees of freedom
are tr(In − S) = n − tr(S). The estimated error variance is therefore given by
Pn
(yi − ŷi )2
σ̂ 2 = i=1
n − tr(S)
Alternatively one could also use n−tr(SS t ) which, for linear regression, is again
equal to n − tr(HH t ) = n − tr(H) because the hat matrix is symmetric and
idempotent. This definition is used in R.
A pointwise approximate 95% confidence interval for f (xi ) is then given by

p
ŷi ± 2σ̂ (SS t )ii .
An example is presented in Figure 14.17. Here, the equivalent number of pa-

rameters is tr(S) = 4.668 ≈ 5, which is about the same as a fourth-degree
polynomial regression.

Finally, we can construct approximate F-tests. To test for nonlinearity, the
reduced model is the linear regression model
yi = β0 + β1 xi + εi
with residual sum of squares SSE0 and residual degrees of freedom n − 2. Let
SSE1 be the residual sum of squares for the lowess fit. Then, in analogy to the
partial F-test (3.5) we compute
(SSE0 − SSE1 )/(tr(S) − 2)

F =
SSE1 /(n − tr(S))
with tr(S)−2 and n−tr(S) degrees of freedom. A similar test can be performed
to compare a lowess fit with a quadratic or any other parametric model, or to
compare two lowess fits with different spans.
11.3. INFERENCE | 205

11.4 Example
The data set gas has 22 observations of two variables from an industrial ex-
periment that studied exhaust from an experimental one-cylinder engine. The
dependent variable, N Ox is the concentration of nitric oxide and nitrogen diox-
ide (expressed in µg of N Ox per joule). The predictor is the equivalence ratio
E at which the engine was run, which is a measure of the richness of the air and
fuel mixture.
gas=read.table(file=path.expand(".\\Updates\\gas.txt"),header=TRUE)
attach(gas)
5
4
NOx
3
2
1
0.7 0.8 0.9 1.0 1.1 1.2
Because of the curvature in the data, we fit a local regression model using
locally quadratic fitting (k = 2), with span s = 2/3. If the parameter family is
not specified, the non-robust fit is used. If we want to include the robustness
weights, we should add family = "symmetric" in the function call.
attach(gas)
gas.m <- loess(NOx ~ E, span = 2/3, degree = 2)
gas.m
Call:
loess(formula = NOx ~ E, span = 2/3, degree = 2)
Number of Observations: 22

Equivalent Number of Parameters: 5.52
Residual Standard Error: 0.3404
The fitted values in the observed xi , the residuals for all the observations and
the smoothed curve can be obtained from
fitted(gas.m)
residuals(gas.m)
plot(E,NOx)
lines(loess.smooth(E,NOx,span=2/3,degree=2))
# or without creating the gas.m object
scatter.smooth(E, NOx, span = 2/3, degree = 2)
5
4
NOx
3
2
1
0.7 0.8 0.9 1.0 1.1 1.2
This plot function evaluates the lowess fit at 50 equally spaced points and
connects these fitted values by line segments. If we want to evaluate the curve
at other points, we set e.g.
z <- c(min(E), median(E), max(E))

predict(gas.m, z)
[1] 1.1964144 5.0687470 0.5236823
To check the model assumptions we make several residual plots. First we have
to check the properties of f (x) that are specified by the choice of s = 2/3 and
k = 2. A plot of the residuals against the predictor variable E does not show
any lack of fit.
11.4. EXAMPLE | 207

scatter.smooth(E, residuals(gas.m), span = 1, degree = 1)
abline(h = 0, lty = 2)
0.6
0.4
0.2
residuals(gas.m)
0.0
−0.2
−0.4
−0.6
0.7 0.8 0.9 1.0 1.1 1.2
Maybe we can allow a larger span, e.g. s = 1. However, the resulting residual
plot shows that there is a dependence of the residuals on E, so s = 1 is too
large.
gas.m.null <- update(gas.m, span = 1)

scatter.smooth(E, residuals(gas.m.null), span = 1, degree = 1)
abline(h = 0, lty = 2)
0.5
residuals(gas.m.null)
0.0
−0.5
−1.0
0.7 0.8 0.9 1.0 1.1 1.2

Next, we verify whether the residuals have constant variance, and whether they
are normally distributed. Both assumptions seem to be satisfied.
scatter.smooth(fitted(gas.m),sqrt(abs(residuals(gas.m))),
span = 1, degree = 1)
qqnorm(residuals(gas.m))
qqline(residuals(gas.m))
0.8
0.7
sqrt(abs(residuals(gas.m)))
0.6
0.5
0.4
0.3
0.2
0.1
1 2 3 4 5
fitted(gas.m)
Normal Q−Q Plot

0.6
0.4
0.2
Sample Quantiles
0.0
−0.2
−0.4
−0.6
−2 −1 0 1 2
Theoretical Quantiles
11.4. EXAMPLE | 209

Finally we compare the gas.m and the gas.m.null fits.
anova(gas.m.null, gas.m)
Model 1: loess(formula = NOx ~ E, span = 1, degree = 2)

Model 2: loess(formula = NOx ~ E, span = 2/3, degree = 2)
Analysis of Variance: denominator df 15.66
ENP RSS F-value Pr(>F)

[1,] 3.49 4.8489
[2,] 5.52 1.7769 10.14 0.0008601 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The result is highly significant, as expected.

Bibliography
Draper, N., Smith, H. (1998), Applied Regression Analysis, 3rd Edition, John
Wiley, New York.
Fox, J. (1997), Applied Regression Analysis, Linear Models, and Related Meth-
ods, Sage Publications, Thousand Oaks.
Gunst, R.F., Mason, R.L. (1980), Regression Analysis and its Application. A
Data-oriented Approach. Marcel Dekker, New York.
Neter, J., Kutner, M.H., Nachtsheim, C.J., Wasserman, W. (2004), Applied

Linear Statistical Models, 5th edition, McGraw Hill, New York.
Ramsay, F., Schafer, D. (2013), The Statistical Sleuth, 3rd Edition, Brooks/Cole
Cengage Learning.
Rousseeuw, P.J., Leroy, A. (1987), Robust Regression and Outlier Detection,

John Wiley, New York.
Ryan, T. (1997), Modern Regression Methods, John Wiley, New York.
Sen, A., Srivastava (1990), Regression Analysis: Theory, Methods and Appli-
cations, Springer, New York.
211
KU Leuven
Leuven Biostatistics and Statistical Bioinformatics Centre (L-BioStat)
Kapucijnenvoer 35 blok d - box 7001, 3000 Leuven
thomas.neyens@kuleuven.be

Regression Analysis Guide for Statistics Students

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis Guide for Statistics Students

Uploaded by

Copyright:

Available Formats

Statistics

Prof. dr. Mia Hubert

KU Leuven, Department of Mathematics

1 The simple regression model 4

2 The general linear model 15

9 Influential observations and outliers 152

10 Nonlinear regression 185

11 Nonparametric regression 197

More generally, regression analysis models the relationship between a set of

1. There is a probability distribution of Y for each level of

2. The means of these probability distributions vary in some systematic fash-

This is illustrated in Figure 1.4.

g(yi ) = β0 + β1 f1 (xi1 , . . . , xi,l−1 ) + . . . + βp−1 fp−1 (xi1 , . . . , xi,l−1 ) + εi

for i = 1, . . . , n and for certain choices of g, f1 , . . . , fp−1 . The error terms εi

1. the first-order regression model: l = p, g(yi ) = yi , fj (xi1 , . . . , xi,p−1 ) =

yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi .

3. polynomial regression: g, f1 , . . . , fl−1 as in the first-order regression model,

4. variable selection: g, f1 , . . . , fl−3 as in the first-order regression model, all

The simple regression

The second example is the result of an industrial laboratory experiment. Under

6 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

1.2 The simple linear model

for i = 1, . . . , n. The parameter β0 is called the intercept, whereas β1 is called

1.2. THE SIMPLE LINEAR MODEL | 7

in the (X, Y )-space.

β1 = E[Y |X = x + 1] − E[Y |X = x],

so the slope β1 can be interpreted as the change in the expected response Y if

8 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

1.3.1 The least squares estimator

ei (β̂0 , β̂1 ) = yi − β̂0 − β̂1 xi ; i = 1, . . . , n.

1.3. ESTIMATION OF THE REGRESSION PARAMETERS | 9

β̂0,LS = ȳn − β̂1,LS x̄n . (1.10)

The second equation can be replaced by

which leads to the equation

By substituting result (1.10) into this equation, we obtain that

Solving this equation yields

where sX en sY are the sample standard deviations of the variables X and Y ,

10 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

Average eruption time = −1.87 + 0.076 ∗ Waiting time.

Average log(Breakdown time) = 18.96 − 0.51 ∗ Voltage.

1.3. ESTIMATION OF THE REGRESSION PARAMETERS | 11

The interpretation of the regression coefficients in terms of the transformed re-

In case of a logarithmic transformation the regression coefficients can be inter-

12 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

med[Y |X] = med[log(Ỹ )|X] = log(med[Ỹ |X]) = β0 + β1 X,

Moreover, we easily find that

Similarly, if the predictor is included in the linear model in a transformed man-

E[Y |X] = β0 + β1 log(X).

Hence, the predictor X has been logarithmically transformed before inclusion

E[Y |X = cx] = β0 + β1 log(cx) = β0 + β1 log(c) + β1 log(x),

1.3. ESTIMATION OF THE REGRESSION PARAMETERS | 13

Being able to estimate the parameters of a linear model is not sufficient. We

14 | CHAPTER 1. THE SIMPLE REGRESSION MODEL

The general linear model

2.1 The linear model

yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi (2.1)

for i = 1, . . . , n. The parameter β0 is called the intercept, whereas the βj (j =

E[Y |X1 , . . . , Xp−1 ] = β0 + β1 X1 + β2 X2 + . . . + βp−1 Xp−1 .

16 | CHAPTER 2. THE GENERAL LINEAR MODEL

E(Y |xi(j) ) = β0 + β1 xi1 + β2 xi2 + . . . + βj xij + . . . + βp−1 xi,p−1

whereas (2.2), (2.3) and (2.4) correspond with

2.2 Estimation of the regression parameters