Professional Documents
Culture Documents
Regression Analysis
Master of Statistics
2020-2021
Contents
Introduction 1
3 Statistical inference 41
3.1 Inference for individual parameters . . . . . . . . . . . . . . . . . 41
3.2 Inference for several parameters . . . . . . . . . . . . . . . . . . . 44
3.3 The overall F-test . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Test for all parameters . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5 A general linear hypothesis . . . . . . . . . . . . . . . . . . . . . 50
3.6 Mean response and prediction . . . . . . . . . . . . . . . . . . . . 51
3.6.1 Inference about the mean response . . . . . . . . . . . . . 51
3.6.2 Inference about the unknown response . . . . . . . . . . . 51
3.7 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Polynomial regression 58
4.1 One predictor variable . . . . . . . . . . . . . . . . . . . . . . . . 58
4.2 Several regressors and interaction terms . . . . . . . . . . . . . . 60
4.3 Estimation and inference . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Detecting curvature . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.1 Residual plots . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.5.2 Partial residual plots . . . . . . . . . . . . . . . . . . . . . 66
5 Categorical predictors 69
5.1 One dichotomous predictor variable . . . . . . . . . . . . . . . . . 69
5.1.1 Constructing the model . . . . . . . . . . . . . . . . . . . 69
5.1.2 Estimation and inference . . . . . . . . . . . . . . . . . . 73
5.1.3 Adding interaction terms . . . . . . . . . . . . . . . . . . 76
5.2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.1 One polytomous predictor variable . . . . . . . . . . . . . 79
5.2.2 More than one categorical variable . . . . . . . . . . . . . 81
5.3 Piecewise linear regression . . . . . . . . . . . . . . . . . . . . . . 82
6 Transformations 85
6.1 The family of power and root transformations . . . . . . . . . . . 85
6.2 Transforming proportions . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Transformations in regression . . . . . . . . . . . . . . . . . . . . 90
6.3.1 Power transformation . . . . . . . . . . . . . . . . . . . . 90
6.3.2 Box-Cox transformation . . . . . . . . . . . . . . . . . . . 91
6.4 Nonconstant variance . . . . . . . . . . . . . . . . . . . . . . . . 92
6.4.1 Detecting heteroscedasticity . . . . . . . . . . . . . . . . . 92
6.4.2 Variance-stabilizing transformations . . . . . . . . . . . . 94
6.4.3 Weighted least squares regression . . . . . . . . . . . . . . 95
CONTENTS | i
7 Variable selection methods 103
7.1 Reduction of explanatory variables . . . . . . . . . . . . . . . . . 103
7.1.1 Surgical unit example . . . . . . . . . . . . . . . . . . . . 104
7.2 All-possible-regressions procedure for variable reduction . . . . . 107
7.2.1 Rp2 criterion . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.2 MSEp criterion . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Mallows’ Cp . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.4 Akaike’s Information Criterion . . . . . . . . . . . . . . . 114
7.2.5 PRESSp Criterion . . . . . . . . . . . . . . . . . . . . . . 115
7.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.3.1 Backward elimination . . . . . . . . . . . . . . . . . . . . 117
7.3.2 Forward selection . . . . . . . . . . . . . . . . . . . . . . . 120
7.3.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.4.1 Collection of new data . . . . . . . . . . . . . . . . . . . . 125
7.4.2 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . 126
8 Multicollinearity 127
8.1 The effects of multicollinearity . . . . . . . . . . . . . . . . . . . 127
8.1.1 Uncorrelated predictor variables . . . . . . . . . . . . . . 127
8.1.2 Perfectly or highly correlated predictors . . . . . . . . . . 132
8.2 Multicollinearity diagnostics . . . . . . . . . . . . . . . . . . . . . 137
8.2.1 Informal methods . . . . . . . . . . . . . . . . . . . . . . . 137
8.2.2 Variance inflation factors . . . . . . . . . . . . . . . . . . 137
8.2.3 The eigenvalues of the correlation matrix . . . . . . . . . 138
8.3 Multicollinearity remedies . . . . . . . . . . . . . . . . . . . . . . 140
8.3.1 Specific solutions . . . . . . . . . . . . . . . . . . . . . . . 140
8.3.2 Principal component regression . . . . . . . . . . . . . . . 141
8.3.3 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . 148
ii | CONTENTS
9.3 Single-case diagnostics . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.1 DFFITS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.3.2 Cook’s distance . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.3 DFBETAS . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.4 The LTS estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4.1 Parameter estimates . . . . . . . . . . . . . . . . . . . . . 178
9.4.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.4.3 Reweighted LTS . . . . . . . . . . . . . . . . . . . . . . . 179
9.5 The MCD estimator . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.5.1 Parameter estimates . . . . . . . . . . . . . . . . . . . . . 181
9.5.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.5.3 Reweighted MCD-estimator . . . . . . . . . . . . . . . . . 182
9.6 A robust R-squared . . . . . . . . . . . . . . . . . . . . . . . . . . 183
9.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Bibliography 211
CONTENTS | iii
Introduction
In its simplest form regression aims to model the relation between an input vari-
able X and an output or response variable Y . Contrary to a correlation analysis,
the regression model is asymmetric. It models the influence or effect of the
input or predictor variable X on the response variable Y . The regression model
allows us to evaluate to what extent the outcome Y changes due to a change
in the value of X. The regression model can then be used to predict Y from
X. Therefore, the input variable X is also called the independent variable, or
regressor, whereas the response variable Y is also called the dependent variable.
Since the observations will in general not satisfy this functional relation exactly,
the regression model will also include a stochastic component ε which expresses
the variation of the data points around the regression curve. A regression model
thus postulates that:
1
ion with X.
We will especially study the general linear regression model which is defined
as
E[εi ] = 0
Var[εi ] = σ 2
E[εi εj ] = 0 for all i 6= j.
The first condition expresses that at each level of (X1 , . . . , Xl−1 ) the regression
curve represents the mean of the corresponding probability distribution of Y .
The second condition states that the probability distribution of Y at each level
of (X1 , . . . , Xl−1 ) has the same variance, namely σ 2 . The last condition implies
that the error terms are uncorrelated.
This general linear model includes:
2 | CONTENTS
2. simple regression: the first-order regression model with p = 2.
y λ −1
5. transformations in X or Y : g(Y ) = log(Y ), g(Y ) = λ , fj = log(Xj ).
Note that this model is linear in β and not necessarily in the independent
variables Xj . An example of a nonlinear model is
yi = β0 + β1 eβ2 xi + εi .
CONTENTS | 3
Chapter 1
1.1 Examples
The ‘old faithful geyser’ is the most famous geyser in the Yellowstone National
Park (Wyoming, USA). Eruptions occur in intervals with length between 45
minutes and 125 minutes. An eruption lasts 1.5 to 5 minutes during which
14,000 to 32,000 liters of boiling water is shot in the air to a height of 32 to 56
meters. It has been observed that there is a relation between the waiting time
until an eruption and the duration of that eruption. To examine this relation
both times (in minutes) have been recorded for 272 eruptions.
Questions that can be examined based on these data are: Is there indeed a strong
influence of waiting time on the following eruption time? Can the waiting time
be used to predict the length of the subsequent eruption? To answer these
questions we model the relation between the waiting and eruption times. First,
we graphically explore this relationship by making a scatterplot of the data.
4
4.5
eruptions
3.5
2.5
1.5
50 60 70 80 90
waiting
Note that the predictor variable ’waiting time’ is plotted horizontally, while the
response variable ’eruption time’ is plotted vertically. The scatterplot clearly
reveals that longer waiting times result in longer eruption times. The main
pattern can at least approximately be represented by a line. However, it is also
clear that the relation between both times is far from perfect. We will see how
we can model the effect of the waiting time on the eruption time.
1.1. EXAMPLES | 5
Breakdown time
1500
500
0
26 28 30 32 34 36 38
Voltage
A scatterplot of the data is again used to explore the relationship. From the
scatterplot we clearly see that the breakdown time decreases rapidly as the volt-
age increases. However, the type of relation between the two variables is difficult
to see from this plot. This is caused by the skewness in the response variable.
To improve the graphical representation of the data, we apply a logarithmic
transformation on the response variable.
8
Logarithm of Breakdown time
6
4
2
0
−2
26 28 30 32 34 36 38
Voltage
This scatterplot provides more insight in the data. We see a decrease of the
breakdown time (logarithmic scale) with increasing dosage of voltage, a pattern
Note that there is an important difference between both examples. In the first
example, the recorded values for both the waiting and eruption times are ob-
served values. In this example both the X and Y variable are thus random
variables. The observed measurement pairs can even be considered to be a
random sample from the joint distribution of the two variables. In the second
example, the voltage dose is chosen by the experimenters and the breakdown
time is recorded for these fixed doses. In this example, the Y variable is still
random, but the X variable is not. Consequently, in this case the paired dose-
time measurements can also not be a random sample. Regression modeling as
discussed next can be used for both types of data under suitable conditions.
yi = β0 + β1 xi + εi (1.1)
In this model the values xi are not necessarily values of the observed predic-
tor variable X, but can be values for any suitable function f (X). We assume
that X does not contain any random effect or measurement error. Note that
this assumption is naturally satisfied in an experimental setting as in the sec-
ond example where the values of the predictor are chosen and fixed by the
experimenter. In the case of an observational study where the values of X are
observed, as is the case for the response Y , it is far more difficult to satisfy this
assumption. In this case, it is up to the statistician and/or data collectors to
judge whether the X variable is observed with sufficient accuracy so that the
assumption is (approximately) satisfied. If there is considerable randomness in
the observation of X, then more complex models such as measurement error
models are needed.
E[εi ] = 0 (1.2)
Var[εi ] = σ 2 (1.3)
E[εi εj ] = 0 for all i 6= j (1.4)
for i = 1, . . . , n.
In the case that X is random, it is also assumed that the errors εi are indepen-
dent of X.
As the εi are random variables with zero mean, also Y is a random variable
that satisfies:
E[Y |X] = β0 + β1 X.
Here, E[Y |X] is a function of X that for each value X = x yields the mean of the
corresponding distribution of the response variable Y at X = x. Conditionally
on the observed values for X, this can also be written as:
E[Y |X = xi ] = β0 + β1 xi (1.5)
At the first-order regression model where the X and Y variables in (1.1) corre-
spond to the observed predictor variable and response variable respectively, this
linear relation geometrically implies that we try to estimate a regression line
E[Y |X] = β0 + β1 X
The simple linear model in (1.1) contains three parameters β0 , β1 and σ. Note
that the two regression parameters β0 and β1 are inherent in the model (1.1)
while the scale parameter σ is a consequence of the second Gauss-Markov condi-
tion (1.3). These model parameters are unknown and need to be estimated from
the available data. A natural strategy is to estimate the regression parameters
such that the corresponding linear function fits the available data points as well
as possible. Otherwise stated, the estimation method should aim to keep the
errors as small as possible. Here, the errors corresponding to any parameter
estimates β̂0 and β̂1 are given by
The first Gauss-Markov condition (1.2) implies that positive and negative errors
occur. To avoid that large positive and large negative errors can cancel each
other out in the estimation strategy, a function needs to be used that adds up
all errors regardless of their sign. The two most common functions to achieve
this goal are the absolute value and the square. Hence, the parameters β0 and
β1 can be estimated by minimizing the sum of the absolute errors:
n
X n
X
(β̂0,LAD , β̂1,LAD ) = argmin |ei (β0 , β1 )| = argmin |yi − (β0 + β1 xi )|. (1.6)
β0 ,β1 i=1 β0 ,β1 i=1
This estimator is called the least absolute deviations estimator. The other option
is to estimate the parameters β0 and β1 by minimizing the sum of the squared
errors:
n
X n
X
(β̂0,LS , β̂1,LS ) = argmin e2i (β0 , β1 ) = argmin (yi − (β0 + β1 xi ))2 . (1.7)
β0 ,β1 i=1 β0 ,β1 i=1
This estimator is called the least squares estimator. Both estimators have
their merits, but the least squares estimator is the standard estimator for the
regression parameters in linear models because it can be solved analytically and
it has some good (optimal) statistical properties that will be discussed later.
Pn 2
The residual sum of squares i=1 ei (β0 , β1 ) is called the objective function
or loss function of the least squares estimator. Differentiating this objective
The least squares estimators β̂0,LS and β̂1,LS for the simple regression model are
the solution of this system of equations. From (1.8) we find that
∂L(β0 , β1 ) ∂L(β0 , β1 )
− x̄n =0
∂β1 ∂β0
Note that for the existence of the least squares estimator it is required that
s2X > 0. This means that the values of the variable X cannot be all the same,
which is a natural condition. If all values of X are equal to each other, then
the data do not provide any information on how the value of the response Y
changes with changes in X.
The graphical representation of this regression fit shows that the estimated
regression line represents well the main trend in the data.
4.5
eruptions
3.5
2.5
1.5
50 60 70 80 90
waiting
Note that the intercept has a negative sign which is physically not a meaningful
value because an eruption time cannot be negative. This is an illustration of the
danger of extrapolation from a regression model. The data do not contain any
information about the length of eruption for very short waiting times (because
short waiting times do not occur in reality). Hence, the model cannot be used
to reliably predict what would happen with the eruption time after such small
waiting times. There is no reason why the model would still be valid in this
case, and in fact the unrealistic values predicted by the model indicate that it
is not valid for such events beyond the range of information.
For the insulating fluid experiments, the simple regression model estimated from
the available data is
8
Logarithm of Breakdown time
6
4
2
0
−2
26 28 30 32 34 36 38
Voltage
med[Y |X] = β0 + β1 X
in this case. Now, if the response Y in the simple linear model is the logarithm
or equivalently,
med[Ỹ |X]) = exp(β0 ) exp(β1 X).
med[Ỹ |X = x + 1])
= exp(β1 ),
med[Ỹ |X = x])
or
med[Ỹ |X = x + 1]) = exp(β1 ) med[Ỹ |X = x])
Hence, exp(β1 ) is the multiplicative change of the median of the measured re-
sponse Ỹ if the predictor X increases by one unit. A similar interpretation holds
for the intercept. In the insulating fluid example, we find that exp(−0.51) = 0.60
so with every unit increase in voltage the median breakdown point is only 60%
of what it was before. Otherwise stated, the median breakdown point decreases
by 40% for every unit increase in X. For example, the median breakdown point
at X = 32 kV is estimated at 15.2 minutes. The median breakdown point at
X = 33 kV then becomes 15.2 ∗ 0.6 = 9.1 minutes.
such that
E[Y |X = cx] − E[Y |X = x] = β1 log(c).
E[εi ] = 0 (2.2)
Var[εi ] = σ 2 (2.3)
E[εi εj ] = 0 for all i 6= j. (2.4)
As the εi are random with zero mean, also Y is a random variable that satisfies:
Conditionally on the observed values for X1 , . . . , Xp−1 , this can also be written
as:
E[Y |xi ] = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 (2.5)
with xi = (1, xi1 , . . . , xi,p−1 )t . Note that the first element of the x-vector is
1, which is the x-value for the intercept. At the first-order regression model
15
(where the Xj in (2.1) correspond with the observed predictor variables), this
linear relation geometrically implies that we try to estimate a hyperplane in the
(X, Y )-space. With p = 2 we recover simple regression as a special case and
thus fit the regression line
E[Y |X] = β0 + β1 X.
hence indeed
βj = E(Y |xi(j+1) ) − E(Y |xi(j) ).
Often it is very convenient to write the general linear model (2.1) in matrix
form. Let the vectors y = (y1 , . . . , yn )t , ε = (ε1 , . . . , εn )t and the matrix X =
(x1 , x2 , . . . , xn )t , then (2.1) is equivalent to
y = Xβ + ε (2.6)
E[ε] = 0 (2.7)
Σ(ε) = σ 2 In . (2.8)
Here, Σ(ε) stands for the variance-covariance matrix of the errors, and In for
the n × n identity matrix.
Any parameter estimate β̂ = (β̂0 , . . . , β̂p−1 )t yields fitted values ŷi and residuals
ei :
ei (β̂) = yi − ŷi
= yi − xti β̂.
X t Xβ = X t y.
β̂ LS = (X t X)−1 X t y. (2.10)
Pn
Figure 13.3 shows the LS objective function 1 e2i (β) for varying values of β
(here, for two regressors). If the rank of X is exactly p, as in Figure 13.3(a), we
see that this objective function is convex and hence yields a unique minimum
which can be derived analytically. If rank(X) < p as in Figure 13.3(b), there
are an infinite number of LS solutions. In practice, such a perfect linear rela-
tionship between the X-variables is not often encountered, but the X-variables
might be strongly correlated. This situation is known as multicollinearity. In
such a case, the LS fit is uniquely defined, but many other parameter estimates
β̂ attain a residual sum of squares which is close to the minimal value of β̂ LS
(see Figure 13.3(c)). Consequently, small changes in the data set may cause a
large change in the parameter estimates. Figure 13.2 illustrates these effects in
the data space.
and
M = In − H
then the following relations hold for ŷ = (ŷ1 , . . . , ŷn )t and e = (e1 , . . . , en )t :
ŷ = Hy (2.14)
e = My (2.15)
e = Mε (2.16)
Σ(e) = σ 2 (In − H) = σ 2 M. (2.17)
Equations (2.14) and (2.15) are trivial and explain why the matrix H is called
the hat matrix. This hat matrix H is symmetric H t = H and idempotent:
H 2 = HH = H. From (2.14) we also derive that Σ(ŷ) = HΣ(y)H t = σ 2 H,
hence the diagonal elements of the hat matrix hii are always positive. Other
properties of the hat matrix will be derived in Section 9.2.3.
The first two equations (2.18) and (2.19) follow from X t e = X t M ε = 0p,n ε =
t
0p . Moreover, ŷ t e = β̂ X t e = 0. These equations thus imply that the mean of
the least squares residuals is zero, and that the residuals are orthogonal to the
design matrix X as well as to the predicted values.
From (2.18) we can also deduce that the LS hyperplane passes through the mean
of the data points. Indeed, as n1 i (yi − ŷi ) = 0 we have that
P
1X
ȳ = ŷi
n i
1X
= (β̂0 + β̂1 xi1 + . . . + β̂p−1 xi,p−1 )
n i
As a result, the intercept of the LS fit will be zero if we first mean-center the
data, by setting yic = yi − ȳ and xcij = xij − x̄j for each i = 1, . . . , n and
j = 1 . . . , p − 1. From (2.21) we see indeed that the intercept β̂0c of the LS fit
through the transformed data equals
Relations (2.19) and (2.20) can also be derived from the geometrical inter-
pretation of the least squares estimator. If we consider the observed y-vector
and the column vectors of X as points in Rn , the least squares estimate β̂ LS is
defined as the vector β which minimizes the Euclidean norm ky − Xβk. This
is because for any vector z ∈ Rn , it holds that
n
X 1/2 √
kzk = zi2 = z t z.
i=1
The data frame fuel.frame (from the ‘SemiPar’ library) contains information
of 60 cars. This data set contains 5 variables: Weight (the weight of the car
in pounds), Disp. (the engine displacement in liters), Mileage (gas mileage in
miles/gallon), Fuel (fuel consumption in gallons per 100 miles, it thus is equal
to 100/Mileage), and Type (a factor giving the general type of car, with levels:
Small, Sporty, Compact, Medium, Large, Van).
We want to predict the fuel consumption of a car by its weight and engine
displacement. The postulated model is:
with εi ∼ N (0, σ 2 ).
data(fuel.frame)
attach(fuel.frame)
## help(fuel.frame)
names(fuel.frame)
pairs(~Fuel+Weight+Disp.)
Fuel
4.0
3.5
3.0
3500
3000
Weight
2500
2000
300
250
200
Disp.
150
100
3.0 3.5 4.0 4.5 5.0 5.5 100 150 200 250 300
The pairwise plots suggest that both Weight and Disp. are linearly related to
Fuel. The analysis yields:
Call:
lm(formula = Fuel ~ Weight + Disp.)
Residuals:
Min 1Q Median 3Q Max
-0.81089 -0.25586 0.01971 0.26734 0.98124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4789731 0.3417877 1.401 0.167
Weight 0.0012414 0.0001720 7.220 1.37e-09 ***
Disp. 0.0008544 0.0015743 0.543 0.589
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with σ̂ = 0.39.
Squaring both sides of the equation and summing over all observations gives
n
X n
X n
X
(yi − ȳ)2 = (ŷi − ȳ)2 + (yi − ŷi )2 . (2.27)
i=1 i=1 i=1
The cross-product term vanishes because of the Pythagorean theorem (see Fig-
ure 10.6). It can also be deduced from (2.18) and (2.20) by noting that
X X X X
(ŷi − ȳ)(yi − ŷi ) = (ŷi − ȳ)ei = ŷi ei − ȳ ei = 0.
i i i i
Relation (2.27) is the ANOVA decomposition which says that the total variation
(SST) in the response y can be decomposed into an ‘explained’ component due
to the regression (SSR) and an ‘unexplained’ component due to the errors (SSE).
We thus have:
SST = SSR + SSE
SSR
MSR =
p−1
SSE
MSE =
n−p
They are typically written in an ANOVA table as in Table 6.1.
It measures the proportion of the total variation in the response y that is ex-
plained by the linear model (2.1) which includes the variables X1 , . . . , Xp−1 . By
construction 0 6 R2 6 1. The minimum value 0 is attained when all ŷi = ȳ, i.e.
when all β̂j = 0 for j = 1, . . . , p − 1. The maximum value 1 is attained when all
the observations fall exactly on the fitted regression surface, i.e. when yi = ŷi
for all cases i.
Remarks:
2. A high value of R2 does not necessarily imply that the fitted model is
useful to make predictions.
4. If the general model (2.1) does not contain an intercept term, that is,
when β0 = 0, the ANOVA decomposition becomes:
n
X n
X n
X
yi2 = ŷi2 + (yi − ŷi )2 .
i=1 i=1 i=1
R = cos(y, ŷ).
because the total sum of squares did not change. Note that
SSE(X1 , X2 ) 6 SSE(X1 ).
or equivalently
These definitions and formulas also hold when we include several regressors at
once. For example,
When we combine (2.32) and (2.34) we see that the total sum of squares (SST)
can be written as
We can thus decompose the SSR of the full model (here, with all 3 predictors)
into several extra sum of squares, as in Table 7.3. Note that the degrees of
freedom associated with each sum of squares is equal to the number of variables
that are added to the model.
summary(aov(Fuelfit))
for any vector v. Hence, if a regression estimator applied to the (xi , yi ) yields
β̂, then it is desirable that the estimator applied to the (xi , yi + xti v) yields
β̂ + v.
2
σ̂LS (xi , cyi ) = c2 σ̂LS
2
(xi , yi ).
This implies that the fit is essentially independent of the choice of measurement
unit for the response variable y. Also, if we apply e.g. a logarithmic transfor-
mation to the y, it does not really make a difference whether we use the natural
logarithm log(y) = ln(y) or log10 (y) as they only differ up to a constant factor.
This transformation is also used when we want to compare the regression coefficients
in common units. Consider e.g. the estimated regression plane:
with sj resp. sY the standard deviation of Xj resp. Y . Using (2.11) and (2.12)
now becomes
and thus
yi − ȳ s1 xi1 − x̄1 sp−1 xi,p−1 − x̄p−1 εi
= β1 + . . . + βp−1 + .
sY sY s1 sY sp−1 sY
We could drop the intercept term from the model because the observations
√
are mean-centered! If finally we divide each term by n − 1, we obtain the
standardized regression model
for i = 1, . . . , n with
εi
ε0i = √ (2.41)
n − 1sY
0 sj
βj = ( )βj . (2.42)
sY
The regression coefficients βj0 are often called the standardized regression coef-
ficients. Because of the correlation transformation their least squares estimates
satisfy:
0 −1
β̂ = RXX rXY . (2.43)
To return to the estimates with respect to the original variables we use (2.42),
(2.21) and the equivariance properties of the least squares estimator:
sY 0
β̂j = ( )β̂ (2.44)
sj j
β̂0 = ȳ − β̂1 x̄1 − . . . − β̂p−1 x̄p−1 .
(2.45)
Example.
Dwaine Studios Inc. operates portrait studios in 21 cities of medium size. They
are specialized in portraits of children. The company wants to investigate
whether sales in a community (Y , expressed in $1000) can be predicted from
the number of persons aged 16 or younger in that community (X1 in thousands
of persons) and the per capita personal income (X2 in $1000). Some data and
results are shown in Table 7.5. The standardized regression model yields:
In the latter model, β̂1 and β̂2 can not be compared directly because both vari-
ables X1 and X2 are measured in other units. The standardized coefficients tell
us that an increase of one standard deviation of X1 when X2 is fixed leads to
a much larger increase in expected sales than if we fix X1 and increase X2 by
one standard deviation. We should however be cautious about this interpreta-
tion as also the correlation between X1 and X2 has an effect on the regression
coefficients.
Statistical inference
When we want to make inferences about β we assume that the errors are inde-
pendent and normally distributed, i.e.
ε ∼ Nn (0, σ 2 In ). (3.1)
y ∼ Nn (Xβ, σ 2 In ) (3.2)
β̂ LS ∼ Np (β, σ 2 (X t X)−1 ). (3.3)
Note that (3.2) does not tell that the {yi , i = 1, . . . , n}follow a common uni-
variate normal distribution. It says that at a certain x, the corresponding
response variable y is normally distributed. In particular, yi ∼ N (xti β, σ 2 ) for
i = 1, . . . , n. This is in general difficult to check as we often only have one
measurement for each xi . Normality of the residuals on the other hand can be
verified using residual plots (see Section 3.7).
41
Moreover it can be shown that
s2
(n − p) ∼ χ2n−p
σ2
and that β̂j and s2 are independent. Consequently,
β̂j − βj
∼ tn−p
s(β̂j )
H0 : β j = 0
H1 : βj 6= 0
These t-values and their corresponding p-values are usually reported in the out-
put of an analysis with a statistical software package. If the p-value is smaller
than α, we reject the H0 hypothesis in favor of the alternative.
and reject H0 if 0 does not belong to CI(βj , α). Note that the quantile tn−p,α/2
satisfies
α
P (T > tn−p, α2 ) = with T ∼ tn−p .
2
Remember that α is the probability of a type I error, i.e.
Example:
Suppose we fit a regression model with three slope parameters:
H0 : β2 = 0 and β3 = 0.
Hence,
P (H0 is rejected |H0 is correct) 6 1 − (1 − 2α) = 2α.
α
SCI(βj , α) = CI(βj , ).
2
[β̂j − tn−p, 2g
α s(β̂ ), β̂ + t
j j α s(β̂ )].
n−p, 2g j
Because the simultaneous confidence intervals are wider than the individual
confidence intervals, they define a much larger region in Rg . Therefore they
yield a larger type II error, i.e. the probability that the H0 hypothesis will not
be rejected although the alternative is true, will be larger. Equivalently, the
probability to detect that H1 is correct, will be small.
Another disadvantage of this procedure is that the correlation between the pa-
rameter estimates is not taken into account. A partial F-test, which is based
on Σ̂(β̂) attains the correct significance level. Under H0 we obtain the reduced
model:
yi = β0 + β1 xi1 + . . . + βp−q−1 xi,p−q−1 + εi .
Let SSEp−q denote the error sum of squares under this reduced model, i.e.
SSEp−q = SSE(X1 , . . . , Xp−q−1 ) and let SSEp be the error sum of squares under
the full model. Thus SSEp = SSE(X1 , . . . , Xp−1 ). Under condition (3.1) it can
be shown that
(SSEp−q − SSEp )/q
F = ∼H0 Fq,n−p (3.5)
SSEp /(n − p)
This test statistic can as well be described using the extra sum of squares.
By (2.34) we have that
1
MSR(Xp−q , . . . , Xp−1 |X1 , . . . , Xp−q−1 ) = SSR(Xp−q |X1 , . . . , Xp−q−1 )+
q
SSR(Xp−q+1 |X1 , . . . , Xp−q ) + . . . + SSR(Xp−1 |X1 , . . . , Xp−2 ) .
The Body Fat data study the relation of amount of body fat (Y ) to three
possible predictor variables: triceps skinfold thickness (X1 ), thigh circumference
(X2 ) and midarm circumference (X3 ). Measurements are taken on 20 healthy
women between 25 and 34 years old. Assume we want to test:
H0 : β2 = β3 = 0
H1 : not both β2 and β3 equal zero
As F2,16,0.05 = 3.63, the p-value of our test is 5% and we are at the boundary of
the decision rule. At the 1% significance level e.g. we would not reject the H0
hypothesis.
−1
SSEp−q − SSEp = btq Vqq bq .
H0 : β1 = β2 = . . . = βp−1 = 0
H1 : not all βj equal zero
The test statistic is derived from the partial F-test (3.6) with q = p − 1:
MSR
F = ∼H0 Fp−1,n−p (3.7)
MSE
From (2.28) and (3.7) it can easily be derived that this F-statistic is equivalent
to:
R2 /(p − 1)
F = .
(1 − R2 )/(n − p)
Its value is usually reported in a statistical package:
(x − β̂ LS )t (X t X)(x − β̂ LS )
Eα = {x ∈ Rp | 6 Fp,n−p,α }.
ps2
H0 : β = β 0
H1 : β 6= β 0
Again, this procedure attains the correct significance level by employing the
covariance matrix of β̂. A drawback is that if the H0 hypothesis is rejected,
we can not directly deduce statements about the individual parameters. This is
why individual or simultaneous confidence intervals for β0j are still useful. The
geometric differences between the two types of tests are illustrated in Figure 5.1.
If the correlation between the parameter estimates is large, the ellipsoid will be
more elongated, and tests based on confidence intervals will be too conservative.
So, it is very important to look at the correlation of the regression parameters.
summary(Fuelfit,correlation=TRUE)
The last part of the output now yields the correlation between the three param-
eter estimates for the fuel consumption data:
Correlation of Coefficients:
(Intercept) Weight
Weight -0.90
Disp. 0.47 -0.80
Clearly, the correlation between Weight and Disp. is very high, as in Figure
5.1. Hence, the large p-value for Disp. is not very informative.
This high correlation is due to a high correlation between Weight and Disp.:
cor(Weight,Disp.)
[1] 0.8032804
Here, cor(Weight, Disp.) = −cor(β^1 , β^2 ) but this equality is not satisfied in
general.
H0 : Cβ = 0 (3.8)
H1 : Cβ 6= 0
Example 1:
H0 : β1 = β2 , β3 = 0
Example 2:
H0 : β1 = β2 = . . . = βp−1 = 0
Example 3:
H0 : βp−q = βp−q+1 = . . . = βp−1 = 0
with
Var(ŷ0 ) = xt0 Σ(β̂)x0
y0 = xt0 β + ε0
is constructed as follows. Consider the random variable ŷ0 −y0 = xt0 β̂−xt0 β−ε0 .
It holds that
This interval is larger than the confidence interval for the mean response be-
cause it also includes the uncertainty given by ε0 .
We notice that the confidence and the prediction intervals become larger as we
move away from the mean of the data. This illustrates how dangerous it is to
draw conclusions about an observation with x-values outside the range of the
observed xi values. This is called extrapolation.
1. non-normality
with hii the ith diagonal element of the hat matrix H. Note that the normality
assumption can also be accessed formally through the Shapiro-Wilk statistic or
the Kolmogorov-Smirnov test.
1.0
2
0.5
1
Standardized residuals
Residuals
0.0
0
−1
−0.5
−2
−2 −1 0 1 2 0 10 20 30 40 50 60
Residuals vs Fitted
1.0
54
2
48
0.5
Standardized Residuals
1
Residuals
0.0
0
−1
−0.5
44
−2
−1.0
1.0
0.5
0.5
Residuals
Residuals
0.0
0.0
−0.5
−0.5
Weight Disp.
Polynomial regression
Note that the β2 coefficient in (4.1) is often denoted as β11 to indicate that it
is the parameter related to X12 . The response function of this quadratic model
is a parabola as in Figure 7.4:
Polynomial models may provide good fits, but one should be careful with ex-
trapolation! Figure 7.4 shows why extrapolation is dangerous: beyond x = 2
the curve is descending (a) or increasing (b), and this might not be appropriate.
Often, xi and x2i will be highly correlated. Example: generate xi ∼ N (5, 4) for
i = 1, . . . , 40. Then cor(xi , x2i ) = 0.97. But cor((xi − x̄), (xi − x̄)2 ) = −0.09.
58
To avoid multicollinearity, it is thus advisable to center the regressors. We then
obtain the model
if we let x0i1 = xi1 − x̄1 . To avoid the different models (with β and β 0 ), we will
assume in this chapter that the regressors xi are centered, and we will denote
the corresponding regression parameters with β.
The centering of the regressors also protects against round-off errors when we
solve the normal equations. In general, they involve a.o. the cross-products
P P 2
i xij xik and the sum of squares i xij . Here, this implies terms of the form
P 2 P
xij , i xij x2ij = i x3ij and i x2ij x2ij = i x4ij .
P P P P
i xij xij =
This is called a cubic regression model, with a response function shown in Figure
7.5.
Higher order of xi are rarely used because the interpretation of the coefficients
becomes difficult and the model may be erratic for interpolations and small
yi = β0 + β1 xi1 + β2 xi2 + β11 x2i1 + β22 x2i2 + β12 xi1 xi2 + εi . (4.5)
Figure 7.6 shows an example of such a quadratic response surface. This model
can easily be extended to a second-order model with three predictor variables.
Then (at most) three interaction terms can be included: β12 xi1 xi2 , β13 xi1 xi3
and β23 xi2 xi3 .
Consequently,
Hence, the effect of X1 for a given level of X2 depends on the level of X2 , and
vice versa. Figure 7.10 illustrates the effect on the response function when an
interaction term is present in the model. Picture (a) shows the response function
without an interaction term. Here,
We now see that the slope of the response function at X2 = 3 is larger than the
slope at X2 = 1. When both β1 and β2 are positive, we say that the interaction
is of reinforcement or synergistic type when β12 is positive. A negative β12 ,
as in Figure 7.10(c), yields an interaction effect of interference or antagonistic
type. A three-dimensional representation of these response functions is given in
Figure 7.11.
When a polynomial term of a given order is retained, then all related terms of
lower order are usually also retained in the model. This can be seen as follows.
Consider the quadratic model (4.1). Geometrically, the fitted parabola attains
its minimum at x = − 2β̂β̂1 if β̂11 is positive. If we drop β1 from the model, we
11
Often we want to express the final model in terms of the original (non-centered)
observations. This can be easily done. For the quadratic model e.g. we have for
the original variables, see (4.1):
4.4 Example
We consider the Power Cells data set, presented in Table 7.9. The response
variable (Y ) is the life of a power cell, measured in terms of the number of
The regressors were first centered and standardized in order to obtain the coded
values -1, 0 and 1 for each regressor. Note that this was possible because in this
experiment both regressors were controlled at three levels. Doing so, the corre-
lation between X1 and X12 was reduced from 0.991 to zero, and the correlation
between X2 and X22 from 0.986 to zero.
The researcher decided to fit the second-order polynomial model (4.5). The
output from R:
Call:
lm(formula = power ~ charge * temp + I(charge^2) + I(temp^2))
Residuals:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 162.84 16.61 9.805 0.000188 ***
charge -55.83 13.22 -4.224 0.008292 **
temp 75.50 13.22 5.712 0.002297 **
I(charge^2) 27.39 20.34 1.347 0.235856
I(temp^2) -10.61 20.34 -0.521 0.624352
charge:temp 11.50 16.19 0.710 0.509184
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(Powerfit))
Now, let us test whether the quadratic terms and the interaction term are
significant:
H0 : β11 = β22 = β12 = 0.
(1645.97+284.93+529.00)/3
The partial F-statistic F = 1048.09 = 0.78. As F3,5,0.05 = 5.41,
we do not reject the H0 hypothesis and we can thus simplify the model.
4.4. EXAMPLE | 65
4.5 Detecting curvature
Then
ei = yi − ŷi
= (β0 − β̂0 ) + (β1 − β̂1 )xi1 + β11 x2i1 + εi
= γ0 + γ1 (β̂0 + β̂1 xi1 ) + γ2 (β̂0 + β̂1 xi1 )2 + εi
for certain values of γ0 , γ1 and γ2 . This shows that both a (xi1 , ei ) and a (ŷi , ei )
plot will show a quadratic curve.
Example: assume e.g. that the true model includes a quadratic term in Xj , thus
We first approximate the true response curve with only a linear term in Xj . If
then the other estimated coefficients β̂k for k 6= j are close to the true βk ,
yi = β0 + β1 xi1 + . . . + βj g(xij ) + . . . + εi
as long as β̂k is close to βk for all k 6= j if we fit the full linear model.
One useful property of partial residuals is that the LS regression of e0i versus xij
yields a slope β̂j (which is the parameter estimate for Xj of the full model) and
a zero intercept. Thus partial residuals plots also provide information on the
direction and magnitude of the linearity as well as the nonlinearity of Xj . Plots
of the residuals ei versus xij on the other hand have a zero slope and intercept,
due to (2.18) and (2.19), so they only show deviations from linearity.
In Figure 12.6 we see the three partial residual plots of a regression model with
three predictors, including the LS slope estimate. Figure 12.6(a) slightly sug-
gests a cubic regression in the variable Education, Figure 12.6(c) a quadratic
term in Percentage of Women, whereas the curve in Figure 12.6(b) shows that
a logarithmic transformation of Income could probably improve the fit.
Categorical predictors
In the simplest case, we have one continuous regressor and one categorical pre-
dictor that takes on only two different values.
Y : the number of months elapsed between the time the first firm adopted
the innovation and the time the given firm adopted the innovation
X1 : the size of the firm, measured by the amount of total assets, in million $
T2 : the type of the firm: stock company or mutual company
69
firm
E[Y |X1 ] = β0 + β1 X1 + β2
= (β0 + β2 ) + β1 X1
E[Y |X1 ] = β0 + β1 X1 + 0
= β0 + β1 X1
In the (X1 , Y ) space, both lines are thus parallel. The parameter β0 is the ex-
pected value for mutual companies at x1 = 0, and β2 indicates how much higher
the elapsed time is for stock firms that for mutual firms, for any given size of
firm, see Figure 11.1.
In general, β2 shows how much higher or lower the mean response line is for the
class coded 1 than the line for the class coded 0, for any given level of X1 . The
or
1
if firm i is a stock company
(3)
X2 =
−1
if firm i is a mutual company
(3)
yi = β0 + β1 xi1 + β2 xi2 + εi . (5.3)
Now the difference between the expected time of a stock firm and the expected
time of a mutual firm at a given x1 is expressed by 2β2 ! The reference line
y = β0 + β1 x1 then lies in between the other two response functions.
Note that we only need one binary variable X2 although T2 has two levels. If
we would e.g. define
1
if firm i is a stock company
X2 =
0
if firm i is a mutual company
and
0
if firm i is a stock company
X3 =
1
if firm i is a mutual company
then our design matrix would not have full rank, because xi2 + xi3 = 1 which is
the intercept term. Consequently the LS estimator would not be unique.
Example.
Assume that the economist is most interested in the effect of type of firm (T2 )
on the elapsed time and wished to obtain a 95% confidence interval for the
mean increase of the time of stock firms compared to mutual firms. Then it is
(1)
recommend to work with the binary variable X2 . In R we obtain
attach(firm)
mst <- lm(Months ~ Size + Type)
coefficients(summary(mst))
(1)
ŷi = 33.874 − 0.102xi1 + 8.055xi2
So, with 95% confidence, we conclude that on the average, stock companies tend
to adopt the innovation between 5 and 11 months later than mutual companies,
for any given size of firm. A scatter plot of the data and the two regression
lines are shown in Figure 5.1. This is the default coding scheme in R.
20
10
0
Size
model.matrix(mst)
options(contrasts = c("contr.helmert","contr.poly"))
mst2 <- lm(Months ~ Size + Type)
summary(mst2)
which yields
Call:
lm(formula = Months ~ Size + Type)
Residuals:
Min 1Q Median 3Q Max
-5.6915 -1.7036 -0.4385 1.9210 6.3406
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.901804 1.770041 21.413 9.78e-14 ***
Size -0.101742 0.008891 -11.443 2.07e-09 ***
Type1 4.027735 0.729553 5.521 3.74e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model.matrix(mst2)
Note that the fitted values and consequently also the residuals are the same
(1) (3)
whether we use X2 or X2 . Hence, the MSE and the overall F-test yield the
same results.
(1)
with X2 = X2 defined in (5.1). The meaning of the regression coefficients
now becomes:
E[Y |X1 ] = β0 + β1 X1 + β2 + β3 X1
= (β0 + β2 ) + (β1 + β3 )X1
E[Y |X1 ] = β0 + β1 X1
Call:
lm(formula = Months ~ Size * Type)
Residuals:
Min 1Q Median 3Q Max
-5.7144 -1.7064 -0.4557 1.9311 6.3259
Coefficients:
Estimate Std. Error t value Pr(>|t|)
The large p-value for β3 confirms that the simplified regression model (5.2)
without interaction term is appropriate for this data set.
Remark:
1. The model with the interaction term (5.4) is almost identical to the model
in which we assume a separate regression line for both groups that are
defined by the dichotomous predictor variable. The only difference is
that model (5.4) assumes that the data points of both classes show the
same variability around their regression line. Doing so, tests about the
equality of the slopes and the intercepts become very easy to apply.
This model describes regression lines that have the same intercept but
different slopes which is a peculiar specification (generally) of no substan-
tive interest. Similarly, the model which retains β3 but removes β1 :
has a zero slope for the class coded X2 = 0 which is usually too restrictive.
With Y = tool wear, and X1 = tool speed, a first-order regression model is:
The response functions are again lines with the same slope β1 for all values of
T2 , see Figure 11.5. The coefficients β2 , β3 and β4 indicate how much higher
(lower) the response functions are for tool models M1, M2 and M3 than for tool
model M4, for any given level of tool speed.
5.2. EXTENSIONS | 79
If interaction effects are present, the regression model (5.5) becomes:
yi = β0 + β1 xi1 + β2 xi2 + β3 xi3 + β4 xi4 + β5 xi1 xi2 + β6 xi1 xi3 + β7 xi1 xi4 + εi .
This model again implies that each tool model has its own regression line, with
different intercepts and slopes for the different tool models.
X2 = 0 X2 = 1
X3 = 0 β0 + β1 X1 (β0 + β2 ) + β1 X1
X3 = 1 (β0 + β3 ) + β1 X1 (β0 + β2 + β3 ) + β1 X1
X2 = 0 X2 = 1
X3 = 0 β0 + β1 X1 (β0 + β2 ) + (β1 + β4 )X1
X3 = 1 (β0 + β3 ) + (β1 + β5 )X1 (β0 + β2 + β3 + β6 ) + (β1 + β4 + β5 )X1
Remarks.
• If all the explanatory variables are qualitative, the models are called anal-
ysis of variance models.
• If the model contains qualitative and quantitative regressors, but the main
variables of interest are the qualitative ones, it is called an analysis of
covariance model.
5.2. EXTENSIONS | 81
5.3 Piecewise linear regression
Indicator variables can also be used when the regression of Y on X follows a cer-
tain linear relation in some range of X but follows a different relation elsewhere.
E[Y |X1 ] = β0 + β1 X1
with X2 = I(X1 > 40). Then for X1 6 40 the response function becomes
E[Y |X1 ] = β0 + β1 X1
so β2 represents the difference in the slopes, and β3 the difference in the mean
responses at Xp = 40.
Transformations
Figure 4.1 shows several of those power transformations. Note that this family
of transformations is monotone increasing in X, whereas the simple form (6.1)
is decreasing if λ < 0. Moreover, for λ = 0, transformation (6.1) would be use-
less (X 0 = 1), whereas the logarithmic transformation is often very appropriate.
85
The effect of a power transformation is the following:
1. if λ < 1 large values of X are compressed, whereas small values are more
spread out.
2. if λ > 1 the inverse effect takes place: large values are more dispersed,
whereas small values are compressed.
The first property makes the power transformation interesting if the distribu-
tion of X is skewed to the right, the second if X is skewed to the left (which
occurs less in practice). Consider e.g. the distribution of income in Figure 4.2.
The non-parametric density estimate clearly shows a right-tailed distribution.
The log of income on the other hand is more symmetric, as illustrated in Figure
4.3.
86 | CHAPTER 6. TRANSFORMATIONS
Remarks:
• The power transformation is not very effective when the ratio between the
largest and the smallest value is small. Example:
X : 2001 2002 2003 2004 2005
log(X) : 7.6014 7.6019 7.6024 7.6029 7.6034
Here, 2005/2001 = 1.002 ≈ 1. When we first subtract 2000 from the data,
we obtain a ratio of 5/1 = 5:
X − 2000 : 1 2 3 4 5
log(X − 2000) : 0 0.6931 1.0986 1.3863 1.6094
and then the logarithmic transformation has more effect.
The logit transformation (Figure 4.15) removes the boundaries of the scale,
spreads out the tails of the distribution and makes the transformed vari-
able symmetric around zero. It is the inverse of the cumulative distribution
function of the logistic distribution and is essential in logistic regression.
2. The probit transformation uses the inverse of the standard normal distri-
bution:
probit(P ) = Φ−1 (P )
88 | CHAPTER 6. TRANSFORMATIONS
√
3. Also the arcsine-square-root arcsin( P ) transformation has a similar shape.
90 | CHAPTER 6. TRANSFORMATIONS
6.3.2 Box-Cox transformation
A more sophisticated approach to transform the response variable is the Box-
Cox transformation. The object of this transformation is to normalize the error
distribution, to stabilize the error variance and to straighten the relation be-
tween Y and X.
(λ)
yi = β0 + β1 xi1 + β2 xi2 + . . . + βp−1 xi,p−1 + εi
(λ)
with εi i.i.d. N (0, σ 2 ), and yi defined by (6.2). Note that all the yi must be
positive, otherwise a constant should first be added. For a particular choice of
λ, the maximized log-likelihood (profile log-likelihood) is
n
n 2
X
log L(λ) = const − log σ̂ (λ) + (λ − 1) log yi
2 i=1
where σ̂ 2 (λ) = e2i (λ)/n and e2i (λ) are the least squares residuals from the
P
This stems from the fact that the likelihood-ratio statistic G = −2(log L(λ) −
log L(λ̂)) is asymptotically distributed as a χ21 distribution. Usually a plot of
the log-likelihood log L(λ) versus λ is made, with the 95% confidence interval
for λ indicated, as in Figure 12.8. Then a value of λ̂ is selected which belongs
to this confidence interval and which coincides with rounded numbers such as
-1.5, -1, -0.5, 0, 0.5, 1, . . . .
Remark that the Box-Cox methodology assumes that there exists a λ such
that on the transformed scale the assumptions of the general linear model are
fulfilled. It is thus important to verify (using residual plots and normal quantile
plots) whether the proposed transformation indeed improved the appropriate-
ness of the model assumptions.
The second Gauss-Markov condition (2.3) states that the error variance is ev-
erywhere the same around the regression surface. Nonconstant error variance
is called heteroscedasticity. In that case the least squares estimator is still un-
biased and consistent, but the variances of the parameter estimates tend to be
large and thus affect the tests of hypothesis substantially. Also s2 (X t X)−1 need
no longer be an unbiased estimate of the covariance matrix of β̂ LS .
Heteroscedasticity can often be seen through residual plots. If σi2 varies with
E[Y |xi ], then a plot of the residuals, which are estimates of εi , against the
92 | CHAPTER 6. TRANSFORMATIONS
fitted values ŷi , which are estimates of E[Y |xi ], might reveal that the residuals
are more spread out for some values of ŷi than for others. A typical example is
shown in Figure 6.2. Because the least-squares residuals have unequal variances
even in the homoscedastic case, it is preferable to use the standardized residuals.
Also plots of the absolute (standardized) residuals or the squared residuals versus
ŷi are often used. Since
we notice that the squared residual e2i is an estimator of the variance σi2 , and
that the absolute residual |ei | is an estimate of the standard deviation σi .
Sometimes the variance of the errors varies with one or more of the regressors
X. Therefore residual plots versus each of the independent variables might also
be appropriate.
Assume that Yx has mean µx and variance σx2 and that g is a function of Yx
such that E[g(Yx )] can be well approximated with g(E[Yx ]) = g(µx ). The Taylor
expansion of g(Yx ) around µx gives:
Consequently
Examples.
94 | CHAPTER 6. TRANSFORMATIONS
6.4.3 Weighted least squares regression
for i = 1, . . . , n, with
εi independent N (0, σi2 ).
Moreover we assume that the variances σi2 are known up to a constant of pro-
portionality:
σi2 = σ 2 /wi
with wi known weights, and σ 2 unknown. The ratio of two variances σj2 and σk2
is then indeed known, since σj2 /σk2 = wk /wj . Let W = diag(w1 , . . . , wn ) then
Σ(ε) = σ 2 W −1 . (6.4)
This generalized linear model (6.3) is equivalent to the general linear model:
√ √ √ √ √ √
wi yi = β0 wi + β1 wi xi1 + β2 wi xi2 + . . . + βp−1 wi xi,p−1 + wi εi
√
since wi εi are independent N (0, σ 2 ). Or
y (W ) = X (W ) β + ε(W )
We thus apply the ordinary least squares estimator (OLS) on the weighted vari-
ables, where observations with a large variance get a small weight and those
with a small variance a large weight (wi ∼ 1/σi2 ).
It can be shown that β̂ WLS is the BLUE estimator of β in the generalized linear
model (6.3) that satisfies (6.4). The variance-covariance matrix of β̂ WLS is given
by
e(W ) = y (W ) − ŷ (W )
= W 1/2 y − W 1/2 X β̂ WLS
= W 1/2 (y − ŷ) with ŷ = X β̂ WLS . (6.5)
from which
Σ̂(β̂ WLS ) = σ̂ 2 (X t W X)−1 (6.6)
follows.
96 | CHAPTER 6. TRANSFORMATIONS
Estimation of the variance function.
Because the error variances σi2 or the weights wi are in general not known, we
are forced to estimate them. As the σi2 often vary with one or several predictor
variables or with the mean response E(yi ), the following procedure might be
helpful:
1. Fit the regression model by ordinary least squares (OLS) and analyze the
residuals.
2. Regress the squared residuals or the absolute residuals on the fitted values
or one or several independent variables. This makes sense because the
squared residuals e2i estimate the variances σi2 , while the absolute residuals
|ei | are estimates of the standard deviations σi .
3. Use the fitted values from the estimated variance or standard deviation
to obtain the weights wi .
Note that the estimated standard deviations of the coefficients, derived from (6.6),
are now only approximate, because the estimation of the variances σi2 has in-
troduced another source of variability. The approximation will often be quite
good when the sample size is not too small.
Call:
lm(formula = Bloodpr ~ Age)
Residuals:
Min 1Q Median 3Q Max
-16.4786 -5.7877 -0.0784 5.6117 19.7813
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 56.15693 3.99367 14.061 < 2e-16 ***
Age 0.58003 0.09695 5.983 2.05e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The scatter plot of the data, the plot of residuals versus age, and absolute
residuals versus age using OLS clearly demonstrate heteroscedasticity.
plot(Age,Bloodpr)
abline(lm(Bloodpr~Age))
resid <- residuals(lmba)
plot(Age,resid)
plot(Age,abs(resid))
98 | CHAPTER 6. TRANSFORMATIONS
110
100
90
Bloodpr
80
70
20 30 40 50 60
Age
20
10
resid
0
−10
20 30 40 50 60
Age
10
5
0
20 30 40 50 60
Age
When we regress the absolute residuals versus Age we obtain the estimated
expected standard deviation:
Call:
lm(formula = abs(resid) ~ Age)
Residuals:
Min 1Q Median 3Q Max
-9.7639 -2.7882 -0.1587 3.0757 10.0350
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.54948 2.18692 -0.709 0.48179
Age 0.19817 0.05309 3.733 0.00047 ***
---
Call:
lm(formula = Bloodpr ~ Age, weights = weightblood)
Weighted Residuals:
Min 1Q Median 3Q Max
-2.0230 -0.9939 -0.0327 0.9250 2.2008
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 55.56577 2.52092 22.042 < 2e-16 ***
Age 0.59634 0.07924 7.526 7.19e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
which is not so different from the OLS line. Therefore, an extra reweighting
should not be considered. We see that the standard error of β̂1 has decreased
from 0.097 in the OLS analysis to 0.079 in the WLS analysis. Consequently the
Finally we test whether the heteroscedasticity has gone now. Note that R
computes the residuals as ei (β̂ WLS ) = yi − ŷi (β̂ WLS ). They will still show the
(W ) √
heteroscedasticity. By (6.5) it is the residuals ei (β̂ WLS ) = wi (yi −ŷi (β̂ WLS ))
who should have constant variance.
plot(Age,resid(wlmba))
plot(Age,resid(wlmba)*sqrt(weightblood),ylab="Weighted residuals")
20
10
resid(wlmba)
0
−10
20 30 40 50 60
Age
2
1
Weighted residuals
0
−1
−2
20 30 40 50 60
Age
• the presence of explanatory variables that are not related to the response
variable increase the variance of the predicted values and hence, decrease
the model’s predictive ability.
On the other hand, omitting important variables (or latent explanatory vari-
ables) leads to biased estimates of the regression coefficients, the error variance,
the mean responses and predictions of new observations.
This is illustrated in Figure 4.3. If too many variables are selected, too much of
the redundancy in the x-variables is used and the solution becomes overfitted.
The regression equation will be very data dependent and gives poor prediction
results. If too few variables are retained, it is called underfitting which means
103
that the model is not large enough to capture the important variability in the
data. The optimal number of variables is usually found in between the two
extremes. It is therefore often a good idea to consider several ‘good’ subsets of
explanatory variables.
1 2 3 4 5 6 20 40 60 80
2.8
2.4
Log.Survival
2.0
1.6
6
5
4
Liver.
3
2
1
80 100
Enzyme
60
40
20
80
60
Prognostic
40
20
10
8
Blood.Clotting
6
4
We take the first half of the data as training data to build a regression model.
Table 8.1 shows part of these data which are graphically shown in the scatterplot
matrix The first order linear model (full model) for the data yields:
Call:
lm(formula = Log.Survival ~ Blood.Clotting + Prognostic + Enzyme +
Liver)
Residuals:
Min 1Q Median 3Q Max
-0.43500 -0.17591 -0.02091 0.18400 0.56192
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.851948 0.266258 14.467 < 2e-16 ***
Blood.Clotting 0.083684 0.028833 2.902 0.00554 **
Prognostic 0.012665 0.002315 5.471 1.51e-06 ***
Enzyme 0.015632 0.002100 7.443 1.37e-09 ***
Liver 0.032161 0.051465 0.625 0.53493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
SSEp
Rp2 = 1 −
SST
Subsets with a high Rp2 coefficient (or equivalently with a low SSEp ) are consid-
ered good. We know that Rp2 always increases if we include additional variables
in the model. Therefore it makes no sense to maximize Rp2 , but we should find
the point where adding more variables is not worthwhile because it leads to a
very small increase in Rp2 .
Figure 8.4 contains a plot of the Rp2 values versus p, and its maximum for each
number of parameters p in the model. From this plot it can be seen that the
inclusion of the fourth variable Liver does not lead to a large increase in the
explained variance. This might be surprising because the correlation between
Liver and Log.Survival is the largest among all the pairwise correlations with
the response variable. This indicates that X1 , X2 and X3 contain much of the
information presented by X4 .
SSE/(n − p)
Ra2 = 1 −
SST/(n − 1)
MSEp
=1− .
SST/(n − 1)
Since SST remains constant over all regression models, considering the ad-
justed Ra2 is equivalent to looking at the mean squared error MSEp = σ̂ 2 =
1
Pn 2
n−p i=1 ei,p with ei,p the residuals from a model with p regressors. Although
X X
SSEp+1 = e2i,p+1 6 SSEp = e2i,p ,
When we consider the MSE criterion we can thus look for the subset(s) with
minimal MSE, or whose MSE is very close to the minimum. Figure 8.5 shows
the MSEp plot for the surgical unit data. Again, the fourth explanatory variable
Liver appears not to be needed in the model.
SSEp
Cp = − (n − 2p) (7.1)
s2
with s2 the MSE from the ’largest’ model, presumed to be a reliable unbiased
estimate of the error variance σ 2 .
E(SSEp )
γp = − (n − 2p)
σ2
from which (7.1) follows. In order to minimize the total mean squared error we
prefer models with small Cp value.
Note that there are several Cp statistics for each p, except for the Cp value when
all variables are included, let’s say CP . At this model
(n − P )MSEP
CP = − n + 2P = P
s2
since s2 = MSEP .
E(Cp ) ≈ p
for an adequate model. When the Cp values for all regression models are thus
plotted against p, the models with little bias will be close to the line Cp = p.
Models with substantial bias (due to the omission of some predictors) will fall
The Cp plot of the surgical unit data clearly shows that only the subset with the
first three variables has little bias. Here, the Cp value falls below the line Cp = p,
because SSE1,2,3 = 0.1099 is only slightly larger than SSE1,2,3,4 = 0.1098 and
consequently MSE1,2,3 = 0.00220 < MSE1,2,3,4 = 0.00224.
follows.
The AIC criterion is used in R to perform stepwise regression (see Section 7.3).
In a forward search strategy, one starts with a model of e.g. p − 1 explanatory
variables, and then includes the variable that yields the largest reduction in the
AIC.
n
X
PRESSp = (yi − ŷi(i) )2 (7.2)
i=1
Models with small PRESSp values (or PRESSp /n) are considered good candi-
date models. The prediction error di = yi − ŷi(i) is also called the deleted residual
for the ith observation. It can be shown to be equal to
ei
di = (7.3)
1 − hii
and thus can be computed without recomputing the regression function.
and thus equals the squared t-value for the parameter test H0 : βj = 0 versus
H1 : βj 6= 0.
The variable for which this Fj∗ is smallest is the candidate for deletion. If this
Fj∗ value falls below a predetermined limit (e.g. F1,n−P,α or the corresponding
p-value is larger than α), then this variable is deleted. Otherwise the process is
stopped. If not, the procedure starts over with the P − 1 remaining variables.
A drawback of this method is that a variable can never come back in the model
once it is deleted.
To illustrate the stepwise procedures we use a random halfsample of the surgical
unit data.
set.seed(1)
surgicalunit2 <- surgicalunit[sample(1:108,54),]
attach(surgicalunit2)
surg.full <- lm(Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver)
summary(surg.full)
Call:
lm(formula = Log.Survival ~ Blood.Clotting + Prognostic + Enzyme +
Residuals:
Min 1Q Median 3Q Max
-0.60295 -0.20763 -0.01256 0.20764 0.57249
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.870175 0.278091 13.917 < 2e-16 ***
Blood.Clotting 0.070782 0.030310 2.335 0.0237 *
Prognostic 0.014444 0.002477 5.830 4.27e-07 ***
Enzyme 0.014245 0.002226 6.399 5.66e-08 ***
Liver 0.041781 0.052601 0.794 0.4308
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because F4∗ = 0.0662 = 0.0044 < F1,54−5,0.05 = 4.04 (or equivalently because
the corresponding p-value 0.948 > 0.05) the fourth variable Liver will be re-
moved.
The procedure stops because all p-values are smaller than the limit.
In R the function stepAIC performs backward selection automatically based on
the AIC criterion. The variable whose removal results in the largest decrease
in AIC is dropped from the model. The procedure stops when AIC cannot
decrease anymore.
Start: AIC=-136.94
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver
Step: AIC=-138.25
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme
Note that the column ”Sum of Sq” indicates the extra sum of squares
SSR(Xj |X1 , . . . , Xj−1 , Xj+1 , . . . , Xp ) whereas RSS = SSEp−1 .
MSR(Xj )
Fj∗ =
MSE(Xj )
and the variable with the largest Fj∗ value is the candidate for the first addition.
If this Fj∗ value exceeds a predetermined level (or the p-value is lower than α),
the Xj variable is added. Otherwise none of the regressors are considered to be
helpful in the prediction of the response variable.
Assume variable X7 is entered, then the next partial Fj∗ values are
MSR(Xj |X7 )
Fj∗ =
MSE(Xj , X7 )
and again the variable with the largest Fj∗ -value is included (if it is large enough).
As with the backward selection procedure, none of the variables can be removed
once they are entered in the model.
Start: AIC=-77.58
Log.Survival ~ 1
Step: AIC=-101.66
Log.Survival ~ Liver
Step: AIC=-111.69
Log.Survival ~ Liver + Enzyme
Step: AIC=-133.24
Log.Survival ~ Liver + Enzyme + Prognostic
Step: AIC=-136.94
Log.Survival ~ Liver + Enzyme + Prognostic + Blood.Clotting
Stepwise regression can also be performed by looking at the increase and de-
crease of the AIC when removing and adding variables:
Start: AIC=-77.58
Log.Survival ~ 1
Step: AIC=-101.66
Log.Survival ~ Liver
Step: AIC=-111.69
Log.Survival ~ Liver + Enzyme
Step: AIC=-136.94
Log.Survival ~ Liver + Enzyme + Prognostic + Blood.Clotting
Step: AIC=-138.25
Log.Survival ~ Enzyme + Prognostic + Blood.Clotting
Start: AIC=-136.94
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme + Liver
Step: AIC=-138.25
Log.Survival ~ Blood.Clotting + Prognostic + Enzyme
• collection of new data to check the model and its predictive ability
• use of a holdout sample to check the model and its predictive ability
The purpose of collecting new data is to able to examine whether the regression
model developed from the earlier data is still applicable for the new data. This
is in particular of interest for exploratory observational studies, as they also
involve model building.
• by re-estimating the final model using the new data and comparing the
estimated regression coefficients and other characteristics of the fitted
model
• by re-estimating from the new data all the ’good’ subset models that had
been considered to see whether the selected regression model is still the
preferred one.
computes the mean of the squared prediction errors of the new data (of
size m), and should be compared with MSE. If MSEP is much larger
than MSE, one should rely on the MSEP as an indicator of how well the
selected regression model will predict in the future.
To obtain reliable results, the training set should be large enough (remember,
n > 5p) otherwise the variances of the regression coefficients will be too large.
If data splitting is impractical at small data sets, the PRESS criterion (7.2)
n
X
PRESSp = (yi − ŷi(i) )2
i=1
If a data set for an explanatory observational study is very large, it can even be
split into three parts: one for developing the regression model, the second for
estimating the parameters and the third for validation. This approach avoids
bias resulting from estimating the regression parameters from the same data set
used for developing the model. On the other hand this approach yields larger
variances of the parameter estimates.
In any case, once the model has been validated, it is customary to use the entire
data set for estimating the final regression model.
Multicollinearity
attach(crew)
crew
127
6 6 2 53
7 6 3 61
8 6 3 60
Call:
lm(formula = crew.productivity ~ crew.size + bonus.pay)
Residuals:
1 2 3 4 5 6 7 8
1.625 -1.375 -1.625 1.375 -2.125 1.875 0.625 -0.375
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3750 4.7405 0.079 0.940016
crew.size 5.3750 0.6638 8.097 0.000466 ***
bonus.pay 9.2500 1.3276 6.968 0.000937 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(crew.lm12)
Response: crew.productivity
Df Sum Sq Mean Sq F value Pr(>F)
crew.size 1 231.125 231.125 65.567 0.0004657 ***
bonus.pay 1 171.125 171.125 48.546 0.0009366 ***
Residuals 5 17.625 3.525
Call:
lm(formula = crew.productivity ~ crew.size)
Residuals:
Min 1Q Median 3Q Max
-6.750 -3.750 0.125 4.500 6.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.500 10.111 2.324 0.0591 .
crew.size 5.375 1.983 2.711 0.0351 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(crew.lm1)
Response: crew.productivity
Df Sum Sq Mean Sq F value Pr(>F)
crew.size 1 231.12 231.125 7.347 0.03508 *
Residuals 6 188.75 31.458
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
Residuals:
Min 1Q Median 3Q Max
-7.000 -4.688 -0.250 5.250 7.250
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.250 11.608 2.348 0.0572 .
bonus.pay 9.250 4.553 2.032 0.0885 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(crew.lm2)
Response: crew.productivity
Df Sum Sq Mean Sq F value Pr(>F)
bonus.pay 1 171.12 171.125 4.1276 0.08846 .
Residuals 6 248.75 41.458
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that β̂1 = 5.375, the regression coefficient for X1 , is the same whether
or not X2 is also included in the model. The same holds for β̂2 = 9.250. This
is a general result, which can be most easily deduced from the estimate of β in
the standardized regression model (2.40):
If all the X variables are uncorrelated, RXX = Ip−1 , and thus β̂j0 = rjy only
depends on Xj and Y . This remains true for the original coefficients
sY 0 sjY
β̂j = ( )β̂ = 2 .
sj j sj
we see that the regression sum of squares due to X1 and X2 together can be
split into the SSR due to X1 alone and the SSR due to X2 alone, when X1 and
X2 are uncorrelated.
Now, consider an example where two predictor variables are perfectly correlated,
as in Table 7.8.
Here, the response function is not unique. Both Ŷ = −87 + X1 + 18X2 and
Ŷ = −7 + 9X1 + 2X2 yield the same fitted values and (zero) residuals.
We will illustrate some of these effects on the Body Fat data, relating the
amount of body fat (Y ) to triceps skinfold thickness (X1 ), thigh circumference
(X2 ) and midarm circumference (X3 ), measured at 20 healthy women of 25-34
years old.
attach(bodyfat)
bodyfat
From the correlation matrix RXX we deduce that X1 and X2 are highly corre-
lated.
print(cor(bodyfat),digits=2)
The third variable X3 is not highly correlated with X1 and X2 but if we regress
X3 on X1 and X2 we obtain R2 = 0.98. We see that the regression coefficient
for a certain predictor varies a lot depending on the presence of some or all of
the other predictors, and even can change sign as for β̂2 . Also the standard
error of the parameter estimates increases considerably when more variables are
added to the model.
variables in model β̂1 β̂2 s(β̂1 ) s(β̂2 )
X1 0.8572 - 0.1288 -
X2 - 0.8565 - 0.1100
X1 , X2 0.2224 0.6594 0.3034 0.2912
X1 , X2 , X3 4.3341 -2.8568 3.0155 2.5820
This example illustrates again that a regression coefficient reflects the marginal
or partial effect of a predictor on the response variable, given the other variables
in the model!
Call:
lm(formula = body.fat ~ triceps + thigh)
Residuals:
Min 1Q Median 3Q Max
-3.9469 -1.8807 0.1678 1.3367 4.0147
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19.1742 8.3606 -2.293 0.0348 *
triceps 0.2224 0.3034 0.733 0.4737
thigh 0.6594 0.2912 2.265 0.0369 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Coefficients:
(Intercept) triceps
triceps 0.73
thigh -0.93 -0.92
H0 : β 1 = 0 and H0 : β 2 = 0
are however both in favor of the H0 hypothesis. If we use the Bonferroni method
at the α = 5% significance level, we see that both p-values (0.47 and 0.037) are
larger than 0.025 = α/2. Reconsidering Figure 5.1 from Chapter 3, we notice
that the absolute value of the correlation between β̂1 and β̂2 is very high (-0.92)
and thus leads to a very tight confidence ellipse which excludes (0, 0). But the
two univariate confidence intervals contain 0.
1
VIFj =
1 − Rj2
It can be shown that VIFj equals the jth diagonal element of the inverse corre-
lation matrix of X.
−1
VIFj = (RXX )jj (8.2)
For the Body Fat data, we have indeed large variance inflation factors for all
three regressors: VIF1 = 708.84, VIF2 = 564.34 and VIF3 = 104.61.
The variance inflation factors measure how much the variances of the estimated
regression coefficients are inflated compared to when the predictor variables are
not linearly related. This can be seen as follows: at the standardized regression
model (2.40):
0 −1
Σ(β̂ ) = (σ 0 )2 ((X 0 )t X 0 )−1 = (σ 0 )2 RXX
with (σ 0 )2 the error variance of the transformed data. Because of (8.2) we derive
that Var(β̂j0 ) = (σ 0 )2 VIFj . In terms of the original variables, this yields
sY 2
Var(β̂j ) = Var(β̂j0 )
sj
sY 2 1 σ2
= 2 σ 2 VIFj = VIFj .
sj (n − 1)sY (n − 1)s2j
As another diagnostic tool we can look at the eigenvalues of RXX . Because the
correlation matrix is symmetric and semi-positive definite, it can be decomposed
as
p−1
X
RXX = P LP t = λj v j v tj (8.3)
j=1
To judge the eigenvalues with respect to their size, we use the equality
p−1
X
λj = tr(RXX ) = p − 1
j=1
so we can
• drop one or several predictor variables that are highly correlated to the
remaining variables
Relation (8.4) illustrates again that the presence of small eigenvalues yields a
0
large sampling variability. Hence, to reduce the variance of β̂ , we can decide to
eliminate the eigenvectors for which the corresponding eigenvalue is too small.
If λk+1 , . . . , λp−1 are sufficiently small, this corresponds to setting
k
X
(Z t Z)+ = λ−1 t
j vj vj .
j=1
and defining
+
β̂ = (Z t Z)+ Z t y 0 .
ti = (P̃k,p−1 )t z i
Next, the response variable y 0 is regressed onto the scores. We thus consider
the regression model
yi0 = tti α + εi
T t T = P̃ t Z t Z P̃ = P̃ t RXX P̃ = P̃k,p−1
t t
Pp−1,p−1 Lp−1,p−1 Pp−1,p−1 P̃p−1,k = L̃k,k
and that
(Z t Z)+ (Z t Z)(Z t Z)+ = (Z t Z)+ .
Consequently
+
E(β̂ ) = (Z t Z)+ Z t E(Y 0 )
= (Z t Z)+ Z t Zβ 0
p−1
X
= β0 − v j v tj β 0 .
j=k+1
+
On the other hand, the variance of β̂ has decreased:
+
Σ(β̂ ) = (Z t Z)+ Z t Σ(Y 0 )Z(Z t Z)+
= (Z t Z)+ σ 2
hence
k
+ X
Var(β̂ l ) =σ 2
λ−1 2
j vlj
j=1
whereas
p−1
0 X
Var(β̂ l ) = σ 2 λ−1 2
j vlj .
j=1
Remarks.
• The PCR method is in particular very useful when n < p. When there are
more variables than observations, there is always perfect multicollinearity
because rank(X) ≤ min(n, p) = n < p.
• PCR selects components which contain most of the variation in the re-
gressors. More sophisticated methods such as Partial Least Squares Re-
gression (PLS) compute components that maximize their covariance with
the response variable, with the goal of retaining components that are more
informative with respect to the regression model.
A very important issue in PCR is the choice of k, the optimal number of principal
components that are retained in the analysis. Some popular strategies are the
following:
with ŷi,k the fitted response value for the ith case based on a PCR regres-
sion with k components, and m the number of observations in the vali-
dation set. The RMSEPk curve for k = 1, . . . , kmax often has the shape
of the upper curve of Figure 4.3 (in Chapter 7, Section 7.1). Its minimal
value then determines the chosen number of components, see Figure 5.2.
RXX β = rXY .
With c = 0 the ridge and the least squares estimators coincide. When c > 0 the
ridge estimator is biased, but has less variability.
It can be shown that the bias of β ∗ increases with c, whereas the variance
(expressed as the trace of the variance-covariance matrix) decreases with c.
The mean squared error combines the bias and the variance of an estimator.
For an estimator of a univariate parameter β:
It has been shown that for any data set there exists always a value of c such that
the ridge estimator β ∗ has a smaller TMSE than the least squares estimator β̂ LS .
To determine the constant c we will consider the ridge trace method, and the
variance inflation factors. The ridge trace plots the evolution of the ridge stan-
dardized regression coefficients βj∗ for different values of c, usually between 0
and 1.
The VIF values for ridge regression are defined as for OLS: they measure for
each coefficient how large the variance of β̂j∗ is relative to what the variance
would be if the predictors were uncorrelated. It can be shown that VIFj for
ridge regression equals the j-diagonal element of the matrix
In the Body fat example (Table 10.3) we see that the VIF’s decrease rapidly as
c changes from 0 towards 1. The constant c is then chosen as the smallest value
where the plot and the VIF’s become stable. Here, it was decided to employ
Also notice that the R2 value only decreased slightly: from 0.8014 to 0.7818.
Since the total sum of squares for the transformed variables
n
X
SST = (yi0 − ȳ 0 )2 = 1
i=1
R2 = 1 − SSE
n
X
=1− (yi0 − ŷi0 )2
i=1
Real data sets often contain outlying observations. Although a precise definition
of outliers is hard to give, they are characterized as the observations that do not
follow the pattern of the majority of the data. In regression, data points can be
split into 4 types:
For simple regression, these different types of observations are illustrated in Fig-
ure 9.1. It is well-known that the least-squares estimator β̂ LS is very sensitive
to vertical outliers and bad leverage points.
152
vertical outlier
good leverage point
•
•
y
• •
•
regular data
•
• •••
••• ••
• • •
• • •
• • • bad leverage point
• • ••
•
library(MASS)
phones
$year
[1] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
[21] 70 71 72 73
$calls
[1] 4.4 4.7 4.7 5.9 6.6 7.3 8.1 8.8 10.6 12.0
[11] 13.5 14.9 16.1 21.2 119.0 124.0 142.0 159.0 182.0 212.0
[21] 43.0 24.0 27.0 29.0
This data set contains six remarkable vertical outliers. It turned out that from
1964 to 1969 another recording system was used, giving the total number of
minutes of these calls. The LS fit has clearly been affected by the outlying y-
values, as shown in Figure 9.2. The robust LTS method, which will be defined in
50 55 60 65 70
year
Section 9.4.1, avoids the outliers and fits nicely the linear model of the majority
of the data.
attach(phones)
plot(year,calls)
phones.lm <- lm(calls ~ year)
abline(phones.lm)
text(70,100,"LS")
library(robustbase)
phones.wlts <- ltsReg(calls~year,alpha=0.75)
abline(phones.wlts,lty=2)
text(67,30,"LTS")
with ŷi,R the fitted values obtained by applying a robust regression method,
and sR a robust measure of scale. If the majority of the data points follows
the general linear model with normal errors, these standardized robust residu-
als approximately lie in [-2,2] with a confidence of 95% and in [-2.5,2.5] with
a confidence of 99%. The robust LTS method nicely detects the outliers, as
shown on the residual plot in Figure 9.3. On the other hand, none of the obser-
100
LS
50
LTS
0
50 55 60 65 70
year
Figure 9.2: Telephone data set with LS and LTS fit superimposed.
25
20
19
20
Standardized LTS residual 18
17
15
16
15
10
5
2.5
0
−2.5
5 10 15 20
Index
Figure 9.3: Telephone data set: Index plot of the standardized robust residuals.
Residuals vs Index
3
2
Standardized LS residual
1
0
−1
5 10 15 20
Index
Figure 9.4: Telephone data set: Index plot of the standardized least squares
residuals.
di = yi − ŷi(i)
ei
=
1 − hii
with ŷi(i) the fitted value of case i, excluded from the data set to estimate the
regression coefficients. It can be shown that
s(i)
s(di ) = √
1 − hii
and that
di e
e∗i = = √i ∼ tn−p−1 . (9.2)
s(di ) s(i) 1 − hii
Hence, the e∗i are called the studentized residuals. They can be computed
without refitting the model each time an observation is deleted, by using the
relation s
n−p−1
e∗i = ei .
SSE(1 − hii ) − e2i
Here, also a plot of the studentized residuals of the Telephone data (Figure 9.5),
does not pinpoint the outliers. This is due to the fact that the outliers are not
isolated here. Deleting one of the outliers does not change the fit drastically!
or alternatively
plot(phones.studres,ylim=c(-3,3))
abline(h=c(-2.5,2.5))
0
−1
−2
−3
5 10 15 20
Index
Figure 9.5: Telephone data set: Index plot of the studentized residuals.
9.2.1 Residuals
Bad leverage points, which are outlying observations in the predictor space that
do not follow the linear model of the majority of the data points, also have a
large influence on the classical LS estimator. Let us illustrate this effect on
the stars data set. These data form the Hertzsprung-Russell diagram of the
star cluster CYG OB1, which contains 47 stars in the direction of Cygnus. The
regressor X is the logarithm of the effective temperature at the surface of the
star, and the response Y is the logarithm of its light intensity.
If we plot the studentized LS residuals (Figure 9.7) we can not detect any devi-
ating observation, but the four outliers stand out in the plot of the standardized
LTS residuals (Figure 9.8).
LS
intensity
5.0
4.5
LTS
4.0
temperature
Figure 9.6: Stars data set with LS and LTS fit superimposed.
3
2
1
stars.studres
0
-1
-2
-3
0 10 20 30 40
2.5
0
-2.5
0 10 20 30 40
Index
Figure 9.8: Stars data set: Index plot of standardized LTS residuals.
Therefore we will need a metric within the X-space to compute the distance of
each observation to the center of the data cloud. For (p−1)-dimensional vectors
xi = (xi1 , . . . , xi,p−1 )t the classical Mahalanobis distance is defined as:
p
MD(xi ) = (xi − x̄)t S −1 (xi − x̄) (9.3)
with
n
1X
x̄ = xi
n i=1
the empirical covariance matrix of the xi . Both the sample mean and the sample
covariance matrix are however non-robust: the mean will be shifted towards the
data(Animals)
Animals
body brain
Mountain beaver 1.350 8.1
Cow 465.000 423.0
Grey wolf 36.330 119.5
Goat 27.660 115.0
Guinea pig 1.040 5.5
Dipliodocus 11700.000 50.0
Asian elephant 2547.000 4603.0
Donkey 187.100 419.0
Horse 521.000 655.0
Potar monkey 10.000 115.0
Cat 3.300 25.6
Giraffe 529.000 680.0
Gorilla 207.000 406.0
Human 62.000 1320.0
African elephant 6654.000 5712.0
Triceratops 9400.000 70.0
Rhesus monkey 6.800 179.0
Kangaroo 35.000 56.0
Golden hamster 0.120 1.0
Mouse 0.023 0.4
Rabbit 2.500 12.1
Sheep 55.500 175.0
Jaguar 100.000 157.0
Chimpanzee 52.160 440.0
Rat 0.280 1.9
Brachiosaurus 87000.000 154.5
Mole 0.122 3.0
Pig 192.000 180.0
Three animals are clearly outlying: these are dinosaurs, with a small brain as
compared with a heavy body. We see that the classical mean, indicated by a plus
sign, is shifted towards the outliers. The covariance matrix can be visualized
Brachiosaurus
5
Triceratops
Dipliodocus
0
robust
−5
classical
−5 0 5 10 15
Figure 9.9: Body and brain weight for 28 animals with classical and robust
tolerance ellipse superimposed.
is much smaller and essentially contains the majority of the data points. Here,
the robust distance is defined analogously to the Mahalanobis distance:
q
RD(xi ) = (xi − µ̂R )t Σ̂−1
R (xi − µ̂R ) (9.4)
where µ̂R and Σ̂R are robust estimates of the center µ and shape Σ of the
x-part of the data points. In Section 9.5 we will discuss the MCD estimator as
a highly robust estimator of location and shape.
Distance−Distance Plot
Brachiosaurus
Dipliodocus
8
Triceratops
6
Robust distance
4
2
0
Mahalanobis distance
Figure 9.10: Animals data set: Robust distances versus Mahalanobis distances.
2.5
-2.5
For the stars data set, this yields Figure 9.12, on which we clearly see the giant
stars and star 7 as bad leverage points. Star 14 is a good leverage point, whereas
star 9 is found to be a vertical outlier.
34
30
20
8
11
6
Standardized LTS residual
4
2.5
9
2
14
0
−2
−2.5
0 1 2 3 4 5 6
Robust distance computed by MCD
H = X(X t X)−1 X t
as defined in (2.13) which transforms the observed response vector y into its LS
estimate
ŷ = Hy
or equivalently
ŷi = hi1 y1 + hi2 y2 + . . . + hin yn .
(Note that the X matrix here includes a constant column of ones for the intercept
term.) The element hij of H thus measures the effect of the jth observation
on ŷi , and the diagonal element hii the effect of the ith observation on its own
prediction. A diagonal element hii = 0 indicates a point with no influence on
the fit. Since
and consequently
h̄ii = p/n.
and thus 0 6 hii and hii > h2ii for all i = 1, . . . , n. Finally, this implies
0 6 hii 6 1.
These limits do not yet tell us when hii is large. Some authors suggest to use
2p
hii >
n
as cut-off value. Note that, when hii = 1, hij = 0 for all j 6= i, and consequently
ŷi = 1yi and ri = yi − ŷi = 0. The ith observation is thus so influential that
the LS fit passes through it. Moreover, the variance of the ith residual is then
zero:
s2 (ei ) = MSE(1 − hii ) = 0.
9.3.1 DFFITS
A measure of the influence that case i has on the fitted value ŷi is
ŷi − ŷi(i)
DFFITSi = q (9.6)
s2(i) hii
Since ŷi = xti β̂, the variance of the fitted value equals
In (9.6) the denominator is the estimated standard deviation of ŷi but the un-
known error variance is now estimated by the MSE obtained by omitting the ith
case from the data set. The DFFITS value thus measures the (standardized)
effect on the prediction when an observation is deleted.
Like other single-case diagnostics, e.g. the deleted residual (7.3), the DFFITS
values can be computed by the results from fitting the entire data set:
r
hii
DFFITSi = e∗i .
1 − hii
The DFFITSi value thus depends on the size of the studentized residual e∗i and
the leverage value hii , and will be large if either e∗i is large, or hii is large or
they are both large. A case is considered to be influential if
p
|DFFITSi | > 2 p/n.
so Di is the squared distance between ŷ and ŷ(i) , divided by ps2 . Since (ŷ −
ŷ(i) ) = X(β̂ − β̂ (i) ) it is also equivalent to
Hence Di also measures the influence of the ith case on the regression coefficients.
Di > 1.
9.3.3 DFBETAS
The DFBETAS measure computes for each case i its influence of each regression
coefficient β̂j :
β̂j − β̂j(i)
DFBETASij = q (9.10)
s2(i) (X t X)−1
jj
e <- residuals(phones.lm)
phones.dfbetas <- dfbetas(phones.lm)
phones.dffits <- h^0.5*e/(si*(1-h))
p <- phones.lm$rank
phones.stres <- stdres(phones.lm)
phones.cd <- (1/p * phones.stres^2 * h)/(1 - h)
Next, we consider the outlier diagnostics for the Stars data set, listed in Table
4. We see that the giant stars 11, 20, 30 and 34 are not noticed by Cook’s
distance Di , but DFFITS and DFBETAS are more powerful.
h
X
β̂ LTS = argmin (e2 (β̂))i:n (9.11)
β̂ i=1
with h an integer between [n + p + 1]/2 and n, and e2i:n the ith smallest squared
residual. For any candidate β̂ we thus rank the squared residuals from smallest
to largest (e2 (β̂))1:n 6 (e2 (β̂))2:n 6 . . . 6 (e2 (β̂))n:n and compute the sum
of the h smallest squared residuals. The LTS fit then corresponds to that β̂
which yields the smallest sum. The LTS estimator does not try to make all the
residuals as small as possible, but only the ‘majority’, where the ‘majority’ is
defined as h/n.
m
ε∗n = ε∗n (T, Z) = min { ; sup kT (Z 0 )k = ∞}
n Z0
where Z 0 = (X 0 , y 0 ) ranges over all data sets obtained by replacing any m ob-
servations of Z = (X, y) by arbitrary points.
ε∗n = (n − h + 1)/n
and is maximal for h = [(n + p + 1)/2]. Roughly speaking, the maximal break-
down value of 50% is obtained for h ≈ n/2. If we choose h = 0.75n, the
breakdown value is approximately 25% etc. If h gets closer to n, the LTS es-
timator approaches the LS estimator. The larger we choose h the better the
finite-sample efficiency of the LTS estimator will be, but the lower its resistance
towards outliers!
9.4.2 Computation
Contrary to the LS estimator, the objective function of the LTS estimator
h
X
(e2 (β̂))i:n (9.12)
i=1
is not convex and has many local minima. Therefore, one has to rely on ap-
proximate algorithms to compute the LTS estimator. Several approaches exist
which differ in speed and/or accuracy.
This is called hard rejection and produces a clear distinction between accepted
and rejected points. Next, a weighted LS fit is computed, which is equivalent
√ √
to apply OLS on the transformed observations ( wi xi , wi yi ) as discussed in
Chapter 6, Section 6.4.3. If we denote the resulting parameter estimates as
β̂ RLTS , we can again compute the corresponding residuals ei (β̂ RLTS ) and the
scale estimate
sP
wi e2i (β̂ RLTS )
iP
sRLTS = .
i wi − p
In the R package robustbase the default output is actually the reweighted LTS.
• find the h observations out of n whose classical covariance matrix has the
lowest determinant
• then, µ̂0 is the average of those h observations, and Σ̂0 is the covariance
matrix of those h observations (multiplied by a consistency factor)
for some constant c. Then it can be shown that the volume of this ellipsoid is
proportional to the square root of the determinant of Sh .
The breakdown value of the MCD estimator is ε∗n = (n−h+1)/n. For any scatter
matrix Σ̂, breakdown means that the largest eigenvalue becomes arbitrary large
or that the smallest eigenvalue becomes zero:
m λmax(X 0 )
ε∗n (Σ̂, X) = min{ sup = ∞}
n X 0 λmin(X 0 )
with X 0 obtained by replacing m points out of X. This implies that either the
tolerance ellipsoid explodes (i.e. becomes unbounded) or that it implodes (i.e.
is flattened to a lower dimension and deflated to a zero volume). Remember
that the determinant of a square matrix is equal to the product of its eigenvalues.
Finally the weighted mean and weighted covariance matrix are obtained:
n
! n
!
X X
µ̂1 = wi xi / wi
i=1 i=1
n
! n
!
X X
Σ̂1 = wi (xi − µ̂1 )(xi − µ̂1 )t / wi − 1
i=1 i=1
2 s2LT S (X, y)
RLT S =1− .
s2LT S (1, y)
The denominator is equal to the squared univariate LTS or MCD scale estimator.
It is defined by the variance of the h-subset with smallest variance, and can be
computed by an explicit algorithm of O(n log n). For this we first have to sort
the (univariate) observations and compute the variance of h successive points.
One should however be very cautious when outliers are detected. They are found
not to satisfy the linear model that is followed by the majority of the data points.
It is very important to investigate the reason why they differ: it can be due to
the fact that they indeed belong to another population and hence satisfy another
relation. But they can also point us to a model-misspecification. The inclusion
of a quadratic term or a transformation of a variable can e.g. accommodate the
outliers. Knowledge about the problem at hand is thus indispensable for model
building and outlier detection!!
Nonlinear regression
can be expressed as
yi = f (β, xi ) + εi (10.1)
with f (β, xi ) = E(yi ) = xti β a function that is linear in the regression parame-
ters β = (β0 , β1 , . . . , βp−1 )t . This general model includes polynomial regression
models, models with interaction terms, binary variables, and transformed vari-
ables as discussed in Chapters 4-6.
yi = f (β, xi ) + εi (10.2)
but with a mean response function which is not linear in the unknown parame-
ters β = (β0 , β1 , . . . , βp−1 )t . The error terms εi are usually assumed to satisfy
the Gauss-Markov conditions (2.2)-(2.4) as for linear regression. Their expec-
tation is thus zero, they have constant variance and they are uncorrelated. In
matrix notation this nonlinear model is written as:
y = f (β, X) + ε
ε ∼ Nn (0, σ 2 In ). (10.3)
185
Examples.
Exponential regression models.
yi = β0 eβ1 xi + εi (10.4)
yi = β0 + β1 eβ2 xi + εi . (10.5)
Typical examples of (10.5) are growth curves where yi represents the length of
individuals at time xi = ti . Then, if β2 < 0, β0 is the maximum length, β0 + β1
the length at time 0 (thus β1 < 0), whereas β2 exposes the proportionality at
time t of the rate of growth y 0 (t) = β1 β2 eβ2 t to the remaining amount of growth
β0 − y(t) = −β1 eβ2 t .
β0
yi = + εi .
1 + β1 e β 2 x i
This model is popular in population studies with yi the population size at time
xi = ti . With β2 < 0, β0 represents again the maximal amount of species.
The response function in now S-shaped. This response function is also used in
logistic regression to model the probability of success for a binary (Bernouilli)
outcome variable.
is linear in β00 and β1 . Model (10.6) could thus be analyzed with linear regression
techniques. However, it depends on the error terms whether (10.4) or (10.6)
should be studied. If we use (10.6), we assume that
or equivalently that
yi = β0 eβ1 xi eεi = β0 eβ1 xi ε0i
with the ε0i begin lognormally distributed. This is not equivalent to (10.4), so
even in this situation we might prefer the nonlinear model (10.4) over (10.7).
Note that contrary to the general linear model (2.1), the number of predictor
variables is not necessarily the same as the number of regression parameters.
We denote the regression parameters β = (β0 , β1 , . . . , βp−1 )t as before, but now
the predictor variables xi = (xi1 , . . . , xiq )t have length q and do not (always)
include a constant element 1.
so at β = β̂ LS , we obtain
n h ∂f (β, x ) i n h ∂f (β, x ) i
i i
X X
yi − f (β, xi ) =0
i=1
∂β j β=β̂ LS
i=1
∂β j β=β̂ LS
or in matrix notation
such that Fij = xij (with Fi0 = 1). The normal equations then reduce to
X t [y − X β̂ LS ] = 0p
β̂ LS = (X t X)−1 X t y.
For the exponential regression model (10.4), the normal equations lead to:
n n
ˆ
X X
yi eβ1 xi − βˆ0 e2β̂1 xi = 0
i=1 i=1
n n
ˆ
X X
xi yi eβ1 xi − βˆ0 xi e2β̂1 xi = 0
i=1 i=1
(k+1) (k)
with Mk such that Q(β̂ ) < Q(β̂ ). The iteration is stopped at conver-
(k+1) (k) (k+1) (k)
gence, i.e. when kβ̂ − β̂ k or kQ(β̂ ) − Q(β̂ )k is sufficiently small.
(k+1) (k)
with Mk small enough such that Q(β̂ ) < Q(β̂ ).
(k+1) (k)
Initially, M0 is very small, such as 10−8 . If Q(β̂ ) < Q(β̂ ) we proceed
with M1 = M0 /10, otherwise we set M1 = 10M0 and try again. For small
values of Mk , the Levenberg-Marquardt is similar to Gauss-Newton, otherwise
it approaches the deepest descent method.
• a grid search in the parameter space, and retaining the solution with
minimal Q(β)
• the least squares solution of the linearized model (if the model is intrinsi-
cally linear).
It is often desirable to try several initial values to make sure that the same
solution will be found.
attach(patients)
days
[1] 2 5 7 10 14 19 26 31 34 38 45 52 53 60 65
prognostic
[1] 54 50 45 37 35 25 20 16 18 13 8 11 8 4 6
The nonlinear regression model (10.4) was considered. To obtain starting values
for the numerical procedure, the model was first linearized as in (10.6). This
ˆ0
yields βˆ00 = 4.0371 and βˆ1 = −0.03797, from which βˆ0 = eβ0 = 56.6646 is ob-
tained. These initial estimates are now used as starting values for the nonlinear
estimation method.
coefficients(lm(log(prognostic)~days))
(Intercept) days
4.03715887 -0.03797418
patients.nls
plot(days,prognostic,xlim=c(0,71),ylim=c(0,59),xaxs="i")
A <- summary(patients.nls)$parameters[1]
B <- summary(patients.nls)$parameters[2]
xx <- seq(0,70,length=100)
yy <- A*exp(B*xx)
lines(xx,yy)
60
50
40
prognostic
30
20
10
0
0 10 20 30 40 50 60 70
days
Figure 10.1: Scatter plot and fitted curve for the patients data set.
Approximate confidence intervals for a single βj are then obtained from (10.14):
Equivalently, tests concerning a single βj are derived from the test statistic
β̂j − βj0
t= ≈H0 tn−p .
s(β̂j )
Example.
summary(patients.nls,correlation = TRUE)
Parameters:
Estimate Std. Error t value Pr(>|t|)
A 58.606531 1.472159 39.81 5.70e-15 ***
Also model checking e.g. using residuals plot remains necessary. Figures 10.2
and 10.3 do not suggest departures from the model assumptions. When in-
terpreting residual plots for nonlinear regression, it should be noticed that the
residuals do not necessarily sum to zero.
3
2
1
Residual
0
−1
−2
10 20 30 40 50
Fitted value
Figure 10.2: Residuals versus fitted values for patients data set.
0
−1
−2
−1 0 1
Theoretical Quantiles
Figure 10.3: Normal quantile plot for the residuals of the patients data set.
Nonparametric regression
yi = f (xi ) + εi (11.1)
197
simply the data points xi , but other values can be considered as well, mainly if
the observed data are rather sparse in some region of x.
1. First the neighborhood or the span of the smoother has to be chosen. This
span 0 < s ≤ 1 represents the fraction of the data that will be included in
each fit. It corresponds to m = [sn] data values. Often s = 0.5 or s = 2/3
work well. The larger s, the smoother the results.
farthest xj .
(c) The fitted value ŷ from this regression step is retained. Connecting
these fitted values for all x-values under consideration produces an
initial nonparametric regression estimate.
Robustness weight are then computed using the bisquare weight function
(1 − z 2 )2 if |zi | < 1
i
vi = wB (zi ) =
0 if |zi | ≥ 1
The different steps in the algorithm are depicted in Figure 14.15 for local linear
regression. Figure 11.1 shows the two different weight functions. In Figure 11.2
we see the difference between the nonrobust and the robust smoothed curve in
the presence of an outlier at xmin .
1.0
0.8
w w
B T
0.6
0.4
0.2
0.0
1. The lowess method extends the locally weighted average or the kernel
approach which sets k = 0 in Step 2(b). This implies that the fitted
value in x is obtained as the weighted average of the responses in the
neighborhood of x: Pm
j=1 w j yj
ŷ = Pm .
j=1 wj
When f (x) is nearly linear in the neighborhood of x, then both the local
average and the local regression will produce nearly unbiased estimates of
f (x), but the variance of the local regression will be smaller. Moreover,
when f (x) is substantially nonlinear, the polynomial approach will yield
less bias. This is illustrated in Figure 2.5.
{xi ∈ Rp : kxi − xk ≤ h}
with h = maxm
j=1 kxj − xk.
3. The method also applies when the variance of the errors is not constant,
√
but instead satisfies Var(εi ) = σ 2 /ai or Var( ai εi ) = σ 2 with the ai
known weights as in the generalized linear model (6.3). Then, the neigh-
√
borhood weights wj and the robustness weight vj are replaced with aj wj
√
resp. aj vj .
where ŷi(i) is the fitted value evaluated at xi for a locally weighted regres-
sion that omits the ith observation.
11.3 Inference
Although the primary goal of scatterplot smoothing is to produce a visual sum-
mary of the relation between Y and X, inference is also useful:
We consider inference for the lowess smoother without robustness weights. Be-
cause the fitted values are derived from a weighted least squares regression, in
or in matrix notation
ŷ = Sy
Note the observations that fall outside the span of the smoother for the ith
observation receive sij = 0.
Σ(ŷ) = SΣ(y)S t = σ 2 SS t
One approach is to consider the analogy with linear least squares regression,
where ŷ = Hy with H = X(X t X)−1 X t the hat matrix and e = (In − H)y.
The number of parameters in this model is tr(H) = p, and the residual degrees
of freedom are tr(In − H) = n − p. By analogy, the ’equivalent’ number of
parameters for the lowess method is tr(S) and the residual degrees of freedom
are tr(In − S) = n − tr(S). The estimated error variance is therefore given by
Pn
(yi − ŷi )2
σ̂ 2 = i=1
n − tr(S)
Alternatively one could also use n−tr(SS t ) which, for linear regression, is again
equal to n − tr(HH t ) = n − tr(H) because the hat matrix is symmetric and
idempotent. This definition is used in R.
yi = β0 + β1 xi + εi
with residual sum of squares SSE0 and residual degrees of freedom n − 2. Let
SSE1 be the residual sum of squares for the lowess fit. Then, in analogy to the
partial F-test (3.5) we compute
with tr(S)−2 and n−tr(S) degrees of freedom. A similar test can be performed
to compare a lowess fit with a quadratic or any other parametric model, or to
compare two lowess fits with different spans.
gas=read.table(file=path.expand(".\\Updates\\gas.txt"),header=TRUE)
attach(gas)
5
4
NOx
3
2
1
Because of the curvature in the data, we fit a local regression model using
locally quadratic fitting (k = 2), with span s = 2/3. If the parameter family is
not specified, the non-robust fit is used. If we want to include the robustness
weights, we should add family = "symmetric" in the function call.
attach(gas)
gas.m <- loess(NOx ~ E, span = 2/3, degree = 2)
gas.m
Call:
loess(formula = NOx ~ E, span = 2/3, degree = 2)
Number of Observations: 22
The fitted values in the observed xi , the residuals for all the observations and
the smoothed curve can be obtained from
fitted(gas.m)
residuals(gas.m)
plot(E,NOx)
lines(loess.smooth(E,NOx,span=2/3,degree=2))
# or without creating the gas.m object
scatter.smooth(E, NOx, span = 2/3, degree = 2)
5
4
NOx
3
2
1
This plot function evaluates the lowess fit at 50 equally spaced points and
connects these fitted values by line segments. If we want to evaluate the curve
at other points, we set e.g.
To check the model assumptions we make several residual plots. First we have
to check the properties of f (x) that are specified by the choice of s = 2/3 and
k = 2. A plot of the residuals against the predictor variable E does not show
any lack of fit.
0.6
0.4
0.2
residuals(gas.m)
0.0
−0.2
−0.4
−0.6
Maybe we can allow a larger span, e.g. s = 1. However, the resulting residual
plot shows that there is a dependence of the residuals on E, so s = 1 is too
large.
0.0
−0.5
−1.0
scatter.smooth(fitted(gas.m),sqrt(abs(residuals(gas.m))),
span = 1, degree = 1)
qqnorm(residuals(gas.m))
qqline(residuals(gas.m))
0.8
0.7
sqrt(abs(residuals(gas.m)))
0.6
0.5
0.4
0.3
0.2
0.1
1 2 3 4 5
fitted(gas.m)
0.0
−0.2
−0.4
−0.6
−2 −1 0 1 2
Theoretical Quantiles
anova(gas.m.null, gas.m)
Draper, N., Smith, H. (1998), Applied Regression Analysis, 3rd Edition, John
Wiley, New York.
Fox, J. (1997), Applied Regression Analysis, Linear Models, and Related Meth-
ods, Sage Publications, Thousand Oaks.
Gunst, R.F., Mason, R.L. (1980), Regression Analysis and its Application. A
Data-oriented Approach. Marcel Dekker, New York.
Ramsay, F., Schafer, D. (2013), The Statistical Sleuth, 3rd Edition, Brooks/Cole
Cengage Learning.
Sen, A., Srivastava (1990), Regression Analysis: Theory, Methods and Appli-
cations, Springer, New York.
211
KU Leuven
Leuven Biostatistics and Statistical Bioinformatics Centre (L-BioStat)
Kapucijnenvoer 35 blok d - box 7001, 3000 Leuven
thomas.neyens@kuleuven.be