Professional Documents
Culture Documents
1
1 Introduction to Model Adequacy Checking
We have made the assumptions in the previous report of multiple regression
model as follows:
1. The relation between response y and regressors X is approximately
linear.
2 Residual Analysis
2.1 Definition of Residuals
As defined previous report, the residual is
e = y − ŷ = y − Hy = (I − H)y
e is the deviation between the observation y and the fitted value ŷ, and it
also measures unexplained variability in the response variable. Residuals of
a sample is the difference between the observed values and the fitted value,
while errors of a sample are the deviation of the observed values from the true
values. We can also consider th residuals as the observed values of the model
errors when errors themselves cannot be observed. Therefore, the residuals
should show up any departures from the assumptions on the errors, in the
sense that residual analysis is effective to check the model inadequacy. The
residuals can also be written as the expression that includes term as below,
So that
E(e) = 0, var(e) = (I − H)σ 2 = M σ 2
2
.
V ar(ei ) = (1 − hii )σ 2
Cov(ei , ej ) = −hij σ 2
Here hii and hij are elements of hat matrix H, and hii is a measure of the
location of the ith point in x space, which means the variance of ei dependents
on where the point xi lies, 0 ≤ hii ≤ 1. Furthermore, the bigger the hii is, the
less the V ar(ei ) is, so hii is called leverage. When hii is close to 1, V ar(ei ) is
almost equal to 0, namely ŷi ≈ yi . It shows that when xi is far away from the
center point x̄i , the point (xi , yi ) will pull the regression line close to itself. If
it has great effect on the estimation of parameters it is called as high leverage
case.
The approximate average variance is estimated by residual mean square
M SE , which is expressed as
SSE e0 e
M SE = =
n−p n−p
e0 e
SSE
E(M SE ) = E =E = σ2
n−k−1 n−k−1
σ̂ 2 = M SE
In the assumptions, errors are independent but residuals are not, in fact,
the degrees of freedom of residuals are n − p. However, as long as the number
of data n is much more than the number of parameters p the independence
of residuals has little effect on model adequacy checking.
Generally speaking, residual analysis plays important role in
1. verifying the model’s assumption,
3
If the possibility of recording errors is ruled out, we had better collect more
data or use robust estimation to reduce the effect of influence points on
estimation.
ei ei
di = √ = , i = 1, · · · , n
M SE σ̂
E(di ) = 0, var(di ) ≈ 1
The standardized residuals (sometimes they are expressed as ZREi ) make
residuals comparable. If di is large (di > 3,say), this record is regarded as an
outlier. The standardized residuals simplify the determination but they do
not solve the problem of unequal variance.
4
If there is only one regressor, it is easy to show that the studentized residuals
are
ei
ri = q .
(xi −x̄)2
M SE [1 − ( n1 + Sxx
)]
Pn
Here Sxx = i=1 (xi − x̄)2 .
Example 1. The residuals for the model from pull strength of a wire bond
in a semiconductor manufacturing process are shown below.
5
Calculate the standardized and studentized residuals corresponding to e15
and e17 ? What can you conclude?
6
2.2.3 PRESS Residuals
Besides the standardize and studentized residuals, another improved residu-
als make the use of yi − ŷ(i) , where ŷ(i) is the fitted value of the ith response
based on all observations except the ith one. This approach deletes the ith
observation so that the estimated response ŷi will not be influenced by this
observation and the result residual can be determined whether it is an outlier.
We defined the prediction error as
e(i) = yi − ŷ(i) .
These prediction errors are called PRESS residuals (prediction error sum of
squares). The ith PRESS residual is
ei
e(i) = .
1 − hii
If hii is large, the PRESS residual will also be large and the point will be
high influence point. Compared with the ordinary residual, the model using
PRESS residuals will fit the data better but the prediction will be poorer.
ei 1 2 σ2
V ar(e(i) ) = V ar( )= [σ (1 − hii )] = .
1 − hii (1 − hii )2 (1 − hii )
A standardized PRESS residuals is
e(i) ei /(1 − hii ) ei
p = √ = √ .
V ar[e(i) ] σi / 1 − hii σ 1 − hii
The homoscedasticity implies that the variance is the same for all observation
(∀ i σi = σ). The prediction error sum of squares are
7
n n 2
X
2
X ei
P RESS = (e(i) ) = .
i=1 i=1
1 − hii
2.2.4 R-student
Sometimes the R-student is also called externally studentized residuals while
the studentized residuals which discussed before are called internally studen-
tized residuals. Different from using M SE in the approach of studentized
2
residuals, S(i) is used in the approach of R-student.
8
2.3.1 Normal Probability Plot
If the errors come from a distribution with heavier tails than the normal, it
will produce outliers that put the regression line in their directions. In these
cases, some other estimation techniques such as robust regression method
should be considered.
Normal probability plot is a direct method to check the normality as-
sumption. The first step of plot is to rank the residuals in increasing or-
der (e[1] , e[2] , · · · , e[n] ), and then plot e[i] against the cumulative probability
Pi = i−1/2
n
on the normal probability plot. It is usually called P-P plot.
1. If the resulting points approximately lie on the diagonal line, it means
the errors satisfy the normal distribution.
2. If fitted curve is sharper than the diagonal line at both extremes, it
means the errors come from a distribution with lighter tail.
3. If fitted curve is more flat than the diagonal line at both extremes, it
means the errors come from a distribution with heavier tail.
4. if fitted curve rises or falls down at the right extreme, it means the
errors come from a distribution which is positive or negative skewed,
respectively.
9
Similar to the P-P plot, Q-Q plot is plotting e[i] against the quantile of
distribution.
Residuals vs Fitted
10
We don’t see any distinctive pattern in Case 1, but we see a parabola in
Case 2, where the non-linear relationship was not explained by the model
and was left out in the residuals.
Normal Q-Q
Scale-Location
11
This plot shows if residuals are spread equally along the ranges of pre-
dictors. This is how you can check the assumption of equal variance (ho-
moscedasticity). It’s good if you see a horizontal line with equally (randomly)
spread points.
Residuals vs Leverage
This plot helps us to find influential cases (i.e., subjects) if any. Not all
outliers are influential in linear regression analysis (whatever outliers mean).
Even though data have extreme values, they might not be influential to de-
termine a regression line. That means, the results wouldn’t be much different
if we either include or exclude them from analysis. They follow the trend in
the majority of cases and they don’t really matter; they are not influential.
On the other hand, some cases could be very influential even if they look
to be within a reasonable range of the values. They could be extreme cases
against a regression line and can alter the results if we exclude them from
analysis. Another way to put it is that they don’t get along with the trend
in the majority of the cases.
Unlike the other plots, this time patterns are not relevant. We watch out
for outlying values at the upper right corner or at the lower right corner.
Those spots are the places where cases can be influential against a regression
12
line. Look for cases outside of a dashed line, Cook’s distance. When cases
are outside of the Cook’s distance (meaning they have high Cook’s distance
scores), the cases are influential to the regression results. The regression
results will be altered if we exclude those cases.
Case 1 is the typical look when there is no influential case, or cases. You
can barely see Cook’s distance lines (a red dashed line) because all cases are
well inside of the Cook’s distance lines. In Case 2, a case is far beyond the
Cook’s distance lines (the other residuals appear clustered on the left because
the second plot is scaled to show larger area than the first plot). The plot
identified the influential observation as 49. If we exclude the 49th case from
the analysis, the slope coefficient changes from 2.14 to 2.68 and R2 from .757
to .851. Pretty big impact!
The four plots show potential problematic cases with the row numbers
of the data in the data set. If some cases are identified across all four plots,
you might want to take a close look at them individually. Is there anything
special for the subject? Or could it be simply errors in data entry?
So, what does having patterns in residuals mean to your research? Its
not just a go-or-stop sign. It tells you about your model and data. Your
current model might not be the best way to understand your data if there is
so much good stuff left in the data.
In that case, you may want to go back to your theory and hypotheses. Is
it really a linear relationship between the predictors and the outcome? You
may want to include a quadratic term, for example. A log transformation
may better represent the phenomena that you like to model. Or, is there any
13
important variable that you left out from your model? Other variables you
didn’t include (e.g., age or gender) may play an important role in your model
and data. Or, maybe, your data were systematically biased when collecting
data. You may want to redesign data collection methods.
14
3 Transformation and Weighting to Correct
Model Inadequate
In the previous section , we have mentioned that we will introduce methods
to solve the problem of the inequality of variance. Data transformation and
weighted least squares are two common methods that are useful in building
models without violations of assumptions. In this section, we lay emphasis
on data transformation.
15
data
X Y
1 0.5 42.5
2 0.5 50.6
3 1.0 68.5
4 1.0 80.7
5 1.5 89.0
6 1.5 99.6
7 2.0 105.3
8 2.0 111.8
9 2.5 112.3
10 2.5 125.7
> lm<-lm(Y~X)
> X2<-sqrt(data$X)
> X2
[1] 0.7071068 0.7071068 1.0000000 1.0000000 1.2247449 1.2247449 1.4142136
[8] 1.4142136 1.5811388 1.5811388
> plot(Y~X2)
> summary(lm2)
Call:
lm(formula = Y ~ X2)
Residuals:
Min 1Q Median 3Q Max
-9.3221 -4.1884 -0.2367 4.1007 7.7200
16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.328 7.892 -1.309 0.227
X2 83.453 6.444 12.951 1.2e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
17
3.2 Assumption of Constant Variance
One assumption of a regression model is that the errors have constant vari-
ance. A common reason for the violation (V ar(i ) 6= V ar(j ), i 6= j) is for
the response y follows a distribution whose variance is relate to the mean
(related to the regressors x). For example, if y follow Poisson distribution
in a simple linear regression model, then E(y) = V ar(y) = λ, here λ is the
parameter of the Poisson distribution and is related to the regressor variable
x.
As another example, we want to analyse the relationship between resident
income and purchasing power yi = β0 + β1 xi + i . The difference among low-
income families is little because their main purchases are daily necessities.
However the difference among high-income families is large because their
purchase range is wider from automobiles to houses. This explain the cause
of inequality of variance of errors.
In another example, if y is proportion, i.e., 0 ≤ yi ≤ 1 then in such cases
the variance of y is proportional to E(y)[1 − E(y)] In such case, the variance
stabilizing transformation is useful.
Many factors result in heteroscedasticity, when the sample data are cross-
section data (observing different subjects at the same time (period)), the
errors are more likely to have different variance.
To solve these problems, we could reexpress the response y and then
inverse the predicted values into origin units. Table 1 shows a series of
common variance-stabilizing transformations. We should notice about the
transformations that
18
Relation of sigma2 to E(y) Transformation
σ 2 ∝ constant y0 = y
√
σ 2 ∝ E(y) y0 = y
√
σ 2 ∝ E(y)[1 − E(y)] y 0 = sin−1 ( y)(0 ≤ yi ≤ 1)
σ 2 ∝ E(y)2 y 0 = ln(y)(log)
σ 2 ∝ E(y)3 y 0 = y −1/2
2
σ ∝ E(y) 4
y 0 = y −1
19
3.2.2 Weighted Least Squares (WLS)
In many cases, assumptions of uncorrelated and homoscedastic errors, may
not be valid. Thus, alternative methods should be considered in order to
compute improved parameter estimates.
Cov() = σ 2 W −1
W = diag(w1 , w2 , ..., wn )
or:
1
w1 0 ... 0 w1
0 ... 0
0 1
0 w2 ... 0
−1 w2
... 0
W = .. , W = ..
.. .. .. .. ... ..
. . . . . . .
1
0 0 . . . wn 0 0 ... wn
20
σ2
Thus, the heteroscedastic variances can be written as V ar(i ) =
wi
For the WLS approach, transformation of the response variable (Y), the
regressor matrix (x) and the errors () are performed such that the trans-
formed variables follow the linear model with homoscedastic errors. For
i = 1, 2, ..., n, and k is the number of regressors in the model:
√
• Transformed errors: ∗i = wi i
1 1 1
W 2 y = W 2 xβ + W 2 (1)
Where
1 √ √ √
W 2 = diag( w1 , w2 , ..., wn )
Regressor coefficient estimates are obtained by minimizing the weighted sums
of squares (WSS):
n
X
W SS(β) = ê2wi
i=1
n
0
X
W SS(β) = wi (yi − xi β)2
i=1
i = 1, 2, ..., n
where the residuals of a weighted least squares model is defined as:
√ 0
êwi = wi (yi − xi β)
21
Example 4. For a simple linear regression model:
yi = β0 + β1 xi + i
the weighted least squares estimates can be obtained as follows:
n
X
W SS(β0 , β1) = wi (yi − β0 − β1 xi )2
i=1
∂Q
= −2 ni=1 wi (yi − β̂0 − β̂1 xi )
P
•
∂β0
n
X n
X n
X
β̂0 wi xi + β̂1 wi x2i = wi xi yi
i=1 i=1 i=1
3. Finally, we solve these equations for the weighted least squares estimate
of β0 and β1 .
For example, it can be shown that the weighted estimator for β1 is
Pn
i=1wi (xi − x̄w )(yi − ȳw )
β̂1w = Pn 2
i=1 wi (xi − x̄w )
Pn Pn
wx
i i i i wy
where x̄w = Pi=1n and ȳw = Pi=1
n . From which we can find the
i=1 wi i=1 wi
weighted least squares estimate of the intercept:
β̂0w = ȳw − β̂1w x̄w .
22
Example 5. Consider the dataset ”cleaningwgt”. We are trying to develop a
regression equation to model the relationship between the number of romms
cleaned, Y the number of crews, and X the number of rooms cleaned by 4
and 16 crews. In this case we take
1
wi = ,
(Standard deviation Yi )2
σ2
then yi has a variance wi
, with σ 2 = 1.
cleaningwtd <-read.table(file.choose(),header=TRUE,sep=",")
> cleaningwtd
Case Crews Rooms StdDev
1 1 16 51 12.000463
2 2 10 37 7.927123
3 3 12 37 7.289910
4 4 16 46 12.000463
5 5 16 45 12.000463
6 6 4 11 4.966555
7 7 2 6 3.000000
8 8 4 19 4.966555
9 9 6 29 4.690416
10 10 2 14 3.000000
11 11 12 47 7.289910
12 12 8 37 6.642665
13 13 16 60 12.000463
14 14 2 6 3.000000
15 15 2 11 3.000000
16 16 2 10 3.000000
17 17 6 19 4.690416
18 18 10 33 7.927123
19 19 16 46 12.000463
20 20 16 69 12.000463
21 21 10 41 7.927123
22 22 6 19 4.690416
23 23 2 6 3.000000
24 24 6 27 4.690416
25 25 10 35 7.927123
26 26 12 55 7.289910
23
27 27 4 15 4.966555
28 28 4 18 4.966555
29 29 16 72 12.000463
30 30 8 22 6.642665
31 31 10 55 7.927123
32 32 16 65 12.000463
33 33 6 26 4.690416
34 34 10 52 7.927123
35 35 12 55 7.289910
36 36 8 33 6.642665
37 37 10 38 7.927123
38 38 8 23 6.642665
39 39 8 38 6.642665
40 40 2 10 3.000000
41 41 16 65 12.000463
42 42 8 31 6.642665
43 43 8 33 6.642665
44 44 12 47 7.289910
45 45 10 42 7.927123
46 46 16 78 12.000463
47 47 2 6 3.000000
48 48 2 6 3.000000
49 49 8 40 6.642665
50 50 12 39 7.289910
51 51 4 9 4.966555
52 52 4 22 4.966555
53 53 12 41 7.289910
> attach(cleaningwtd)
> wm1 <- lm(Rooms~Crews,weights=1/StdDev^2)
> summary(wm1)
Call:
lm(formula = Rooms ~ Crews, weights = 1/StdDev^2)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.43184 -0.82013 0.03909 0.69029 2.01030
24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8095 1.1158 0.725 0.471
Crews 3.8255 0.1788 21.400 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
We would like to find a 95% prediction intervals for Y when x = 4 and 16.
> predict(wm1,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
fit lwr upr
1 16.11133 13.71210 18.51056
2 62.01687 57.38601 66.64773
Warning message:
In predict.lm(wm1, newdata = data.frame(Crews = c(4, 16)), interval = "prediction"
Assuming constant prediction variance even though model fit is weighted
Write the model as a multiple linear regression with two predictors and no
intercept.
Call:
lm(formula = ynew ~ x1new + x2new - 1)
Residuals:
Min 1Q Median 3Q Max
-1.43184 -0.82013 0.03909 0.69029 2.01030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
25
x1new 0.8095 1.1158 0.725 0.471
x2new 3.8255 0.1788 21.400 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
ln y = ln β0 + β1 x + ln .
or
y 0 = β00 + β1 x + 0
Where β00 = ln β0 .
26
4 Detection and Treatment of Outliers ( In-
fluential observations)
As mentioned before, an outlier is an observation point that is distant from
other observations, sometimes it may be three of four standard deviations
from the mean |di | > 3. An outlier may be due to variability in the mea-
surement or it may indicate experimental error and the latter can be simply
removed from the data set. Taking the advantage of studentized or R-student
residuals and the residual plots against ŷi and the normal probability plot,
we can find outliers easily. If an outlier does not come from the system error,
then it play a more important role than other points because it may control
many key model properties.
We can compare the origin values of parameters or summary statistics
such as t statistic, F statistic, R2 and M SR with the values after removing
the outliers.
Outliers can be divided into two possibilities, about the response y or
about the regressors xi . If the absolute value of R-student is greater than
3(|ri∗ | > 3), then it can be determined as an outlier and can be deleted.
Sometimes the residual is big but it may be an influence point rather than
a outlier about response. So the high influence point do not have bad effect
on the regression model but have significant influence on the regression effect.
As defined in paragraph 1.1, hii , the diagonal elements of the hat matrix H is
called the leverage of the ith observation because it is a standardized measure
of the distance of the ith observation from the center of the x space and it
can be written as
hii = x0i (X 0 X)−1 xi
where x0i is the ith row of the X matrix. The bigger the hii is, the less the
V ar(ei ) is. The points with big leverage will put the regression line in their
direction. If the high leverage point lies almost on the line passing through
the remaining observation, it has no effect on the regression coefficients. The
mean value of hii is
n
1X p
h̄ = hii =
n i=1 n
where ni=1 hii = rank(H) = rank(X) = p. If a leverage hii is 2 or 3 times
P
greater than h̄, it will be regarded as high leverage value. Observations with
large hat diagonals and large residuals are called influence points. Influ-
27
ence points are not always outliers about response, so we cannot determine
whether an influence point is an outlier.
It is desirable to consider both the location of the point in the x space
and the response variable in measuring influence. Therefore, we introduce
Cook’s distance, a measure of the squared distance to determine. It is named
after the American statistician R. Dennis Cook, who introduced the concept
in 1977. The distance measure can be expressed as
0
β̂(i) − β̂ X 0 X β̂(i) − β̂
Di = , i = 1, · · · , n
pM SE
where β̂ is the least-squares estimate based on all n points and β̂(i) is the
estimate obtained by deleting the ith point. It can also be written as
From the above equation, we can find that Cook’s distance reflects a
combined effect of leverage hii and residual ei .
Points with large values of Di have great influence on the least-squares
estimate. However, it is complicated to determine the standard of the size of
Cook’s distance. There are different opinions regarding what cut-off values
to use for spotting highly influential points. A simple operational guideline
is that if Di > 1, then it is regarded as an outlier. If Di < 0.5, it is not an
outlier. Others have indicated that Di > n4 as the standard.
R Codes:
library(foreign)
library(MASS)
cd<-cooks.distance(lm)
Example 6. The table below, lists the values of the hat matrix diagonals
hii and Cook’s distance measure Di for the wire bond pull strength data.
Calculate D1 .
28
5 Lack of Fit of the Regression Model
In the formal test for lack of fit, we assume the regression model has satisfied
the requirement of normality, independence and constant variance and we
just to determine whether the relationship is straight-line or not.
Replicating experiments in regression analysis aims to make it clear that
whether there are any non-negligible factors except x. Here replication means
having ni separate experiments at level x = xi and observing the value of
response yi instead of having just one experiment and observing the response
ni times. The residuals of the latter method are only from the difference of
method of measurements and useless for analysis of fitted model. If other
uncontrolled or non-negligible factors, including interaction, are in the model,
then the fitting effect may not be good and it was called lack of fit. In
the situation, even if the hypothesis testing shows that the regression is
significant, it only indicates that the regressor x has effect on the response
y but not indicates the fitting is good. Thus, with regard to the data set
involves replicated data, we can use the test for lack of fit on the expected
function. Determining whether the model is good or not is mainly by the
residual analysis.
Residuals are composed of two parts, one of parts is called pure error
which is random and unable to be eliminated while the other is related to
the model called lack of fit.
Test for lack of fit is a test used to judge whether a regression model would
be accepted. And the test is based on the relative size of the lack of fit and
the pure error. If the lack of fit is greater than the pure error significantly,
then the model should be rejected.
29
A sum of squares due to lack of fit is one of the components of a partition
of the sum of squares in an analysis of variance, used in the numerator in an
F-test of the null hypothesis that says that a proposed model fits well.
SSE = SSP E + SSLOF
here SSP E is the sum of squares due to pure error and SSLOF is the sum of
squares due to lack of fit.
Notation:
xi , i = 1, · · · , m: the ith level of regressor x.
ni : the number of replication at the ith level of x.
m
X
ni = n
i=1
30
Pm Pni
The cross-product term i=1 j=1 (yij − ȳi )(ȳi − ŷi ) equals to 0. The
reduced model is
ni
m X
X
SSP E = SSE (F ullM odel) = (yij − ȳi )2 ,
i=1 j=1
m
X
SSLOF = SSE − SSP E = ni (ȳi − ŷi )2 .
i=1
This term ni and ŷi in the formula is because all the yij at the level of xi
have the same fitted value ŷi .
If the fitted values ŷi are close to the corresponding average responses ȳi ,
then the lack of fit is approximate zero, which indicates that the regression
function is more likely to be linear, vice versa.
The degree of freedom for pure error at each level xi is ni − 1 (similar to
SST ), and the total number ofPdegrees of freedom associated with the sum
of squares due to pure error is m i=1 (ni − 1) = n − m. The degree of freedom
associated with SSLOF is m − 2 because the regressor has m levels and two
parameters must be estimated to obtain the ȳi .
The hypothesis is
H0 : E(yi ) = β0 + β1 xi
(reduced model is adequate)
H1 : E(yi ) 6= β0 + β1 xi
31
Example 7. Perform the F-test for lack of fit on the following data:
(90, 81), (90, 83), (79, 75), (66, 68), (66, 60), (66, 62), (51, 60), (51, 64), (35, 51), (35, 53)
Model 1: y ~ x
Model 2: y ~ as.factor(x)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 118.441
2 5 46.667 3 71.775 2.5634 0.168
From the code above, we can conclude : SSE = 118.44 (with 8 df), SSLOF =
71.77 (with 5 df), SSP E = 46.667 (with 3 df).
32