You are on page 1of 32

MAT 3375 - Regression Analysis

Model Adequacy Checking and Diagnostics for


Leverage and Influence
Chapters 4, 5 and 6
Professor: Termeh Kousha
Fall 2016

1
1 Introduction to Model Adequacy Checking
We have made the assumptions in the previous report of multiple regression
model as follows:
1. The relation between response y and regressors X is approximately
linear.

2. The errors j (j = 1, · · · , n) follow normal distribution N (0, σ 2 ) and


they are uncorrelated.
If any assumptions is violated, it may lead to an unstable model. And the
standard summary statistics like t or F test, or R2 are not enough to ensure
the model adequacy. In these report, we will use several methods based on
residuals to diagnose whether the basic regression assumptions are violated.
We would like to present several methods useful for diagnosing violations of
the basic regression assumptions.

2 Residual Analysis
2.1 Definition of Residuals
As defined previous report, the residual is

e = y − ŷ = y − Hy = (I − H)y

e is the deviation between the observation y and the fitted value ŷ, and it
also measures unexplained variability in the response variable. Residuals of
a sample is the difference between the observed values and the fitted value,
while errors of a sample are the deviation of the observed values from the true
values. We can also consider th residuals as the observed values of the model
errors when errors themselves cannot be observed. Therefore, the residuals
should show up any departures from the assumptions on the errors, in the
sense that residual analysis is effective to check the model inadequacy. The
residuals can also be written as the expression that includes term  as below,

e = Xβ +  − X(X 0 X)−1 X 0 (Xβ + ) =  − H = (I − H) = M .

So that
E(e) = 0, var(e) = (I − H)σ 2 = M σ 2

2
.
V ar(ei ) = (1 − hii )σ 2
Cov(ei , ej ) = −hij σ 2
Here hii and hij are elements of hat matrix H, and hii is a measure of the
location of the ith point in x space, which means the variance of ei dependents
on where the point xi lies, 0 ≤ hii ≤ 1. Furthermore, the bigger the hii is, the
less the V ar(ei ) is, so hii is called leverage. When hii is close to 1, V ar(ei ) is
almost equal to 0, namely ŷi ≈ yi . It shows that when xi is far away from the
center point x̄i , the point (xi , yi ) will pull the regression line close to itself. If
it has great effect on the estimation of parameters it is called as high leverage
case.
The approximate average variance is estimated by residual mean square
M SE , which is expressed as
SSE e0 e
M SE = =
n−p n−p
e0 e
   
SSE
E(M SE ) = E =E = σ2
n−k−1 n−k−1
σ̂ 2 = M SE
In the assumptions, errors are independent but residuals are not, in fact,
the degrees of freedom of residuals are n − p. However, as long as the number
of data n is much more than the number of parameters p the independence
of residuals has little effect on model adequacy checking.
Generally speaking, residual analysis plays important role in
1. verifying the model’s assumption,

2. finding observations that are outliers and extreme values.


The paired points (xi1 , · · · , xip , yi ) that have great influence on statistical
inference are called influence points. We hope each pair of data has certain
influence but not very large so that our estimation of data will be stable.
Otherwise, if we simply remove the influence points the estimated model will
change massively compared with the origin model. As a result, we should
not trust the origin model and doubt whether it shows the true relationship
between response and regressors. If the errors are as the result of the system
errors, such as the mistake in measurement, we can delete this record directly.

3
If the possibility of recording errors is ruled out, we had better collect more
data or use robust estimation to reduce the effect of influence points on
estimation.

2.2 Methods for Scaling Residuals


It is convenient to find outliers and extreme values by using scaled residuals.
Here we will introduce four popular methods for scaling residuals.

2.2.1 Standardized Residuals


We have known that M SE is the estimate of the approximate average vari-
ance, so a reasonable scaling residuals would be the standardized residuals

ei ei
di = √ = , i = 1, · · · , n
M SE σ̂

E(di ) = 0, var(di ) ≈ 1
The standardized residuals (sometimes they are expressed as ZREi ) make
residuals comparable. If di is large (di > 3,say), this record is regarded as an
outlier. The standardized residuals simplify the determination but they do
not solve the problem of unequal variance.

2.2.2 Studentized Residuals


One of the important application of residuals is to find outliers based on the
absolute value of it. However the ordinary residuals are related to hii , so it is
not appropriate to compare residuals directly. Different from the standard-
ized residuals, the studentized residuals are another method to standardize
residuals,
e
√ i
σ 1 − hii
Because the σ is unknown, we replace it with σ̂ = M SE . The studentized
residuals are expressed as
ei
ri = p .
M SE (1 − hii )

4
If there is only one regressor, it is easy to show that the studentized residuals
are

ei
ri = q .
(xi −x̄)2
M SE [1 − ( n1 + Sxx
)]
Pn
Here Sxx = i=1 (xi − x̄)2 .

Compared with standardized residuals di , ri has constant variance V ar(ri ) =


1 regardless of the location of xi , in other words, they solve the problem of
unequal variance. When the data sets are large enough, the variance of the
residuals stabilizes, so in many cases standardize and studentized residuals
convey equivalent information because there are little differences between
them. However, studentized residuals are better because it reduce the influ-
ence of the points with large residuals and large hii .

Example 1. The residuals for the model from pull strength of a wire bond
in a semiconductor manufacturing process are shown below.

A normal probability plot of there is shown below. No severe deviations from


normality are obviously apparent although two largest residuals ( e15 = 5.84
and e17 = 4.33) do not fall extremely close to a straight line drawn through
the remaining residuals.

5
Calculate the standardized and studentized residuals corresponding to e15
and e17 ? What can you conclude?

6
2.2.3 PRESS Residuals
Besides the standardize and studentized residuals, another improved residu-
als make the use of yi − ŷ(i) , where ŷ(i) is the fitted value of the ith response
based on all observations except the ith one. This approach deletes the ith
observation so that the estimated response ŷi will not be influenced by this
observation and the result residual can be determined whether it is an outlier.
We defined the prediction error as
e(i) = yi − ŷ(i) .
These prediction errors are called PRESS residuals (prediction error sum of
squares). The ith PRESS residual is
ei
e(i) = .
1 − hii
If hii is large, the PRESS residual will also be large and the point will be
high influence point. Compared with the ordinary residual, the model using
PRESS residuals will fit the data better but the prediction will be poorer.
ei 1 2 σ2
V ar(e(i) ) = V ar( )= [σ (1 − hii )] = .
1 − hii (1 − hii )2 (1 − hii )
A standardized PRESS residuals is
e(i) ei /(1 − hii ) ei
p = √ = √ .
V ar[e(i) ] σi / 1 − hii σ 1 − hii
The homoscedasticity implies that the variance is the same for all observation
(∀ i σi = σ). The prediction error sum of squares are

7
n n  2
X
2
X ei
P RESS = (e(i) ) = .
i=1 i=1
1 − hii

2.2.4 R-student
Sometimes the R-student is also called externally studentized residuals while
the studentized residuals which discussed before are called internally studen-
tized residuals. Different from using M SE in the approach of studentized
2
residuals, S(i) is used in the approach of R-student.

2 (n − p)M SE − e2i /(1 − h(ii) )


S(i) =
n−p−1
and the R-student is given as
ei
ri∗ = ti = q
2
S(i) (1 − h(ii) )

under some assumptions, ri∗ ∼ tn−p−1 .


The relationship between ri and ri∗ is
n−p−1
ri∗2 = 2
∗ ri2 .
n − p − ri
In many situations, ri∗ differ little from ri . However when the ith obser-
vation is influential, the R-student will be more sensitive to the point than
studentized residual.

2.3 Residual Plots


After calculating the value of different types of residuals, we can use graphical
analysis of residuals to investigate the adequacy of the fit of a regression
model and check whether the assumptions are satisfied. In this report, we
will introduce some residuals plots, for example, normal probability plot,
plot of residuals against the fitted values ŷi , the regressors xi and the time
sequence.

8
2.3.1 Normal Probability Plot

Figure 1: Patterns for residual plots:


(a)satisfactory;(b)funnel;(c)double bow;(d)nonlinear.

If the errors come from a distribution with heavier tails than the normal, it
will produce outliers that put the regression line in their directions. In these
cases, some other estimation techniques such as robust regression method
should be considered.
Normal probability plot is a direct method to check the normality as-
sumption. The first step of plot is to rank the residuals in increasing or-
der (e[1] , e[2] , · · · , e[n] ), and then plot e[i] against the cumulative probability
Pi = i−1/2
n
on the normal probability plot. It is usually called P-P plot.
1. If the resulting points approximately lie on the diagonal line, it means
the errors satisfy the normal distribution.
2. If fitted curve is sharper than the diagonal line at both extremes, it
means the errors come from a distribution with lighter tail.
3. If fitted curve is more flat than the diagonal line at both extremes, it
means the errors come from a distribution with heavier tail.
4. if fitted curve rises or falls down at the right extreme, it means the
errors come from a distribution which is positive or negative skewed,
respectively.

9
Similar to the P-P plot, Q-Q plot is plotting e[i] against the quantile of
distribution.

2.3.2 Plot of Residuals against the Fitted Values ŷi


The plot of residuals (ei , ri , di or ri∗ ) against the fitted values ŷi is useful for
several model inadequacies. In the plot residuals vertical ordinate is residuals
and horizontal ordinate is fitted values.

1. The ideal plot is that all points are in a horizontal band.

2. If the plot looks likes a outward-opening funnel pattern(or inward-


opening), it means the variance of errors is an increasing function of
y(or decreasing function).

3. If the plot looks like a double-bow, it means that the variance of a


binomial proportion near 0.5 is greater than one near 0 or 1.

4. If the plot looks like a curve, it means the model is nonlinear.

Example 2. The diagnostic plots show residuals in four different ways. If


you want to look at four plots at once rather than one by one:

par(mfrow=c(2,2)) # Change the panel layout to 2 x 2


plot(lm)
par(mfrow=c(1,1)) # Change back to 1 x 1

Lets take a look at the first type of plot:

Residuals vs Fitted

This plot shows if residuals have non-linear patterns. There could be a


non-linear relationship between predictor variables and an outcome variable
and the pattern could show up in this plot if the model doesn’t capture
the non-linear relationship. If you find equally spread residuals around a
horizontal line without distinct patterns, that is a good indication you don’t
have non-linear relationships.
Lets look at residual plots from a good model and a bad model. The good
model data are simulated in a way that meets the regression assumptions very
well, while the bad model data are not:

10
We don’t see any distinctive pattern in Case 1, but we see a parabola in
Case 2, where the non-linear relationship was not explained by the model
and was left out in the residuals.

Normal Q-Q

This plot shows if residuals are normally distributed. Do residuals follow


a straight line well or do they deviate severely? It is good if residuals are
lined well on the straight dashed line.

Case 2 definitely concerns us. We would not be concerned by Case 1 too


much, although an observation numbered as 38 looks a little off. Lets look
at the next plot while keeping in mind that 38 might be a potential problem.

Scale-Location

11
This plot shows if residuals are spread equally along the ranges of pre-
dictors. This is how you can check the assumption of equal variance (ho-
moscedasticity). It’s good if you see a horizontal line with equally (randomly)
spread points.

In Case 1, the residuals appear randomly spread. Whereas, in Case 2,


the residuals begin to spread wider along the x-axis as it passes around 5.
Because the residuals spread wider and wider, the red smooth line is not
horizontal and shows a steep angle in Case 2.

Residuals vs Leverage

This plot helps us to find influential cases (i.e., subjects) if any. Not all
outliers are influential in linear regression analysis (whatever outliers mean).
Even though data have extreme values, they might not be influential to de-
termine a regression line. That means, the results wouldn’t be much different
if we either include or exclude them from analysis. They follow the trend in
the majority of cases and they don’t really matter; they are not influential.
On the other hand, some cases could be very influential even if they look
to be within a reasonable range of the values. They could be extreme cases
against a regression line and can alter the results if we exclude them from
analysis. Another way to put it is that they don’t get along with the trend
in the majority of the cases.

Unlike the other plots, this time patterns are not relevant. We watch out
for outlying values at the upper right corner or at the lower right corner.
Those spots are the places where cases can be influential against a regression

12
line. Look for cases outside of a dashed line, Cook’s distance. When cases
are outside of the Cook’s distance (meaning they have high Cook’s distance
scores), the cases are influential to the regression results. The regression
results will be altered if we exclude those cases.

Case 1 is the typical look when there is no influential case, or cases. You
can barely see Cook’s distance lines (a red dashed line) because all cases are
well inside of the Cook’s distance lines. In Case 2, a case is far beyond the
Cook’s distance lines (the other residuals appear clustered on the left because
the second plot is scaled to show larger area than the first plot). The plot
identified the influential observation as 49. If we exclude the 49th case from
the analysis, the slope coefficient changes from 2.14 to 2.68 and R2 from .757
to .851. Pretty big impact!

The four plots show potential problematic cases with the row numbers
of the data in the data set. If some cases are identified across all four plots,
you might want to take a close look at them individually. Is there anything
special for the subject? Or could it be simply errors in data entry?
So, what does having patterns in residuals mean to your research? Its
not just a go-or-stop sign. It tells you about your model and data. Your
current model might not be the best way to understand your data if there is
so much good stuff left in the data.

In that case, you may want to go back to your theory and hypotheses. Is
it really a linear relationship between the predictors and the outcome? You
may want to include a quadratic term, for example. A log transformation
may better represent the phenomena that you like to model. Or, is there any

13
important variable that you left out from your model? Other variables you
didn’t include (e.g., age or gender) may play an important role in your model
and data. Or, maybe, your data were systematically biased when collecting
data. You may want to redesign data collection methods.

2.3.3 Plot of Residuals against the Regressor


Here the horizontal ordinate is the regressor. Compared with the ordinary
residuals, the outliers of studentized residuals are easier to be determined.
The cases of plots are same as the four situations before. Assuming the
horizontal ordinate is xj , the pattern of 2 or 3 indicates nonconstant variance
and the patter of 4 indicates the assumed relationship between y and xj is
not correct.

14
3 Transformation and Weighting to Correct
Model Inadequate
In the previous section , we have mentioned that we will introduce methods
to solve the problem of the inequality of variance. Data transformation and
weighted least squares are two common methods that are useful in building
models without violations of assumptions. In this section, we lay emphasis
on data transformation.

3.1 Transformation for Nonlinear Relation Only


First we consider transformations for linearizing a nonlinear regression rela-
tion when the distribution of the error terms is reasonably close to normal
distribution and the error terms have approximately constant variance. In
this situation, transformations on X should be attempted. The reason why
transformations on may not be desirable here is that a transformation on Y
may change the shape of the distribution of the error terms and also leads
differing error term variances.
Example 3. In one study we have 10 participants and X represents the
number of days of training received and Y performance score in a battery
of simulated sales situations. A scatter plot of these data are shown below:
Clearly the regression relation appears to be curvilinear, os the simple re-

gression model doesn’t


√ seem to be appropriate. We consider a square root
transformation X 0 = X). In the scatter plot below the same data are plot-
ted with the predictor variable transformed to X 0 . Note that the scatter plot
now shows a reasonably linear relation.

15
data
X Y
1 0.5 42.5
2 0.5 50.6
3 1.0 68.5
4 1.0 80.7
5 1.5 89.0
6 1.5 99.6
7 2.0 105.3
8 2.0 111.8
9 2.5 112.3
10 2.5 125.7
> lm<-lm(Y~X)
> X2<-sqrt(data$X)
> X2
[1] 0.7071068 0.7071068 1.0000000 1.0000000 1.2247449 1.2247449 1.4142136
[8] 1.4142136 1.5811388 1.5811388
> plot(Y~X2)
> summary(lm2)

Call:
lm(formula = Y ~ X2)

Residuals:
Min 1Q Median 3Q Max
-9.3221 -4.1884 -0.2367 4.1007 7.7200

16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.328 7.892 -1.309 0.227
X2 83.453 6.444 12.951 1.2e-06 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 6.272 on 8 degrees of freedom


Multiple R-squared: 0.9545, Adjusted R-squared: 0.9488
F-statistic: 167.7 on 1 and 8 DF, p-value: 1.197e-06

We obtain the following fitted regression function



Ŷ = −10.33 + 83.45X 0 = −10.33 + 83.45 X

The plot of residuals shows no evidence of unequal error variance and no


strong indications of substantial departures from normality.

17
3.2 Assumption of Constant Variance
One assumption of a regression model is that the errors have constant vari-
ance. A common reason for the violation (V ar(i ) 6= V ar(j ), i 6= j) is for
the response y follows a distribution whose variance is relate to the mean
(related to the regressors x). For example, if y follow Poisson distribution
in a simple linear regression model, then E(y) = V ar(y) = λ, here λ is the
parameter of the Poisson distribution and is related to the regressor variable
x.
As another example, we want to analyse the relationship between resident
income and purchasing power yi = β0 + β1 xi + i . The difference among low-
income families is little because their main purchases are daily necessities.
However the difference among high-income families is large because their
purchase range is wider from automobiles to houses. This explain the cause
of inequality of variance of errors.
In another example, if y is proportion, i.e., 0 ≤ yi ≤ 1 then in such cases
the variance of y is proportional to E(y)[1 − E(y)] In such case, the variance
stabilizing transformation is useful.
Many factors result in heteroscedasticity, when the sample data are cross-
section data (observing different subjects at the same time (period)), the
errors are more likely to have different variance.
To solve these problems, we could reexpress the response y and then
inverse the predicted values into origin units. Table 1 shows a series of
common variance-stabilizing transformations. We should notice about the
transformations that

1. If mild transformation is applied over a relatively narrow range of values


( yymax
min
< 3), it has little effect.

2. If strong transformation is applied over a wide range, it has a dramatic


effect on analysis.

After making the suitable transformation, use y 0 as a study variable in


respective case. Through the residual plot, we can find the trend of the
residuals against fitted values ŷ. If the plot looks like the pattern (b),(c) or
(d) in figure 1, it indicates variance inequality. And then we can try some
transformations to get a better residual plot. The strength of a transfor-
mation depends on the amount of curvature present in the curve between
study and explanatory variable. The transformation mentioned here range

18
Relation of sigma2 to E(y) Transformation
σ 2 ∝ constant y0 = y

σ 2 ∝ E(y) y0 = y

σ 2 ∝ E(y)[1 − E(y)] y 0 = sin−1 ( y)(0 ≤ yi ≤ 1)
σ 2 ∝ E(y)2 y 0 = ln(y)(log)
σ 2 ∝ E(y)3 y 0 = y −1/2
2
σ ∝ E(y) 4
y 0 = y −1

Table 1: Useful Variance-Stabilizing Transformations

from relatively mild to relatively strong. The square root transformation is


relatively mild and reciprocal transformation is relatively strong. The square
root transformation is relatively mild and reciprocal transformation is rela-
tively strong.

3.2.1 Box-Cox Method for Appropriate Transformation


Transformation by observing residual plots and scatter diagrams of y against
x are empirical. Here we introduce a more formal and objective to transform
a regression modelBox-Cox method.
Box-Cox method is applying power transformation to response y in form
of y λ , λ and the parameters can be estimated simultaneously by using the
method of maximum-likelihood. There is a obvious problem that when λ ap-
proaches to zero, y λ approaches to 1, which is meaningless. We can transform
the response as ( λ
y −1
λ 6= 0
y (λ) = λ
ln y λ = 0.
However, there is still a problem, when λ changes, the value of (y λ − 1)/λ
change dramatically. We improve the formula
( λ
y −1
(λ) λẏ λ−1
λ 6= 0
y =
ẏ ln y λ = 0.
Pn
where ẏ = ln−1 1

n i=1 ln yi is the geometric mean of the observations.

19
3.2.2 Weighted Least Squares (WLS)
In many cases, assumptions of uncorrelated and homoscedastic errors, may
not be valid. Thus, alternative methods should be considered in order to
compute improved parameter estimates.

One alternative is weighted least squares (WLS). This method redis-


tributes the influence of data points on the estimation of the parameters.
This approach is consulted when there are heteroscedastic errors in the model
(non-constant variances). Through the weighted least squares approach, ob-
servations with greater variances will have smaller weights, and conversely
observations with smaller variances will have greater weights. With this
method, the expression of the variance-covariance matrix is replaced in favour
of a more general expression:

Cov() = σ 2 W −1

where W is a positive definite matrix (Fahrmeir 2013 ).

W is defined as the diagonal matrix of weights as follows:

W = diag(w1 , w2 , ..., wn )
or:
1
   
w1 0 ... 0 w1
0 ... 0
0 1
 0 w2 ... 0 
−1 w2
... 0 
W =  ..  , W =  ..
   
.. .. .. .. ... .. 
 . . . . . . . 
1
0 0 . . . wn 0 0 ... wn

20
σ2
Thus, the heteroscedastic variances can be written as V ar(i ) =
wi
For the WLS approach, transformation of the response variable (Y), the
regressor matrix (x) and the errors () are performed such that the trans-
formed variables follow the linear model with homoscedastic errors. For
i = 1, 2, ..., n, and k is the number of regressors in the model:


• Transformed errors: ∗i = wi i

Transformation of errors induces constant variance where V ar(∗i ) =



V ar( wi i ) = σ 2

• Transformed response variable: yi∗ = wi yi

• Transformed predictors: x∗ik = wi xik
Thus, the linear model can be rewritten as:

1 1 1
W 2 y = W 2 xβ + W 2  (1)
Where
1 √ √ √
W 2 = diag( w1 , w2 , ..., wn )
Regressor coefficient estimates are obtained by minimizing the weighted sums
of squares (WSS):

n
X
W SS(β) = ê2wi
i=1
n
0
X
W SS(β) = wi (yi − xi β)2
i=1

i = 1, 2, ..., n
where the residuals of a weighted least squares model is defined as:
√ 0
êwi = wi (yi − xi β)

21
Example 4. For a simple linear regression model:
yi = β0 + β1 xi + i
the weighted least squares estimates can be obtained as follows:

1. We define the weighted sums of squares:

n
X
W SS(β0 , β1) = wi (yi − β0 − β1 xi )2
i=1

2. We then differentiate W SS(β0 , β1) with respect to β0 and β1, and


equate them to zero:

• Let Q = ni=1 wi (yi − β0 − β1 xi )2


P

∂Q
= −2 ni=1 wi (yi − β̂0 − β̂1 xi )
P

∂β0

• 0 = −2 ni=1 wi (yi − β̂0 − β̂1 xi )


P

• β̂0 ni=1 wi + β̂1 ni=1 wi xi = ni=1 wi yi


P P P

The same procedure (with respect to β1 ) is done to obtain:

n
X n
X n
X
β̂0 wi xi + β̂1 wi x2i = wi xi yi
i=1 i=1 i=1

3. Finally, we solve these equations for the weighted least squares estimate
of β0 and β1 .
For example, it can be shown that the weighted estimator for β1 is
Pn
i=1wi (xi − x̄w )(yi − ȳw )
β̂1w = Pn 2
i=1 wi (xi − x̄w )
Pn Pn
wx
i i i i wy
where x̄w = Pi=1n and ȳw = Pi=1
n . From which we can find the
i=1 wi i=1 wi
weighted least squares estimate of the intercept:
β̂0w = ȳw − β̂1w x̄w .

22
Example 5. Consider the dataset ”cleaningwgt”. We are trying to develop a
regression equation to model the relationship between the number of romms
cleaned, Y the number of crews, and X the number of rooms cleaned by 4
and 16 crews. In this case we take
1
wi = ,
(Standard deviation Yi )2
σ2
then yi has a variance wi
, with σ 2 = 1.
cleaningwtd <-read.table(file.choose(),header=TRUE,sep=",")
> cleaningwtd
Case Crews Rooms StdDev
1 1 16 51 12.000463
2 2 10 37 7.927123
3 3 12 37 7.289910
4 4 16 46 12.000463
5 5 16 45 12.000463
6 6 4 11 4.966555
7 7 2 6 3.000000
8 8 4 19 4.966555
9 9 6 29 4.690416
10 10 2 14 3.000000
11 11 12 47 7.289910
12 12 8 37 6.642665
13 13 16 60 12.000463
14 14 2 6 3.000000
15 15 2 11 3.000000
16 16 2 10 3.000000
17 17 6 19 4.690416
18 18 10 33 7.927123
19 19 16 46 12.000463
20 20 16 69 12.000463
21 21 10 41 7.927123
22 22 6 19 4.690416
23 23 2 6 3.000000
24 24 6 27 4.690416
25 25 10 35 7.927123
26 26 12 55 7.289910

23
27 27 4 15 4.966555
28 28 4 18 4.966555
29 29 16 72 12.000463
30 30 8 22 6.642665
31 31 10 55 7.927123
32 32 16 65 12.000463
33 33 6 26 4.690416
34 34 10 52 7.927123
35 35 12 55 7.289910
36 36 8 33 6.642665
37 37 10 38 7.927123
38 38 8 23 6.642665
39 39 8 38 6.642665
40 40 2 10 3.000000
41 41 16 65 12.000463
42 42 8 31 6.642665
43 43 8 33 6.642665
44 44 12 47 7.289910
45 45 10 42 7.927123
46 46 16 78 12.000463
47 47 2 6 3.000000
48 48 2 6 3.000000
49 49 8 40 6.642665
50 50 12 39 7.289910
51 51 4 9 4.966555
52 52 4 22 4.966555
53 53 12 41 7.289910
> attach(cleaningwtd)
> wm1 <- lm(Rooms~Crews,weights=1/StdDev^2)
> summary(wm1)

Call:
lm(formula = Rooms ~ Crews, weights = 1/StdDev^2)

Weighted Residuals:
Min 1Q Median 3Q Max
-1.43184 -0.82013 0.03909 0.69029 2.01030

24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.8095 1.1158 0.725 0.471
Crews 3.8255 0.1788 21.400 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.9648 on 51 degrees of freedom


Multiple R-squared: 0.8998, Adjusted R-squared: 0.8978
F-statistic: 458 on 1 and 51 DF, p-value: < 2.2e-16

We would like to find a 95% prediction intervals for Y when x = 4 and 16.

> predict(wm1,newdata=data.frame(Crews=c(4,16)),interval="prediction",level=0.95)
fit lwr upr
1 16.11133 13.71210 18.51056
2 62.01687 57.38601 66.64773
Warning message:
In predict.lm(wm1, newdata = data.frame(Crews = c(4, 16)), interval = "prediction"
Assuming constant prediction variance even though model fit is weighted

Write the model as a multiple linear regression with two predictors and no
intercept.

> ynew <- Rooms/StdDev


> x1new <- 1/StdDev
> x2new <- Crews/StdDev
> wm1check <- lm(ynew~x1new + x2new - 1)
> summary(wm1check)

Call:
lm(formula = ynew ~ x1new + x2new - 1)

Residuals:
Min 1Q Median 3Q Max
-1.43184 -0.82013 0.03909 0.69029 2.01030

Coefficients:
Estimate Std. Error t value Pr(>|t|)

25
x1new 0.8095 1.1158 0.725 0.471
x2new 3.8255 0.1788 21.400 <2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.9648 on 51 degrees of freedom


Multiple R-squared: 0.9617, Adjusted R-squared: 0.9602
F-statistic: 639.6 on 2 and 51 DF, p-value: < 2.2e-16

3.3 Assumption of Linear Relationship Between Re-


sponse and Regressors
Some cases, a nonlinear model can be linearized by using a suitable trans-
formation. Such nonlinear models are called intrinsically or transformable
linear. The advantage of transforming the nonlinear function into linear
function is that the statistical tools are developed for the case of linear re-
gression model. If the scatter diagram of y against x indicates looks like some
curve of the function that we have known, then we can use the linearized form
of the function to represent the data. For example, consider the exponential
function
y = β0 eβ1 x ,
Using the transformation y 0 = ln(y) the model becomes

ln y = ln β0 + β1 x + ln .

or
y 0 = β00 + β1 x + 0
Where β00 = ln β0 .

26
4 Detection and Treatment of Outliers ( In-
fluential observations)
As mentioned before, an outlier is an observation point that is distant from
other observations, sometimes it may be three of four standard deviations
from the mean |di | > 3. An outlier may be due to variability in the mea-
surement or it may indicate experimental error and the latter can be simply
removed from the data set. Taking the advantage of studentized or R-student
residuals and the residual plots against ŷi and the normal probability plot,
we can find outliers easily. If an outlier does not come from the system error,
then it play a more important role than other points because it may control
many key model properties.
We can compare the origin values of parameters or summary statistics
such as t statistic, F statistic, R2 and M SR with the values after removing
the outliers.
Outliers can be divided into two possibilities, about the response y or
about the regressors xi . If the absolute value of R-student is greater than
3(|ri∗ | > 3), then it can be determined as an outlier and can be deleted.
Sometimes the residual is big but it may be an influence point rather than
a outlier about response. So the high influence point do not have bad effect
on the regression model but have significant influence on the regression effect.
As defined in paragraph 1.1, hii , the diagonal elements of the hat matrix H is
called the leverage of the ith observation because it is a standardized measure
of the distance of the ith observation from the center of the x space and it
can be written as
hii = x0i (X 0 X)−1 xi
where x0i is the ith row of the X matrix. The bigger the hii is, the less the
V ar(ei ) is. The points with big leverage will put the regression line in their
direction. If the high leverage point lies almost on the line passing through
the remaining observation, it has no effect on the regression coefficients. The
mean value of hii is
n
1X p
h̄ = hii =
n i=1 n
where ni=1 hii = rank(H) = rank(X) = p. If a leverage hii is 2 or 3 times
P
greater than h̄, it will be regarded as high leverage value. Observations with
large hat diagonals and large residuals are called influence points. Influ-

27
ence points are not always outliers about response, so we cannot determine
whether an influence point is an outlier.
It is desirable to consider both the location of the point in the x space
and the response variable in measuring influence. Therefore, we introduce
Cook’s distance, a measure of the squared distance to determine. It is named
after the American statistician R. Dennis Cook, who introduced the concept
in 1977. The distance measure can be expressed as
 0  
β̂(i) − β̂ X 0 X β̂(i) − β̂
Di = , i = 1, · · · , n
pM SE

where β̂ is the least-squares estimate based on all n points and β̂(i) is the
estimate obtained by deleting the ith point. It can also be written as

ri2 V ar(ŷi ) r2 hii e2 hii


Di = · = i · = i2 ·
p V ar(ei ) p 1 − hii pσ̂ 1 − hii

From the above equation, we can find that Cook’s distance reflects a
combined effect of leverage hii and residual ei .
Points with large values of Di have great influence on the least-squares
estimate. However, it is complicated to determine the standard of the size of
Cook’s distance. There are different opinions regarding what cut-off values
to use for spotting highly influential points. A simple operational guideline
is that if Di > 1, then it is regarded as an outlier. If Di < 0.5, it is not an
outlier. Others have indicated that Di > n4 as the standard.
R Codes:

library(foreign)
library(MASS)
cd<-cooks.distance(lm)

Example 6. The table below, lists the values of the hat matrix diagonals
hii and Cook’s distance measure Di for the wire bond pull strength data.
Calculate D1 .

28
5 Lack of Fit of the Regression Model
In the formal test for lack of fit, we assume the regression model has satisfied
the requirement of normality, independence and constant variance and we
just to determine whether the relationship is straight-line or not.
Replicating experiments in regression analysis aims to make it clear that
whether there are any non-negligible factors except x. Here replication means
having ni separate experiments at level x = xi and observing the value of
response yi instead of having just one experiment and observing the response
ni times. The residuals of the latter method are only from the difference of
method of measurements and useless for analysis of fitted model. If other
uncontrolled or non-negligible factors, including interaction, are in the model,
then the fitting effect may not be good and it was called lack of fit. In
the situation, even if the hypothesis testing shows that the regression is
significant, it only indicates that the regressor x has effect on the response
y but not indicates the fitting is good. Thus, with regard to the data set
involves replicated data, we can use the test for lack of fit on the expected
function. Determining whether the model is good or not is mainly by the
residual analysis.
Residuals are composed of two parts, one of parts is called pure error
which is random and unable to be eliminated while the other is related to
the model called lack of fit.
Test for lack of fit is a test used to judge whether a regression model would
be accepted. And the test is based on the relative size of the lack of fit and
the pure error. If the lack of fit is greater than the pure error significantly,
then the model should be rejected.

29
A sum of squares due to lack of fit is one of the components of a partition
of the sum of squares in an analysis of variance, used in the numerator in an
F-test of the null hypothesis that says that a proposed model fits well.
SSE = SSP E + SSLOF
here SSP E is the sum of squares due to pure error and SSLOF is the sum of
squares due to lack of fit.
Notation:
xi , i = 1, · · · , m: the ith level of regressor x.
ni : the number of replication at the ith level of x.
m
X
ni = n
i=1

yij , i = 1, · · · , m, j = 1, · · · , ni : the jth observation at xi .


the ijth residual is
yij − ŷi = (yij − ȳi ) + (ȳi − ŷi ).
The full model is
yij = µi + ij ,
E(yij ) = µi .
For the full model, we can estimate µi as
µ̂i = ȳi
ni
m X
X
SSE (F ullM odel) = (yij − ȳi )2 .
i=1 j=1

The reduced model is


y(ij) = β0 + β1 xi + (ij) ,
ŷij = β0 + β1 xi .
Note that the reduced model is an ordinary simple linear regression model.
X ni
m X ni
m X
X
2
SSE = SSE (ReducedM odel) = (yij − (β0 + β1 xi )) = (yij − ŷi )2
i=1 j=1 i=1 j=1
Xm X ni m
X
= (yij − ȳi )2 + ni (ȳi − ŷi )2
i=1 j=1 i=1

30
Pm Pni
The cross-product term i=1 j=1 (yij − ȳi )(ȳi − ŷi ) equals to 0. The
reduced model is
ni
m X
X
SSP E = SSE (F ullM odel) = (yij − ȳi )2 ,
i=1 j=1

m
X
SSLOF = SSE − SSP E = ni (ȳi − ŷi )2 .
i=1

This term ni and ŷi in the formula is because all the yij at the level of xi
have the same fitted value ŷi .
If the fitted values ŷi are close to the corresponding average responses ȳi ,
then the lack of fit is approximate zero, which indicates that the regression
function is more likely to be linear, vice versa.
The degree of freedom for pure error at each level xi is ni − 1 (similar to
SST ), and the total number ofPdegrees of freedom associated with the sum
of squares due to pure error is m i=1 (ni − 1) = n − m. The degree of freedom
associated with SSLOF is m − 2 because the regressor has m levels and two
parameters must be estimated to obtain the ȳi .
The hypothesis is
H0 : E(yi ) = β0 + β1 xi
(reduced model is adequate)

H1 : E(yi ) 6= β0 + β1 xi

(reduced model is not adequate).


The test statistic for lack of fit is
SSLOF /(m − 2) M SLOF
F0 = =
SSP E /(n − m) M SP E

In theory, if a model is fitted well, the parameters should not change a


lot under several replicated experiments and the SSLOF is the smaller the
better. If F0 > F(α,m−2,n−m) or P -value of F0 < α, the null hypothesis that
the tentative model adequately describes the data should be rejected, in other
words, lack of fit exists in the model at α level. Otherwise the lack of fit may
not exist.

31
Example 7. Perform the F-test for lack of fit on the following data:

(90, 81), (90, 83), (79, 75), (66, 68), (66, 60), (66, 62), (51, 60), (51, 64), (35, 51), (35, 53)

> lmred<-lm(y~x) # reduced model


> lmfull<-lm(y~as.factor(x)) #full model
> anova(lmred,lmfull)
Analysis of Variance Table

Model 1: y ~ x
Model 2: y ~ as.factor(x)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 8 118.441
2 5 46.667 3 71.775 2.5634 0.168

From the code above, we can conclude : SSE = 118.44 (with 8 df), SSLOF =
71.77 (with 5 df), SSP E = 46.667 (with 3 df).

32

You might also like