You are on page 1of 17

Chapter 4: Regression Assumptions and Residual/ Errors Diagnostics

When we fit a simple linear regression model to our sample data and use the estimated model to make predictions and statistical
inferences about a larger population, we make several assumptions that may or may not be correct for the data at hand. The key
theoretical assumptions about the linear regression model are:
 Linearity. The equation that is used for the connection between the expected value of the Y (dependent) variable and the
different levels of the X (independent) variable describes the actual pattern of the data. In other words, we use a straight-line
equation because we assume the \average" pattern in the data is indeed linear.
 Normality. The errors are normally distributed with a mean of 0.
 Homoskedasticity/ constant variance. The errors have the same theoretical variance, regardless of the values of X and,
thus, regardless of the expected value of Y. For a straight line, this means that the vertical variation of data points around the
line has about the same magnitude everywhere.
 Independence. The errors are independent of each other (i.e., that they are a random sample) and are independent of any
time order in the data. A discussion assessing dependency requires an introduction to time series analysis. Durbin-Waston
Test

This chapter presents various regression diagnostic procedures to assess these assumptions.
Moreover, the same standard deviation holds at all values of the independent variable. Thus, the distribution curves (shown in red
at three different values of the independent variable) all look the same, but they are just translated along the x-axis based on the
regression relationship. Note that the only assumption that is not visualized here is the assumption of independence, which will
usually be satisfied if there is no temporal component and the experimenter did a proper job of designing the experiment.
A. CONSEQUENCES OF INVALID ASSUMPTIONS

Linearity. Using the wrong equation (such as using a straight line for curved data) is very serious. Predicted values will be wrong in
a biased manner, meaning that predicted values will systematically miss the true pattern of the expected value of Y as related to X.

Normality. If the errors do not have a normal distribution, it usually is not particularly serious. Simulation results have shown that
regression inferences tend to be robust with respect to normality (or nonnormality of the errors). In practice, the residuals may
appear to be nonnormal when the wrong regression equation has been used. With that stated, if the error distribution is significantly
nonnormal, then inferential procedures can be very misleading; e.g., statistical intervals may be too wide or too narrow. It is not
possible to check the assumption that the overall mean of the errors is equal to 0 because the least squares process causes the
residuals to sum to 0. However, if the wrong equation is used and the predicted values are biased, the sample residuals will be
patterned so that they may not average 0 at specific values of X.

Homoskedasticity. The principal consequence of nonconstant variance i.e., where the variance is not the same at each level of X |
are prediction intervals for individual Y values will be wrong because they are determined assuming constant variance. There is a
small effect on the validity of t-test and F-test results, but generally regression inferences are robust with regard to the variance
issue.

Independence. As noted earlier, this assumption will usually be satisfied if there is no temporal component and the experimenter
did a proper job of designing the experiment. However, if there is a time-ordering of the observations, then there is the possibility of
correlation between consecutive errors. If present, then this is indicative of a badly misspecified model and any subsequent
inference procedures will be severely misleading.
B. DIAGNOSING VALIDITY OF ASSUMPTIONS

Visualization plots and diagnostic measures used to detect types of disagreement between the observed data and an assumed
regression model have a long history. Central to many of these methods are residuals based on the fitted model. As will be shown,
there are more than just the raw residuals; i.e., the difference between the observed values and the fitted values. We briefly outline
the types of strategies that can be used to diagnose validity of each of the four assumptions. We will then discuss these procedures
in greater detail.

 DIAGNOSING WHETHER THE RIGHT TYPE OF EQUATION WAS USED

1. Examine a plot of residuals versus fitted values (predicted values). A curved pattern for the residuals versus fitted values plot

indicates that the wrong type of equation has been used.

2. Use goodness-of-fit measures. For example, r2 can be used as a rough goodness-of-fit measure, but by no means should

be used to solely determine the appropriateness of the model fit. The plot of residuals versus fitted values can also be used

for assessing goodness-of-fit. Other measures, like model selection criteria, are discussed in Chapter 5.

3. If the regression has repeated measurements at the same x i values, then you can perform a formal lack-of-fit test

(also called a pure error test) in which the null hypothesis is that the type of equation used as the regression equation is

correct. Failure to reject this null hypothesis is a good thing since it means that the regression equation is okay.
 DIAGNOSING WHETHER THE ERRORS HAVE A NORMAL DISTRIBUTION

1. Examine a histogram of the residuals to see if it appears to be bell-shaped, such as the residuals from the simulated data

given in Figure (a). The difficulty is that the shape of a histogram may be difficult to judge unless the sample size is large.

2. Examine a normal probability plot of the residuals. Essentially, the ordered (standardized) residuals are plotted against

theoretical expected values for a sample from a standard normal curve population. A straight-line pattern for a normal

probability plot (NPP) indicates that the assumption of normality is reasonable, such as the NPP given in Figure (b).

3. Do a hypothesis test in which the null hypothesis is that the errors have a normal distribution. Failure to reject this null

hypothesis is a good result. It means that it is reasonable to assume that the errors have a normal distribution.

Goodness-of-fit tests are also used for this purpose.


 DIAGNOSING WHETHER OR NOT THE VARIANCE IS CONSTANT

1. Examine a plot of residuals versus fitted values. Obvious differences in the vertical spread of the residuals indicate

nonconstant variance. The most typical pattern for nonconstant variance is a plot of residuals versus fittted values with a

pattern that resembles a sideways cone.

2. Do a hypothesis test with the null hypothesis that the variance of the errors is the same for all values of the predictor

variable(s). There are various statistical tests that can be used, such as the modified Levene test and Bartlett's test.
 DIAGNOSING INDEPENDENCE OF THE ERROR TERMS

1. Examine a plot of the residuals versus their order. Trends in this plot are indicative of a correlated error structure and, hence,
dependency.
2. If the observations in the dataset represent successive time periods at which they were recorded and you suspect a
temporal component to the study, then time series can be used for analyzing and remedying the situation.

C. HOW TO IDENTIFY AN OUTLIER IN REGRESSION

A value of
|r i| > 3 usually indicates the value can be considered an outlier. Other methods can be used in outlier detection, which

are discussed later. Plots of residuals versus fitted values are constructed by plotting all pairs of
r i and y^ i for the observed sample,
with residuals on the vertical axis. The idea is that properties of the residuals should be the same for all different values of the
predicted values. Specifically, as we move across the plot (across predicted values), the average of the residuals should always be

about 0 and the vertical variation in residuals should maintain about the same magnitude. Note that plots of
e i and y^ i can also be
used and typically appear similar to those based on the Studentized residuals. Sr=e/sqr(mse)

Ideal Appearance of Plots

The usual assumptions for regression imply that the pattern of deviations (errors) from the regression line should be similar
regardless of the value of the predicted value (and value of the X variable). The consequence is that a plot of residuals versus fitted
values (or residuals versus an X variable) ideally has a random \zero correlation" appearance.
Regression model in Matrix notation
^
Y =Xβ+ ε ⇒ Y^ =X β=X (X'X )−1 X'Y

[ ]
h11 h .. .
h22
⋮ number of parameters p
−1 ' h ii 0≤hii ≤1 compare it with 2 h̄=2 =2
Hat / projection matrix: H=P=X ( X'X) X= where n n

number of parameters p
2 h̄=2 =2
This formula n n is applicable only if the sample size is greater than the number of independent
variables

( x i − x̄ )2 1 ( x i− x̄ )( x j− x̄ )
2
1 1 ( x i − x̄ )
hii = pii= + = + hij = pij = +
In simple linear regression we can also use this formula
n ∑ ( x i− x̄ ) 2 n SS xx
and
n ∑ ( x − x̄ ) 2
i

Properties of hat matrix


'
1. Symmetric; A=A

' −1 ' '


Proof: H =[ X (X'X ) X ] =H

2. Idempotent, H.H=H
H3=H
Proof:
H .H =[ X ( X'X)−1 X ' ][ X (X'X )−1 X ' ]
¿XI (X'X )−1 X ' =H
Show that ( I −H ) is idempotent
A
e . g A . A−1 = =I
A
Y^ =HY

HY=Y^
−1 ' ^ ^
Show that LHS=HY=X ( X'X) X Y =X β=Y
LEVERAGE, INFLUENCE AND OUTLIERS

 An outlier is a data point whose response y does not follow the general trend of the rest of the data.
 A data point has high leverage if it has "extreme" predictor x values. With a single predictor, an extreme x value is simply
one that is particularly high or low. With multiple predictors, extreme x values may be particularly high or low for one or more
predictors, or may be "unusual" combinations of predictor values (e.g., with two predictors that are positively correlated, an
unusual combination of predictor values might be a high value of one predictor paired with a low value of the other predictor)

Note that — for our purposes — we consider a data point to be an outlier only if it is extreme with respect to the other y values, not
the x values.
A data point is influential if it excessively influences any part of a regression analysis, such as the predicted responses, the
estimated slope coefficients, or the hypothesis test results.
Outliers and high leverage data points have the potential to be influential, but we generally have to investigate further to determine
whether or not they are actually influential.

One advantage of the case in which we have only one predictor is that we can look at simple scatter plots in order to identify any
outliers and high leverage data points.

Let's take a look at a few examples that should help to clarify the distinction between the two types of extreme values.

Example 1
Based on the definitions above, do you think the following data set contains any outliers? Or, any high leverage data points?

All of the data points follow the general trend of the rest of the data, so there are no outliers (in the y direction). And, none of the
data points are extreme with respect to x, so there are no high leverage points. Overall, none of the data points would appear to be
influential with respect to the location of the best fitting line.
Example 2
Now, how about this example? Do you think the following data set contains any outliers? Or, any high leverage data points?

Of course! Because the red data point does not follow the general trend of the rest of the data, it would be considered an outlier.
However, this point does not have an extreme x value, so it does not have high leverage. Is the red data point influential? An easy
way to determine if the data point is influential is to find the best fitting line twice — once with the red data point included and once
with the red data point excluded.
Example 3
Now, how about this example? Do you think the following data set contains any outliers? Or, any high leverage data points?

In this case, the red data point does follow the general trend of the rest of the data. Therefore, it is not deemed an outlier here.
However, this point does have an extreme x value, so it does have high leverage. Is the red data point influential? It certainly
appears to be far removed from the rest of the data (in the x direction), but is that sufficient to make the data point influential in this
case?
Example 4
One last example! Do you think the following data set contains any outliers? Or, any high leverage data points?

In this case, the red data point is most certainly an outlier and has high leverage! The red data point does not follow the general
trend of the rest of the data and it also has an extreme x value. And, in this case the red data point is influential.
DIFFICULTIES POSSIBLY SEEN IN THE PLOTS
Three primary difficulties may show up in plots of residuals versus fitted values:
1. Outliers in the data.
2. The regression equation for the average does not have the right form.
3. The residual variance is not constant.

 Difficulty 1: Outliers in the Y Values


An unusual value for the Y variable will often lead to a large residual. Thus, an outlier may show up as an extreme point in the
vertical direction of the residuals versus fitted values plot. The Figure below gives a plot of data simulated from a simple linear
regression model. However, the point at (6; 35) is an outlier, especially since the residual at this point is so large. This is also
confirmed by observing the plot of Studentized residuals versus fitted values, which is not shown here.

HOW TO CHECK FOR INFLUENTIAL OBSERVATIONS


 Leverage procedure.
 Jacknife/ Deleted residuals.
 Cook’s Distance.

 Difficulty 2: Wrong Mathematical Form of the Regression Equation

A curved appearance in a plot of residuals versus fitted values indicates that we used a regression equation that does not match
the curvature of the data. Thus, we have misspecified our model. Figures (c) and (d) show the case of nonlinearity in the residual
plots. When this happens, there is often a pattern to the data similar to that of an exponential or trigonometric function.

 Difficulty 3: Nonconstant Residual Variation

Many times, nonconstant residual variance leads to a sideways cone or funnel shape for the plot of residuals versus fitted values.
Figures (b) and (d) show plots of such residuals versus fitted values. Nonconstant variance is noted by how the data \fans out" as
the predicted value increases. In other words, the residual variance (i.e., the vertical range in the plot) is increasing as the size of
the predicted value increases. This typically has a minimal impact on regression estimates, but is a feature of the data that needs to
be taken into account when reporting the accuracy of the predictions, especially for inferential purposes.

D. DATA TRANSFORMATIONS

Transformations of the variables are used in regression to describe curvature and sometimes are also used to adjust for
nonconstant variance in the errors and the response variable. Below are some general guidelines when considering
transformations.

 What to Try?
When there is curvature in the data, there might possibly be some theory in the literature of the subject matter that suggests an
appropriate equation. Or, you might have to use trial-and-error data exploration to determine a model that fits the data. In the trial-
and-error approach, you might try polynomial models or transformations of X and/or Y , such as square root, logarithmic, or
reciprocal. One of these will often end up improving the overall fit, but note that interpretations of any quantities will be on the
transformed variable(s).

 Transform X or Transform Y?

In the data exploration approach, if you transform Y, you will change the variance of the Y and the errors. You may wish to try

common transformations of the Y (e.g., log(Y ),√ Y , orY ) when there is nonconstant variance and possible curvature to the data.
−1

−1
Try transformations of the X (e.g., X ; X 2 or X 3 ) when the data are curved, but the variance looks to be constant in the original
scale of Y.

 Why Might Logarithms Work?

Logarithms often are used because they are connected to common exponential growth and power curve relationships. The
relationships discussed below are easily verified using the algebra of logarithms.

You might also like