This action might not be possible to undo. Are you sure you want to continue?
11, page 1 Math 445 Rainfall data Chapter 11 Model Checking and Refinement
In the rainfall data, we ended up leaving out case 28 (Death Valley) because it had a large residual and its altitude was the lowest in the data set. The resulting model is therefore not applicable to such low altitude locations. If case 28 had not been unusual, then we would not have been justified in omitting it.
Without #28 (Death Valley)
Coefficientsa Unstandardized Coefficients B Std. Error -2.074 .525 .000725 .000241 .093924 .014285 -.431176 .059929 -.000019 .000006 Standardized Coefficients Beta 4.647 .773 -.662 -4.620
(Constant) Altitude (ft) Latitude (degrees) Rainshadow Altitude*Latitude
t -3.951 3.012 6.575 -7.195 -2.959
Sig. .001 .006 .000 .000 .007
a. Dependent Variable: Log10(Precipitation)
R2 = .80 Since there is an interaction between Altitude and Latitude, interpretation of the coefficients for these variables becomes a little complicated. However, we can interpret the effect of the Rainshadow variable in this model.
Chap. 11, page 2 Case 28 is an example of an outlier, a case for which the model does not fit well. Outliers have large residuals. We are also interested in influential cases, cases whose omission changes the fitted model substantially. Influential cases may not be outliers. Least squares is sensitive to unusual cases and an influential case may “pull” the regression plane toward it so much that it does not have a large residual. In simple linear regression, we could often identify influential cases simply from a scatterplot. In multiple regression, it may not be possible to see influential cases in pairwise scatterplots and we need additional tools.
Case-Influence statistics Leverage: The leverage of a case is based only on the values of the explanatory variables. It measures the distance of the case from the mean for the explanatory variables (in multidimensional space). For one explanatory variable, the leverage is 1 ⎡ Xi − X ⎤ 1 hi = ⎢ ⎥ + (n − 1) ⎣ s X ⎦ n
(X i − X )2 + 1 ∑ ( X i − X )2 n
With more than one explanatory variable, the leverage is a measure of distance in higher-dimensional space. The distance takes into account the joint variability of the variables – see Display 11.10 on p. 316.
High-leverage cases are easy to identify visually with only one explanatory variable, but become increasingly difficult to identify visually with more explanatory variables. Leverages are always between 1/n and 1. The average of all the leverages in a data set is always p/n where p is the number of explanatory variables. SPSS computes centered leverages (under the Linear Regression…Save button), even though it calls them simply “leverages.” The centered leverage is hi − 1 / n . Therefore, the centered leverage is between 0 and 1-1/n. Leverage measures the potential influence of a case. High leverage cases have the potential to change the least squares fit substantially.
Chap. 11, page 3 Studentized residuals While the true residuals (what we called the ε i ) all have the same standard deviation σ in the regression model, the observed residuals ei don’t. Why not? Consider simple linear regression: • • True residual ε i = Yi − ( β 0 + β1 X i ) ˆ ˆ Observed residual: ei = Yi − ( β 0 + β1 X i )
First, we already know that the size of the observed residuals tend to be smaller than the sizes of the true residuals. That’s why we divide by n-2 when we compute the standard deviation of the observed residuals to get an estimate of the standard deviation of the true residuals. The reason that the observed residuals tend to be smaller is that the least squares line is the line which best fits the data so the deviations from this line will tend to be smaller than the deviations from the true line. What do we mean when we say that the residuals do not all have the same standard deviation? How can a single value have a standard deviation? What we mean is: what is the standard deviation of the residuals at each X i from many simulated sets of data from the linear regression model with a fixed set of Xi ’s? To carry out this simulation we would follow the following steps. The Xi ’s remain the same for every simulation. 1. Generate a set of a set of Yi ’s where each Yi is from a normal distribution with mean β 0 + β1 X i and standard deviation σ. That gives a set of n pairs of values ( X 1 , Y1 ), ( X 2 , Y2 ),… , ( X n , Yn ) . 2. Fit the least squares line 3. Compute the residuals. 4. Repeat steps 1-3 many times with a new set of Yi ’s each time. Now look at the distribution of observed residuals for each X i and, in particular, compute the standard deviation of the observed residuals at each X i . You will find that the standard deviations are different and that the standard deviation of the residuals for X i ’s far from X (high leverage values) is smaller than for X i ’s near X (low leverage values). In fact, it can be shown that the standard deviation of the residual at X i is: SD(Residuali) = σ (1 − hi ) where hi is the leverage. This formula applies to any multiple regression model, not just the simple linear regression model.
Chap. 11, page 4 Example (simple linear regression): Suppose Yi is normal with mean µ (Yi ) = 1 + 2 X i , i= 1,..,5, and standard deviation σ = 1, and that the X i ’s are 1, 4, 5, 6 and 14. Here are the Yi ’s from one simulation: 3.42, 9.86, 10.05, 12.90, 27.38. The least squares line is ˆ Y = 1.73 + 1.83 X and the residuals are: -0.145, 0.803, -0.844, 0.182, 0.004.
Repeating the simulation 10,000 times, here are the mean and standard deviation of the residuals at each Xi : Xi 1 mean of residuals 0.008 std. dev. of residuals 0.737 4 0.002 0.869 5 -0.013 0.884 6 0.000 0.900 14 0.004 0.350
Calculate the leverages for these 5 X i ’s:
Use the formula on the previous page to calculate the standard deviation of the residuals at each X i . How do they match the values estimated from the simulation?
Why are we so concerned about the standard deviation of the residuals at different X i ’s? • Because a big residual is more unusual at a high leverage point than at a low-leverage point. Therefore, standardizing the residuals by an estimate of their standard deviation is a better way to compare residuals. Since residuals always have mean 0, this means dividing each residual by an ˆ estimate of its standard deviation σ (1 − hi ) . Since we don’t know σ, we replace it by σ (the square root of mean square residual in the ANOVA table). • The studentized residual is
resi ˆ σ 1 − hi
Chap. 11, page 5 • • Studentized residuals are also sometimes called internally studentized residuals. In SPSS, they are called “studentized residuals” (under the Save button on the Linear Regression window).
ˆ A potential problem with the studentized residuals is that σ may be inflated if a residual is an outlier. Therefore, a modified version of the studentized residual is the externally studentized ˆ ˆ residual, called the studentized deleted residual in SPSS. σ is replaced by σ (i ) , the estimated
standard deviation of the residuals from the model fit with the ith observation omitted.
studresi* = resi ˆ σ (i ) 1 − hi
Internally and externally studentized residuals can be used in just the same way as the raw residuals: in residual plots, normal probability plots, etc. In fact, they are preferred to the raw residuals because the nonconstant variance of the raw residuals has been corrected for. When examining studentized residuals, one should look for outliers. In addition, one can use the standard normal distribution as a rough guide for identifying unusual values: e.g, we expect about 5% of values less than -2 or greater than 2 and less than 1% to be outside the range –3 to 3.
Cook’s Distance A more direct measure of the influence of an observation is Cook’s Distance, which measures how much the fitted values change when each observation is omitted. For case i,
Di = ∑
j (i )
ˆ pσ 2
where p is the number of regression coefficients. The numerator of the above expression is what’s important; the denominator just standardizes the statistic. ˆ ˆ Y j is the fitted value for case j when the whole data set is used to fit the model. Y j (i ) is the fitted value for case j when case i is omitted in fitting the model. So, for example, to calculate D1 we omit case 1, calculate the model, and calculate the fitted values for all observations including case 1. We then calculate the sum of the squared differences between these predicted values and the predicted values from the model fit to all the data. A values of Cook’s D close to or greater than 1 is often considered to be indicative of an observation with large influence. While Cook’s D is a useful measure if the goal of the model is prediction, it is not as useful for seeing how a particular coefficient changes when an observation is omitted. However, it can be used to identify cases to check – omit a case with large Cook’s D and see how the coefficients of interest change.
Chap. 11, page 6 Other measures of influence A number of other measures of influence have been proposed. However, some of these measures are redundant and it is not necessary to look at all of them. Two others that SPSS computes are DfFits, which measures how much the predicted value for case i changes when case I is omitted and DfBetas, which measures how much the omission of case i changes each of the coefficients in the model (hence, for each case, there is a separate DfBetas value for each variable).
Rainfall data: model Logprec = Altitude + Latitude + Rainshadow + Altitude*Latitude Results with all cases included:
Chap. 11, page 7 Without case #28:
In Sec. 11.4.4, p. 320, Ramsey and Schafer suggest that if “the residual plot from a good inferential model fails to suggest any problems, there is generally no need to examine case influence statistics at all.” I would agree except that I would suggest that the residual plot should use the externally studentized residuals (=studentized deleted residuals).
Chap. 11, page 8 We next examine two types of plots useful in refining models:
Partial regression leverage plots (also called added-variable plots) are useful for visually identifying influential and high leverage points for each regression coefficient separately. These are not discussed in the text, but are easily available in SPSS. Partial residual plots (also called component-plus-residual plots) are useful for identifying nonlinear relationships in a multiple regression model. These are discussed in the text, but are not readily available in SPSS. They can be constructed in SPSS, but it’s rather tedious.
It might seem that simply plotting the response variable Y versus each explanatory variable would be adequate for assessing the relationship between Y and each X variable for a multiple regression model. However, these plots can be misleading because they do not control for the values of the other X variables. For example, an apparently strong relationship between Y and X 1 may disappear when other variables are included in the model. If the scatterplot of Y versus X 1 looks curved, it does not necessarily mean that a squared term will be necessary with the other X variables in the model. Similarly, a case that appears influential in the Y versus X 1 scatterplot may not be influential with the other X variables in the model and a case that doesn’t appear influential may turn out to be so with the other X variables in the model. Plots of the residuals versus each X variable are also inadequate. They are better than Y versus X plots because they show only the unexplained variation in Y on the y-axis. However, the X variables are not adjusted for relationships with each other. Partial regression leverage plots (not in text) • A partial regression leverage plot (or added-variable plot) attempts to separate out the relationship between the response and any explanatory variable after adjusting for the other explanatory variables in the model.
The steps involved creating the partial regression leverage plot for variable X 1 are: 1. Compute the residuals from the regression of Y on all the other X variables in the model except X1 . 2. Compute the residuals from the regression of X 1 on all the other X variables in the model. 3. Plot the first set of residuals on the y-axis against the second set on the x-axis.
Steps 1-3 are repeated for all the X variables in the model. The partial regression leverage plot for X 1 looks at the relationship between Y and X 1 after adjusting for the other X variables. It turns out that the slope of the least squares line for this plot is exactly equal ˆ to β 1 , the coefficient on X 1 in the regression model with all the X’s in it. In addition, high leverage and ˆ influential cases for β can be identified from this scatterplot. This is the primary use of the partial
regression leverage plots.
SPSS: Partial regression leverage plots for all X variables can be generated automatically in SPSS by selecting “Produce all partial plots” on the Plot menu for the Regression…Linear menu.
Chap. 11, page 9 Partial residual plots • A partial residual plot (or component-plus-residual plot) is constructed differently from a partial regression leverage plot, but also has the property that the slope of the least squares line through the plot is the coefficient for that variable in the multiple regression model with all the X variables included.
• • •
Partial residual plots are better than partial regression leverage plots for identifying nonlinear relationships between Y and an X variable after adjusting for the other X variables in the model. If a clear nonlinear relationship is identified, possible solutions include adding the square of the X variable to the model, transforming the X variable, or transforming the Y variable. To construct the partial residual plot for X 1 , follow the following steps. For the sake of this example, assume there are three other X variables in the model: X 2 , X 3 , X 4 .
ˆ ˆ ˆ ˆ ˆ ˆ 1. Regress Y on all the X variables to obtain Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 . ˆ ˆ ˆ ˆ 2. Compute the partial residuals for X 1 as pres = Y − β 0 − β 2 X 2 − β 3 X 3 − β 4 X 4 .
3. Plot the partial residuals for X 1 (on the y-axis) against X 1 (on the x-axis). Steps 1-3 should be repeated for X 2 , X 3 , X 4 .
Partial residual plots are also useful for identifying high leverage and influential cases. SPSS does not automatically produce partial residual plots (recall that “partial plots” in the PSSS regression menu means “partial regression leverage plots”). It is somewhat of a hassle to produce these plots in SPSS manually, but it can be done by following steps 1-3. It is easier to replace step 2 by the equivalent calculation: ˆ 2. pres = res + β1 X 1 where res is the residual from the full model fit in step 1. Thus, the steps are: fit the full model (step 1) and save the residuals as RES_1. Use Transform… ˆ Compute to compute the partial residuals as RES_1+ β 1 X 1 where you will type in the value for ˆ β from the model fit in step 1. Plot the partial residuals versus X . Repeat for the other
variables. A loess smooth can be added to the partial residual plot to help identify non-linear relationships. The following page contains both partial regression leverage plots and partial residual plots for the rainfall data where the log10(Precip) is regressed on Altitude, Latitude, and Rainshadow with no interaction. It might be best to look for nonlinear relationships before considering interactions, but certainly these plots can also be used for models with interactions. Case #28 has been omitted.
Coefficientsa Unstandardized Coefficients B Std. Error -1.137 .479 .0000139 .0000167 .06835 .01302 -.40686 .06795 Standardized Coefficients Beta .089 .562 -.625
(Constant) Altitude (ft) Latitude (degrees) Rainshadow
t -2.372 .832 5.250 -5.988
Sig. .026 .413 .000 .000
a. Dependent Variable: Log10(Precipitation)
Chap. 11, page 10 Partial regression leverage plots Partial residual plots
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.