Testing the assumptions of linear regression

Quantitative models always rest on assumptions about the way the world works, and regression models are no exception. There are four principal assumptions which justify the use of linear regression models for purposes of prediction: (i) linearity of the relationship between dependent and independent variables (ii) independence of the errors (no serial correlation) (iii) homoscedasticity (constant variance) of the errors (a) versus time (b) versus the predictions (or versus any independent variable) (iv) normality of the error distribution. If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity, and/or non-normality), then the forecasts, confidence intervals, and economic insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.

Violations of linearity are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. How to detect: nonlinearity is usually most evident in a plot of the observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or a

6) indicate that there is some room for fine-tuing in the model.0.2 to 0. Pay especially close attention to significant correlations at the first couple of lags and in the vicinity of the seasonal period. as we saw in the auto sales example.6 for a sample size of 50. you can save the RESIDUALS and use another procedure to plot the autocorrelations. The latter transformation is possible even when X and/or Y have negative values. they should be between +/. Another possibility to consider is adding another regressor which is a nonlinear function of one of the other variables. Look carefully for evidence of a "bowed" pattern. lag-1 residual autocorrelation in the range 0.horizontal line in the latter plot.. most of the residual autocorrelations should fall within the 95% confidence bands around zero..) Ideally. or a Durbin-Watson statistic between 1.4. X-squared). which are located at roughly plus-or-minus 2-over-the-square-root-of-n. so ideally it should be close to 2. a log transformation may be feasible.4 and 2.2. the autocorrelations should be between +/.2 and 1. Thus. Consider adding lags of the dependent variable and/or lags of some of the . How to fix: Minor cases of positive serial correlation (say. (If this is not part of the standard output for your regression procedure. How to fix: consider applying a nonlinear transformation to the dependent and/or independent variables--if you can think of a transformation that seems appropriate. How to detect: The best test for residual autocorrelation is to look at an autocorrelation plot of the residuals. because these are probably not due to mere chance and are also fixable. then it may make sense to regress Y on both X and X^2 (i. For example.0. if the sample size is 50. Serial correlation is also sometimes a byproduct of a violation of the linearity assumption--as in the case of a simple (i. where n is the sample size. between 1. if you have regressed Y on X. and the graph of residuals versus predicted suggests a parabolic curve.e. The Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual autocorrelation. whereas logging may not be.0-say. If the sample size is 100. and extreme serial correlation is often a symptom of a badly mis-specified model.e. if the data are strictly positive. indicating that the model makes systematic errors whenever it is making unusually large or small predictions. For example. straight) trend line fitted to data which are growing exponentially over time. Violations of independence are also very serious in time series regression models: serial correlation in the residuals means that there is room for improvement in the model.3.

Violations of homoscedasticity make it difficult to gauge the true standard deviation of the forecast errors.5) usually indicate a fundamental structural problem in the model. It may help to stationarize all variables through appropriate combinations of differencing.independent variables. Seasonality can be handled in a regression model in one of the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted). (Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted.e. try adding an AR=1 or MA=1 term.6). confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. causing their seasonal patterns to leak into the forecasts. indicator variables for different seasons of the year. (An AR=1 term in Statgraphics adds a lag of the dependent variable to the forecasting equation. logging. if you have ARIMA options available. etc. whereas an MA=1 term adds a lag of the forecast error.3 or DW stat greater than 2. or (ii) use seasonal lags and/or seasonally differenced variables (caution: be careful not to overdifference!). or (iii) add seasonal dummy variables to the model (i.. this indicates that seasonality has not been properly accounted for in the model.) Major cases of serial correlation (a Durbin-Watson statistic well below 1. then a 2nd-order lag may be appropriate. autocorrelations well above 0. usually resulting in confidence intervals that are too wide or too narrow. if the variance of the errors is increasing over time. If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0. You may wish to reconsider the transformations (if any) that have been applied to the dependent and independent variables. the seasonal adjustment is multiplicative. Differencing tends to drive autocorrelations in the negative direction. and/or deflating. Heteroscedasticity may also . and too much differencing may lead to patterns of negative correlation that lagged variables cannot correct for. If there is significant correlation at the seasonal period (e. Or.) The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year. watch out for the possibility that you may have overdifferenced some of your variables. In particular.) If there is significant correlation at lag 2.g. If the dependent variable has been logged. such as MONTH=1 or QUARTER=2.0. some of your independent variables may not be. at lag 4 for quarterly data or lag 12 for monthly data).

the points on this plot should fall close to the diagonal line. a few extreme observations can exert a disproportionate influence on parameter estimates. If the error distribution is significantly non-normal. (To be really thorough. Since parameter estimation is based on the minimization of squared error. Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions.have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. in which case it may also be fixed as a byproduct of fixing those problems. If the distribution is normal.) How to fix: In time series models. Calculation of confidence intervals and various signficance tests for coefficients are all based on the assumptions of normally distributed errors.e. A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i. with too many large errors in the same direction). confidence intervals may be too wide or too narrow.. more spread-out) either as a function of time or as a function of the predicted value. How to detect: look at plots of residuals versus time and residuals versus predicted value. and be alert for evidence of residuals that are getting larger (i.. heteroscedasticity often arises due to the effects of inflation and/or real compound growth. a simple fix would be to work with shorter intervals of data in which volatility is more nearly constant. Sometimes the error distribution is "skewed" by the presence of a few large outliers. Some combination of logging and/or deflating will often stabilize the variance in this case. This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. Stock market data may show periods of increased or decreased volatility over time--this is normal and is often modeled with so-called ARCH (auto-regressive conditional heteroscedasticity) models in which the error variance is fitted by an autoregressive model. An S-shaped pattern of deviations indicates that the . you might also want to plot residuals versus some of the independent variables. Such models are beyond the scope of this course-however. How to detect: the best test for normally distributed errors is a normal probability plot of the residuals. perhaps magnified by a multiplicative seasonal pattern.e. they are not symmetrically distributed. Violations of normality compromise the estimation of coefficients and the calculation of confidence intervals.

htm Normality You also want to check that your data is normally distributed. the problem with the residual distribution is mainly due to one or two very large errors. This histogram shows that age is normally distributed: . and/or (b) the linearity assumption is violated.e.edu/~rnau/testing. not the result of data entry errors). Often the histogram will include a line that depicts what the shape would look like if the distribution were truly normal (and you can "eyeball" how much the actual distribution deviates from this line). you can construct histograms and "look" at the data to see its distribution. are similar events likely to occur again in the future. there are either two many or two few large errors in both directions.. are they explainable. In some cases.. it may be that the extreme values in the data provide the most useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors. a nonlinear transformation of variables might cure both problems. and how influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.duke. http://www. To do this.e. In such cases.residuals have excessive kurtosis--i. In some cases. How to fix: violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal. then you may have cause to remove them. however. Such values should be scrutinized closely: are they genuine (i.) If they are merely errors or if they can be explained as unique events not likely to be repeated.

The normal value is the position it holds in the actual distribution. In this plot. the actual scores are ranked and sorted.You can also construct a normal probability plot. and an expected normal value is computed and compared with an actual normal value for each case. Basically. you would like to see your actual values lining up along the diagonal that goes from lower left to upper right. The expected normal value is the position a case with that rank holds in a normal distribution. This plot also shows that age is normally distributed: .

If the data (and the residuals) are normally distributed. then you don't need to do the separate plots. (Residuals will be explained in more detail in a later section." Residuals are the difference between obtained and predicted DV scores.) If the data are normally distributed. These data are not perfectly normally distributed in that the residuals about the zero . then residuals should be normally distributed around each predicted DV score. with some residuals trailing off symmetrically from the center. the residuals scatterplot will show the majority of residuals at the center of the plot for each value of the predicted score.You can also test for normality within the regression analysis by looking at a plot of the "residuals. You might want to do the residual plot before graphing each variable separately because if this residuals plot looks good. Below is a residual plot of a regression where age of patient and time (in months since diagnosis) are used to predict breast tumor size.

either too peaked or too flat. "Skewness" is a measure of how symmetrical the data are. you can also statistically examine the data's normality. Nevertheless. they do appear to be fairly normally distributed. then you will probably want to transform it (which will be discussed in a later section). . Checking for outliers will also help with the normality problem. "Extreme values" for skewness and kurtosis are values greater than +3 or less than -3.e. an extreme value for either one would tell you that the data are not normally distributed. Specifically.line appear slightly more spread out than those below the zero line. the mean and median are quite different).. "Kurtosis" has to do with how peaked the distribution is. statistical programs such as SPSS will calculate the skewness and kurtosis for each variable. In addition to a graphic examination of the data. a skewed variable is one whose mean is not in the middle of the distribution (i. If any variable is not normally distributed.

You can test for linearity between an IV and the DV by looking at a bivariate scatterplot (i.Linearity Regression analysis also has an assumption of linearity. the greater your level of happiness. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV. you can see that friends is linearly related to happiness.e. the scatterplot will be oval.. If the two variables are linearly related. Specifically. Looking at the above bivariate scatterplot. Linearity means that there is a straight line relationship between the IVs and the DV. . you could also imagine that there could be a curvilinear relationship between friends and happiness. the more friends you have. However. Any nonlinear relationship between the IV and DV is ignored. such that happiness increases with the number of friends to a point. a graph with the IV on one axis and the DV on the other).

the overall shape of the plot will be curved. This is demonstrated by the graph below: You can also test for linearity by using the residual plots described previously. then the relationship between the residuals and the predicted DV scores will be linear. however. This is because if the IVs and DV are linearly related. Nonlinearity is demonstrated when most of the residuals are above the zero line on the plot at some predicted values. instead of rectangular. In other words. happiness declines with a larger number of friends. As you can see. The following is a residuals plot produced when happiness was predicted from number of friends and age. and below the zero line at other predicted values. the data are not linear: .Beyond that point.

in this case. the data are linear: If your data are not linear. then you can usually make it linear by transforming IVs or the DV so that there is a linear relationship between . again predicting happiness from friends and age.The following is an example of a residuals plot. But.

The following residuals plot shows data that are fairly homoscedastic.them. with a concentration of points along the center): . and normality (because the residual plot is rectangular. Alternatively. The failure of linearity in regression will not invalidate your analysis so much as weaken it. then the regression will at least capture the linear relationship. the IV and DV are just not linearly related. linearity. Heteroscedasticity is usually shown by a cluster of points that is wider as the values for the predicted DV get larger. Homoscedasticity The assumption of homoscedasticity is that the residuals are approximately equal for all predicted DV scores. the linear regression coefficient cannot fully capture the extent of a curvilinear relationship. Sometimes transforming one variable won't work. you want the cluster of points to be approximately the same width all over. You can check homoscedasticity by looking at the same residuals plot talked about in the linearity and normality sections. Another way of thinking of this is that the variability in scores for your IVs is the same at all values of the DV. If there is a curvilinear relationship between the DV and IV. you can check for homoscedasticity by looking at a scatterplot between each IV and the DV. Alternatively. this residuals plot shows data that meet the assumptions of homoscedasticity. In fact. Data are homoscedastic if the residuals plot is the same width for all values of the predicted DV. If there is both a curvilinear and a linear relationship between the IV and DV. you might want to dichotomize the IV because a dichotomous variable can only have a linear relationship with another variable (if it has any relationship at all). As with the residuals plot. then you might need to include the square of the IV in the regression (this is also known as a quadratic regression). if there is a curvilinear relationship between the IV and the DV.

your problem is easily solved by deleting one of the two variables. but you should check your programming first. Thus.90 or greater) or by high multivariate correlations. If you do have high bivariate correlations.Heteroscedasiticy may occur when some variables are skewed and others are not. Multicollinearity and Singularity Multicollinearity is a condition in which the IVs are very highly correlated (. often this is a . checking that your data are normally distributed should cut down on the problem of heteroscedasticity. violation of the assumption of homoscedasticity does not invalidate your regression so much as weaken it.90 or greater) and singularity is when the IVs are perfectly correlated and one IV is a combination of one or more of the other IVs. Like the assumption of linearity. Multicollinearity and singularity can be caused by high bivariate correlations (usually of . High bivariate correlations are easy to spot by simply running correlations among your IVs.

the inversion is impossible. then your IVs are redundant with one another.70 or greater. if singularity exists. is calculated by 1-SMC. SMC is the squared multiple correlation ( R2 ) of the IV when it serves as the DV which is predicted by the rest of the IVs. Tolerance is the proportion of a variable's variance that is not accounted for by the other IVs in the equation. one IV doesn't add any predictive value over another IV. Logically. having multicollinearity/ singularity can weaken your analysis. Tolerance. you probably wouldn't want to include two IVs that correlate with one another at . you don't want multicollinearity or singularity because if they exist. a related concept. you need to calculate the SMC for each IV.mistake when you created the variables. but you do lose a degree of freedom. To do this. Transformations . Consequently. As such. In such a case. and if multicollinearity exists the inversion is unstable. It's harder to spot high multivariate correlations. Statistically. you do not want singularity or multicollinearity because calculation of the regression coefficients is done through matrix inversion. You don't need to worry too much about tolerance in that most programs will not allow a variable to enter the regression model if tolerance is too low. In general.

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.