You are on page 1of 14

Testing the assumptions of linear regression

Quantitative models always rest on assumptions about the way the world works, and regression models are no exception. There are four principal assumptions which justify the use of linear regression models for purposes of prediction: (i) linearity of the relationship between dependent and independent variables (ii) independence of the errors (no serial correlation) (iii) homoscedasticity (constant variance) of the errors (a) versus time (b) versus the predictions (or versus any independent variable) (iv) normality of the error distribution. If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity, and/or non-normality), then the forecasts, confidence intervals, and economic insights yielded by a regression model may be (at best) inefficient or (at worst) seriously biased or misleading.

Violations of linearity are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. How to detect: nonlinearity is usually most evident in a plot of the observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or a

2 and 1.e.4.0.4 and 2. and the graph of residuals versus predicted suggests a parabolic curve. and extreme serial correlation is often a symptom of a badly mis-specified model.e.2.2 to 0. For example. straight) trend line fitted to data which are growing exponentially over time.0. so ideally it should be close to 2. if the sample size is 50. (If this is not part of the standard output for your regression procedure. most of the residual autocorrelations should fall within the 95% confidence bands around zero. The latter transformation is possible even when X and/or Y have negative values. you can save the RESIDUALS and use another procedure to plot the autocorrelations. between 1.6 for a sample size of 50.3. if the data are strictly positive. or a Durbin-Watson statistic between 1. lag-1 residual autocorrelation in the range 0. Thus. where n is the sample size. which are located at roughly plus-or-minus 2-over-the-square-root-of-n. if you have regressed Y on X. If the sample size is 100. Serial correlation is also sometimes a byproduct of a violation of the linearity assumption--as in the case of a simple (i. X-squared). For example. as we saw in the auto sales example. How to fix: Minor cases of positive serial correlation (say..0-say. Consider adding lags of the dependent variable and/or lags of some of the . How to fix: consider applying a nonlinear transformation to the dependent and/or independent variables--if you can think of a transformation that seems appropriate. Pay especially close attention to significant correlations at the first couple of lags and in the vicinity of the seasonal period. a log transformation may be feasible. the autocorrelations should be between +/. Another possibility to consider is adding another regressor which is a nonlinear function of one of the other variables. The Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual autocorrelation.) Ideally. Violations of independence are also very serious in time series regression models: serial correlation in the residuals means that there is room for improvement in the model. then it may make sense to regress Y on both X and X^2 (i.horizontal line in the latter plot. Look carefully for evidence of a "bowed" pattern. whereas logging may not be. they should be between +/. How to detect: The best test for residual autocorrelation is to look at an autocorrelation plot of the residuals. because these are probably not due to mere chance and are also fixable.6) indicate that there is some room for fine-tuing in the model. indicating that the model makes systematic errors whenever it is making unusually large or small predictions..

6). such as MONTH=1 or QUARTER=2. (An AR=1 term in Statgraphics adds a lag of the dependent variable to the forecasting equation. this indicates that seasonality has not been properly accounted for in the model. logging. etc. indicator variables for different seasons of the year. Seasonality can be handled in a regression model in one of the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted). if the variance of the errors is increasing over time. (Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted. confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. try adding an AR=1 or MA=1 term. the seasonal adjustment is multiplicative. If the dependent variable has been logged.g. Or. or (iii) add seasonal dummy variables to the model (i. If there is significant correlation at the seasonal period (e.) If there is significant correlation at lag 2.) The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year.3 or DW stat greater than 2. Heteroscedasticity may also . whereas an MA=1 term adds a lag of the forecast error. some of your independent variables may not be..5) usually indicate a fundamental structural problem in the model. Violations of homoscedasticity make it difficult to gauge the true standard deviation of the forecast errors. Differencing tends to drive autocorrelations in the negative direction. and/or deflating. and too much differencing may lead to patterns of negative correlation that lagged variables cannot correct for.e. causing their seasonal patterns to leak into the forecasts. usually resulting in confidence intervals that are too wide or too narrow. It may help to stationarize all variables through appropriate combinations of differencing.independent variables.) Major cases of serial correlation (a Durbin-Watson statistic well below 1. You may wish to reconsider the transformations (if any) that have been applied to the dependent and independent variables. then a 2nd-order lag may be appropriate. or (ii) use seasonal lags and/or seasonally differenced variables (caution: be careful not to overdifference!). watch out for the possibility that you may have overdifferenced some of your variables. if you have ARIMA options available. In particular.0. at lag 4 for quarterly data or lag 12 for monthly data). autocorrelations well above 0. If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0.

Violations of normality compromise the estimation of coefficients and the calculation of confidence intervals. the points on this plot should fall close to the diagonal line. Such models are beyond the scope of this course-however. Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions. How to detect: look at plots of residuals versus time and residuals versus predicted value.) How to fix: In time series models. heteroscedasticity often arises due to the effects of inflation and/or real compound growth. How to detect: the best test for normally distributed errors is a normal probability plot of the residuals. a simple fix would be to work with shorter intervals of data in which volatility is more nearly constant. (To be really thorough. perhaps magnified by a multiplicative seasonal pattern. If the error distribution is significantly non-normal.. Sometimes the error distribution is "skewed" by the presence of a few large outliers. If the distribution is normal. A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e. Some combination of logging and/or deflating will often stabilize the variance in this case. more spread-out) either as a function of time or as a function of the predicted value. Since parameter estimation is based on the minimization of squared error. they are not symmetrically distributed. with too many large errors in the same direction).have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. confidence intervals may be too wide or too narrow. you might also want to plot residuals versus some of the independent variables. This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. Calculation of confidence intervals and various signficance tests for coefficients are all based on the assumptions of normally distributed errors. in which case it may also be fixed as a byproduct of fixing those problems. a few extreme observations can exert a disproportionate influence on parameter estimates.. Stock market data may show periods of increased or decreased volatility over time--this is normal and is often modeled with so-called ARCH (auto-regressive conditional heteroscedasticity) models in which the error variance is fitted by an autoregressive model.e. and be alert for evidence of residuals that are getting larger (i. An S-shaped pattern of deviations indicates that the .

duke. and how influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations. In such cases.htm Normality You also want to check that your data is normally distributed.e. In some cases.e. In some cases. the problem with the residual distribution is mainly due to one or two very large errors. then you may have cause to remove them. however. Such values should be scrutinized closely: are they genuine (i.residuals have excessive kurtosis--i.. http://www. are they If they are merely errors or if they can be explained as unique events not likely to be repeated. How to fix: violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal. it may be that the extreme values in the data provide the most useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors. a nonlinear transformation of variables might cure both problems. not the result of data entry errors). Often the histogram will include a line that depicts what the shape would look like if the distribution were truly normal (and you can "eyeball" how much the actual distribution deviates from this line). there are either two many or two few large errors in both directions. are similar events likely to occur again in the future. you can construct histograms and "look" at the data to see its distribution. To do this. This histogram shows that age is normally distributed: .. and/or (b) the linearity assumption is violated.

In this plot. the actual scores are ranked and sorted. The normal value is the position it holds in the actual distribution. and an expected normal value is computed and compared with an actual normal value for each case. This plot also shows that age is normally distributed: . Basically. The expected normal value is the position a case with that rank holds in a normal distribution.You can also construct a normal probability plot. you would like to see your actual values lining up along the diagonal that goes from lower left to upper right.

the residuals scatterplot will show the majority of residuals at the center of the plot for each value of the predicted score. These data are not perfectly normally distributed in that the residuals about the zero . then you don't need to do the separate plots. (Residuals will be explained in more detail in a later section. Below is a residual plot of a regression where age of patient and time (in months since diagnosis) are used to predict breast tumor size.You can also test for normality within the regression analysis by looking at a plot of the "residuals.) If the data are normally distributed. with some residuals trailing off symmetrically from the center. then residuals should be normally distributed around each predicted DV score. You might want to do the residual plot before graphing each variable separately because if this residuals plot looks good. If the data (and the residuals) are normally distributed." Residuals are the difference between obtained and predicted DV scores.

. they do appear to be fairly normally distributed. "Skewness" is a measure of how symmetrical the data are. Checking for outliers will also help with the normality problem. then you will probably want to transform it (which will be discussed in a later section). "Extreme values" for skewness and kurtosis are values greater than +3 or less than -3. an extreme value for either one would tell you that the data are not normally distributed. . you can also statistically examine the data's normality. the mean and median are quite different). Specifically. "Kurtosis" has to do with how peaked the distribution is. Nevertheless. In addition to a graphic examination of the data. either too peaked or too flat. a skewed variable is one whose mean is not in the middle of the distribution (i.line appear slightly more spread out than those below the zero line. If any variable is not normally distributed. statistical programs such as SPSS will calculate the skewness and kurtosis for each variable.e.

e. . Looking at the above bivariate scatterplot. Any nonlinear relationship between the IV and DV is ignored. a graph with the IV on one axis and the DV on the other).. such that happiness increases with the number of friends to a point. However. You can test for linearity between an IV and the DV by looking at a bivariate scatterplot (i. you can see that friends is linearly related to happiness. This assumption is important because regression analysis only tests for a linear relationship between the IVs and the DV. Linearity means that there is a straight line relationship between the IVs and the DV. you could also imagine that there could be a curvilinear relationship between friends and happiness.Linearity Regression analysis also has an assumption of linearity. the greater your level of happiness. the more friends you have. If the two variables are linearly related. Specifically. the scatterplot will be oval.

happiness declines with a larger number of friends. As you can see. the data are not linear: . The following is a residuals plot produced when happiness was predicted from number of friends and age. This is demonstrated by the graph below: You can also test for linearity by using the residual plots described previously. however. Nonlinearity is demonstrated when most of the residuals are above the zero line on the plot at some predicted values.Beyond that point. then the relationship between the residuals and the predicted DV scores will be linear. This is because if the IVs and DV are linearly related. In other words. instead of rectangular. and below the zero line at other predicted values. the overall shape of the plot will be curved.

again predicting happiness from friends and age. the data are linear: If your data are not linear. in this case.The following is an example of a residuals plot. then you can usually make it linear by transforming IVs or the DV so that there is a linear relationship between . But.

The failure of linearity in regression will not invalidate your analysis so much as weaken it. The following residuals plot shows data that are fairly homoscedastic. then the regression will at least capture the linear relationship. this residuals plot shows data that meet the assumptions of homoscedasticity. and normality (because the residual plot is rectangular. In fact. Sometimes transforming one variable won't work. Homoscedasticity The assumption of homoscedasticity is that the residuals are approximately equal for all predicted DV scores. Alternatively. you want the cluster of points to be approximately the same width all over. linearity.them. if there is a curvilinear relationship between the IV and the DV. Data are homoscedastic if the residuals plot is the same width for all values of the predicted DV. Alternatively. You can check homoscedasticity by looking at the same residuals plot talked about in the linearity and normality sections. the linear regression coefficient cannot fully capture the extent of a curvilinear relationship. the IV and DV are just not linearly related. Heteroscedasticity is usually shown by a cluster of points that is wider as the values for the predicted DV get larger. you might want to dichotomize the IV because a dichotomous variable can only have a linear relationship with another variable (if it has any relationship at all). you can check for homoscedasticity by looking at a scatterplot between each IV and the DV. If there is both a curvilinear and a linear relationship between the IV and DV. with a concentration of points along the center): . then you might need to include the square of the IV in the regression (this is also known as a quadratic regression). If there is a curvilinear relationship between the DV and IV. As with the residuals plot. Another way of thinking of this is that the variability in scores for your IVs is the same at all values of the DV.

your problem is easily solved by deleting one of the two variables. High bivariate correlations are easy to spot by simply running correlations among your IVs. violation of the assumption of homoscedasticity does not invalidate your regression so much as weaken it. Multicollinearity and Singularity Multicollinearity is a condition in which the IVs are very highly correlated (.90 or greater) or by high multivariate correlations. Like the assumption of linearity.Heteroscedasiticy may occur when some variables are skewed and others are not. Multicollinearity and singularity can be caused by high bivariate correlations (usually of .90 or greater) and singularity is when the IVs are perfectly correlated and one IV is a combination of one or more of the other IVs. but you should check your programming first. often this is a . If you do have high bivariate correlations. checking that your data are normally distributed should cut down on the problem of heteroscedasticity. Thus.

then your IVs are redundant with one another. You don't need to worry too much about tolerance in that most programs will not allow a variable to enter the regression model if tolerance is too low. is calculated by 1-SMC. you do not want singularity or multicollinearity because calculation of the regression coefficients is done through matrix inversion. In general. if singularity exists. To do this. and if multicollinearity exists the inversion is unstable. Consequently. As such. having multicollinearity/ singularity can weaken your analysis.70 or greater. Transformations . Statistically. Tolerance is the proportion of a variable's variance that is not accounted for by the other IVs in the equation. you don't want multicollinearity or singularity because if they exist. SMC is the squared multiple correlation ( R2 ) of the IV when it serves as the DV which is predicted by the rest of the IVs. It's harder to spot high multivariate correlations.mistake when you created the variables. Logically. Tolerance. a related concept. you need to calculate the SMC for each IV. the inversion is impossible. but you do lose a degree of freedom. you probably wouldn't want to include two IVs that correlate with one another at . one IV doesn't add any predictive value over another IV. In such a case.