You are on page 1of 23

25 Collinearity

ADDING EXPLANATORY VARIABLES ............................................................................................25-3 REVIEW: REVISITING THE CAPM.................................................................................................25-3 DEPENDENCE AMONG SEVERAL VARIABLES ...............................................................................25-4 IMPACT OF LARGE COLLINEARITY ...............................................................................................25-7 INTERPRETING A SLOPE ...............................................................................................................25-10 EFFECTS OF COLLINEARITY ........................................................................................................25-10 STANDARD ERROR IN MULTIPLE REGRESSION ..........................................................................25-11 VARIANCE INFLATION FACTOR ..................................................................................................25-13 REMEDIES FOR COLLINEARITY ...................................................................................................25-14 SEEING HOW A SLOPE CAN CHANGE SIGN ................................................................................25-19 SUMMARY ....................................................................................................................................25-21 PITFALLS ......................................................................................................................................25-22

7/27/07

25 Collinearity

The Capital Assets Pricing Model (CAPM, introduced in Chapter 21) makes strong claims about the explanatory variables that belong in a regression that explains the variation in stock returns. According to the CAPM, the regression ought to be very simple: one explanatory variable, the percentage changes in the value of the market. We have avoided some of the details about these percentage changes. In this chapter well work with excess returns. Percentage changes ignore the interest that could have been earned on the money thats invested. Suppose you have $1,000. If you leave it in the bank at 6% annual interest (% per month), then at the end of a month, you have 1000 1.005 = $1,005, guaranteed. If you buy stock, the value of your investment depends on what happens in the market. If the stock goes up by more than %, you come out ahead by taking on the risk. If the stock rises by 5% that month, you can sell the investment for $1,050, put the money back in the bank, and come out $45 ahead. That $45 is the excess gain earned on the stock beyond what you could have made without taking a chance. Thats really what the CAPM is about: the reward for taking risks. The standard approach to calculating excess returns subtracts the interest paid on US government bonds, the so-called risk-free rate. To calculate excess percentage changes on stock, we subtract the risk-free rate during each period from the percentage changes on stock. The CAPM specifies a regression equation that describes the excess percentage changes: Excess %Change in Stock = + (Excess %Change in the Market) + Beta is the slope and, according to the CAPM, the intercept = 0. The CAPM continues to attract supporters, but numerous rivals such as those proposed by Fama and French have appeared. 1 Rivals specify other explanatory variables. Whereas the CAPM says that its good enough to use excess returns on the whole stock market, alternatives claim that it takes two, three, or more factors to represent the risks associated with the whole market. These alternatives are multiple regression models. The key question is: Which explanatory variables belong in a regression? The question becomes hard to answer when the predictors are related to each other.

Fama and French led the assault on beta in the 1990s with their Three-Factor Model. See, for example, E. F. Fama and K. R. French (1995) Size and book-to-market factors in earnings and returns, Journal of Finance, 50, 131-155, and more recently J. L. Davis, E. F. Fama and K. R. French (2000) Characteristics, covariances and average returns: 1929 to 1997, Journal of Finance, 55, 389-406.
1

25-2

7/27/07

25 Collinearity

Adding Explanatory Variables

Once your regression includes several explanatory variables, the important questions change. Deciding whether a slope is different from zero is easy: look at the confidence interval, t-statistic, or p-value. Figuring out which explanatory variables to consider and interpreting the resulting equation are hard. How do you know whether you have gotten all of the important explanatory variables? The search for relevant explanatory variables encourages you to add other variables to your model. If the added explanatory variables are correlated with each other and those already included in the regression, the picture often gets murky. In the previous chapter, we made sure to use explanatory variables that are weakly associated; for example, the correlation between the number of rooms in a house and climate is 0.33. Even so, the marginal and partial slopes differ in important ways. The situation often gets messier. Chances are, in looking for a better model, youll find several explanatory variables that measure the same thing. Its common for the explanatory variables to be related to each other. Youre trying to explain variation in the response, and its common for the explanations to be related. The easiest way to appreciate how collinearity, the association among the explanatory variables, affects regression is to see an example.

Review: Revisiting the CAPM


Lets do CAPM right and by using excess percentage changes. This scatterplot shows monthly excess percentage changes in the value of IBM stock versus the corresponding changes in the overall stock market for the 10 years 1996-2005.
30

IBM Excess %Chg

20 10 0 -10 -20 -20 -15 -10 -5 0 5 10

Market Excess %Chg

Figure 25-1. Excess returns on IBM versus the market.

The following table summarizes this simple regression. Portions of the table colored orange rely on unverified conditions. It wouldnt be wise to interpret se, for example, until weve checked that the residuals have similar variances. 25-3

7/27/07

25 Collinearity

R2 se n Term ! Intercept Market Excess % Change

0.383365 7.76636 120 Std Error t-statistic 0.714106 0.80 0.152508 8.57 p-value 0.4233 <.0001

Estimate 0.5738378 1.3062462

Table 25-1. CAPM regression for returns on IBM.

Conditions 1. Straight enough 2. No embarrassing lurking variable 3. Similar variances 4. Nearly normal

Lets check the conditions for using the SRM. The following graphs show that the residuals have similar variances and are nearly normal. The circled negative outlier gives the appearance of a lack of constant variance, but thats not enough make us believe that the variances change. The normal quantile plot confirms that the residuals are nearly normal. (Also, a timeplot that is not shown reveals no pattern.)
30 20
20 10 0 -10 -20 .001.01.05 .10.25.50.75.90 .95.99.999

Residual

10 0 -10 -20 -30 -10 0

5 10 15 20 -4

-3

-2

-1

Market Excess %Chg

Count

Normal Quantile Plot

Figure 25-2. Residuals from the CAPM regression versus the explanatory variable (left) and in a normal quantile plot (right).

The most negative outlier happened in October 1999; the excess percentage change in IBM this month was -19% even though the market returned 6%. In the fall of 1999, build-to-order specialist Dell was booming but IBM had to withdraw its poor-selling line of desktop computers. IBM stock took a beating. Now that we have verified the conditions of the SRM, we can move on to inference. Most interesting from the point of view of the CAPM, the estimated intercept is not significantly different from zero. Even though the estimate b0 is positive, the confidence interval for includes zero. The 95% confidence interval for is 0.5738 2(0.7141) -0.9 to 2.0. might be -0.5, 1.21 or any other value inside the confidence interval, including zero. The estimate lies only 0.8 standard errors from zero (t-statistic). Such deviations often happen by chance alone (p-value 0.4).

Dependence among Several Variables


The simple regression specified by the CAPM leaves a lot of variation in the residuals: 62% of the variation in the excess percentage changes in IBM stock remains unexplained. Perhaps we can explain more variation with a different measure of the overall stock market. 25-4

7/27/07

25 Collinearity

Weve been using returns on the whole market as if you bought stock in every company. Our measure of the whole market, the value-weighted index, assumes that you invest more in big companies than in small companies. If one company, say Google, makes up 5% of the total value of stocks, then our measure of the whole market assumes that 5% of your money is invested in Google as well. The well-known Dow-Jones industrial index works differently. This index tracks the value of stocks in 30 large companies. Rather than invest proportional to size, the Dow assumes that you divide your investment equally among these companies. The following correlation matrix summarizes the association between these variables. A correlation matrix is a table that displays all of the pairwise correlations among a collection of variables. The variables in this example are the excess percentage changes on IBM, the US market, and Dow-Jones index. We added Date as well to see if any of these time series trends up or down. IBM Market Dow-Jones Excess %Chg Excess %Chg Excess %Chg 1 0.6192 0.6488 0.6192 1 0.8766 0.6488 0.8766 1 -0.1562 -0.0662 -0.1070 Date -0.1560 -0.0662 -0.1070 1

Correlation Matrix Table that shows the correlations among a collection of variables.

IBM Excess % Change Market Excess % Change DJ Excess %Change Date

Table 25-2. Correlation matrix for excess stock returns.

Each number in the correlation matrix is the correlation corr(row, column) between the variables that label the margins of the table. The 1s along the diagonal remind you that the correlation of something with itself is 1. Also, the correlation matrix is symmetric. If you swap the rows and columns, you end up with the same thing since corr(x,y) = corr(y,x).
Square the Correlation The R2 in simple regression is the square of the correlation between x and y. You can interpret the square any correlation as the R2 of the regression of one variable on the other.

The top row of this table shows the correlations of the response with the other variables. For example, excess percentage changes in IBM are more correlated with those in the Dow-Jones index than the value-weighted market index (0.6488 vs. 0.6192). In retrospect, its not too surprising that the Dow-Jones index is more correlated with IBM: IBM is one of the 30 companies in the Dow-Jones index! The table also shows a weak negative correlation with Date. Overall, returns for IBM are down slightly over these 10 years. All of these correlations in the first row are smaller than the correlation between the Dow-Jones index and the whole market. The square of this correlation is r2 = 0.87662 0.77, so about 77% of the variation in the two explanatory variables is shared. The table of correlations compactly summarizes the association between the variables, but plots can be more useful. You cannot judge the straightenough condition, for instance, from a correlation. Thats why someone decided to replace each correlation by the corresponding scatterplot. If you replace each corr(y,x) in Table 25-2 by the scatterplot of y on x, you have a scatterplot matrix. A scatterplot matrix is a table of scatterplots, 25-5

7/27/07

25 Collinearity

with each cell of the table holding a scatterplot of the variable that labels the row versus the variable that labels the column. Like a correlation matrix, a scatterplot matrix has a lot of duplication: the scatterplot of y on x is the mirror image of the scatterplot of x on y. Because of the redundancies, we generally concentrate on the upper right half of the scatterplot matrix. The next figure shows the scatterplot matrix corresponding to the correlation matrix in Table 25-2.
30 20 10 0 -10 -20

Scatterplot Matrix A table of scatterplots arranged as in a correlation matrix, but with corr(y,x) replaced by a scatterplot of y on x.

IBM Excess %Chg

0 -10 10 0 -10 20060101 20040101 20020101 20000101 19980101 19960101 -20 0 10 20 30

Market Excess %Chg

DJ Excess %Chg

Date

-10

-10

10

Figure 25-3. Scatterplot matrix of excess percentage changes.

The scatterplot matrix puts names of the variables along the diagonal. These names label the axes. For example, IBM Excess %Chg labels the y-axis of the 3 plots in the first row (as well as the x-axis of the 3 plots in the first column). Similarly, Market Excess %Chg labels the x-axis of the 3 plots in the second column. Along the margins of the scatterplot, the axes show the scales. With Date as the last variable, the 4th column shows timeplots of each variable. We can see, for instance, that excess percentage changes in IBM trend downward (the correlation is -0.156), but the trend is small relative to the variation in the data. The high correlation between excess changes in the Dow-Jones index and whole market and stands out. Points in a scatterplot of highly correlated variables concentrate along a diagonal, and a scatterplot matrix makes this pattern easy to spot. The narrow cluster of points along the diagonal draws your attention better than a number in the correlation matrix. The following scatterplot shows three variables: Y, X1 and X2. Assume that we are planning to fit a multiple regression of Y on X1 and X2.

AYT

20060101

20000101 20020101

20040101

19960101

19980101

25-6

7/27/07

25 Collinearity

(a) Which pair of variables in the scatterplot matrix has the largest positive correlation? Does this large correlation indicate collinearity?2 (b) What is the maximum value, approximately, of X1?3

Impact of Large Collinearity


Collinearity (sometimes called by the longer name multicollinearity) refers to correlation between explanatory variables in a multiple regression. Collinearity complicates the interpretation of a multiple regression, but it does not violate any of the assumptions of the MRM. Explanatory variables are usually correlated to some degree.
Collinearity Correlation among the explanatory variables in a multiple regression.

The following tables summarize the multiple regression of excess returns on IBM on those of the whole market and the Dow-Jones index. (This model meets the conditions of the MRM. Weve omitted the checks of the conditions to give us more room to discuss collinearity.) R2 se n Term ! Intercept Market Excess %Chg DJ Excess %Chg 0.431922 7.486097 120 Std Error t-statistic 0.688522 0.91 0.30547 1.50 0.312689 3.16 p-value 0.3663 0.1353 0.0020

Estimate 0.6244402 0.4594495 0.9888452

Table 25-3. Summary of the two-predictor model.

The largest is between Y and X 2 in the upper right corner. This correlation does not indicate collinearity because its association between the response and an explanatory variable. 3 A bit less than 80. Its easiest to see this from the top of the middle scale for X1 along the vertical axis.
2

25-7

7/27/07

25 Collinearity

Lets compare this multiple regression to the CAPM regression that uses only excess percentage changes in the whole market (Table 25-1). First of all, R2 increased from 38.3% to 43.2%. Thats a small change, however, when compared to the large correlation (0.65) between the excess returns on IBM and the Dow-Jones index. Instead of explaining another 0.652 42% of the variation, the addition of excess percentage changes on the Dow-Jones index adds only 4.9%. Similarly se, the standard deviation of the residuals, barely nudges downward, from 7.7 to 7.5.
Big Statistic? t: more than 2 or less than -2. F: more than 4. Values of F < 4 are sometimes significant, but all those larger than 4 are significant.

Regardless of the small change in R2, the F-statistic shows that this model as a whole explains statistically significant variation in excess percentage changes in IBM. The F-statistic is

R2 n "1" q F= # 2 1" R q 0.431922 120 " 3 = # $ 44.4 1 " 0.431922 2


Thats much bigger than 4 and so rejects H0: 1=2=0. A regression does not explain 43% of the variation in 120 cases with 2 unrelated explanatory variables. By! rejecting H0: 1=2=0, the overall F-statistic says something is going on but it does not indicate what that might be. It also doesnt tell us if this model explains statistically significantly more variation than the model with one predictor summarized in Table 25-1. After all, R2 always goes up when you add another explanatory variable. To see if the added variable significantly improves the fit of the model, weve got to check its t-statistic. The estimated slope for the Dow-Jones index is b2 = 0.99, with t-statistic 3.16. Because t > 2, we reject H0: 2 = 0. Adding this variable improves the fit by a statistically significant amount. Although adding the Dow-Jones index improves the fit, look what happens to the slope of the whole market. Its fallen to b1 = 0.46 and is not statistically significant because its t-statistic is 1.5. In the simple regression, its t-statistic was 8.6. Now its less than 2, meaning that 1 could be zero. This once important explanatory variable isnt helping the fit anymore. Thats troubling. How can a variable that is so highly correlated with the response and was statistically significant when used alone in a simple regression not improve the fit of a multiple regression?

q is the number of explanatory variables

Under the Hood: The ANOVA Table and the F-statistic


We generally calculate the F-statistic from R2 and compare the result to 4. Values of the F-statistic larger than 4 indicate that the estimated equation explains statistically significant variation in y more than youd expect by picking explanatory variables at random. In most of our examples, the Fstatistic has been much larger than 4. When the F-statistic is near 4, youll want to see a p-value. In some cases, the F-statistic can be less than 4 but still indicate a statistically significant fit (p-value less than 0.05). 25-8

7/27/07

25 Collinearity

To find the p-value for the F-statistic, youve got to look into one of the more daunting summaries of a multiple regression: the analysis of variance. The analysis of variance, or ANOVA, summarizes the ability of a regression model to describe the variation in the response. Heres the ANOVA for the two-predictor model in this chapter. Source Regression Residual Total Sum of Squares 4985.333 6556.872 11542.205 DF Mean Square F Statistic 2 2492.67 44.4788 117 56.04 p-value 119 <.0001

R2 = 4985.333/ 11542.205

Table 25-4. ANOVA for the two-predictor regression.

se = 56.04

Several summaries from Table 25-3 are repeated, albeit in different form, in this table (R2 and se2). Something that isnt found elsewhere is the pvalue of the F-statistic. Because the F-statistic is so large in this example, the p-value is much less than 0.05.
Aliases Sum of squares = SS Regression SS = Model SS = Explained SS Residual SS ! = Error SS = Unexplained SS

The ANOVA summary gives a detailed accounting of the variation in y. The ANOVA starts with the sum of squared deviations around the mean y , typically called the Total SS (or even TSS). The ANOVA then divides this variation into two piles, one that is explained by the regression and one that remains in the residuals. For example, Total SS = Regression SS + Residual SS 11542.205 = 4985.333 + 6556.872 Here are the definitions of these sums of squares, in symbols:
n n 2 n 2

i " y ) +# ( y i " y ) #( yi " y) = #( y


i= 1 i= 1 n i= 1 2 n

i " y) + = #( y
i= 1

#e
i= 1

2 i

That first equality is not obvious, but its true. (Its a generalization of the Pythagorean Theorem for right triangles: a2 + b2 = c2). The Total SS splits into squared deviations in the fitted values (Regression SS) plus the sum !residuals (Residual SS). R2 is the ratio of squared

R2 =

Regression SS Variation in y = Variation in y Total SS 4985.333 = " 0.431922 11542.205

The F-statistic adjusts R2 for the sample size n and q, the number of explanatory variables in the estimated equation. The F-statistic amortizes !the explained variation to take account of the number of explanatory variables and the number of cases. Mean Squares in the table are ratios of sums of squares divided by constants labeled DF (for degrees of freedom). The degrees of freedom for the regression is the number of explanatory variables, q. The degrees of freedom for the residuals is n 25-9

m3

m2

m4 m6

m5

7/27/07

25 Collinearity

minus the number of estimated parameters in the equation, n - 1 - q. Hence, the Mean Square for the residuals is se2. The F-statistic is the ratio of these mean squares:

Regression Sum of Squares Mean Square Regression q F= = 2 Residual Sum of Squares Mean Square Residual = se n -1 " q
On top, the numerator of the F-statistic measures the amount of variation in the fitted values relative to the number of explanatory variables q. The ! bottom is amount of residual variation per number of degrees of freedom in the residuals, se2.

Interpreting a Slope
6.55% First, lets review the interpretation of the marginal slope in simple regression. The slope in the simple regression of the excess percentage change in IBM on the market measures how prices in IBM change when the market changes. The fitted model is (Table 25-1) m Estimated Excess %Change IBM = 0.6 + 1.31 Excess %Change Market For example, if we pick months in which the market returns either 0% or 5%, the average percentage change in IBM in the second batch of months is about 1.31 5% = 6.55% larger than the average in the first batch. The partial slope for the market index estimates something different. Its the coefficient of the same variable, but computed under different conditions. Lets compare returns in two batches of months. For the marginal slope, the market returns 0% in one batch and 5% in the other. We cant use these batches to interpret the partial slope. Instead, weve got to consider what happens to the other explanatory variable. The months in both batches must have the same changes in the Dow. That way we can isolate the effect of differences in one explanatory variable from differences in the other. To interpret the partial slope for the market, we need to compare the excess percentage changes in IBM in months with different percentage changes in the market but the same percentage changes in the Dow-Jones index. Because of the high correlation between the Dow-Jones index and the market, thats hard to do. There arent many months when one index stays the same but the other changes. This lack of information means that were not going to get good estimates for the partial slopes when the explanatory variables are highly correlated. Heres a list of things that happen when we add a second variable that is highly correlated with the explanatory variable that is already in the model: 25-10

m m

m m

0%

5%

2.30% m m m m m m

0%

5%

tip

Effects of Collinearity

7/27/07

25 Collinearity

Collinearity produces 1. Small increase in R2 2. Large changes in slopes 3. Big F, mediocre t 4. Increased SE

1. R2 increases less than youd expect. The correlation between the excess percentage changes in the Dow-Jones index and IBM is 0.65, but R2 increases by only about 0.05 when we add this variable. Wed have expected an increase closer to 0.652 0.4. 2. The slope for the explanatory variable already in the model changes dramatically. The marginal slope for the market index (beta in the CAPM regression) is 1.31. After we add the Dow-Jones index to the fit, the slope for the market drops to less than 0.5. 3. The overall F-statistic is more impressive than the individual t-statistics. The p-value for the F-statistic is off the charts (less than 0.0001 in Table 25-4), but only one explanatory appears useful. Neither explanatory variable appears so impressive as the overall fit (considering their t-statistics and p-values). 4. The standard errors for the partial slopes are larger than those of the marginal slopes. The standard error of the marginal slope of the market index is 0.15, but the standard error of the partial slope is 0.3, twice as large even though the estimated multiple regression explains more variation than the simple regression. These puzzling effects are all consequences of collinearity, the correlation between the two explanatory variables. These happen to some extent when you add any variable to the regression that is correlated with other explanatory variables; its just a question of degree. The effects of collinearity are pronounced in this example because of the redundancy between the two indices. The correlation between the two indices is 0.88, so the two explanatory variables share much variation. Its no wonder R2 increases only a little. The initial CAPM regression already explains much of the variation that is associated with changes in the DowJones index. The addition of percentage changes on the Dow-Jones cant add very much. Collinearity also explains why the F-statistic is statistically significant, but only one of the two explanatory variables is significant. The F-statistic says that this model explains more than a random share of the variation in the excess percentage changes in IBM. Either one of these variables explains more than 35% of the variation. Because the two are so similar, however, we arent getting much from the second. The t-statistic for the partial slope of the market is not significant in the multiple regression because it measures the incremental contribution of this variable to a model that already has the other variable. The market index adds little beyond what is explained by the Dow. Once a regression includes the Dow-Jones index, theres little that the market index can add.

Standard Error in Multiple Regression


The presence of highly correlated explanatory variables in the regression inflates their standard errors. This increase is regressions way of telling 25-11

7/27/07

25 Collinearity

us that weve asked it to describe something that does not happen in the data. In a sense, we lack the data necessary to fit the equation. The partial slope measures the effects of differences in one explanatory variable when the other does not change, but that just does not happen in these data. Lets review the standard error for the slope in simple regression. Its crucial that the values of the explanatory variable spread out. The standard error of the marginal slope is s 1 se(b1 ) " e # n sx More variation in the explanatory variable increases sx and hence reduces se(b1). A scatterplot confirms the benefit of having variation in the explanatory variable. !
30

IBM Excess %Chg

20 10 0 -10 -20 -20 -15 -10 -5 0 5 10

Market Excess %Chg

Figure 25-4. Theres considerable variation in the market.

We get to see how performance of IBM stock varies under a wide range of market conditions. The excess percentage changes in the market vary from less than -15% to near 10%. Simple regression gets to use all of this variation to estimate the marginal slope. Multiple regression has to work with less. Multiple regression is confined to variation that is unique to each explanatory variable. To estimate the partial slope for the market index, multiple regression requires months that have the same return on the Dow-Jones index months in which the whole market moves, but Dow-Jones does not. Heres the same scatterplot, but highlighting two batches of months.
30

IBM Excess %Chg

20 10 0 -10 -20 -20 -15 -10 -5 0 5 10

Market Excess %Chg

Figure 25-5. Subsets with different levels of returns on the Dow.

The two subsets identify months in which the Dow-Jones either lost (green +) or gained (red ) about 5%. (For the 16 months in green, the Dow-Jones 25-12

7/27/07

25 Collinearity

index lost 3 to 7%. For the 26 months in red, it gained from 3 to 7%.) The association between IBM and the whole market within these subsets is rather weak. Also, theres less variation in the market during these periods; the points cluster within relatively narrow portions of the x-axis. Thats the variation left for estimating the partial slope. Because of the high correlation, little unique variation remains for estimating the partial slope in the multiple regression.

Variance Inflation Factor


The variance inflation factor shows how collinearity affects the standard error of a partial slope. The standard error for a marginal slope in simple regression is s 1 se(b1 ) " e # n sx In multiple regression, the standard error for the partial slope is s 1 se(b1 ) " e # # VIF ( X1 ) n sx ! The variance inflation factor (VIF) indicates how collinearity increases the standard error of a slope. For a multiple regression with two explanatory variables, the correlation between the two determines the variance ! inflation factor: 1 , r = corr(x1,x2) VIF ( X j ) = 1" r2 If the explanatory variables are uncorrelated, then VIF(X1) = 1; theres no collinearity. In this example, however, the correlation r = 0.88, so VIF(X1) = VIF(X2) = 1/(1-r2) 4.4. ! As a result, collinearity roughly doubles the standard error because VIF = 4.4 2. (Both explanatory variables suffer the same way from collinearity.) If you compare the standard errors for the slopes of the market index in Table 25-1 and Table 25-3, thats what happened. The standard error of b1 in the simple regression is 0.15 versus 0.30 in the multiple regression. Variance inflation factors are a handy measure for collinearity, particularly in regression models with several explanatory variables. Say, youve got a model and the partial slope for an explanatory variable is not statistically significant. Its t-statistic is near zero with a wide confidence interval. Is this explanatory variable redundant like using both the DowJones and whole market in the same regression or is it simply unrelated to the response? To decide, look at its VIF. If the VIF is near 1, then there is little collinearity and this explanatory variable is unrelated to the response. If, on the other hand, the VIF is large (say 5 to 10 or more), then this variable is associated with other explanatory variables and partially redundant. 25-13

Whats a Big VIF? Theres no definitive rule. VIF = 1 when there is no collinearity. Whether a VIF > 1 indicates a problem depends on the standard error for the slopes. You could have a large VIF (say, 10 or more) and still get a narrow confidence interval. Generally, though, VIF > 5 or 10 deserve a look.

7/27/07

25 Collinearity

Under the Hood: Variance Inflation Factors in Larger Models


The VIF is easy to calculate in regression models with more than two explanatory variables. With two explanatory variables, all you need is the correlation corr(x1,x2). With more than two, you need another multiple regression. In general, to find the VIF for an explanatory variable, regress that explanatory variable on all of the other explanatory variables. For instance, to find the VIF for x1, regress it on the other explanatory variables. Denote the R2 of that regression as R21. The VIF for x1 is then 1 VIF ( X1 ) = 1 " R12 That is, just replace the square of the correlation corr(x1,x2) by R21. As R21 increases, less unique variation remains in x1 and the effects of collinearity (and VIF) grow. ! (a) Is it possible for collinearity between explanatory variables to reduce the standard error of a slope estimate, or must it always increase the standard error (and hence lead to longer confidence intervals)?4 (b) In the prior AYT, the correlation between X1 and X2 is 0.84. Whats the VIF for X1? For X2?5 (c) The following table summarizes the estimated equation from the regression of Y on X1 and X2. Does it appear that collinearity is responsible for the insignificant effect of X1?6
Term Intercept X1 X2 Parameter Estimates Estimate Std Error 2.4404306 3.057573 0.0092054 0.133619 3.2182753 0.107313 t Ratio 0.80 0.07 29.99 Prob>|t| 0.4307 0.9455 <.0001

AYT

Remedies for Collinearity


Collinearity among explanatory variables in multiple regression means that partial slopes may be rather different from marginal slopes. Collinearity not only makes multiple regression harder to interpret, it also blurs the estimates because it increases standard errors. You end up with wider confidence intervals and t-statistics that are closer to zero. What should you do about it? Theres a range of possible actions. Nothing. Thats favorite remedy chosen by our students, too. Actually, its not necessarily a bad choice so long as you are aware of the presence of the collinearity. If you plan to use the model to predict new cases that resemble the observed data, then collinearity doesnt
Collinearity always increases the SE. VIF = 1 when there is no correlation and can only increase. The VIF for both explanatory variables is 1/(1-corr2) = 1/(1-0.842) 3.4 6 No. Use VIF to find the SE had the two explanatory variables been uncorrelated. If X1 and X 2 were uncorrelated (with VIF=1), the SE for X1 would be 0.133/3.4 0.072. The estimate b1 would still be much less than 1 standard error from zero. Its just not going to contribute to the fit.
4 5

25-14

7/27/07

25 Collinearity

matter. You can have a very predictive model a large R2 even with substantial collinearity. Be cautions in making predictions; its easy to extrapolate from collinear data. Make sure that youre predicting under conditions that mimic the dependence among your predictors. Remove redundant predictors. Next to nothing, this one is also easy. The catch is deciding which to remove. If the t-statistic of an explanatory variable is close to zero, then that variable is not adding to your model. Youll know whether the small t is the result of collinearity if the explanatory variable has a large marginal correlation with the response, but no effect in the multiple regression (like the market in our example). The VIF works well for this purpose. Re-express your predictors. Collinearity means that several explanatory variables measure a common attribute. If you know enough of the context, you can generally think of a way to combine the explanatory variables into an index. For example, the Consumer Price Index (CPI) blends prices for a collection of products. This index replaces a collection of variables that all measure prices and are highly collinear. Similarly, empirical courses in social sciences frequently introduce a variable called socio-economic status (SES, or something close to this). This variable combines income, education, status and the like variables that tend to be correlated with one another. In this example, theres a simple way to re-express the explanatory variables to remove most of the collinearity. Each index tracks how the stock market moves in slightly different ways. The similarity (both have the same units) suggests that we average the two and use this as an explanatory variable. At the same time, the Dow-Jones index and the whole market index are different, so lets retain the difference between the two as a second explanatory variable. Heres the multiple regression with these two variables in place of the original pair. R2 se n Term Intercept Average % Change Difference % Change Estimate 0.6244402 1.4482947 0.2646978 0.431922 7.486097 120 Std Error t-statistic 0.688522 0.91 0.153714 9.42 0.299393 0.88 p-value 0.3663 <.0001 0.3784

Table 25-5. Multiple regression with rearranged explanatory variables.

This equation has the same R2 and se as obtained in the previous fit (Table 25-3) because you can get one equation from the other. Estimated IBM Excess% = 0.624 + 1.448 (Ave % Chg) + 0.265 (Diff % Chg) = 0.624 + 1.448 (Market + Dow)/2 + 0.265 (Dow Market) = 0.624 + (1.448/2 0.265) Market + (1.448/2 + 0.265) Dow 25-15

7/27/07

25 Collinearity

= 0.624 + 0.459 Market + 0.989 Dow However, the average and difference are almost uncorrelated (correlation = -0.05). Theres basically no collinearity clouding our interpretation. The average of the indices is really useful; the difference is not. Knowing the difference between the excess percentage changes in the overall market and the Dow-Jones in a month doesnt add anything once you know the average of the two.

4M Market Segmentation
A marketing research company performed a study to measure interest in a new type cellular phone. The firm obtained a sample of 75 consumers. After a preliminary discussion, representatives showed each consumer individually the phone. A questionnaire then asked each consumer to rate their likelihood of purchase on a scale of 110, with 1 implying little chance of purchase and 10 indicating almost certain purchase. In addition to the rating, the marketing firm measured two characteristics of the consumers: age (in years) and income (in thousands of dollars). Here are the correlations among these variables and the scatterplot matrix.
Rating Age Income Table 25-6Correlation matrix. Rating 1.0000 0.5867 0.8845 Age 0.5867 1.0000 0.8286 Income 0.8845 0.8286 1.0000

The diagonal patterns throughout the scatterplot matrix reinforce the fact that everything in this data is correlated with everything else.
9 8 7 6 5 4 3 2 1 80 70 60 50 40 30 130 110 90 70 50 30 1 2 3 4 5 6 7 8 9 30 40 50 60 70 80 30 50 70 90 110 130

Rating

Age

Income

Figure 25-6. Scatterplot matrix of the three variables.

Having collected the data, the marketing group needed to make a recommendation. It has resources to place advertisements in one of two possible magazines. Both magazines appeal to affluent consumers with 25-16

7/27/07

25 Collinearity

high incomes, but subscribers to one of the magazines average 20 years older than subscribers to the other.
Motivation State the business issue and decision that needs to be made. Should the marketing group concentrate the initial advertising on the magazine with younger or older readers? The whole industry will be watching the launch, so management wants a large burst of initial sales Ill use a multiple regression of the rating on both age and income to isolate the effects of age from those of income. The partial slopes for age and income will tell me how these two factors combine to influence the rating. I am interested in how consumers at high incomes react, so I am particularly interested in the partial slope for age.. The data come from an experiment that showed a sample of 75 consumers the new phone design. These should be representative. Each consumer saw the device separately to avoid introducing dependence from how the reaction of one might possibly influence others The scatterplots of the response in the scatterplot matrix (Figure 25-6, top row) look straight enough. Theres quite a bit of collinearity, however, between the two explanatory variables. Straight-enough. Seems OK from these plots. Theres no evident bending and the plots indicate clear dependence. No embarrassing lurking factor. Hopefully, the marketing survey got a good mix of men and women. It would be embarrassing to find out that all of the younger customers in the survey were women, and the older customers were men. The best remedy for this is to check that the survey participants were randomly selected from our target consumer profile. Similar variances. The residuals seem like a simple random swarm, with no tendency for changing variation.
2.0 1.5

Method Plan your analysis. Identify the explanatory variables and the response.

Relate regression to the business decision.

Describe the sample.

Use a plot to make sure your method makes sense.

Verify the big-picture condition.

Mechanics Check the additional conditions on the errors by examining the residuals.

Rating Residual

1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 -2.5 0 1 2 3 4 5 6 7 8 9

Rating Predicted

25-17

7/27/07

25 Collinearity
Nearly normal. The normal quantile plot of the residuals looks good. Theres one lone unusually negative residual, but this outlier does not seem leveraged so it has little effect on the slopes.
1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5 -3 -2 -1 0 1 2 3 .01 .05 .10 .25 .50 .75 .90 .95 .99

Normal Quantile Plot

If there are no severe violations of the conditions, summarize the overall fit of the model.

Heres the summary of my fitted model

R2 se n

0.8507 0.6444 75

Check that the model overall explains significant variation in y

The F-statistic is F = (0.85/0.15)*(75-3)/2 204; I can clearly reject H0 that both slopes are zero. This model explains variation in the ratings. Term Intercept Age Income Estimate Std Error 0.5177 0.3521 -0.0716 0.0125 0.1007 0.0064 t stat p-value 1.47 0.1459 -5.74 <.0001 15.63 <.0001

Show the estimated equation.

The estimated equation is Est Rating 0.5 - 0.07 Age + 0.10 Income Each predictor explains significant variation. Both add to a simple regression with the other as one predictor. The confidence interval for either slope does not include zero. The rough 95% confidence interval for age is -0.071620.0125 = -0.966 to 0.0466 rating points/year So, for consumers with comparable incomes, a 20-year age difference gives an effect of 20 [-0.966 to 0.0466 ] = -0.932 to -1.932 rating points Twenty times the standard error of this slope is 20 .0125 = 0.25, so my message interval is 0.93 to 1.93. It might be it easier to say 1 to 2 points. The company should place the ads in the magazine appealing to younger customers. I am 95% confident that a younger affluent audience would assign on average a rating of 1 to 2 points higher than the older audience. The magazine with the younger readers finds a more receptive audience. Although it is true that older customers on

Build confidence intervals for the relevant parameters.

With all of these other details, dont forget to round to the relevant precision.

Message Answer the question.

Dont run from collinearity. Someone at your

25-18

7/27/07
presentation may be aware of the positive marginal correlation between age and rating, so its better for you to address the issue.

25 Collinearity
average rate our PDA higher than younger customers, this tendency is an artifact of income. Older customers have larger incomes on average than younger customers. When comparing two customers with comparable incomes, I expect the younger customer to rate our product more highly My analysis presumes that customers who assign a higher rating will indeed be more likely to purchase our product. Also, my model does not take into account other consumer attributes, such as sex or amount the consumer uses a cell phone. These might affect the rating in addition to age and income..

Note important caveats that are relevant to the question at hand

Seeing How a Slope Can Change Sign


In the 4M example, the slope for age reverses when adjusted for differences in income. It makes sense, when you think about it (only younger customers with money to spend find the product appealing). Theres also a nice plot that shows what is going on. This scatterplot graphs rating on age, with the least squares line.
9 8 7 6

Rating

5 4 3 2 1 0 20 30 40 50 60 70 80

Age

Figure 25-7. Scatterplot of rating on age.

The slope of the simple regression is clearly positive: on average, older customers rate the product more highly. To see what happens when we compare customers with comparable incomes, lets identify consumers with similar levels of income and limit the fit to within those groups. Well pick out 3 subsets of consumers. This table gives the income, counts, and colors for each subset. Income Less than $60,000 $70,000-80,000 More than $100,000
Table 25-7. Subsets with similar income.

Number of Cases 14 17 11

Color Green Blue Orange

25-19

7/27/07

25 Collinearity

The following scatterplot adds a simple regression within each of these subsets. The slopes are closer to zero and all are negative.
9 8 7 6

Rating

5 4 3 2 1 0 20 30 40 50 60 70 80

Age

Figure 25-8. Finding the partial slope in a scatterplot.

Thats what the multiple regression is telling us: among customers with similar incomes, younger customers are the biggest fans.

25-20

7/27/07

25 Collinearity

Summary
Correlation between explanatory variables in a multiple regression produces collinearity. Severe cases of collinearity lead to surprising coefficients with wide confidence intervals. The variance inflation factor quantifies the extent to which collinearity increases the standard error of a slope. You can detect collinearity in the correlation matrix and scatterplot matrix. These tables summarize the pairwise associations with numbers or plots. Remedies for severe collinearity include combining variables or removing one of a highly correlated pair.

Index

analysis of variance, 25-9 collinearity, 25-7 correlation matrix, 25-5

scatterplot matrix, 25-5 variance inflation factor, 25-13

Formulas
Estimates in multiple regression If you ever have to compute the slopes by hand, here are the formulas for a multiple regression with 2 explanatory variables s corr( y, x1 ) " corr( y, x 2 ) corr( x1, x 2 ) b1 = y sx1 1 " corr( x1, x 2 ) 2

b2 =

The ! ratio of the standard deviations attaches correct units to the estimates. If the explanatory variables are uncorrelated, the slope for each reduces to the marginal slope in the associated simple regression ! of y on x1 or y on x2. At the other extreme, if the correlation between x1 and x2 is 1: you divide by zero and the multiple regression is not uniquely defined (too much collinearity). Once you have the slopes, the intercept is easy: b0 = Y " b1 X1 " b2 X 2 Standard error of a slope in multiple regression s 1 se(b1 ) " e # # VIF ( X1 ) n sx ! Variance inflation factor (VIF) In multiple regression with two explanatory variables, the VIF for both x1 and x2 is ! 1 , r = corr(x1,x2) VIF ( X j ) = 1" r2

sy corr( y, x 2 ) " corr( y, x1 ) corr( x1, x 2 ) sx 2 1 " corr( x1, x 2 ) 2

25-21

7/27/07

25 Collinearity

Best Practices
Start your regression analysis by looking at plots. Tools like the scatterplot matrix make it easy to quickly skim over the plots between y and your explanatory variables. These views also help you see the association between the explanatory variables. Keep track of time in the scatterplot matrix. Theres a lot going on in multiple regression. When the variables are time series, include the time plots as the last column in your scatterplot matrix. These plots will show you aligned time trends in every variable. Recognize whether you need a partial slope or a marginal slope. If we need the partial slope, as in the 4M example, youre going to have to deal with the collinearity. The marginal slope answers a different question. Use the F-statistic for the overall model and a t-statistic for each explanatory variable. The overall F-statistic tells you about the whole model, not any one explanatory variable. If you have a question about the whole model, look at the F-statistic. If you have a question about a single explanatory variable, look at the t-statistic. Learn to recognize the presence of collinearity. When you see the slopes in your model change as you add or remove a variable, recognize that youve got collinearity in the model. Make sure you know why. Variance inflation factors provide a concise numerical summary of the effects of the collinearity. Dont run from collinearity try to understand it. In the 4M example of this chapter, we could have gotten rid of the collinearity by removing one of the variables. The simple regression of the rating on the age doesnt show any effects of collinearity. But it also would have led to the wrong conclusion. In examples like the stock market illustration, you can see that you have two variables that measure almost the same thing. In such cases, it does make sense to combine them or perhaps remove one.

Pitfalls
Remove variables at the first sign of collinearity. Collinearity does not violate any assumption of the MRM. The MRM lets you pick values of the predictors however you choose. In many cases, the collinearity is an important part of the model: the partial and marginal slopes are simply different. Think that multiple regression is just several simple regressions glued together. Its really quite different. The only time that the slopes in the associated simple regressions match the partial slopes in multiple regression happens when the explanatory variables are uncorrelated. 25-22

7/27/07

25 Collinearity

About the Data

Cherry picking. Its common for experienced regression modelers to dive right into the t-statistics, looking for any explanatory variable thats statistically significant and tossing the rest. Thats really hasty. Stop and check the plots and F-statistic first. Think about the relationships among the explanatory variables.

The data for stocks as usual come from CRSP, the Center for Research in Security Prices. We used interest rates on the 30-day Treasury Bill as the risk-free rate in the examples. These risk-free rates also come from CRSP. The data on consumer product ratings comes from a case study developed for the statistics course offered in the MBA program at Wharton.

Software Hints
The software commands for building a multiple regression are essentially those used for building a model with one explanatory variable. All you need to do is select several columns as explanatory variables rather than just one.

Excel

To fit a multiple regression, follow the menu commands Tools > Data Analysis > Regression (If you dont see this option in your Tools menu, you will need to add these commands. See the Software Hints in Chapter 19.) Selecting several columns as X variables produces a multiple regression analysis. The menu sequence

Minitab

JMP

Stat > Regression > Regression constructs a multiple regression if several variables are chosen as explanatory variables. The menu sequence

Analyze > Fit Model constructs a multiple regression if two or more variables are entered as explanatory variables. Click the Run Model button to obtain a summary of the least squares regression. The summary window combines the now familiar numerical summary statistics as well as several plots.

25-23