You are on page 1of 10


Multicollinearity is a problem which occurs if one of the columns of the X matrix is exactly or nearly a linear combination of the other columns. Exact multicollinearity is rare, but could happen, for example, if we include a dummy (0-1) variable for "Male", another one for "Female", and a column of ones.

More typically, multicollinearity will be approximate, arising from the fact that our explanatory variables are correlated with each other (i.e. they essentially measure the same thing). For example, if we try to describe consumption in households (y ) in

-2terms of income (x 1) and net worth (x 2), then it will be hard to identify the separate effects of x 1 and x 2 on y . The estimated regression coefcients b 1 and b 2 will be hard to interpret. The variance of b 1 and b 2 will be very large, so the corresponding t -statistics will tend to be insignicant, even though the F for the model as a whole is signicant and R 2 is high. Further, the coefcient of x 1, and the corresponding t -statistic, may change dramatically if the seemingly insignicant variable x 2 is deleted from the model.

-3For a numerical example, consider a data set on the monthly sales of backyard satellite antennas (y ) in nine randomly selected districts, together with the number of households (x 1) in the district, and the number of owner-occupied households (x 2) in the district. (Both x 1 and x 2 are measured in units of 10,000 households). The multiple regression of y on x 1 and x 2 indicates that neither variable is linearly related to y . However, R 2 = .9279, and the overall F test is highly signicant, indicating that at least one of x 1 and x 2 is linearly related to y .


Satellite Antenna Sales District Sales (y ) #Households ( x 1) #OwnerOccupied Households (x 2) 11 18 5 20 30 21 15 11 17

1 2 3 4 5 6 7 8 9

50 73 32 121 156 98 62 51 80

14 28 10 30 48 30 20 16 25

-5The reason why the results of the two t -tests are so different from the result of the F -test is that collinearity has destroyed the t -tests by strongly reducing their power. The Pearson correlation coefcient between x 1 and x 2 is r = .985, so the two variables are highly collinear. A simple regression of y on x 1 gives a t -statistic for b 1 of 9.35 (highly signicant), while a simple regression of y on x 2 gives a t statistic for b 2 of 8.62 (also highly signicant). Note also that the R 2 values for these two simple regressions are .9259 and .9139, respectively, both of which are almost as high as the multiple R 2 for the full model, .9279.

-6To get some mathematical insight into the general problem, we use the spectral decomposition (Jobson, p. 576) to write
1 = p (X X )

i =0

i1pi p i , of X X and

where i

are the eigenvalues

P = [p 0 , . . . , pp ] is an orthogonal matrix of eigenvectors of X X . If there is exact multicollinearity, then for some (p +1)1 vector 0, we must have X = 0, so that is an eigenvector of X X , and the corresponding eigenvalue is zero. Therefore, one of the i must be zero. In this case, X X is not invertible, since (X X )1 would have to satisfy

-7(X X )1(X X ) = 0 , that is, = 0, which is ruled out by the denition of .

Our computer will (hopefully) be unable to calculate the least squares estimator b , since b is no longer uniquely dened, and (X X )1 does not exist. Due to roundoff and other numerical errors, however, some packages will be able to carry out their calculations without any obvious catastrophe (e.g. dividing by zero), and therefore they will produce output, which will be completely inappropriate and useless.

-8If there is approximate multicollinearity, then one or more of the i will be very close to zero, so that the entries of (X X )1 = large. i1pi p i will be very

i =0

2 Since var (b j ) = u (X X ) j1, we see that j

approximate multicollinearity tends to inate the estimated variance of b j for one or more (perhaps all) j . As a result, the t -statistics will tend to be insignicant. The overall F is not adversely affected by multicollinearity, so it may be signicant even if none of the individual b j is. It can also be shown that the prediction variance (incurred in "predicting" either the response surface or a future value of y at a particular value of the

-9explanatory variables) will not be disastrously affected by multicollinearity, as long as the entries of obey the same approximate multicollinearities as the columns of X .

Keep in mind, though, that multicollinearity often arises because we are trying to use too many explanatory variables. This tends to inate the prediction variance. (See the handout on model selection).

So, although the effect of multicollinearity on the predictions may not be disastrous, we will still typically be able to improve the quality of the predictions by using fewer variables.

- 10 In my opinion, the best remedy to multicollinearity is to use fewer variables. This can be achieved by a combination of thinking about the problem, transformation and combination of variables, and model selection. Two methods of diagnosing multicollinearity in a given data set are (1) Look at the Pearson correlation coefcient of all pairs of explanatory variables (2) Look at the ratio Max /Min of the largest to the smallest eigenvalues of (X X )1.

For those who insist on working with a multicollinear data set, there are biased estimation techniques (e.g. ridge regression) which may have a lower mean squared error than least squares.