You are on page 1of 6

Applied Econometrics Academic Year 2013-2014

Week 6: The Linear Probability Model and Heteroskedasticity Problem Set Solutions

Exercise 1
Using the dataset crime1.dta, estimate the following equation in order to predict the probability of an individual being arrested during the year 1986: arr86 = + 1 pcnv + 2 avgsen + 3 tottime + 4 ptime86 + 5 qemp86 (1)

where arr86 is a dummy variable equals to one if that individual has been arrested, and zero otherwise, pcnv the percentage of previous arrests that have been followed by a conviction, avgsen the average length of sentences received in the past, tottime is the total number of months spent in prison since the age of 18, ptime86 is the number of months spent in prison during 1986, qemp86 the number of quarters the person has been employed during 1986. a.) Estimate the equation and interpret the results. The output of the regression is the following:

Variable pcnv avgsen tottime ptime86 qemp86 intercept

Coecient -0,1624448 0,0061127 -0,0022616 -0,0219664 -0,0428294 0,4406154

p-value 0,00% 34,40% 65,00% 0,00% 0,00% 0,00%

Coecients for avgsen e tottime are not statistically signicant. A one unit increase in the variable pcnv reduces by 0,16 (or 16%) the probability of being arrested again; a one unit increase in ptime86 (which means an additional month in jail during the year 1986) reduces the probability of being arrested by 0,02, while an increase in qemp86 (which is equal to an additional quarter 1

of regular job for the person) reduces by 0,04 the probability of being arrested. b.) Test the null hypothesis that the two variables avgsen and tottime are jointly not signicant. The null hypothesis can be tested with the usual F test: F(2, 2719) = 1,06 p-value = 0,3467 Given the p-value we obtain, we cannot reject the null hypothesis that the two variables are jointly not signicant. On the other hand, the linear probability model has some problems we must highlight: in this specic model, we should nd that if a person spends the whole year in prison (so that ptime86=12) then she would have a probability equal to zero of being arrested, since she already is in prison. However, if ptime86=12 and all other variables are zero, the person has a probability of being arrested equal to 0, 441 0, 022 12 = 0, 177 which is greater than zero. On the other hand, if we measure the probabilty starting from the unconditional probability of being arrested (that is, 27,7%, which is the mean value of arr86 in the sample), then we obtain a probability of 0, 277 0, 022 12 = 0, 013, a value very close to zero (but still not equal to zero). predicted probability values are not always included in the interval [0, 1], as it should be for probability measures (in this specic model, however, tted values have a minimum of 0,0066431 and a maximum of 0,5576897). there is always a problem of heteroskedasticity in the residuals, which implies that the usual statistical inference is invalid.

Exercise 2
Following the work of Estrella and Mishkin in their 1998 paper Predicting U.S. Recessions: Financial Variables as Leading Indicators published on The Review of Economics and Statistics, using the data included in the dataset U Smacro.dta estimate the following equation in order to build an econometric model for predicting recessions in the US: recessiont = + 1 spread10t1 + 2 stockrett1 + 3 unempt1 + 4 mf gsurveyt1 (2) a.) Estimate the given equation and interpret the results. The estimation output is: Variable spread10 stockret unemp mfgsurvey cons Coecient -.0352234 -.8130781 .0054597 -.0042127 .2946165 Std. Err. .0128879 .1637266 .0104905 .0018119 .120155 t -2.73 -4.97 0.52 -2.33 2.45 p-value 0.007 0.000 0.603 0.021 0.015

The results indicate that the term spread and the returns on the stock market are signicant predictors of recessions; the level of the unemployment rate is not signicant even at the 10% level, while the value of the MFG survey index is signicant at the 5% level. Here the dependent variable is a dummy variable, so we are estimating what is called a linear probability model : in our example, the coecients indicate how a change in the corresponding explanatory variable aects the probability of a recession in the US. In particular, an increase in the term spread reduces the probability of a recession, and the same is true for returns of the stock market; also, the higher the value of the MFG survey index, the smaller the probability of a recession. b.) Obtain the tted values of the regression, and comment your ndings. If we look at the tted values of the previous regression, we nd that there are some negative values; however, these tted values are probability measures, so they should always be inside the interval [ 0, 1]. This is one common problem when using a linear probability model: the tted values are sometimes outside the [ 0, 1] interval and therefore they are not proper probability measures. c.) Test the null hypothesis that the variables unemp and mf gsurvey are jointly not signicant at the 1% level. The null hypothesis can be tested with the usual F test; the F statistic we get for this test is F (2, 200) = 3.28, with a p value of 3.95%: the null hypothesis cannot be rejected at the given signicance level of 1%.

Exercise 3
Using dataset vote1.dta, estimate the following equation: voteA = + 1 lexpendA + 2 lexpendB + 3 prtystrA a.) Estimate the equation and interpret your results. The regression gives the following results: voteA = 45, 0789 + 6, 083316 lexpendA 6, 615417 lexpendB + 0, 1519574 prtystrA The constant and the coecients for lexpendA and lexpendB are statistically signicant at the 1% level while the coecient for prtystrA is signicant at the 5% level. b.) Test the null hypothesis of homoskedasticity in the residuals. In order to test the null hypothesis of homoskedasticity, we can use the Breusch-Pagan test; this test requires to: estimate the original regression, and obtain the residuals; take the squared value of the residuals; 3 (3)

regress the squared residuals against the explanatory variables of the original regression; take the R2 of this new regression: then, n R2 (where n=number of observations) follows a 2 distribution with a number of degrees of freedom equal to the number of independent variables in the second regression, under the null hypothesis of homoskedasticity; if the p value is not signicant at the chosen signicance level, then the null hypothesis of homoskedasticity can be rejected in favour of the alternative hypothesis of heteroskedasticity. In our example, we have:

n R
2 2

173 5,45% 9,4285 3 2,41%

degrees of freedom p-value

The null hypothesis can be rejected at the 5% signicance level, according to this test. The test can also be done using the F test for multiple restrictions on the auxiliary regression of the squared residuals on the explanatory variables from the original model. Using the R2 form of the F test, the F statistic can be written as: F =
2 Ru 2 /q 2 1 Ru 2 / (n k )

2 2 where Ru 2 is the R of the auxiliary regression of the squared residuals on the explanatory

variables from the original model. This is precisely the usual F statistic for the overall signicance of the regression, and is given in Stata as part of the output for the auxiliary regression we used for the test. In our example, we have: F (3, 169) = 3.25, with a p-value of 2,33%, so that the null hypothesis can be rejected at the 5% level, while it cannot be rejected at the 1% level. Stata, uses a slightly dierent version of the test, the so-called Cook-Weisberg formulation of the Breusch-Pagan test for heteroskedasticity. This test uses the following formula: BP test statistic = ESSa /2 (RSS/n)
2

where ESSa is the Explained Sum of Squares from the auxiliary regression of the squared residuals on the explanatory variables of the original model; RSS is the Residual Sum of Squares of the original regression, and n the number of observations. This statistic has a 2 distribution with a number of degrees of freedom equal to the number of independent variables in the second regression, under the null hypothesis of homoskedasticity. This dierent formulation 4

of the test implies a linear relationship between the log of the residuals variance and the independent variables, while the previous version of the test assumes a linear relationship between the residuals variance and the independent variables. In our example, we get: 2 (3) = 11.82, with a p value of 0.8%: we can therefore reject the null hypothesis of homoskedasticity at the 1% level. Another test for testing the null hypothesis of homoskedasticity is the White test. This test requires to: estimate the original regression, and obtain the residuals; take the squared value of these residuals; regress the squared residuals on all the explanatory variables from the original regression, their squared values, and the cross products; take the R2 of this new regression: then, n R2 (where n=number of observations) follows a 2 distribution with a number of degrees of freedom equal to the number of independent variables in the second regression, under the null hypothesis of homoskedasticity; if the p value is not signicant at the chosen signicance level, then the null hypothesis of homoskedasticity can be rejected in favour of the alternative hypothesis of heteroskedasticity. Compared to the Breusch-Pagan test, the White test adds to the auxiliary regression the squared values of the independent variables and their cross products. In our example we have:

n R
2 2

173 13,44% 23,2512 9 0,57%

degrees of freedom p-value

This means that the null hypothesis of homoskedasticity can be rejected at the 1% level. An alternative (restricted) version of the White test is the following: estimate the original regression, and obtain the residuals and the tted values; take the squared value of these residuals; regress the squared residuals on the tted values and the square of the tted values; take the R2 of this new regression: then, n R2 (where n=number of observations) follows a 2 distribution with two degrees of freedom, under the null hypothesis of homoskedasticity;

if the p value is not signicant at the chosen signicance level, then the null hypothesis of homoskedasticity can be rejected in favour of the alternative hypothesis of heteroskedasticity. Using this version of the test in our example gives the following results:

n R2 2 degrees of freedom p-value

173 5,33% 9,2209 2 0,99%

The null hypothesis can be rejected also according to this test. The homoskedasticity assumption is violated: this implies that the estimated standard errors for our coecients, obtained with standard OLS estimation, are not unbiased any more, and, as a consequence, the statistical inference on the parameters is invalid. c.) In light of the ndings from b.), discuss the statistical signicance of the estimated coecients, using heteroskedasticity-robust standard errors. In b.) we have found that the homosckedasticity assumption is violated: in order to carry out the statistical inference on the estimated coecients, we need to estimate the Whites heteroskedasticity-consistent standard errors, and use them for calculating our t statistics. Using these new standard errors, we have:

Variable lexpendA lexpendB prtystrA c

Coecient 6,08332 -6,61542 0,15196 45,07893

Std. Err. 0,51460 0,33146 0,05601 4,06835

t stat. 11,820 -19,960 2,710 11,080

p-value 0,000% 0,000% 0,700% 0,000%

All parameters are now signicant not only at the 5% level, as before, but also at the more restrictive 1% level, while in a.) prtystrA was signicant only at the 5% level.