You are on page 1of 8


Problem Set #4
Nathaniel Higgins
Read 4.1 4.3. Hand in answers to
C4.5(i) and (ii)
C4.7(i) (iv)
C4.8(i) (v)
C4.9(i) and (iii)
C4.10(i) (vi)
The following model can be used to study whether campaign expenditures aect election
voteA =
log(expendA) +
log(expendB) +
prtystrA + u,
where voteA is the percentage of the vote received by Candidate A, expendA and expendB
are campaign expenditures by Candidates A and B, and prtystrA is a measure of party
strength for Candidate A (the percentage of the most recent presidential vote that wen
to As party).
What is the interpretation of
When expenditure by Candidate As campaign increases by 1%, the percentage of the
vote that Candidate A receives is predicted to increase by
Use the data in MLB1.RAW for this exercise.
Use the model estimated in equation (4.31) and drop the variable rbisyr. What happens to
the statistical signicance of hrunsyr? What about the size of the coecient on hrunsyr?
The model estimated in equation (4.31) yields:

log(salary) = 11.19 + 0.0689years + 0.0126gamesyr + 0.00098bavg + 0.0144hrunsyr + 0.0108rbisyr

N = 353 R squared = 0.6278.
When we drop the variable rbisyr we get:

log(salary) = 11.02091 + 0.0677325years + 0.0157595gamesyr + 0.0014185bavg + 0.0359434hrunsyr

N = 353 R squared = 0.6254.
hrunsyr goes from insignicant (t-stat of 0.90) to signicant at the greater than the 1%
level (t-stat of 4.96). The coecient size essentially triples in size.
Add the variables runsyr (runs per year), dperc (elding percentage), and sbasesyr
(stolen bases per year) to the model from part(i). Which of these factors are individually
Only runsyr is individually statistically signicant at conventional levels (at the 5%
level). It is also the most economically signicant variable (it has the largest coecient
by an order of magnitude).
Refer to the example used in Section 4.4. You will use the data set TWOYEAR.RAW.
The variable phsrank is the persons high school percentile. (A higher number is better.
For example, 90 means you are ranked better than 90 percent of your graduating class.)
Find the smallest, largest, and average phsrank in the sample.
This is a piece of cake with the sum function in Stata. The smallest value of phsrank
is 0, the largest is 99, and the average (mean) value of phsrank is 56.16. The average
high school rank of individuals in the dset is above 50% (interesting; if this is a random
sample of high school graduates, would you expect a value like this?)
Add phsrank to equation (4.26) and report the OLS estimates in the usual form. Is
phsrank statistically signicant? How much is 10 percentage points of high school rank
worth in terms of wage?
When I add phsrank to equation (4.26) and run the regression I get the following results:

log(wage) = 1.46 0.01jc + 0.08totcoll + 0.00exper + 0.00phsrank

N = 6, 763 R squared = 0.22.
phsrank is not statistically signicant at conventional (5%) levels (the t-statistic is 1.27,
which is less than the 1.96 it would need to be in order to reject the null hypothesis that

= 0 against the two-sided alternative hypothesis that
= 0 at the 0.05
level of signicance. You can also see this from the p-value, which is greater than 0.05).
Ten percentage points of high school rank is worth an increase of 10 .0003032 =
0.003032 in log(wage). That is, since high school rank is already expressed in percentage
terms, we only need to multiply the coecient by 10 to see how much log(wage) increases
by when we increase phsrank by 10. We could leave it there, but its not very useful
to interpret things in terms of log(wage) units. We can easily interpret this change of
0.003032 log(wage) units in terms of percentage increase in wage instead (since wage is
logged on the left-hand-side of the regression equation). To do this, we multiply by 100.
Therefore, a 10 percentage point increase in phsrank increases wage by approximately
0.3032 percent. Not that much! High school really didnt matter (which is good for me,
because I spent most of high school guring out creative ways to get in trouble).
Does adding phsrank to (4.26) substantively change the conclusions on the returns to
two- and four-year colleges? Explain.
Adding phsrank to (4.26) (i.e. to a regression model that includes junior college credits,
total college credits, and work experience) does not seem to make much dierence. The
total variation explained by all the variables in each model is very similar, the magnitude
of the coecients (i.e. their absolute value) are very similar between the regressions,
and the standard errors (and hence the t-statistics) are essentially unchanged.
The data set contains a variable called id. Explain why if you add id to equation (4.17)
or (4.26) you expect it to be statistically insignicant. What is the two-sided p-value?
By using the describe command in Stata I can see that the id variable is nothing
but an ID number, which should have absolutely nothing to do with a persons wage.
Therefore, if I add it to any regression model explaining wage, I should hope like hell
that it doesnt correlate very highly with wage. And it doesnt (p-value of 0.587).
The data set 401KSUBS.RAW contains information on net nancial wealth (nettfa),
age of the survey respondent (age), annual family income (inc), family size (fsize), and
participation in certain pension plans for people in the United States. The wealth and
income variables are both recorded in thousands of dollars. For this question, use only
the data for single-person households (so fsize=1).
How many single-person households are there in the dset?
I used three commands to determine that out of 9,275 responses in the dset, 2,017 of them
came from individuals with a family size of 1. The only necessary command was: sum
if fsize==1. I then see that there were 2,017 observations in the dset with fsize==1.
I also wondered if any of the observations of fsize were missing (which would indicate
the possibility that there were more single-person households that I could not observe).
To nd this out, I typed describe to nd out that there were 9,275 total observations
in the dset, then typed sum fsize to determine that there were 9,275 observations of
the variable fsize, i.e. there were no missing values. Just some bonus knowledge for
Use OLS to estimate the model
nettfa =
inc +
age + u,
and report the results using the usual format. Be sure to use only the single-person
households in the sample. Interpret the slope coecients. Are there any surprises in the
slope estimates?
To run this regression I used the command reg nettfa inc age if fsize == 1. When
I did so I obtained the results:

nettfa = 43.04 + 0.80inc + 0.84age

N = 2, 017 0.12.
When annual income increases by $1,000, we predict that net nancial assets will increase
by $800 (which makes some sense we would be surprised if income increased net assets
by more than the income increase). When age increases by one year, net nancial assets
increase by $840. We expect individuals to be accumulating wealth as they age.
Does the intercept from the regression in part(ii) have an interesting meaning? Explain.
The intercept is the predicted net nancial assets of a zero-year-old. Nu said.
Find the p-value for the test H
= 1 against H
< 1. Do you reject H
at the
1% signicance level?
The p-value of a test is the probability of getting a test statistic as large (or as small)
as you obtained under the null. So in this case, we want to know: if the null is true (if

= 1), what is the probability of getting a t-stat as large (in absolute) as the t-stat we
observe? First, we have to calculate the t-stat. The t-stat under the null is:
0.8426563 1
= 1.7099435.
We now want to know what the probability is of getting a t-stat that is bigger than
1.7099435. If we were testing a two-sided hypothesis (i.e. if the alternative hypothesis
were H
= 0 instead of H
< 0) then we would double the value we are about
to obtain. We want to nd:
P(T > 1.7099435),
where T is a random draw from the t-distribution with 2,014 degrees of freedom. Lets
nd this exact value using Stata:
scalar pval = ttail(2014,1.7099435).
When I do this I obtain: pval = .04371517. The p-value of 0.04 tells me that we would
not reject the null hypothesis at the 1% signicance level (we would reject the null at
the 5% signicance level, but not at the 1% level).
Note that if you did not have Stata (or your Stata does not have the ttail command) you
could come close to this value simply by comparing the absolute value of the t-statistic
we obtained (1.7099435) to the critical values in table G2.
Use the data in DISCRIM.RAW to answer this question.
Use OLS to estimate the model
log(psoda) = + 0 +
prpblck +
log(income +
prppov + u,
and report the results in the usual form. Is

statistically dierent from zero at the 5%
level against a two-sided alternative? What about at the 1% level?
When I run the regression in Stata I obtain:

log(psoda) = 1.46 + 0.07prpblck + 0.14 log(income) + 0.38prppov

N = 401 R squared = 0.09.

is statistically dierent from zero at the 5% level, but not at the 1% level (p-value of
0.018, which is less than 0.05, but greater than 0.01).
To the regression in part(i), add the variable log(hseval). Interpret its coecient and
report the two-sided p-value for H
= 0.
When I add the variable log(hseval) to the regression above, I get a coecient of
0.1213056 and a p-value of 0.000. I thus reject the null hypothesis that the true coecient
on log(hseval) is zero at the 1% signicance level. The coecient tells me that when
log(hseval) increases by one unit, log(psoda) is predicted to increase by about 0.12
units. To change this result from units of logged-variables to units of the variables
themselves, I dont need to do anything to the coecient (since both the independent
variable hseval and the dependent variable psoda are logged). Therefore, I can say that
when median house value in a zip code increases by 1%, the price of a medium soda in
that same zip code is predicted to increase by 0.12%.
Use the data in ELEM94 95 to answer this question. The ndings can be compared with
those in Table 4.1. The dependent variable lavgsal is the log of average teacher salary
and bs is the ratio of average salary (by school).
Run the simple regression of lavgsal on bs. Is the estimated slope statistically dierent
from zero? Is it statistically dierent from -1?
When I run the simple regression I get the following results:

lavgsal = 10.75 0.80bs

N = 1, 848 R squared = 0.02.
The t-statistic for the null hypothesis that
= 0 is -5.31 (p-value 0.00) we reject
the null hypothesis that
= 0 against the two-sided alternative at better than the
1% level. By inspecting the condence interval, we can see that -1 is inside the 95%
condence interval. This leads us to conclude that we cannot reject the null hypothesis
= 1 at the 5% signicance level.
Add the variables lenrol and lsta to the regression from part(i). What happens to the
coecient on bs? How does the situation compare with that in Table 4.1?
When I add lenrol and lstaff to the regression from part(i), I get:

lavgsal = 13.95 0.61bs 0.03lenrol 0.71lstaff

N = 1, 848 R squared = 0.48.
The coecient on bs has gone from -0.80 to - 0.61 (i.e. it has decreased in magnitude
by about 20%) while the t-statistic remains largely unchanged. This is exactly what we
see in columns (1) and (2) of Table 4.1.
How come the standard error on the bs coecient is smaller in part(ii) than in part(i)?
The standard error of the coecient is smaller because the unexplained variation (the
variation of u) is signicantly smaller. The addition of two more variables has explained
signicantly more of the variation in lavgsal than the previous model (compare the
R-squared values of the two models to see this). Of course, adding two variables to a
model has the chance to cause problems with multicollinearity. If the two variables that
were included are highly correlated with bs, this could have the eect of increasing the
standard error of

. The eect of reducing the unexplained variation in the model
outweighs the collinearity eect, in this case. This makes sense when you observe that
the correlation between bs and lenrol is 0.02 and the correlation between bs and lstaff
is 0.04 (both relatively small).
How come the coecient on lsta is negative? Is it large in magnitude?
The coecient on lstaff suggests that the relationship between number of sta and
average teacher salary is negative, controlling for enrollment size and ratio of benets to
salary. This suggests that when more sta are added to a school, each teacher is paid
less, on average. The magnitude of the coecient is relatively large: when the number
of sta increases by 10%, the average salary decreases by about 7%/.
Now add the variable lunch to the regression. Holding other factors xed, are teachers
being compensated for teaching students from disadvantaged backgrounds? Explain.
When I run this new regression I obtain:

lavgsal = 13.83 0.52bs 0.03lenrol 0.69lstaff 0.00lunch

N = 1, 848 R squared = 0.49.
Teachers are not being compensated for teaching students from disadvantaged back-
grounds. The presence of students from disadvantaged backgrounds is indicated by a
higher proportion of students who qualify for a lunch subsidy. When the proportion
increases by 0.10, the average salary declines by 0.7%.
Overall, is the pattern of results that you nd with ELEM94 95 consistent with the pattern
in Table 4.1?
Yes. The magnitude of the eects decline as more variables are added, although the signs
and levels of signicance remain the same. Even though this exercise uses a dierent
dset, the results of the model appear robust. This is good! The last column of Tabl
4.1 is not comparable to the regressions we have run in the exercise (elementary school
students cannot drop out of school).