You are on page 1of 9

102B - Introduction to Econometrics Winter Term 2012/13

Paolo Pin
ppin@stanford.edu

Stanford, March 21st 2013

Final Exam
Instructions
PLEASE REVIEW before starting.
The exam lasts for 3 hours from 7pm to 10pm.
This test has 4 questions, and a total of 20 subquestions. Each of those subquestions is worth
5 points, for a total of 100 points for the whole exam.
You are allowed one letter-sized double-sided page of notes, but otherwise this is a closed-book
exam.
You are allowed to use a calculator.
Show your work! No credit will be given for correct answers if you do not justify your
argument.
Please make sure that your handwriting is legible!
In essay questions, be precise but brief. If a correct reply is hidden among wrong, or
irrelevant, arguments, you will not get full credit.
We will grade only what is written in your blue books. There should be plenty of space for
all your calculations.
If time is running short, set up the problem without final calculations.
If you cannot solve one subquestion, do not skip the rest of the question. Attempt to solve
each subquestion.
Please answer each question in a separate blue book! Hand in one blue book for each
question, even if you are not attempting to solve it.

Please do not turn this page over until you are instructed
to do so!
1

(25 points)

We consider a dataset of 1388 births to women in the United States in 1988. We run the
regression
\ = 119.8 0.514 cigs ,
bwght
(0.57)

(0.090)

where bwght is the birth weight, in ounces, and cigs is the number of cigarettes smoked per
day by the mother while pregnant.
(a) Can we think of the coefficient on cigs, -0.514, as the causal effect of mothers smoking
during pregnancy on birth weight? Why, or why not?
No. Omitted variables cause bias.
(b) Predict the weight of a child born from a mother that smokes a pack per day (20
cigarettes). We dont need to know about causal effects to make predictions. The
expected weight will be 119.8 20 0.514 = 109.52.
(c) The average number of cigs is 2.0. What is the average of bwght?
119.8- 0.514*2 = 118.7
(d) cigs is self-reported by the mothers, who might want to under-report (to appear more
health-conscious, for example). Suppose that each mother under-reported cigs by 2,
meaning that for each mother i, her true number of cigarettes smoked per day is cigsi =
cigsi + 2.
If you observed cigs , the mothers true number of cigarettes smoked per day, and ran
a regression of bwght on cigs and a constant, what would be the slope and intercept?
Show your work.
Slope remains unchanged.
Intercept = 119.8 + 0.514*2 =102.8
(e) Now we add the variable f aminc (family income) and obtain
\ = 117.0 + 0.093 f aminc 0.463 cigs .
bwght
(1.05)

(.029)

(0.092)

What is the role of f aminc? Why does the estimated effect of cigs increase?
f aminc is a control variable that is correlated with healthy nutrition, so it is positively
correlated with bwght. Then, if the coefficient of cigs goes down it is because f aminc
is also negatively correlated with cigs.
2

(25 points)

We have a dataset of panel data on Congressional campaign expenditures, in 1988 and 1990.
We check 186 observations (i.e. electoral districts which are smaller than States) where
the winner of 1988 elections is still running in the 1990 elections.
(a) We run the following OLS regression for the incumbents
\ = 22.46 + 0.320 prtystr + 5.32 democ + 0.132 incshr90 + 0.136 vote88 ,
vote90
(4.47)

(0.064)

(1.23)

(0.41)

(0.74)

where:
vote90 is the percentage of votes that the incumbent gets in 1990 elections,
prtystr is the percentage of votes that her/his party gets in presidential 1988 elections in that State,
democ is a dummy which is 1 if the incumbent is Democrat (Democrats won in
both Congressional elections of 1988 and 1990, even if a Republican President was
elected in 1988: George H. W. Bush),
incshr90 is the percentage of electoral campaign expenditures of the incumbent,
with respect to the electoral campaign expenditures of the challenger, in the same
district in the 1990 elections,
vote88 is the percentage of votes that the incumbent got in 1988 elections.
Comment on the estimators of the five regressors in this regression.
democ is in relative terms the most important variable for an election in which democrat
won at the national level: it counts more than a 15% difference in prtystr, and more
than a 40% difference in incshr90.
The other main determinants of the electoral results is prtystr which is positive and
highly significant.
incshr90 is not significant, but we will see that it may suffer from biases that we will
correct with a panel approach.
vote88 in not significant, at any confidence level.
(b) Why do you think that the variable vote88 has a low tvalue?
Suppose that we are interested in estimating the causal effect of incshr90 on vote90.
Knowing that vote88 and prtystr have a sample correlation of 0.44, do you think that
vote88 could be used in a different way?
3

vote88 is correlated with prtystr, because they are two variables from elections in 1988,
and they both refer to votes to the party of the incumbent. This makes vote88 redundant, when prtystr is considered, in explaining vote90 (they both capture the political
sentiment of the district in 1988).
As we are not interested in the causal effect of any of them, they cannot be used as instrumental variables for incshr90 (there is no plausible explanation for why they could
have indirect causal effect on vote90 only through incshr90).
(c) Now we run another regression on the same dataset, of the following form
\ = 2.68 + 0.218 incshrdif ,
votedif
(0.037)

(0.62)

where:
votedif = vote90 vote88,
and incshrdif = incshr90 incshr88.
What regression are we running? What is the interpretation of the constant? Why can
we avoid many other variables? Could there be an omitted variable bias?
We are running a change specification, that is allowed because we have only two
moments in time. The constant includes the difference in the yearspecific effects. Any
statespecific effect that is not changing between 1988 and 1990 is canceled out by
considering differences. If we believe that there are no other effects (hard to justify)
this regression estimates exactly the causal effect from incshr to vote, so we dont need
other variables. It is however arguable that many economic and political variables could
have changed across time between 1988 and 1990, in a way that is heterogeneous across
states and/or districts (example: one party may have governed locally extremely well or
bad inbetween).
(d) In the 1990 elections Democrats won on average with higher percentages than in 1988, so
we can suppose that there was a different aggregate trend (at National level) in favor of
them, in the two years. In the regression from point (c) are we capturing this aggregate
effect?
Yes, because the constant captures all timefixed effects.
(e) Suppose now that in State X the local Governor, of party Y, did extremely bad in the
local administration between 1988 and 1990, causing a loss in terms of votes for the
4

candidates of party Y in State X. Would this effect be captured by the regression from
point (c)?
No, because this introduces a change which is state and time specific.

(25 points)

Let rent be the average monthly rent paid on rental units in one of 64 college towns in the
United States. Let pop denote the total city population, avginc the average city income,
and pctstu the student population as a percent of the total population (so, if students were
10% of the population, pctstu would take on the value 10). We want to check how rents are
influenced by the student population.
(a) The first result from an OLS regression for year 1990 is
\ = 0.43 + 0.066 logpop + 0.507 logavginc + 0.0056 pctstu ,
logrent
(0.844)

(0.039)

(0.081)

(0.0017)

where log in front of the name of a variable means that we are considering its log.
Interpret the coefficient on pctstu. How will predicted rents change in a town if the
student population increases from 12% to 13% of the population?
One more percent increase in the percentage of students in the population increases rents
by 0.56%, which is a lot, and this effect is significant at the 99% confidence level.
(b) What bias could be present in this regression?
Clearly many sources of omitted variable bias: e.g. how nice is the place. Moreover
there is a simultaneous causality going on, as whenever there are prices in a market with
demand and supply.
(c) Suppose that we had also data for 1980, for the same 64 towns. We build a panel data
set for the two years and run a regression with town and year fixed effects, so run the
following regression:
\ it = 0 +1 logpopit +2 logavgincit +3 pctstuit +2 D2i +3 D3i +...+64 D64i +2 B21990 +uit ,
logrent
where
D2i is a binary indicator equal to 1 if a town is the second town in the dataset, 0
otherwise
5

....
D64i is a binary indicator equal to 1 if a town is the 64th town in the dataset, 0
otherwise
B21990 is a binary indicator equal to 1 if the year is 1990, 0 otherwise
Interpret the constant. Which of the biases from point (b) have been solved? Which
havent?
The constant is the predicted rent in town 1 in 1980 if the town had a value of 0 for all
of the regressors. We have townfixed effects and yearfixed effects included: the effect
of any omitted variable which is not changing across all towns in one of the two years,
or across the same town between the two years, will be considered, so that there is no
more onitted variable bias for them. Any variable that may not satisfy one of the two
previous conditions (e.g. one town is more polluted in 1990 than in 1980) would still
give omitted variable bias. To solve for the simultaneous causality we would need an
instrumental variable (e.g. taxes on rents).
(d) Someone wants to see how rent depends on whether a town is near the ocean. They want
to gather data on how many miles a town is from the ocean, add it to their dataset, and
regress rent on the towns distance from the ocean, including both years of rent data.
Would they want to include town fixed effects? Why or why not?
Should not include town fixed effects because the distance variable is fixed across time
and so its not possible to estimate the coefficient if town fixed effects are included due
to multicollinearity.
(e) Someone working with the dataset accidentally duplicates it, so that their new dataset
has two identical copies of each observation. So, whereas the initial dataset had one
observation for each town in each year, the new dataset has two identical observations
for each town and each year. They run the same regression as in point (c). How will
their coefficients change? How will their standard errors change if they dont cluster
their standard errors at the town level? How would their standard errors change if they
do cluster at the town level?
Coefficients should stay the same. Unclustered standard errors would decrease, clustered
standard errors would stay the same.

(25 points)

In betting on basketball games the pointspread bet is a specific bet that is made in the
following way: the event considered is whether team A wins over team B by at least x
points, where x is the pointspread ; every possible bet (i.e. the event happens or not) is
paid equally (e.g. for every 1$ the winner receives 95 cents clearly less than 1$ to make
a profit for the bookmakers). It is the bookmakers who decides ex ante the pointspread
x.
We have data on 553 college basketball game where a pointspread bet was proposed in
Las Vegas.
(a) We run the following OLS regression and Ftest in Stata
. reg sprdcvr favhome neutral fav25 und25 , robust
Linear regression

Number of obs
F( 4,
548)
Prob > F
R-squared
Root MSE

=
=
=
=
=

553
0.48
0.7489
0.0034
.50118

-----------------------------------------------------------------------------|
Robust
sprdcvr |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------favhome |
.0345911
.0498231
0.69
0.488
-.0632766
.1324588
neutral |
.117618
.0931229
1.26
0.207
-.0653036
.3005396
fav25 | -.0234674
.0503824
-0.47
0.642
-.1224336
.0754988
und25 |
.0178728
.0900634
0.20
0.843
-.1590389
.1947846
_cons |
.4895665
.0448511
10.92
0.000
.4014655
.5776675
-----------------------------------------------------------------------------. test favhome neutral fav25 und25
(
(
(
(

1)
2)
3)
4)

favhome
neutral
fav25 =
und25 =
F(

= 0
= 0
0
0

4,
548) =
Prob > F =

0.48
0.7489

where:
sprdcvr is a dummy which is equal to 1 if the spread is covered (i.e. the event
occurs),
f avhome is a dummy which is equal to 1 if the favorite team plays at home,
neutral is a dummy which is equal to 1 if the match is played in a neutral site,
f av25 is a dummy which is equal to 1 if the favorite team is in the top 25 of
that season,
und25 is a dummy which is equal to 1 if the other team (the underdog) is in the
top 25 of that season.
Explain all the tstatistics, and the result of the Ftest. Is the outcome surprising?
All the variables are not significant, even in aggregate (the Ftest). This is what
you would expect from a good bookmaker, who sets a pointspread such that the two
events are equally likely and unpredictable. If one variable were significant a better
could use it to make a profitable forecast (remember that for forecasts we dont care
about omitted variables).
(b) Now consider the following output in Stata
. probit favwin spread
Iteration
Iteration
Iteration
Iteration
Iteration

0:
1:
2:
3:
4:

log
log
log
log
log

likelihood
likelihood
likelihood
likelihood
likelihood

=
=
=
=
=

-302.74988
-264.91454
-263.56319
-263.56219
-263.56219

Probit regression

Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2

Log likelihood = -263.56219

=
=
=
=

553
78.38
0.0000
0.1294

-----------------------------------------------------------------------------favwin |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------spread |
.092463
.0121811
7.59
0.000
.0685885
.1163374
_cons | -.0105926
.1037469
-0.10
0.919
-.2139329
.1927476
------------------------------------------------------------------------------

where
f avwin is a dummy which is equal to 1 if the favorite team actually wins,
spread is the pointspread x decided by the bookmaker.
What is Stata reporting in the first 5 lines starting with Iteration?
Stata is maximizing the probit loglikelihood of winning with respect to the pointspread.
This is done with numerical approximation because there are no closed analytical
formulas.
(c) Consider again the output in (b). For which value of pointspread are the probability
of winning of the favorite team equal to 21 .
Two ways to see this. First, clearly the two teams are equally likely to win if the
spread is set to 0. Second and more formal, the cumulative normal distribution (z)
is .5 for z = 0, which happens only (if we consider the range given by standard
errors) for spread = 0.
(d) Are there any advantages or disadvantages to using the probit model over OLS?
One advantage of using the Probit model is that is constraints the coefficients to be
between 0 and 1, so we do not obtain nonsensical probabilities less than 0 or above
1.
One disadvantage of the Probit model is that the coefficients are not as readily
interpretable as an increase in probability we must use the cumulative normal
distribution .
(e) In the linear probability model presented in part (a), does the model imply that if
the favorite team plays in a neutral camp instead that at home, then it is more likely
that the favorite team will win?
No, because the spread is set endogenously after all the conditions are known, it
means that in those matches where the site is neutral, compared to those where the
favorite team is at home. On average the spread is set in such a way that it has an
8% higher probability to be reached.

You might also like