You are on page 1of 48

The Economics and Econometrics of

Evidence-Based Policy
Lecture 6: Instrumental Variables

Daniel Kaliski

21 February 2022
Last Lecture

▶ Randomised Controlled Trials (RCTs)

▶ Counterfactuals

▶ Treatment Effects
In this Lecture

▶ Instrumental Variables
▶ Identification
▶ Inference
▶ Applications

Reading: Angrist, Imbens & Rubin (1995, JASA, pp. 1-13);


Murray (2006, JEP); Angrist & Krueger (2001, JEP); Keane
(2010, JoE, pp. 4-7); Wolpin (2013, pp. 110-3)
The Idea

▶ We have some variable of interest Xi which is endogenous


(Cov(Xi , ηi ) ̸= 0)
Yi = β0 + β1 Xi + ηi , (1)
▶ Estimating this equation by OLS will result in a biased
estimator of β1 . But suppose we have another variable Zi
such that
Cov(Zi , ηi ) = 0 (Instrument Validity)
Cov(Xi , Zi ) ̸= 0 (Instrument Relevance)
The Idea

▶ Then we can estimate


Xi = γ0 + γ1 Zi + vi , (2)
▶ The predicted values X ci = γc0 + γ
c1 Zi will portion out a subset
of variation in Xi that is uncorrelated with ηi (since Zi is
uncorrelated with ηi )
▶ Consequence: we can obtain an unbiased estimator of β1 by
first estimating the “first-stage” (2), obtaining the predictions
Xb i , and then estimating

Yi = β0 + β1 X
b i + ηi , (3)

▶ NB: don’t do this sequentially! Packages in R, Stata, Matlab,


Python all do both steps simultaneously
Instrumental Variables: Origins
▶ (See Angrist & Krueger (2001, JEP) ). Suppose we ran the
regression
lnQjt = α0 + α1 lnPjt + vjt , (4)
▶ How do we know whether we’re estimating the price elasticity
of demand or supply?
▶ Solution: find a “demand shifter” or “supply shifter” that only
affects one but not the other, “shifting” the entire curve:

lnPjt = δ0 + δ1 Zjt + ξjt , (5)

▶ I.e. suppose Zjt is the withdrawal of a firm from market j at


time t. This affects the supply curve but not consumers’
preferences, which are in vjt (so Cov(Zjt , vjt ) = 0). The shift
of the supply curve “traces out” the demand curve in that it
changes the price “exogenously” - i.e. presents consumers
with different menus of prices that don’t depend on their
characteristics
Sketch Proof of IV Consistency

▶ This can be found in any standard econometrics textbook


(E.g. Wooldridge, Greene). We end up obtaining
Cov(Y, Z)
plimβbIV =
Cov(X, Z)
Cov(βX + η, Z)
= (6)
Cov(X, Z)
Cov(η, Z)
=β+ = β,
Cov(X, Z)
Potential Outcomes (Again)

▶ Yi1 − Yi0 is the effect of the intervention on the ith individual

▶ Problem: we only observe one of Yi0 , Yi1 for a given individual


(no time machines)
▶ Define Di = 1 if an individual is actually exposed to an
intervention; Di = 0 otherwise
▶ Additionally let Zi be a process by which individuals are
assigned to treatment
▶ This can differ from Di : individuals can be assigned to
treatment but not take up the treatment (imperfect
compliance)
Potential Outcomes (Again)

▶ Yi1 − Yi0 is the effect of the intervention on the ith individual

▶ But: suppose this varies from individual to individual

▶ That is, not just the levels - unobserved untreated states Yi0 -
but the slopes - reactions to treatment - are randomly
distributed

Yi = E[Yi0 ] + β1i Di + (Yi0 − E[Yi0 ])


(7)
E[β1i ] = E[Yi1 − Yi0 ],
Angrist Imbens Rubin 96

▶ This paper provides additional assumptions under which we


can interpret IV estimands as causal effects
▶ Suppose Di , Zi ∈ {0, 1}

▶ Stable Unit Treatment Value Assumption (SUTVA):

Zi = Zi′ =⇒ Di (Z) = Di (Z ′ )
(8)
Zi = Zi′ &Di = Di′ =⇒ Y (D, Z) = Y (D′ , Z ′ ),
▶ I.e. no spillovers. No one’s treatment status affects anyone
else’s treatment status
Angrist Imbens Rubin 96

▶ Provides additional assumptions under which we can interpret


IV estimands as causal effects
▶ Assignment to treatment Zi is random (note that it is not
necessary that Di is random)

P r(Z = c) = P r(Z = c′ ), (9)


Angrist Imbens Rubin 96

▶ Provides additional assumptions under which we can interpret


IV estimands as causal effects
▶ Monotonicity: Zi either only makes it more likely or less likely
you access the intervention Di :

Di (1) ≥ Di (0)∀i = 1, ..., N, (10)


▶ This rules out Zi making Di = 1 more likely for some
individuals and less likely for others
▶ E.g. teenagers being assigned to an anti-smoking programme
might violate this
Angrist Imbens Rubin 96

▶ Summary:
“A variable Z is an instrumental variable for the causal effect of
D on Y if: [1] its average effect on D is nonzero, [2] it satisfies
the exclusion restriction and [3] the monotonicity assumption, [4] it
is randomly (or ignorably) assigned, and [5] SUTVA holds”
▶ We will see in what sense we recover a causal effect shortly
LATE

▶ Under these assumptions, the instrumental variables estimand


can be written
E[Yi (Di (1), 1) − Yi (Di (0), 0)]
E[Di (1) − Di (0)] (11)
= E[Yi (1) − Yi (0)|Di (1) − Di (0) = 1],

▶ The most important bit for us to interpret this object is the


conditioning on Di (1) − Di (0) = 1
▶ This is the subset of individuals who would not have been
exposed to D had it not been for being “nudged” by Z
▶ Technical term: compliers
LATE

Source: Angrist, Imbens & Rubin (1996)


LATE

▶ The monotonicity assumption implies that there are no


“defiers” - only always-takers, never-takers and compliers
▶ E[Yi (1) − Yi (0)|Di (1) − Di (0) = 1] is the “Local Average
Treatment Effect” (LATE)
▶ Instead of recovering the Average Treatment Effect for the
population, we recover it for the subset of individuals who are
caused to be treated by Z (compliers)
▶ AIR go on to discuss consequences of the violation of some of
the necessary assumptions on what is recovered by the IV
estimator
Example: French & Song (2014)

▶ Question: what is the effect of being granted disability


insurance on the subsequent probability of being in work?
▶ Di = 1 if an applicant to disability insurance is granted DI, 0
otherwise
▶ Zi : random assignment of judges of varying degrees of
leniency in DI cases
▶ Yi : whether individual is working 3 years later, or earnings 3
or 5 years later
Example: French & Song (2014)

“Relative to Bound’s (1989) classic study on earnings of rejected


DI applicants, we make the following key improvement. We
address the fact that those who are denied benefits are potentially
different than those who are allowed. Using Social Security
administrative data, we exploit the assignment of DI cases to
administrative law judges (ALJs), an assignment which is
essentially random. We document large differences in allowance
rates across judges, and show that these differences are unrelated
to the health or earnings potential of DI applicants”.
Example: French & Song (2014)

▶ Careful to note that judges aren’t randomly assigned in the


population:
▶ Where you live determines where your case is heard, and this
is related to the probability of DI application & result (i.e.
mining towns)
▶ FS therefore condition on venue and day. Assignment to a
judge is random within venues and days
Example: French & Song (2014)
Possible Exam Q

i. Explain why French & Song (2014) have to condition on


venue and day for their instrument to be “as good as
random”. ii. What are the implications of the other
instrumental variable assumptions in this setting?
Example: French & Song (2014)
Example: French & Song (2014)

▶ Some lessons so far:

▶ Arguing for instrument validity can be quite involved - you


might have to argue that E[u|Z, X] = 0 for some X and not
just E[u|Z] = 0
▶ Present OLS results alongside IV results

▶ Check whether other observables that affect your outcome


and/or intervention of interest affect your instrument!
(Similar to the “balance tests” from last lecture)
Weak Instruments
▶ For past 25 years, Cov(X, Z) ̸= 0 has been known to be an
insufficient condition for unbiased estimation of an IV
regression
▶ Reason: if it is nonzero but not “big enough”, t-stats do not
have the “right” distribution
▶ In fact, the bias of the IV estimator can exceed the bias of the
OLS estimator!

▶ Accordingly, we need to show the F -stats from first-stage


regression (or even better, Cragg-Donald statistic) and show
that they are “big enough”
▶ Stock, Wright & Yogo (2002) discuss the derivation and exact
representation of the threshold values
▶ Conventional wisdom: for most common case of 1 instrument
and 1 endogenous variable, F > 10 is a good threshold
▶ If not, instruments are deemed “weak”
Reporting Weak Instruments Tests: Bernal & Keane
(2011)
Multiple Instruments: Bernal & Keane (2011)

▶ We can also have “too many” instruments

▶ Bernal & Keane (2011) run into this problem because they
have local rules for childcare benefits as instruments, and
many different regimes, each of which contributes an
instrument
▶ Just due to noise in our instruments, as we add in
progressively more, we can end up adding the OLS bias back
in
▶ Limiting argument: as R2 → 1 in first stage, βbIV → βbOLS
LATEs & The Relationship Between Estimates from
Different IVs

▶ This discussion will be largely based on Wolpin (2013), pp.


107-113
▶ We will need to briefly describe two papers to understand the
discussion of them:
▶ Angrist & Krueger (1991) on birth-month cohorts and the
return to education
▶ Butcher & Case (1994) on sibling differences and the return to
education
Angrist & Krueger (1991)

▶ Famous paper, now mostly famous for illustrating problems


with weak and/or many instruments
▶ Y : log weekly earnings for 41-50 year olds

▶ D: additional year of schooling


▶ Z: quarter of birth
▶ Key: Birthday on 01/09 =⇒ 1 more year of schooling at age
16 than if birthday falls on 02/09
▶ Monotonicity =⇒ ?
Wolpin on Angrist & Krueger’s Birth-Quarter Instrument

▶ Consider who the “always-takers” are: children who will


obtain more schooling regardless of birthdate (higher returns
to education)
▶ Divide population into two “ability” types: always takers have
ability µ1 > µ2 , the ability of the compliers & never-takers
▶ Proportion π1 high-ability, & π2 low-ability; π2 = 1 − π1

▶ Mean adult earnings of younger children:

π1 f (S0 + 1, µ1 ) + (1 − π1 )f (S0 + 1, µ2 ), (12)

▶ And of older children:

π1 f (S0 + 1, µ1 ) + (1 − π1 )f (S0 , µ2 ) (13)


Wolpin on Angrist & Krueger’s Birth-Quarter Instrument
▶ Mean adult earnings of younger children:

π1 f (S0 + 1, µ1 ) + (1 − π1 )f (S0 + 1, µ2 ), (14)


▶ And of older children:

π1 f (S0 + 1, µ1 ) + (1 − π1 )f (S0 , µ2 ), (15)


▶ Difference in mean earnings:

△E[W ] = (1 − π1 )[f (S0 + 1, µ2 ) − f (S0 , µ2 )], (16)


▶ Difference in mean schooling:

△E[S] = π1 × 0 + (1 − π1 ) × 1 = 1 − π1 , (17)
▶ Wald estimator/LATE:

△E[W ]
= f (S0 + 1, µ2 ) − f (S0 , µ2 ), (18)
△E[S]
Wolpin on Angrist & Krueger’s Birth-Quarter Instrument

▶ A simple economic model shows us that, under relatively mild


assumptions about which children attain more schooling, the
LATE is
f (S0 + 1, µ2 ) − f (S0 , µ2 ),
▶ This is the return to an additional year of schooling for those
induced to attain it by the accident of birth quarter
Wolpin on Angrist & Krueger’s Birth-Quarter Instrument

▶ Since the higher-”ability” students will attain that additional


year anyway:
▶ f (S0 + 1, µ2 ) − f (S0 , µ2 ) is for those whose private return is
low enough for them to “drift into” an additional year of
schooling but not so low that they drop out regardless
▶ We only rule out students being made more likely to drop out
if they are “moved by instrument” - i.e. they are assigned a
quarter of birth that would give them a longer schooling
period, but are induced by this fact to drop out (defiers)
▶ Contrast: those who drop out no matter what the instrument
does are never-takers

▶ Angrist & Krueger (1991) find that the private return to an


additional year of education is about 7% in 1970 and 10% in
1980 (for men aged 41-50 in those years)
Butcher & Case (1994)

▶ Use a different strategy, but still an IV

▶ Y : earnings again

▶ D: additional year of schooling


▶ Z: Whether firstborn child in household is female or male
▶ Mechanisms discussed in Wolpin & orig. paper
▶ Note that validity of instrument questioned by Rosenzweig &
Wolpin (2000) - presence of sisters depends on fertility
▶ Monotonicity =⇒ ?
Wolpin on Butcher & Case’s Sibling-Gender Instrument
▶ Mean adult earnings of girls with firstborn brothers:

π1 f (S0 , µ1 ) + (1 − π1 )f (S0 , µ2 ), (19)


▶ And of girls with firstborn sisters:

π1 f (S0 + 1, µ1 ) + (1 − π1 )f (S0 , µ2 ), (20)


▶ Difference in mean earnings:

△E[W ] = π1 [f (S0 + 1, µ1 ) − f (S0 , µ1 )], (21)


▶ Difference in mean schooling:

△E[S] = π1 × 1 + (1 − π1 ) × 0 = π1 , (22)
▶ Wald estimator/LATE:

△E[W ]
= f (S0 + 1, µ1 ) − f (S0 , µ1 ), (23)
△E[S]
Wolpin on Butcher & Case’s Sibling-Gender Instrument

▶ A simple economic model shows us that, under relatively mild


assumptions about which children attain more schooling, the
LATE is
f (S0 + 1, µ1 ) − f (S0 , µ1 ),
▶ This is the return to an additional year of schooling for those
induced not to attain it by the presence of a firstborn brother
Wolpin on Butcher & Case’s Sibling-Gender Instrument

▶ f (S0 + 1, µ1 ) − f (S0 , µ1 ) is the change for those whose ability


is high enough that in the absence of additional costs they
would obtain an additional year of schooling, but are in the
event prevented from doing so:
▶ i.e. relatively higher-ability types
▶ low-ability types are “always takers” who don’t obtain the
additional year regardless - instrument “moves in the opposite
direction”

▶ So the two instruments estimate different parameters, even


though in principle both return “unbiased estimators of ’the’
return to schooling”!
▶ Result: Case & Butcher instrument will overstate the return
to schooling for overall population, while Angrist & Krueger
instrument will understate it
Keane (2010, JoE) on Lottery IVs

▶ What about if you have a truly random lottery as an IV?

▶ Angrist (1990): exploits Vietnam draft lottery to examine


impact of military service on earnings
▶ Y : earnings yet again!

▶ D: military service

▶ Z: randomly allocated “draft lottery” number


Keane (2010, JoE) on Lottery IVs

▶ Consider individuals who serve in the military despite not


being drafted (always-takers)
▶ They expect a positive return from military service (by
revealed preference)
▶ In original paper (Angrist, 1990) - find negative effect of
service on earnings
▶ So among compliers, the effect is negative, while the effect is
positive among always-takers!
▶ The Local Average Treatment Effect (LATE) is not a good
guide to “the” effect of military service on earnings!
Keane (2010, JoE) on Lottery IVs

▶ Moreover, the specific mechanism behind the reduction in


earnings matters for external validity
▶ The reduction could come from
▶ Return to military experience < return to civilian work
experience
▶ Draft interrupts schooling
▶ Disability/PTSD from Vietnam

▶ Which of these three mechanisms is strongest has different


implications for public policy
▶ E.g. a peacetime corps along the same lines outside of wartime
may have different effects - perhaps even positive
Keane (2010, JoE) on Lottery IVs

▶ Moreover no. 2(!): the specific mechanism behind the


reduction in earnings matters for monotonicity
▶ Schooling could either be decreased (via interruption, as per
previous slide), or increased (via tuition benefits upon leaving
the military)
▶ If we use our seemingly “perfect” lottery as an IV for schooling
to identify the effect of schooling on earnings, monotonicity
won’t hold - rendering the estimate uninterpretable!
Remedies for Limitations of LATEs?

▶ There is no simple answer, but it is good to be aware and


explicitly think through how IV estimates relate to population
▶ Some researchers combine IVs with explicit economic models
to extrapolate
▶ We will return to some of these issues in the last lecture when
we discuss the Marginal Treatment Effect (MTE)
Nonexaminable Machine-Learning Aside II

▶ Note that the first-stage problem of obtaining X


ci is a
prediction problem
▶ Lots of recent interest in improving statistical tools for
prediction
▶ Survey of some integration between traditional IV methods &
machine-learning approaches in Mullainathan & Spiess (2017)
Nonexaminable Machine-Learning Aside II

▶ Caution: may introduce unknown biases into estimation (see


Rosenzweig & Wolpin discussion of IVs later in module, &
“forbidden regression” problem later in this lecture)
▶ In general, being sure you have an unbiased estimator in large
samples is much harder when you move to a nonlinear or
nonparametric setting
▶ (Nonlinear: Y = g(X, ε) where g(.) is some pre-specified,
nonlinear function (i.e. Logit or Probit); Nonparametric: g(.) is
left unspecified and we are instead estimating a quantity such
as E[Y |X = x] without reference to particular parameters)
Nonlinear IV

▶ Suppose we are interested in a nonlinear relationship, i.e.

Yi = 1[βXi ≥ ηi ], (24)

▶ I.e. we have Yi = 1 if βXi ≥ ηi and 0 otherwise. Have to


specify distribution of ηi to estimate (Normal if Probit,
Logistic if Type I Extreme Value)
▶ Does this change our estimation?
Nonlinear IV: Forbidden Regression

▶ What we shouldn’t do is a “forbidden regression”:

Xi = g(πZi ≥ vi ), (25)

▶ Followed by taking predictions from X


ci and using them in any
second stage with any nonlinear function of Z
▶ This is because E[g(X)] ̸= g(E[X]) (Jensen’s Inequality)
Nonlinear IV: Forbidden Regression

▶ What we ought to do instead involves, e.g. partial


identification, which we’ll cover later, but also a number of
approaches beyond scope of this course
▶ If we’re just interested in the partial difference in averages, we
can just do as French & Song (2014) do and estimate
standard linear model, which will deliver a consistent
estimator under the standard assumptions
▶ We won’t be able to interpret coefficients outside [−1, 1], but
this rarely happens in practice
Instrumental Variables Checklist
▶ Estimate the equation of interest by OLS. Why is this
estimator biased?
▶ Check if your prospective instrument is weak. All your clever
arguments will have been wasted if they are for the validity of
a weak instrument!
▶ Can you argue that your prospective instrument reproduces a
“natural experiment”? (“As-good-as-random” assignment)
▶ Are there likely to be spillovers between different units?
(SUTVA)
▶ Does your instrument push individuals only in one direction
with respect to your explanatory variable of interest?
(Monotonicity)
▶ What is the relationship between the LATE your instrumental
variables estimator recovers and a more general hypothetical
intervention that may have motivated your study?
Summary

▶ We have to argue carefully that have an instrument that is “as


good as random”
▶ Even if an instrument is valid, we have to check that it isn’t
“weak”
▶ Even if an instrument is valid & strong, IVs only recover an
average treatment effect for a particular part of the
distribution - hence “local”
▶ This is not going to stop us from using IVs! Finding a novel
instrument can make a researcher’s career

You might also like